Progress in artificial intelligence has long been driven by methods that make fewer assumptions. As compute and data scale, approaches with weaker inductive biases tend to outperform those with stronger, more handcrafted constraints. This trend is especially visible in visual representation learning, which has evolved from supervised learning to weakly supervised learning and now to self-supervised learning (SSL) without human labels. Yet even modern SSL methods still depend on strong inductive biases such as data augmentations, masking, or cropping. If the trend holds, these remaining biases should become bottlenecks at scale — and according to a new paper, experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates a search for approaches that rely on fewer assumptions, leading to the introduction of Temporal Difference in Vision (TDV).
The Shift Away From Strong Assumptions
The paper, titled "You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences," was published on arXiv on June 14, 2026 by researchers Daithankar, Ninad, Gladstone, Alexi, LeCun, Yann, Ji, Heng. The authors argue that even the best self-supervised methods today still impose significant human-designed biases. For example, contrastive learning relies on heavy data augmentations; masked autoencoders rely on masking strategies. These assumptions, while effective, may ultimately limit performance as datasets grow. The team set out to create a method that avoids such biases entirely.
How TDV Works
TDV introduces a new paradigm for self-supervised learning from video. Instead of using augmentations, masking, or cropping, TDV relies on a single causal assumption: that the past causes the future. Specifically, the method jointly trains an image encoder and a motion encoder so that the representation of the current frame plus the encoded motion equals the representation of the next frame. This simple temporal difference objective allows the model to learn visual representations purely from the natural flow of video, without any human-crafted perturbations.
Because TDV operates directly on raw video frames and their temporal relationships, it avoids the need for strong inductive biases. The authors note that this aligns with the historical trend of AI progress: methods that assume less tend to scale better with data and compute.
Experimental Validation
Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks. The paper reports that TDV performs comparably to leading self-supervised methods on tasks such as semantic segmentation and object detection, which require detailed per-pixel understanding. This is significant because dense spatial tasks are typically more demanding than global classification tasks. The results suggest that TDV's weaker assumptions do not come at the cost of performance — at least on the benchmarks tested.
Broader Implications for Visual Representation Learning
The TDV paper adds to a growing body of evidence that strong inductive biases may become unnecessary as data volume increases. For enterprises that rely on computer vision — such as those in logistics, autonomous navigation, or quality inspection — this research points toward AI systems that can learn more flexibly from unlabeled video streams. While TDV is still a research method, its success on dense tasks indicates that future visual representation learning may require less manual engineering of augmentations and more reliance on temporal structure.
The authors conclude by laying the foundation for representation learning without strong assumptions, potentially opening the door to more scalable and generalizable visual AI.