pith. sign in

arxiv: 2606.07770 · v1 · pith:RRCP5CGYnew · submitted 2026-06-05 · 💻 cs.LG

Contrast encodes inductive bias: separating slow noise from dynamics in predictive representation learning

Pith reviewed 2026-06-27 22:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords contrastive learningpredictive representation learningself-supervised learningdynamical systemsnoise robustnessnegative samplingJEPAinductive bias
0
0 comments X

The pith

Contrastive predictive objectives encode slow noise instead of dynamics when negatives are sampled across trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that methods such as JEPA and similar contrastive predictive learners fail on dynamical data because they preferentially encode slowly varying noise features that stay constant within each trajectory. This happens because negatives drawn from different trajectories allow the model to use those noise features as an easy predictive cue. When negatives are instead drawn from within the same trajectory, the noise cue disappears and the encoder must learn the actual state variables that govern the dynamics. Training across many trajectories then produces representations whose quality improves with trajectory length even under strong noise. The result holds for both a synthetic moving-dot task and a rigid-body pendulum dataset, indicating that the sampling choice itself sets the inductive bias.

Core claim

When negatives are sampled across trajectories, contrastive predictive objectives encode trajectory-specific slow noise rather than the latent variables that govern the observed dynamics; sampling negatives within a single trajectory removes this shortcut and forces the encoder to capture the dynamics.

What carries the argument

Negative sampling strategy in contrastive predictive objectives: drawing negatives from other trajectories lets constant-within-trajectory noise serve as a distinguishing cue, while within-trajectory sampling removes that cue.

If this is right

  • Representations learned with across-trajectory negatives will be dominated by trajectory identity rather than state variables, so downstream task performance will degrade as noise amplitude increases.
  • Representations learned with within-trajectory negatives will improve as the number and duration of trajectories grow, even when slow noise is strong.
  • The same failure mode appears in any contrastive predictive objective that draws negatives across trajectories, including SimCLR-style JEPA variants and DySIB.
  • Physical systems observed with slowly drifting experimental noise require within-trajectory negative sampling to extract interpretable dynamical representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling principle may apply to other self-supervised objectives that rely on temporal or trajectory-level negatives, such as masked prediction or reconstruction losses in latent space.
  • For real-world robotics or neuroscience data, experimenters could record long continuous trajectories and deliberately reuse frames from the same recording as negatives during training.
  • The result suggests a general design rule: any contrastive objective whose negatives can be distinguished by a static cue will learn that cue instead of the intended signal.

Load-bearing premise

Noise features remain approximately constant within each trajectory so they can distinguish frames across different trajectories.

What would settle it

Train the same encoder on the pendulum movies with strong slow noise; measure whether within-trajectory negative sampling yields representations whose downstream prediction error decreases with trajectory length while across-trajectory sampling does not.

Figures

Figures reproduced from arXiv: 2606.07770 by Ilya Nemenman, Paarth Gulati.

Figure 1
Figure 1. Figure 1: Datasets and architectures. The two (dataset, architecture) pairs studied here; both panels show example frames from length-T trajectories with added temporally correlated noise. (A) SimCLR￾predictor [Sobal et al., 2022] on a synthetic moving-dot dataset, with an action-conditioned predictor that maps the encoded current frame and action to the encoded next frame. (B) DySIB [Martini et al., 2026] on experi… view at source ↗
Figure 2
Figure 2. Figure 2: ATN but not WTN predictive objectives fail under fixed noise. (A) Stochastic moving￾dot dataset with the SimCLR-predictor objective. (B) Experimental pendulum videos with the DySIB objective at kz = 8 (chosen following the original DySIB study to reduce sensitivity to local optima during training; Appendix B). We add temporally correlated noise to the observations and report the probe RMSE on the true dyna… view at source ↗
Figure 3
Figure 3. Figure 3: Fixed noise breaks ATN but not WTN angular embeddings. DySIB embeddings of the pendulum dataset with a two-dimensional bottleneck (kz = 2), colored by the true pendulum angle. Each point is a sample from the variational posterior q(zt | xt−nF +1:t) for every nF -frame block in the training data. Rows: ATN (top), WTN (bottom). Columns: no added noise, changing noise, fixed noise (σG = 0.2 for both). Without… view at source ↗
Figure 4
Figure 4. Figure 4: After WTN removes the fixed-noise failure, the largest errors occur at intermediate noise correlation times. Stochastic moving-dot dataset with temporally correlated uniform noise of strength σU (colors) and correlation time τ (horizontal axis), at fixed trajectory duration T = 17. Dotted colored horizontal lines mark the changing-noise (τ → 0) probe RMSE at the corresponding σU . Black dashed line is chan… view at source ↗
Figure 5
Figure 5. Figure 5: Longer trajectories help WTN, not ATN. Stochastic moving-dot dataset with temporally correlated uniform noise of strength σU = 3.0, varying trajectory duration T and correlation time τ . (A) Probe RMSE for ATN. The fixed-noise failure region at τ ≫ T persists at every T. (B) Probe RMSE for WTN. The only remaining failure is the intermediate-correlation peak near τ ≈ T, which weakens as T grows. (C) Maximum… view at source ↗
Figure 2
Figure 2. Figure 2: (A) Moving dot: 50 pseudorandom hyperparameter combinations per point. Light fill spans [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Self-supervised methods that learn representations and predict dynamics fully in the latent space, such as JEPA, have been shown to confuse slowly varying noise with the dynamical signals they aim to capture. Specifically, when noise features remain approximately constant within each trajectory, contrastive predictive objectives preferentially encode these features instead of the true latent variables governing the system. The learned representation then becomes dominated by trajectory-specific noise, so downstream performance degrades with noise strength and does not improve even as the number and duration of training trajectories increase. We argue that this failure is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories. To illustrate this generality, we study the failure mode and its remedy in two settings: a standard SimCLR-style JEPA on a synthetic moving-dot dataset, and DySIB, a recently introduced method designed for extracting physically interpretable representations of dynamics, on movies of a rigid-body pendulum. When negatives are instead sampled within a single trajectory, the slow noise can no longer distinguish frames within that trajectory, removing the predictive shortcut. Training one encoder simultaneously on many such trajectories then forces it to encode the variables relevant for the dynamics, with longer trajectories yielding better representations even for strong slow noise. Our results point toward principles for designing contrastive predictive objectives in dynamical representation learning, especially for physical systems with noisy experimental observations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that contrastive predictive objectives (e.g., JEPA-style and DySIB) that draw negatives across trajectories encode slowly varying, trajectory-constant noise features rather than true dynamical variables; this is shown to degrade downstream performance in a manner independent of data volume. The failure is illustrated via controlled synthetic experiments on moving-dot trajectories (SimCLR-style JEPA) and pendulum movies (DySIB). Switching to within-trajectory negative sampling removes the shortcut, yielding representations that improve with trajectory length even under strong noise. The authors argue the issue is intrinsic to the objective class.

Significance. If the central mechanism holds, the work identifies a load-bearing inductive bias in a family of contrastive predictive losses used for dynamical representation learning, with direct implications for method design on noisy physical data. The controlled synthetic experiments directly illustrate both the failure mode and its removal, providing clear, falsifiable support for the proposed noise-encoding shortcut without reliance on fitted parameters or self-referential definitions.

major comments (1)
  1. [Abstract] Abstract and introduction: the assertion that the failure 'is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories' rests on demonstrations in two specific architectures/datasets; without a general loss analysis, reduction to other objectives (e.g., CPC or InfoNCE variants), or proof that the minimizer must prefer constant-per-trajectory noise features under the stated noise model, the generality claim is not yet load-bearing.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the clear identification of where our generality claim requires additional support. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the assertion that the failure 'is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories' rests on demonstrations in two specific architectures/datasets; without a general loss analysis, reduction to other objectives (e.g., CPC or InfoNCE variants), or proof that the minimizer must prefer constant-per-trajectory noise features under the stated noise model, the generality claim is not yet load-bearing.

    Authors: We agree that the current evidence consists of controlled demonstrations in two representative settings rather than a formal reduction or proof, and that this limits the strength of the generality assertion. The mechanism we identify is tied to the shared structure of cross-trajectory negative sampling, which permits any approximately constant-per-trajectory feature to act as a loss-minimizing shortcut. We will revise the abstract and introduction to qualify the claim as applying to the class of contrastive predictive objectives that employ cross-trajectory negatives, and we will add a short subsection providing a qualitative analysis of the InfoNCE-style loss under the slow-noise model. This analysis will show why the minimizer favors trajectory-constant features and will briefly indicate how the same logic extends to CPC and standard InfoNCE variants. These changes will be marked as partial because a complete mathematical proof for every possible variant is beyond the scope of the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical illustrations on synthetic data do not reduce to fitted inputs or self-definitions

full rationale

The manuscript's central argument—that the observed failure mode is an inherent property of contrastive predictive objectives sampling negatives across trajectories—is advanced via controlled experiments on two synthetic datasets (moving dots with SimCLR-style JEPA; pendulum movies with DySIB). No equations, loss derivations, or uniqueness theorems are supplied that would allow a prediction or result to be rewritten as an input by construction. The text contains no self-citation chains invoked as load-bearing mathematical facts, no fitted parameters renamed as predictions, and no ansatzes smuggled via prior work. The demonstration is therefore self-contained against external benchmarks (the two data generators) and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that noise is slowly varying and constant within trajectories, plus the modeling choice that the two synthetic datasets are representative of the general failure mode.

axioms (1)
  • domain assumption Noise features remain approximately constant within each trajectory
    This premise enables the noise to act as a cross-trajectory distinguishing cue; it is invoked in the description of the failure mode.

pith-pipeline@v0.9.1-grok · 5775 in / 1222 out tokens · 27137 ms · 2026-06-27T22:35:48.260189+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 14 canonical work pages · 7 internal anchors

  1. [1]

    Thapa, R., He, B., Kjaer, M

    Joint embedding predictive architectures focus on slow features , author=. 2022 , howpublished=. 2211.10831 , archiveprefix=

  2. [2]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Removing the background by adding the background: Towards background robust self-supervised video representation learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  3. [3]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Self-supervised video representation learning by context and motion decoupling , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  4. [4]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Motion-aware contrastive video representation learning via foreground-background merging , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  5. [5]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Tubelet-contrastive self-supervision for video-efficient generalization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  6. [6]

    dynamic information , author=

    A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  7. [7]

    2022 , howpublished=

    Denoised MDPs: Learning world models better than the world itself , author=. 2022 , howpublished=. 2206.15477 , archiveprefix=

  8. [8]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Revisiting feature prediction for learning visual representations from video , author=. 2024 , howpublished=. 2404.08471 , archiveprefix=

  9. [9]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M. and Bardes, A. and Fan, D. and Garrido, Q. and Howes, R. and Komeili, M. and Muckley, M. and Rizvi, A. and Roberts, C. and Sinha, K. and Zholus, A. and Arnaud, S. and Gejji, A. and Martin, A. and Hogan, F. R. and Dugas, D. and Bojanowski, P. and Khalidov, V. and Labatut, P. and Massa, F. and Szafraniec, M. and Krishnakumar, K. and Li, Y. and Ma...

  10. [10]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Self-supervised learning from images with a joint-embedding predictive architecture , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  11. [11]

    Representation Learning with Contrastive Predictive Coding

    Representation learning with contrastive predictive coding , author=. 2018 , howpublished=. 1807.03748 , archiveprefix=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    What makes for good views for contrastive learning? , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    2021 , howpublished=

    Conditional contrastive learning for improving fairness in self-supervised learning , author=. 2021 , howpublished=. 2106.02866 , archiveprefix=

  14. [14]

    Journal of Machine Learning Research , volume=

    Emergence of invariance and disentanglement in deep representations , author=. Journal of Machine Learning Research , volume=

  15. [15]

    OpenReview , volume=

    A path towards autonomous machine intelligence version 0.9.2, 2022-06-27 , author=. OpenReview , volume=

  16. [16]

    Martini, K. M. and Abdelaleem, E. and Gulati, P. and Nemenman, I. , title =. 2026 , howpublished =. 2604.24662 , archiveprefix =

  17. [17]

    LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

    LeJEPA: Provable and scalable self-supervised learning without the heuristics , author=. 2025 , howpublished=. 2511.08544 , archiveprefix=

  18. [18]

    LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels , author=. 2026 , howpublished=. 2603.19312 , archiveprefix=

  19. [19]

    Neural Computation , volume=

    Data efficiency, dimensionality reduction, and the generalized symmetric information bottleneck , author=. Neural Computation , volume=

  20. [20]

    2025 , howpublished=

    Joint embedding vs reconstruction: Provable benefits of latent space prediction for self-supervised learning , author=. 2025 , howpublished=. 2505.12477 , archiveprefix=

  21. [21]

    Nature Computational Science , volume=

    Automated discovery of fundamental variables hidden in experimental data , author=. Nature Computational Science , volume=

  22. [22]

    Cell , volume=

    Machine learning interpretable models of cell mechanics from protein images , author=. Cell , volume=

  23. [23]

    Gurevich, D. R. and Golden, M. R. and Reinbold, P. A. K. and Grigoriev, R. O. , journal=. Learning fluid physics from highly turbulent data using sparse physics-informed discovery of empirical relations (

  24. [24]

    Cell Reports Methods , volume=

    Inferring single-cell transcriptomic dynamics with structured latent gene expression dynamics , author=. Cell Reports Methods , volume=

  25. [25]

    2025 , howpublished=

    Interpretable gene network inference with nonlinear causality , author=. 2025 , howpublished=

  26. [26]

    Daniels, B. C. and Ryu, W. S. and Nemenman, I. , journal=. Automated, predictive, and interpretable inference of

  27. [27]

    Proceedings of the National Academy of Sciences , volume=

    Physics-tailored machine learning reveals unexpected physics in dusty plasmas , author=. Proceedings of the National Academy of Sciences , volume=

  28. [28]

    International Conference on Learning Representations , year=

    Towards principled representation learning from videos for reinforcement learning , author=. International Conference on Learning Representations , year=

  29. [29]

    and Misra, D

    Efroni, Y. and Misra, D. and Krishnamurthy, A. and Agarwal, A. and Langford, J. , booktitle=. Provable

  30. [30]

    and Chen, S

    Yu, J. and Chen, S. and Liu, M. and Horiuchi, N. and Braverman, V. and Xu, Z. and Haramati, D. and Balestriero, R. , year=. Why and how auxiliary tasks improve. 2509.12249 , archiveprefix=

  31. [31]

    Causal-JEPA: Learning World Models through Object-Level Latent Masking

    Nam, J. and Le Lidec, Q. and Maes, L. and LeCun, Y. and Balestriero, R. , year=. Causal-. 2602.11389 , archiveprefix=

  32. [32]

    2026 , howpublished=

    Learning invariant visual representations for planning with joint-embedding predictive world models , author=. 2026 , howpublished=. 2602.18639 , archiveprefix=

  33. [33]

    Neural Computation , volume=

    Slow feature analysis: Unsupervised learning of invariances , author=. Neural Computation , volume=

  34. [34]

    International Conference on Machine Learning , pages=

    Understanding contrastive learning requires incorporating inductive biases , author=. International Conference on Machine Learning , pages=

  35. [35]

    Advances in Neural Information Processing Systems , volume=

    Self-supervised learning with data augmentations provably isolates content from style , author=. Advances in Neural Information Processing Systems , volume=

  36. [36]

    International Conference on Machine Learning , pages=

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere , author=. International Conference on Machine Learning , pages=

  37. [37]

    International Conference on Learning Representations , year=

    What should not be contrastive in contrastive learning , author=. International Conference on Learning Representations , year=

  38. [38]

    2025 , howpublished=

    Koopman invariants as drivers of emergent time-series clustering in joint-embedding predictive architectures , author=. 2025 , howpublished=. 2511.09783 , archiveprefix=

  39. [39]

    and Pereira, F

    Tishby, N. and Pereira, F. C. and Bialek, W. , title =. 37th Annual Allerton Conference on Communication, Control, and Computing , pages =

  40. [40]

    and Mosenzon, O

    Friedman, N. and Mosenzon, O. and Slonim, N. and Tishby, N. , title =. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI) , pages =

  41. [41]

    and Fischer, I

    Alemi, A. and Fischer, I. and Dillon, J. and Murphy, K. , title =. International Conference on Learning Representations , year =

  42. [42]

    and Nemenman, I

    Abdelaleem, E. and Nemenman, I. and Martini, K. M. , title =. Journal of Machine Learning Research , volume =

  43. [43]

    and Nemenman, I

    Bialek, W. and Nemenman, I. and Tishby, N. , title =. Neural Computation , volume =

  44. [44]

    , title =

    Wiener, N. , title =

  45. [45]

    Advances in neural information processing systems , volume=

    Random features for large-scale kernel machines , author=. Advances in neural information processing systems , volume=