From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning

Mintu Dutta; Mohendra Roy; Ritesh Vyas

arxiv: 2604.13518 · v1 · submitted 2026-04-15 · 💻 cs.LG · cs.AI

From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning

Mintu Dutta , Ritesh Vyas , Mohendra Roy This is my paper

Pith reviewed 2026-05-10 13:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords self-supervised learningpredictive representation learningJEPAalignmentreconstructiontaxonomyBYOLMAE

0 comments

The pith

Self-supervised learning gains a new category of Predictive Representation Learning that predicts latent unobserved data components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Predictive Representation Learning (PRL) as a distinct approach in self-supervised learning centered on latent prediction of unobserved data parts from observed inputs. It contrasts PRL with alignment methods that match representations and reconstruction methods that rebuild inputs. The authors introduce a unified taxonomy to organize these three categories together. They position Joint-Embedding Predictive Architecture (JEPA) as a core example of PRL and back the framework with experiments comparing BYOL, MAE, and I-JEPA. The work frames PRL as a promising direction that could better capture data distributions.

Core claim

We define Predictive Representation Learning (PRL) as revolving around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL alongside alignment and reconstruction-based learning approaches. Joint-Embedding Predictive Architecture (JEPA) serves as an exemplary member of this paradigm. Theoretical perspectives and open challenges are discussed, and comparative implementations of Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) show MAE reaching perfect similarity of 1.00 with robustness 0.55 while BYOL and I-JEPA reach accuracies of 0.98 and 0.95 with robustness scores of 0.75 and 0

What carries the argument

Predictive Representation Learning (PRL), the category that revolves around latent prediction of unobserved data components to form structures predictive of the data distribution.

If this is right

JEPA can be viewed as a member of the PRL paradigm.
The taxonomy organizes alignment, reconstruction, and PRL methods under one framework.
PRL approaches may better support learning of structures that predict the data distribution.
Further work on PRL could address open challenges in self-supervised learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported robustness scores suggest PRL methods such as I-JEPA may balance performance traits differently than reconstruction-focused ones like MAE.
Explicitly designing for latent prediction of unobserved parts could lead to architectures that handle incomplete observations more directly.
Extending the taxonomy to additional data modalities might reveal whether PRL advantages hold beyond the image experiments shown.

Load-bearing premise

The proposed distinctions between alignment, reconstruction, and predictive representation learning form a meaningful and non-overlapping taxonomy that yields new insight.

What would settle it

A demonstration that methods placed in the PRL category rely primarily on alignment or reconstruction mechanisms, or that PRL-labeled approaches show no measurable advantage in predicting unobserved data components over the other categories.

Figures

Figures reproduced from arXiv: 2604.13518 by Mintu Dutta, Mohendra Roy, Ritesh Vyas.

**Figure 1.** Figure 1: Architectural comparison of Contrastive, Non Contrastive, Reconstruction based and Predictive Representation 4.2 Representative Loss Functions and Formulations Contrastive Loss (InfoNCE) Contrastive loss functions are widely used in methods such as SimCLR and MoCo to learn discriminative representations. A commonly adopted formulation is the InfoNCE loss: LInfoNCE = − log exp sim(zi , z+ i )/τ exp sim(zi … view at source ↗

**Figure 2.** Figure 2: Comparision of the results from our custom implimentation and training: Comparison of augmentation similarity and occlusion robustness across BYOL, I-JEPA, and MAE. Predictive learning (I-JEPA) achieves superior robustness despite lower similarity. 4.4 Benchmark Results of JEPA Variants We summarize representative benchmark results from published JEPA-based models across modalities. These results are dra… view at source ↗

**Figure 3.** Figure 3: New Taxonomy for SSL categorization [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Self-supervised learning has emerged as a major technique for the task of learning from unlabeled data, where the current methods mostly revolve around alignment of representations and input recon struction. Although such approaches have demonstrated excellent performance in practice, their scope remains mostly confined to learning from observed data and does not provide much help in terms of a learning structure that is predictive of the data distribution. In this paper, we study some of the recent developments in the realm of self-supervised learning. We define a new category called Predictive Representation Learning (PRL), which revolves around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL along with alignment and reconstruction-based learning approaches. Furthermore, we argue that Joint-Embedding Predictive Architecture(JEPA) can be considered as an exemplary member of this new paradigm. We further discuss theoretical perspectives and open challenges, highlighting predictive representation learning as a promising direction for future self-supervised learning research. In this study, we implemented Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for comparative analysis. The results indicate that MAE achieves perfect similarity of 1.00, but exhibits relatively weak robustness of 0.55. In contrast, BYOL and I-JEPA attain accuracies of 0.98 and 0.95, with robustness scores of 0.75 and 0.78, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper mainly offers a taxonomy that slots JEPA-style methods into a new Predictive Representation Learning bucket separate from alignment and reconstruction, but the split looks porous and the supporting experiments stay thin.

read the letter

The main takeaway is that this work defines Predictive Representation Learning as latent prediction of unobserved data components and groups it with alignment and reconstruction in a shared taxonomy, positioning I-JEPA as the prime example. It also runs a quick head-to-head on BYOL, MAE, and I-JEPA that reports similarity and robustness numbers. That framing and the short discussion of open challenges are the clearest pieces of new organization the paper adds. The experiments at least give concrete numbers to compare the three methods on the same axes, which can help readers see differences in practice. The theoretical perspectives section flags some directions worth watching. The soft spots sit in the taxonomy itself and the evidence. The definition of PRL as predicting unobserved latent parts overlaps with what masked autoencoders already do when they reconstruct missing patches, yet MAE is placed firmly in the reconstruction column without a clear rule for why the two categories stay separate. The reported results mix accuracy and similarity scores without naming datasets, exact metrics, or training details, so it is difficult to judge how much they actually illustrate the claimed advantages. No formal justification or proof is given that the taxonomy avoids overlap or yields design insights beyond re-labeling. This is the sort of conceptual paper that could interest people who follow self-supervised learning surveys and want a quick way to sort recent methods. Readers who need tight definitions or reproducible experiments will find it light. It deserves peer review so referees can press on the category boundaries and ask for fuller experimental reporting; the structure is coherent enough to make that feedback useful rather than a waste of time.

Referee Report

3 major / 3 minor

Summary. The paper claims to define a new category of Predictive Representation Learning (PRL) in self-supervised learning centered on latent prediction of unobserved data components, proposes a taxonomy that organizes PRL alongside alignment and reconstruction-based methods, positions Joint-Embedding Predictive Architecture (JEPA) as an exemplar of PRL, and reports comparative experiments with BYOL, MAE, and I-JEPA yielding metrics such as MAE similarity of 1.00 with robustness 0.55, BYOL accuracy 0.98 with robustness 0.75, and I-JEPA accuracy 0.95 with robustness 0.78.

Significance. If the taxonomy provides a meaningful, non-overlapping organization that yields new insight, it could help structure the self-supervised learning literature and direct attention toward predictive methods as a distinct research direction. The experiments offer preliminary numerical comparisons suggesting robustness differences, but the absence of formal definitions and experimental details limits the potential contribution.

major comments (3)

[Abstract] Abstract: The definition of PRL as revolving around 'the latent prediction of unobserved components of data based on the observation' does not secure a non-overlapping distinction from reconstruction-based approaches. MAE, classified as reconstruction, predicts unobserved masked inputs, yet no section provides formal definitions or analysis showing why this does not qualify as PRL under the stated criterion or why the taxonomy yields insight beyond re-labeling.
[Experimental Results] Experimental section (implied by results in abstract): The reported metrics (MAE similarity 1.00 and robustness 0.55; BYOL accuracy 0.98 and robustness 0.75; I-JEPA 0.95 and 0.78) use inconsistent measures across methods without any description of datasets, exact metric definitions, or setup. This makes the comparative claims uninterpretable and prevents assessment of whether they illustrate PRL advantages.
[Taxonomy Proposal] Taxonomy section: The common taxonomy classifying PRL with alignment and reconstruction is presented as a conceptual organization but lacks rigorous justification, formal definitions, or demonstration of non-overlap and novelty. The experiments are described separately without linking back to validate the taxonomy.

minor comments (3)

[Abstract] Abstract contains a spacing error: 'input recon struction' should read 'reconstruction'.
[Abstract] Abstract has missing space: 'Architecture(JEPA)' should be 'Architecture (JEPA)'.
[Experimental Results] The manuscript does not specify the full experimental protocol, including datasets and evaluation details, which is a clarity issue for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed and constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and committing to revisions where appropriate to improve the clarity and rigor of our work.

read point-by-point responses

Referee: [Abstract] Abstract: The definition of PRL as revolving around 'the latent prediction of unobserved components of data based on the observation' does not secure a non-overlapping distinction from reconstruction-based approaches. MAE, classified as reconstruction, predicts unobserved masked inputs, yet no section provides formal definitions or analysis showing why this does not qualify as PRL under the stated criterion or why the taxonomy yields insight beyond re-labeling.

Authors: We thank the referee for highlighting this important point regarding potential overlap in definitions. Our intention with PRL is to emphasize prediction performed directly in the latent representation space, where the model learns to predict representations of unobserved data components (such as alternative views or masked regions in embedding space) rather than reconstructing the raw input data. For instance, JEPA predicts the embedding of a target view from the context view without any input reconstruction. In contrast, MAE uses a decoder to reconstruct the actual pixel values of masked patches. We will revise the manuscript to include formal mathematical definitions for PRL, alignment, and reconstruction categories, along with an analysis demonstrating their non-overlapping nature based on the prediction target (latent vs. input space). This taxonomy offers insight by identifying a direction focused on learning predictive models of the data manifold in representation space, which could explain differences in robustness observed in experiments. We will update the abstract accordingly. revision: yes
Referee: [Experimental Results] Experimental section (implied by results in abstract): The reported metrics (MAE similarity 1.00 and robustness 0.55; BYOL accuracy 0.98 and robustness 0.75; I-JEPA 0.95 and 0.78) use inconsistent measures across methods without any description of datasets, exact metric definitions, or setup. This makes the comparative claims uninterpretable and prevents assessment of whether they illustrate PRL advantages.

Authors: We agree that the experimental details and metric descriptions were inadequate in the submitted version, rendering the results difficult to interpret. The similarity metric for MAE likely refers to reconstruction fidelity or representation similarity, while accuracy for BYOL and I-JEPA may refer to downstream task performance, and robustness to some perturbation test. To address this, we will substantially expand the experimental section with complete details on the datasets used (e.g., specific benchmarks like ImageNet subsets), precise definitions and formulas for all metrics (similarity, accuracy, robustness), the full experimental setup including hyperparameters, and how these metrics demonstrate advantages or characteristics of PRL methods like I-JEPA. We will also ensure consistent evaluation across methods where possible to allow fair comparison and link the results to the proposed taxonomy. revision: yes
Referee: [Taxonomy Proposal] Taxonomy section: The common taxonomy classifying PRL with alignment and reconstruction is presented as a conceptual organization but lacks rigorous justification, formal definitions, or demonstration of non-overlap and novelty. The experiments are described separately without linking back to validate the taxonomy.

Authors: The taxonomy is meant to categorize self-supervised learning approaches based on their core learning objective: alignment for learning invariant representations through positive pairs, reconstruction for recovering input details, and PRL for predicting latent representations of unobserved parts to capture predictive structure. We will enhance the taxonomy section with rigorous justification drawn from theoretical perspectives on what each method learns about the data distribution, provide formal definitions, and explicitly show non-overlap with concrete examples from the literature. Furthermore, we will integrate the experimental results with the taxonomy by discussing how the higher robustness of I-JEPA (as a PRL method) compared to MAE may validate the predictive approach's benefits. This will demonstrate the taxonomy's novelty and utility in structuring the field and guiding future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity in taxonomy proposal or experiments

full rationale

The paper proposes Predictive Representation Learning (PRL) as a new category revolving around latent prediction of unobserved components and offers a taxonomy classifying it alongside alignment and reconstruction approaches. This is presented as a definitional and organizational framework rather than any derived result from equations, data fits, or prior results. The experiments implementing and comparing BYOL, MAE, and I-JEPA are reported separately with their own metrics and outcomes, without the taxonomy being used to generate or force those outcomes or vice versa. No self-citations, uniqueness theorems, ansatzes, or renamings appear as load-bearing elements in the provided text, and no step reduces by construction to its own inputs. The central claims remain self-contained as conceptual contributions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the contribution is a conceptual taxonomy and limited empirical comparison whose supporting assumptions are not detailed.

pith-pipeline@v0.9.0 · 5565 in / 1198 out tokens · 60594 ms · 2026-05-10T13:08:26.363081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Representation Learning with Contrastive Predictive Coding,

A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Proc. Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[2]

In: ICML (2020)

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Con- trastive Learning of Visual Representations. In: ICML (2020)

work page 2020
[3]

Momentum Contrast for Unsuper- vised Visual Representation Learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsuper- vised Visual Representation Learning,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[4]

In: NeurIPS (2020)

Grill,J.-B.,etal.:BootstrapYourOwnLatent:ANewApproachtoSelf-Supervised Learning. In: NeurIPS (2020)

work page 2020
[5]

Exploring Simple Siamese Representation Learning,

X. Chen and K. He, “Exploring Simple Siamese Representation Learning,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[6]

Masked Autoencoders Are Scalable Vision Learners,

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[7]

BEiT: BERT Pre-Training of Image Transformers,

H. Bao, L. Dong, and F. Wei, “BEiT: BERT Pre-Training of Image Transformers,” Proc. Int. Conf. Learning Representations (ICLR), 2022. 16 M. Dutta et al

work page 2022
[8]

Meta AI White Paper (2022)

LeCun, Y.: A Path Towards Autonomous Machine Intelligence. Meta AI White Paper (2022)

work page 2022
[9]

In: CVPR (2023)

Assran, M., et al.: Self-Supervised Learning from Images with a Joint Embedding Predictive Architecture. In: CVPR (2023)

work page 2023
[10]

Video Joint-Embedding Predictive Architec- ture,

A. Bardes, J. Ponce, and Y. LeCun, “Video Joint-Embedding Predictive Architec- ture,”arXiv preprint arXiv:2401.xxxxx, 2024

work page 2024
[11]

arXiv preprint arXiv:2512.10942 (2025)

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language. arXiv preprint arXiv:2512.10942 (2025)

work page arXiv 2025
[12]

ICLR Workshop or OpenReview preprint (2024)

Skenderi, G., Li, H., Tang, J., Cristani, M.: Graph-JEPA: Graph-Level Represen- tation Learning with Joint-Embedding Predictive Architectures. ICLR Workshop or OpenReview preprint (2024)

work page 2024
[13]

V-JEPA 2: Self-Supervised Video Models En- able Understanding, Prediction and Planning,

A. Recasens, J. Carreira, L. Beyer, F. Strub, L. Kirsch, N. Savinov, M. Tschannen, A. van den Oord, and O. J. Hénaff, “V-JEPA 2: Self-Supervised Video Models En- able Understanding, Prediction and Planning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2025, pp. 1–10

work page 2025
[14]

arXiv preprint arXiv:2410.03755 (2024)

Chen, D., Hu, J., Wei, X., Wu, E.: Denoising with a Joint-Embedding Predictive Architecture (D-JEPA). arXiv preprint arXiv:2410.03755 (2024)

work page arXiv 2024
[15]

arXiv preprint arXiv:2511.17354 (2025)

He, X., Sakai, S., Yuan, K., Padoy, N., Hasegawa, T., Sigal, L.: DSeq-JEPA: Dis- criminative Sequential Joint-Embedding Predictive Architecture. arXiv preprint arXiv:2511.17354 (2025)

work page arXiv 2025
[16]

Ghaemi, H., Muller, E., Bakhtiari, S.: seq-JEPA: Autoregressive Predictive Learn- ingofInvariant-EquivariantWorldModels.arXivpreprintarXiv:2505.03176(2025)

work page arXiv 2025
[17]

A-JEPA: Joint-Embedding Predictive Architecture Can Listen,

Z. Fei, M. Fan, and J. Huang, "A-JEPA: Joint-Embedding Predictive Architecture Can Listen," arXiv preprint, 2023

work page 2023
[18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

work page 2021
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2022)

Gui, J., Chen, T., Zhang, J., Zhang, Q., Liu, Y., Wang, S., Wang, X., Huang, F.: A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2022)

work page 2022
[20]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)

Jing, L., Tian, Y.: Self-supervised Visual Feature Learning with Deep Neural Net- works: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)

work page 2020
[21]

In: Proceedings of the International Conference on Machine Learning (ICML) (2022)

Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: A Gen- eral Framework for Self-Supervised Learning in Speech, Vision, and Language. In: Proceedings of the International Conference on Machine Learning (ICML) (2022)

work page 2022

[1] [1]

Representation Learning with Contrastive Predictive Coding,

A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” Proc. Advances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018

[2] [2]

In: ICML (2020)

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Con- trastive Learning of Visual Representations. In: ICML (2020)

work page 2020

[3] [3]

Momentum Contrast for Unsuper- vised Visual Representation Learning,

K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsuper- vised Visual Representation Learning,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020

[4] [4]

In: NeurIPS (2020)

Grill,J.-B.,etal.:BootstrapYourOwnLatent:ANewApproachtoSelf-Supervised Learning. In: NeurIPS (2020)

work page 2020

[5] [5]

Exploring Simple Siamese Representation Learning,

X. Chen and K. He, “Exploring Simple Siamese Representation Learning,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

[6] [6]

Masked Autoencoders Are Scalable Vision Learners,

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[7] [7]

BEiT: BERT Pre-Training of Image Transformers,

H. Bao, L. Dong, and F. Wei, “BEiT: BERT Pre-Training of Image Transformers,” Proc. Int. Conf. Learning Representations (ICLR), 2022. 16 M. Dutta et al

work page 2022

[8] [8]

Meta AI White Paper (2022)

LeCun, Y.: A Path Towards Autonomous Machine Intelligence. Meta AI White Paper (2022)

work page 2022

[9] [9]

In: CVPR (2023)

Assran, M., et al.: Self-Supervised Learning from Images with a Joint Embedding Predictive Architecture. In: CVPR (2023)

work page 2023

[10] [10]

Video Joint-Embedding Predictive Architec- ture,

A. Bardes, J. Ponce, and Y. LeCun, “Video Joint-Embedding Predictive Architec- ture,”arXiv preprint arXiv:2401.xxxxx, 2024

work page 2024

[11] [11]

arXiv preprint arXiv:2512.10942 (2025)

Chen, D., Shukor, M., Moutakanni, T., Chung, W., Yu, J., Kasarla, T., Bolourchi, A., LeCun, Y., Fung, P.: VL-JEPA: Joint Embedding Predictive Architecture for Vision-Language. arXiv preprint arXiv:2512.10942 (2025)

work page arXiv 2025

[12] [12]

ICLR Workshop or OpenReview preprint (2024)

Skenderi, G., Li, H., Tang, J., Cristani, M.: Graph-JEPA: Graph-Level Represen- tation Learning with Joint-Embedding Predictive Architectures. ICLR Workshop or OpenReview preprint (2024)

work page 2024

[13] [13]

V-JEPA 2: Self-Supervised Video Models En- able Understanding, Prediction and Planning,

A. Recasens, J. Carreira, L. Beyer, F. Strub, L. Kirsch, N. Savinov, M. Tschannen, A. van den Oord, and O. J. Hénaff, “V-JEPA 2: Self-Supervised Video Models En- able Understanding, Prediction and Planning,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2025, pp. 1–10

work page 2025

[14] [14]

arXiv preprint arXiv:2410.03755 (2024)

Chen, D., Hu, J., Wei, X., Wu, E.: Denoising with a Joint-Embedding Predictive Architecture (D-JEPA). arXiv preprint arXiv:2410.03755 (2024)

work page arXiv 2024

[15] [15]

arXiv preprint arXiv:2511.17354 (2025)

He, X., Sakai, S., Yuan, K., Padoy, N., Hasegawa, T., Sigal, L.: DSeq-JEPA: Dis- criminative Sequential Joint-Embedding Predictive Architecture. arXiv preprint arXiv:2511.17354 (2025)

work page arXiv 2025

[16] [16]

Ghaemi, H., Muller, E., Bakhtiari, S.: seq-JEPA: Autoregressive Predictive Learn- ingofInvariant-EquivariantWorldModels.arXivpreprintarXiv:2505.03176(2025)

work page arXiv 2025

[17] [17]

A-JEPA: Joint-Embedding Predictive Architecture Can Listen,

Z. Fei, M. Fan, and J. Huang, "A-JEPA: Joint-Embedding Predictive Architecture Can Listen," arXiv preprint, 2023

work page 2023

[18] [18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging Properties in Self-Supervised Vision Transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

work page 2021

[19] [19]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2022)

Gui, J., Chen, T., Zhang, J., Zhang, Q., Liu, Y., Wang, S., Wang, X., Huang, F.: A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2022)

work page 2022

[20] [20]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)

Jing, L., Tian, Y.: Self-supervised Visual Feature Learning with Deep Neural Net- works: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2020)

work page 2020

[21] [21]

In: Proceedings of the International Conference on Machine Learning (ICML) (2022)

Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., Auli, M.: data2vec: A Gen- eral Framework for Self-Supervised Learning in Speech, Vision, and Language. In: Proceedings of the International Conference on Machine Learning (ICML) (2022)

work page 2022