Recognition: 3 theorem links
· Lean TheoremRevisiting Feature Prediction for Learning Visual Representations from Video
Pith reviewed 2026-05-12 12:32 UTC · model grok-4.3
The pith
Predicting features across video frames produces versatile visual representations that work well when frozen on both video and image tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
V-JEPA models learn solely by predicting the encoded features of masked or future video patches from visible context using a transformer encoder and a separate predictor. Trained on two million public videos, the largest ViT-H/16 variant achieves 81.9 percent on Kinetics-400, 72.2 percent on Something-Something-v2, and 77.9 percent on ImageNet-1K with no parameter updates at evaluation time.
What carries the argument
The feature prediction objective, in which visible video patches are used to forecast the high-level features of masked patches through a dedicated predictor network.
If this is right
- Representations learned this way transfer effectively to both motion-heavy video tasks and appearance-based image tasks without adaptation.
- Training requires only public video collections and no additional supervision sources.
- Larger transformer models benefit from the objective and produce higher downstream accuracy.
- The method removes dependence on pretrained image encoders or text data during pretraining.
Where Pith is reading between the lines
- If the claim holds, video data alone could become the dominant pretraining source for general vision backbones.
- Similar feature-prediction objectives might extend naturally to other time-series domains such as audio or sensor data.
- Testing the same models on dense prediction tasks like segmentation would clarify how much spatial detail the representations retain.
Load-bearing premise
That performance of the frozen encoder on standard benchmarks accurately reflects the general usefulness of the learned representations.
What would settle it
A new downstream task or dataset where a feature-prediction model underperforms a reconstruction-based or contrastive model trained on the same video data would challenge the central claim.
read the original abstract
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces V-JEPA, a family of vision transformers trained exclusively with a feature-prediction objective on approximately 2 million unlabeled videos drawn from public datasets. No pretrained image encoders, text, negative samples, reconstruction losses, or other supervision are used. The central empirical claim is that the resulting frozen backbones yield versatile representations that perform competitively on both video (Kinetics-400, Something-Something-v2) and image (ImageNet-1K) downstream tasks, with the largest ViT-H/16 variant reaching 81.9%, 72.2%, and 77.9% respectively.
Significance. If the reported numbers prove reproducible under the stated protocol, the work would demonstrate that a pure feature-prediction objective on video alone can produce general-purpose visual representations competitive with contemporary self-supervised methods. This would strengthen the case for video-centric pretraining pipelines that avoid reconstruction, contrastive negatives, or external encoders, and would provide a useful baseline for future ablation studies on target generation and masking strategies.
major comments (2)
- [§4] §4 (Experimental Setup) and associated tables: the headline accuracies (e.g., 81.9% Kinetics-400) are presented without accompanying details on optimizer schedule, learning-rate values, exact video sampling strategy, train/val splits of the 2 M video corpus, or number of random seeds used for statistical significance. These omissions are load-bearing because the central claim rests entirely on the downstream frozen-backbone numbers.
- [§3.2] §3.2 (Target Generation): the description of how feature targets are obtained for the prediction objective is insufficiently precise. It is unclear whether any auxiliary network or preprocessing step is used to produce the targets, which directly affects the claim that the method is “stand-alone” and free of pretrained encoders.
minor comments (2)
- [Figure 2] Figure 2 and Table 1: axis labels and caption wording should explicitly state whether the plotted curves correspond to frozen or fine-tuned evaluation, and whether the ImageNet numbers are linear-probe or full fine-tuning.
- Notation: the symbol for the feature predictor head is introduced inconsistently across equations (3)–(5); a single, clearly defined symbol would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We will revise the manuscript to address the concerns about experimental reproducibility and the precision of the target generation description.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup) and associated tables: the headline accuracies (e.g., 81.9% Kinetics-400) are presented without accompanying details on optimizer schedule, learning-rate values, exact video sampling strategy, train/val splits of the 2 M video corpus, or number of random seeds used for statistical significance. These omissions are load-bearing because the central claim rests entirely on the downstream frozen-backbone numbers.
Authors: We agree that these implementation details are essential for reproducibility and should have been included. In the revised manuscript we will expand §4 (and add an appendix) with the optimizer (AdamW), learning-rate schedule and values, exact video clip sampling procedure, the composition and train/val splits of the 2 M public video corpus, and the number of random seeds used for the reported results. revision: yes
-
Referee: [§3.2] §3.2 (Target Generation): the description of how feature targets are obtained for the prediction objective is insufficiently precise. It is unclear whether any auxiliary network or preprocessing step is used to produce the targets, which directly affects the claim that the method is “stand-alone” and free of pretrained encoders.
Authors: We will revise §3.2 to make the target-generation procedure explicit. The feature targets are produced by applying the same V-JEPA ViT encoder (updated via exponential moving average) directly to the target patches of the input video; no auxiliary pretrained network, external encoder, or additional supervision is used at any stage. This preserves the stand-alone character of the method. We will include a clearer algorithmic description and diagram. revision: yes
Circularity Check
No significant circularity; empirical results are self-contained
full rationale
The paper describes an empirical training procedure for vision transformers using a feature-prediction objective on unlabeled video data, followed by frozen-backbone evaluation on separate downstream image and video classification benchmarks. No mathematical derivation chain, uniqueness theorem, or first-principles prediction is presented that reduces by construction to fitted parameters or self-citations. The reported metrics (e.g., 81.9% on Kinetics-400) are externally verifiable against standard datasets and protocols, with no evidence that target features, loss terms, or evaluation quantities are defined in terms of the final performance numbers. The approach is explicitly positioned as stand-alone, without reliance on pretrained encoders or negatives, confirming the results rest on independent experimental outcomes rather than tautological re-labeling of inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- ViT-H/16 architecture scale and training schedule
axioms (1)
- domain assumption ViT transformer blocks can be trained end-to-end with a feature-prediction loss on video patches
invented entities (1)
-
V-JEPA model family
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 31 Pith papers
-
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...
-
ProteinJEPA: Latent prediction complements protein language models
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference
Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations
LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.
-
Predictive but Not Plannable: RC-aux for Latent World Models
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
-
Understanding Self-Supervised Learning via Latent Distribution Matching
Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants ev...
-
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
-
LA-Pose: Latent Action Pretraining Meets Pose Estimation
LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
-
Exploring High-Order Self-Similarity for Video Understanding
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0
BADAS-2.0 scales collision anticipation with a 178k-video long-tail benchmark built via active oracle selection, 7-12x faster distilled edge models, and object-centric attention heatmaps plus VLM-based textual reasoning.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Sapiens2
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
The Global Neural World Model: Spatially Grounded Discrete Topologies for Action-Conditioned Planning
GNWM maps environments to a discrete 2D grid with snapping to stabilize autoregressive planning and learns generalized dynamics from maximum-entropy random walks.
Reference graph
Works this paper leans on
-
[1]
Dogus Cubuk, Ekin and Zoph, Barret and Mane, Dandelion andVasudevan, Vijay and V. Le, Quoc , title=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=
-
[2]
Proceedings of the IEEE international conference on computer vision , year=
ViViT: A Video Vision Transformer , author=. Proceedings of the IEEE international conference on computer vision , year=
-
[3]
arXiv preprint arXiv:2307.12698 , year=
MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features , author=. arXiv preprint arXiv:2307.12698 , year=
-
[4]
arXiv preprint arXiv:2203.16527 , year=
Exploring plain vision transformer backbones for object detection , author=. arXiv preprint arXiv:2203.16527 , year=
-
[5]
arXiv preprint arXiv:2210.01571 , year=
VICRegL: Self-Supervised Learning of Local Visual Features , author=. arXiv preprint arXiv:2210.01571 , year=
- [6]
-
[7]
A Theoretical Analysis of Contrastive Unsupervised Representation Learning
A theoretical analysis of contrastive unsupervised representation learning , author=. arXiv preprint arXiv:1902.09229 , year=
work page Pith review arXiv 1902
-
[8]
Advances in neural information processing systems , volume=
Unsupervised classifiers, mutual information and'phantom targets , author=. Advances in neural information processing systems , volume=
- [9]
-
[10]
Journal of statistical software , volume=
Spherical k-means clustering , author=. Journal of statistical software , volume=
-
[11]
Expert systems with applications , volume=
A simple and fast algorithm for K-medoids clustering , author=. Expert systems with applications , volume=. 2009 , publisher=
work page 2009
- [12]
-
[13]
2010 IEEE International Conference on Data Mining , pages=
Learning a bi-stochastic data similarity matrix , author=. 2010 IEEE International Conference on Data Mining , pages=. 2010 , organization=
work page 2010
-
[14]
Proceedings of the 23rd international conference on Machine learning , pages=
The uniqueness of a good optimum for k-means , author=. Proceedings of the 23rd international conference on Machine learning , pages=
-
[15]
Adapting the right measures for k-means clustering , author=. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=
-
[16]
IEEE Transactions on Fuzzy Systems , volume=
The K -means-type algorithms versus imbalanced data distributions , author=. IEEE Transactions on Fuzzy Systems , volume=. 2012 , publisher=
work page 2012
-
[17]
Size matters: Cardinality-constrained clustering and outlier detection via conic optimization , author=. SIAM J. Optimization , volume=. 2019 , publisher=
work page 2019
-
[18]
Microsoft Research, Redmond , volume=
Constrained k-means clustering , author=. Microsoft Research, Redmond , volume=
-
[19]
Fair k-center clustering for data summarization , author=. ICML , pages=. 2019 , organization=
work page 2019
-
[20]
arXiv preprint arXiv:1904.03035 , year=
Identifying and reducing gender bias in word-level language models , author=. arXiv preprint arXiv:1904.03035 , year=
-
[21]
Conference on Fairness, Accountability and Transparency , pages=
Gender shades: Intersectional accuracy disparities in commercial gender classification , author=. Conference on Fairness, Accountability and Transparency , pages=. 2018 , organization=
work page 2018
-
[22]
Frontiers of Information Technology & Electronic Engineering , pages=
On the principles of Parsimony and Self-consistency for the emergence of intelligence , author=. Frontiers of Information Technology & Electronic Engineering , pages=. 2022 , publisher=
work page 2022
-
[23]
Cybernetics or Control and Communication in the Animal and the Machine , author=. 2019 , publisher=
work page 2019
-
[25]
Advances in neural information processing systems , volume=
Discriminative clustering by regularized information maximization , author=. Advances in neural information processing systems , volume=
-
[26]
Advances in neural information processing systems , volume=
Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
-
[27]
International conference on machine learning , pages=
Data-efficient image recognition with contrastive predictive coding , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[28]
International conference on machine learning , pages=
Learning discrete representations via information maximizing self-augmented training , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[29]
Self-organization in a perceptual network , author=. Computer , volume=. 1988 , publisher=
work page 1988
-
[30]
On mutual information maximization for representation learning , author=. arXiv preprint arXiv:1907.13625 , year=
-
[31]
Proceedings of the annual meeting of the cognitive science society , volume=
One shot learning of simple visual concepts , author=. Proceedings of the annual meeting of the cognitive science society , volume=
-
[32]
Artificial Intelligence and Statistics , pages=
Learning a nonlinear embedding by preserving class neighbourhood structure , author=. Artificial Intelligence and Statistics , pages=. 2007 , organization=
work page 2007
- [33]
-
[34]
Journal of research in science teaching , volume=
Cognitive development in children: Piaget , author=. Journal of research in science teaching , volume=
-
[35]
Artificial intelligence and Piagetian theory , author=. Synthese , pages=. 1978 , publisher=
work page 1978
-
[36]
Annals of the New York Academy of Sciences , volume=
Reply to Individual and collective problems in the study of thinking , author=. Annals of the New York Academy of Sciences , volume=
- [37]
-
[38]
Semi-supervised learning , pages=
Entropy regularization , author=. Semi-supervised learning , pages=. 2006 , publisher=
work page 2006
-
[40]
arXiv preprint arXiv:2006.10029 , year=
Big self-supervised models are strong semi-supervised learners , author=. arXiv preprint arXiv:2006.10029 , year=
-
[44]
arXiv preprint arXiv:1606.04080 , year=
Matching networks for one shot learning , author=. arXiv preprint arXiv:1606.04080 , year=
-
[45]
Available: https://arxiv.org/abs/1703.05175
Prototypical networks for few-shot learning , author=. arXiv preprint arXiv:1703.05175 , year=
-
[46]
Optimization as a model for few-shot learning , author=
-
[47]
Behavioral and brain sciences , volume=
Building machines that learn and think like people , author=. Behavioral and brain sciences , volume=. 2017 , publisher=
work page 2017
-
[48]
International Journal of Computer Vision , volume=
Imagenet large scale visual recognition challenge , author=. International Journal of Computer Vision , volume=
-
[49]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[50]
Large Batch Training of Convolutional Networks
Large batch training of convolutional networks , author=. arXiv preprint arXiv:1708.03888 , year=
-
[51]
International conference on machine learning , pages=
On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=
-
[52]
preprint arXiv:1904.12848 , year=
Unsupervised data augmentation , author=. arXiv preprint arXiv:1904.12848 , year=
-
[53]
D., Kurakin, A., Zhang, H., and Raffel, C
Fixmatch: Simplifying semi-supervised learning with consistency and confidence , author=. arXiv preprint arXiv:2001.07685 , year=
-
[54]
preprint arXiv:2003.10580 , year=
Meta pseudo labels , author=. arXiv preprint arXiv:2003.10580 , year=
-
[55]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[56]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[57]
arXiv preprint arXiv:1803.00676 , year=
Meta-learning for semi-supervised few-shot classification , author=. arXiv preprint arXiv:1803.00676 , year=
-
[58]
Momentum contrast for unsupervised visual representation learning
Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick , title =. arXiv preprint arXiv:1911.05722 , year =
-
[59]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He , title =. arXiv preprint arXiv:2003.04297 , year =
work page internal anchor Pith review arXiv 2003
-
[60]
arXiv preprint arXiv:1810.02334 , year=
Unsupervised learning via meta-learning , author=. arXiv preprint arXiv:1810.02334 , year=
-
[61]
Exploring simple siamese representation learning
Exploring Simple Siamese Representation Learning , author=. arXiv preprint arXiv:2011.10566 , year=
-
[62]
Loshchilov, Ilya and Hutter, Frank , journal=
-
[63]
arXiv preprint arXiv:2004.11362 , year=
Supervised Contrastive Learning , author=. arXiv preprint arXiv:2004.11362 , year=
-
[64]
IEEE transactions on pattern analysis and machine intelligence , volume=
Virtual adversarial training: a regularization method for supervised and semi-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=
work page 2018
-
[65]
arXiv preprint arXiv:1903.03825 , year=
Interpolation Consistency Training for Semi-Supervised Learning , author=. arXiv preprint arXiv:1903.03825 , year=
-
[66]
Proceedings of the IEEE international conference on computer vision , pages=
S4l: Self-supervised semi-supervised learning , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[67]
In International Conference on Machine Learning Workshop , year=
Lee, Dong-Hyun , title=. In International Conference on Machine Learning Workshop , year=
- [68]
-
[69]
In Proceedings of the National Conference on Artificial Intelligence , year=
Riloff, Ellen , title=. In Proceedings of the National Conference on Artificial Intelligence , year=
-
[70]
Advances in Neural Information Processing Systems , pages=
Mixmatch: A holistic approach to semi-supervised learning , author=. Advances in Neural Information Processing Systems , pages=
-
[71]
arXiv preprint arXiv:1911.09785 , year=
ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring , author=. arXiv preprint arXiv:1911.09785 , year=
-
[72]
In 33rd Annual Meeting of the Association for Computational Linguistics , year=
Yarowsky, David , title=. In 33rd Annual Meeting of the Association for Computational Linguistics , year=
-
[73]
arXiv preprint arXiv:1911.05371 , year=
Self-labelling via simultaneous clustering and representation learning , author=. arXiv preprint arXiv:1911.05371 , year=
-
[74]
preprint arXiv:2006.06882 , year=
Rethinking pre-training and self-training , author=. arXiv preprint arXiv:2006.06882 , year=
-
[75]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Self-training with noisy student improves imagenet classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[77]
arXiv preprint arXiv:2112.10740 , year=
Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author=. arXiv preprint arXiv:2112.10740 , year=
-
[78]
Representation learning via invariant causal mechanisms
Representation learning via invariant causal mechanisms , author=. arXiv preprint arXiv:2010.07922 , year=
-
[79]
preprint arXiv:2006.10803 , year=
Supervision accelerates pre-training in contrastive semi-supervised learning of visual representations , author=. arXiv preprint arXiv:2006.10803 , year=
-
[80]
arXiv preprint arXiv:1206.6413 , year=
A convex relaxation for weakly supervised classifiers , author=. arXiv preprint arXiv:1206.6413 , year=
-
[81]
arXiv preprint arXiv:1610.02242 , year=
Temporal ensembling for semi-supervised learning , author=. arXiv preprint arXiv:1610.02242 , year=
-
[82]
arXiv preprint arXiv:1902.02336 , year=
Semi-supervised learning by label gradient alignment , author=. arXiv preprint arXiv:1902.02336 , year=
-
[83]
arXiv preprint arXiv:1911.09265 , year=
Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning , author=. arXiv preprint arXiv:1911.09265 , year=
-
[84]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[85]
Wide residual networks , author=. arXiv preprint arXiv:1605.07146 , year=
work page internal anchor Pith review arXiv
-
[86]
Communications of the ACM , volume=
YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.