arxiv: 2404.08471 · v1 · submitted 2024-02-15 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes , Quentin Garrido , Jean Ponce , Xinlei Chen , Michael Rabbat , Yann LeCun , Mahmoud Assran , Nicolas Ballas

Authors on Pith no claims yet

Pith reviewed 2026-05-12 12:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords feature predictionvideo self-supervised learningvisual representationsvision transformerunsupervised pretrainingV-JEPA

0 comments

The pith

Predicting features across video frames produces versatile visual representations that work well when frozen on both video and image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a feature prediction objective applied only to video data can train large vision transformers into useful general representations. These V-JEPA models require no image pretraining, text, negative pairs, or pixel reconstruction. When the backbone stays frozen, the models reach strong accuracy on action recognition and static image classification. A sympathetic reader cares because this points to video as a sufficient source for learning appearance and motion features without extra supervision signals.

Core claim

V-JEPA models learn solely by predicting the encoded features of masked or future video patches from visible context using a transformer encoder and a separate predictor. Trained on two million public videos, the largest ViT-H/16 variant achieves 81.9 percent on Kinetics-400, 72.2 percent on Something-Something-v2, and 77.9 percent on ImageNet-1K with no parameter updates at evaluation time.

What carries the argument

The feature prediction objective, in which visible video patches are used to forecast the high-level features of masked patches through a dedicated predictor network.

If this is right

Representations learned this way transfer effectively to both motion-heavy video tasks and appearance-based image tasks without adaptation.
Training requires only public video collections and no additional supervision sources.
Larger transformer models benefit from the objective and produce higher downstream accuracy.
The method removes dependence on pretrained image encoders or text data during pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the claim holds, video data alone could become the dominant pretraining source for general vision backbones.
Similar feature-prediction objectives might extend naturally to other time-series domains such as audio or sensor data.
Testing the same models on dense prediction tasks like segmentation would clarify how much spatial detail the representations retain.

Load-bearing premise

That performance of the frozen encoder on standard benchmarks accurately reflects the general usefulness of the learned representations.

What would settle it

A new downstream task or dataset where a feature-prediction model underperforms a reconstruction-based or contrastive model trained on the same video data would challenge the central claim.

read the original abstract

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V-JEPA shows a clean feature-prediction objective on raw video can produce frozen ViT backbones that hit strong numbers on both video action and image classification without extra supervision.

read the letter

The main thing to know is that this paper trains ViT models on 2 million videos using only a feature-prediction loss and reports competitive frozen-backbone results: 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K for the largest ViT-H/16. No pretrained encoders, no negatives, no reconstruction, and no fine-tuning at evaluation time. That setup is the core contribution and it lands in a useful spot for unsupervised video work. The numbers demonstrate that the learned features transfer across motion-heavy and appearance-heavy tasks without adaptation, which is the practical takeaway. The implementation appears self-contained and the evaluation covers both video and image benchmarks, which adds to the claim of versatility. Prior feature-prediction ideas exist, but the scale here plus the stand-alone recipe without mixing in other objectives marks a clear incremental step. The central empirical claim holds up internally with the stated protocol. Soft spots are mostly around missing details rather than contradictions. The abstract gives headline metrics but skips training hyperparameters, exact data splits, statistical significance, and the precise mechanism for generating target features. Those choices can shift downstream numbers, so the full paper needs to show the ablations and baselines clearly to pin down what drives the gains. Reproducibility would benefit from more protocol transparency. This paper is for groups working on scalable self-supervised video pretraining and frozen evaluation protocols. Readers focused on JEPA-style predictive objectives or comparing video-only training to image pretraining will find the results directly usable. It deserves a serious referee because the empirical evidence is sharp enough to warrant external scrutiny on the training recipe and controls, even if revisions are needed for full details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces V-JEPA, a family of vision transformers trained exclusively with a feature-prediction objective on approximately 2 million unlabeled videos drawn from public datasets. No pretrained image encoders, text, negative samples, reconstruction losses, or other supervision are used. The central empirical claim is that the resulting frozen backbones yield versatile representations that perform competitively on both video (Kinetics-400, Something-Something-v2) and image (ImageNet-1K) downstream tasks, with the largest ViT-H/16 variant reaching 81.9%, 72.2%, and 77.9% respectively.

Significance. If the reported numbers prove reproducible under the stated protocol, the work would demonstrate that a pure feature-prediction objective on video alone can produce general-purpose visual representations competitive with contemporary self-supervised methods. This would strengthen the case for video-centric pretraining pipelines that avoid reconstruction, contrastive negatives, or external encoders, and would provide a useful baseline for future ablation studies on target generation and masking strategies.

major comments (2)

[§4] §4 (Experimental Setup) and associated tables: the headline accuracies (e.g., 81.9% Kinetics-400) are presented without accompanying details on optimizer schedule, learning-rate values, exact video sampling strategy, train/val splits of the 2 M video corpus, or number of random seeds used for statistical significance. These omissions are load-bearing because the central claim rests entirely on the downstream frozen-backbone numbers.
[§3.2] §3.2 (Target Generation): the description of how feature targets are obtained for the prediction objective is insufficiently precise. It is unclear whether any auxiliary network or preprocessing step is used to produce the targets, which directly affects the claim that the method is “stand-alone” and free of pretrained encoders.

minor comments (2)

[Figure 2] Figure 2 and Table 1: axis labels and caption wording should explicitly state whether the plotted curves correspond to frozen or fine-tuned evaluation, and whether the ImageNet numbers are linear-probe or full fine-tuning.
Notation: the symbol for the feature predictor head is introduced inconsistently across equations (3)–(5); a single, clearly defined symbol would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to address the concerns about experimental reproducibility and the precision of the target generation description.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup) and associated tables: the headline accuracies (e.g., 81.9% Kinetics-400) are presented without accompanying details on optimizer schedule, learning-rate values, exact video sampling strategy, train/val splits of the 2 M video corpus, or number of random seeds used for statistical significance. These omissions are load-bearing because the central claim rests entirely on the downstream frozen-backbone numbers.

Authors: We agree that these implementation details are essential for reproducibility and should have been included. In the revised manuscript we will expand §4 (and add an appendix) with the optimizer (AdamW), learning-rate schedule and values, exact video clip sampling procedure, the composition and train/val splits of the 2 M public video corpus, and the number of random seeds used for the reported results. revision: yes
Referee: [§3.2] §3.2 (Target Generation): the description of how feature targets are obtained for the prediction objective is insufficiently precise. It is unclear whether any auxiliary network or preprocessing step is used to produce the targets, which directly affects the claim that the method is “stand-alone” and free of pretrained encoders.

Authors: We will revise §3.2 to make the target-generation procedure explicit. The feature targets are produced by applying the same V-JEPA ViT encoder (updated via exponential moving average) directly to the target patches of the input video; no auxiliary pretrained network, external encoder, or additional supervision is used at any stage. This preserves the stand-alone character of the method. We will include a clearer algorithmic description and diagram. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper describes an empirical training procedure for vision transformers using a feature-prediction objective on unlabeled video data, followed by frozen-backbone evaluation on separate downstream image and video classification benchmarks. No mathematical derivation chain, uniqueness theorem, or first-principles prediction is presented that reduces by construction to fitted parameters or self-citations. The reported metrics (e.g., 81.9% on Kinetics-400) are externally verifiable against standard datasets and protocols, with no evidence that target features, loss terms, or evaluation quantities are defined in terms of the final performance numbers. The approach is explicitly positioned as stand-alone, without reliance on pretrained encoders or negatives, confirming the results rest on independent experimental outcomes rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a predictive feature objective on video is sufficient for versatile representations; many standard deep-learning hyperparameters and the ViT architecture are inherited without re-derivation.

free parameters (1)

ViT-H/16 architecture scale and training schedule
Model size and optimization details are chosen to achieve the reported numbers.

axioms (1)

domain assumption ViT transformer blocks can be trained end-to-end with a feature-prediction loss on video patches
Invoked implicitly when stating that the models are trained solely on the objective.

invented entities (1)

V-JEPA model family no independent evidence
purpose: Collection of vision transformers trained with the feature-prediction objective
New named artifact introduced to describe the trained models; no independent falsifiable prediction beyond the reported accuracies.

pith-pipeline@v0.9.0 · 5463 in / 1251 out tokens · 53011 ms · 2026-05-12T12:32:30.198185+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 7.0

Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
cs.CV 2026-05 unverdicted novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...
ProteinJEPA: Latent prediction complements protein language models
cs.LG 2026-05 unverdicted novelty 7.0

Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference
cs.RO 2026-05 unverdicted novelty 7.0

Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations
cs.LG 2026-05 unverdicted novelty 6.0

LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.
Predictive but Not Plannable: RC-aux for Latent World Models
cs.LG 2026-05 unverdicted novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Understanding Self-Supervised Learning via Latent Distribution Matching
cs.LG 2026-05 unverdicted novelty 6.0

Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants ev...
Text-Conditional JEPA for Learning Semantically Rich Visual Representations
cs.LG 2026-05 unverdicted novelty 6.0

TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
cs.AI 2026-05 unverdicted novelty 6.0

Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
LA-Pose: Latent Action Pretraining Meets Pose Estimation
cs.CV 2026-04 unverdicted novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
Exploring High-Order Self-Similarity for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization
cs.CV 2026-04 unverdicted novelty 6.0

A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0
cs.CV 2026-04 unverdicted novelty 6.0

BADAS-2.0 scales collision anticipation with a 178k-video long-tail benchmark built via active oracle selection, 7-12x faster distilled edge models, and object-centric attention heatmaps plus VLM-based textual reasoning.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
cs.RO 2025-05 unverdicted novelty 6.0

UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
Towards Effective Theory of LLMs: A Representation Learning Approach
cs.LG 2026-05 unverdicted novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Sapiens2
cs.CV 2026-04 unverdicted novelty 5.0

Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
The Global Neural World Model: Spatially Grounded Discrete Topologies for Action-Conditioned Planning
cs.LG 2026-04 unverdicted novelty 4.0

GNWM maps environments to a discrete 2D grid with snapping to stabilize autoregressive planning and learns generalized dynamics from maximum-entropy random walks.

Reference graph

Works this paper leans on

268 extracted references · 268 canonical work pages · cited by 27 Pith papers · 19 internal anchors

[1]

Le, Quoc , title=

Dogus Cubuk, Ekin and Zoph, Barret and Mane, Dandelion andVasudevan, Vijay and V. Le, Quoc , title=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

work page
[2]

Proceedings of the IEEE international conference on computer vision , year=

ViViT: A Video Vision Transformer , author=. Proceedings of the IEEE international conference on computer vision , year=

work page
[3]

arXiv preprint arXiv:2307.12698 , year=

MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features , author=. arXiv preprint arXiv:2307.12698 , year=

work page arXiv
[4]

arXiv preprint arXiv:2203.16527 , year=

Exploring plain vision transformer backbones for object detection , author=. arXiv preprint arXiv:2203.16527 , year=

work page arXiv
[5]

arXiv preprint arXiv:2210.01571 , year=

VICRegL: Self-Supervised Learning of Local Visual Features , author=. arXiv preprint arXiv:2210.01571 , year=

work page arXiv
[6]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[7]

A Theoretical Analysis of Contrastive Unsupervised Representation Learning

A theoretical analysis of contrastive unsupervised representation learning , author=. arXiv preprint arXiv:1902.09229 , year=

work page Pith review arXiv 1902
[8]

Advances in neural information processing systems , volume=

Unsupervised classifiers, mutual information and'phantom targets , author=. Advances in neural information processing systems , volume=

work page
[9]

NeurIPS , pages=

Spectral relaxation for k-means clustering , author=. NeurIPS , pages=

work page
[10]

Journal of statistical software , volume=

Spherical k-means clustering , author=. Journal of statistical software , volume=

work page
[11]

Expert systems with applications , volume=

A simple and fast algorithm for K-medoids clustering , author=. Expert systems with applications , volume=. 2009 , publisher=

work page 2009
[12]

, author=

Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

work page
[13]

2010 IEEE International Conference on Data Mining , pages=

Learning a bi-stochastic data similarity matrix , author=. 2010 IEEE International Conference on Data Mining , pages=. 2010 , organization=

work page 2010
[14]

Proceedings of the 23rd international conference on Machine learning , pages=

The uniqueness of a good optimum for k-means , author=. Proceedings of the 23rd international conference on Machine learning , pages=

work page
[15]

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

Adapting the right measures for k-means clustering , author=. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

work page
[16]

IEEE Transactions on Fuzzy Systems , volume=

The K -means-type algorithms versus imbalanced data distributions , author=. IEEE Transactions on Fuzzy Systems , volume=. 2012 , publisher=

work page 2012
[17]

Size matters: Cardinality-constrained clustering and outlier detection via conic optimization , author=. SIAM J. Optimization , volume=. 2019 , publisher=

work page 2019
[18]

Microsoft Research, Redmond , volume=

Constrained k-means clustering , author=. Microsoft Research, Redmond , volume=

work page
[19]

ICML , pages=

Fair k-center clustering for data summarization , author=. ICML , pages=. 2019 , organization=

work page 2019
[20]

arXiv preprint arXiv:1904.03035 , year=

Identifying and reducing gender bias in word-level language models , author=. arXiv preprint arXiv:1904.03035 , year=

work page arXiv 1904
[21]

Conference on Fairness, Accountability and Transparency , pages=

Gender shades: Intersectional accuracy disparities in commercial gender classification , author=. Conference on Fairness, Accountability and Transparency , pages=. 2018 , organization=

work page 2018
[22]

Frontiers of Information Technology & Electronic Engineering , pages=

On the principles of Parsimony and Self-consistency for the emergence of intelligence , author=. Frontiers of Information Technology & Electronic Engineering , pages=. 2022 , publisher=

work page 2022
[23]

2019 , publisher=

Cybernetics or Control and Communication in the Animal and the Machine , author=. 2019 , publisher=

work page 2019
[25]

Advances in neural information processing systems , volume=

Discriminative clustering by regularized information maximization , author=. Advances in neural information processing systems , volume=

work page
[26]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

work page
[27]

International conference on machine learning , pages=

Data-efficient image recognition with contrastive predictive coding , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[28]

International conference on machine learning , pages=

Learning discrete representations via information maximizing self-augmented training , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[29]

Computer , volume=

Self-organization in a perceptual network , author=. Computer , volume=. 1988 , publisher=

work page 1988
[30]

K., GELLY, S., LUCIC, M

On mutual information maximization for representation learning , author=. arXiv preprint arXiv:1907.13625 , year=

work page arXiv 1907
[31]

Proceedings of the annual meeting of the cognitive science society , volume=

One shot learning of simple visual concepts , author=. Proceedings of the annual meeting of the cognitive science society , volume=

work page
[32]

Artificial Intelligence and Statistics , pages=

Learning a nonlinear embedding by preserving class neighbourhood structure , author=. Artificial Intelligence and Statistics , pages=. 2007 , organization=

work page 2007
[33]

1980 , publisher=

Jean Piaget , author=. 1980 , publisher=

work page 1980
[34]

Journal of research in science teaching , volume=

Cognitive development in children: Piaget , author=. Journal of research in science teaching , volume=

work page
[35]

Synthese , pages=

Artificial intelligence and Piagetian theory , author=. Synthese , pages=. 1978 , publisher=

work page 1978
[36]

Annals of the New York Academy of Sciences , volume=

Reply to Individual and collective problems in the study of thinking , author=. Annals of the New York Academy of Sciences , volume=

work page
[37]

, author=

Biology and knowledge: An essay on the relations between organic regulations and cognitive processes. , author=. 1971 , publisher=

work page 1971
[38]

Semi-supervised learning , pages=

Entropy regularization , author=. Semi-supervised learning , pages=. 2006 , publisher=

work page 2006
[40]

arXiv preprint arXiv:2006.10029 , year=

Big self-supervised models are strong semi-supervised learners , author=. arXiv preprint arXiv:2006.10029 , year=

work page arXiv 2006
[44]

arXiv preprint arXiv:1606.04080 , year=

Matching networks for one shot learning , author=. arXiv preprint arXiv:1606.04080 , year=

work page arXiv
[45]

Available: https://arxiv.org/abs/1703.05175

Prototypical networks for few-shot learning , author=. arXiv preprint arXiv:1703.05175 , year=

work page arXiv
[46]

Optimization as a model for few-shot learning , author=

work page
[47]

Behavioral and brain sciences , volume=

Building machines that learn and think like people , author=. Behavioral and brain sciences , volume=. 2017 , publisher=

work page 2017
[48]

International Journal of Computer Vision , volume=

Imagenet large scale visual recognition challenge , author=. International Journal of Computer Vision , volume=

work page
[49]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Deep residual learning for image recognition , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[50]

Large Batch Training of Convolutional Networks

Large batch training of convolutional networks , author=. arXiv preprint arXiv:1708.03888 , year=

work page Pith review arXiv
[51]

International conference on machine learning , pages=

On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=

work page
[52]

preprint arXiv:1904.12848 , year=

Unsupervised data augmentation , author=. arXiv preprint arXiv:1904.12848 , year=

work page arXiv 1904
[53]

D., Kurakin, A., Zhang, H., and Raffel, C

Fixmatch: Simplifying semi-supervised learning with consistency and confidence , author=. arXiv preprint arXiv:2001.07685 , year=

work page arXiv 2001
[54]

preprint arXiv:2003.10580 , year=

Meta pseudo labels , author=. arXiv preprint arXiv:2003.10580 , year=

work page arXiv 2003
[55]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[56]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[57]

arXiv preprint arXiv:1803.00676 , year=

Meta-learning for semi-supervised few-shot classification , author=. arXiv preprint arXiv:1803.00676 , year=

work page arXiv
[58]

Momentum contrast for unsupervised visual representation learning

Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick , title =. arXiv preprint arXiv:1911.05722 , year =

work page arXiv 1911
[59]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He , title =. arXiv preprint arXiv:2003.04297 , year =

work page internal anchor Pith review arXiv 2003
[60]

arXiv preprint arXiv:1810.02334 , year=

Unsupervised learning via meta-learning , author=. arXiv preprint arXiv:1810.02334 , year=

work page arXiv
[61]

Exploring simple siamese representation learning

Exploring Simple Siamese Representation Learning , author=. arXiv preprint arXiv:2011.10566 , year=

work page arXiv 2011
[62]

Loshchilov, Ilya and Hutter, Frank , journal=

work page
[63]

arXiv preprint arXiv:2004.11362 , year=

Supervised Contrastive Learning , author=. arXiv preprint arXiv:2004.11362 , year=

work page arXiv 2004
[64]

IEEE transactions on pattern analysis and machine intelligence , volume=

Virtual adversarial training: a regularization method for supervised and semi-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=

work page 2018
[65]

arXiv preprint arXiv:1903.03825 , year=

Interpolation Consistency Training for Semi-Supervised Learning , author=. arXiv preprint arXiv:1903.03825 , year=

work page arXiv 1903
[66]

Proceedings of the IEEE international conference on computer vision , pages=

S4l: Self-supervised semi-supervised learning , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[67]

In International Conference on Machine Learning Workshop , year=

Lee, Dong-Hyun , title=. In International Conference on Machine Learning Workshop , year=

work page
[68]

, title=

Scudder, H. , title=. IEEE Transactions on Information Theory , volume=

work page
[69]

In Proceedings of the National Conference on Artificial Intelligence , year=

Riloff, Ellen , title=. In Proceedings of the National Conference on Artificial Intelligence , year=

work page
[70]

Advances in Neural Information Processing Systems , pages=

Mixmatch: A holistic approach to semi-supervised learning , author=. Advances in Neural Information Processing Systems , pages=

work page
[71]

arXiv preprint arXiv:1911.09785 , year=

ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring , author=. arXiv preprint arXiv:1911.09785 , year=

work page arXiv 1911
[72]

In 33rd Annual Meeting of the Association for Computational Linguistics , year=

Yarowsky, David , title=. In 33rd Annual Meeting of the Association for Computational Linguistics , year=

work page
[73]

arXiv preprint arXiv:1911.05371 , year=

Self-labelling via simultaneous clustering and representation learning , author=. arXiv preprint arXiv:1911.05371 , year=

work page arXiv 1911
[74]

preprint arXiv:2006.06882 , year=

Rethinking pre-training and self-training , author=. arXiv preprint arXiv:2006.06882 , year=

work page arXiv 2006
[75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Self-training with noisy student improves imagenet classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[77]

arXiv preprint arXiv:2112.10740 , year=

Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author=. arXiv preprint arXiv:2112.10740 , year=

work page arXiv
[78]

Representation learning via invariant causal mechanisms

Representation learning via invariant causal mechanisms , author=. arXiv preprint arXiv:2010.07922 , year=

work page arXiv 2010
[79]

preprint arXiv:2006.10803 , year=

Supervision accelerates pre-training in contrastive semi-supervised learning of visual representations , author=. arXiv preprint arXiv:2006.10803 , year=

work page arXiv 2006
[80]

arXiv preprint arXiv:1206.6413 , year=

A convex relaxation for weakly supervised classifiers , author=. arXiv preprint arXiv:1206.6413 , year=

work page arXiv
[81]

arXiv preprint arXiv:1610.02242 , year=

Temporal ensembling for semi-supervised learning , author=. arXiv preprint arXiv:1610.02242 , year=

work page arXiv
[82]

arXiv preprint arXiv:1902.02336 , year=

Semi-supervised learning by label gradient alignment , author=. arXiv preprint arXiv:1902.02336 , year=

work page arXiv 1902
[83]

arXiv preprint arXiv:1911.09265 , year=

Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning , author=. arXiv preprint arXiv:1911.09265 , year=

work page arXiv 1911
[84]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[85]

Wide Residual Networks

Wide residual networks , author=. arXiv preprint arXiv:1605.07146 , year=

work page internal anchor Pith review arXiv
[86]

Communications of the ACM , volume=

YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=

work page 2016

Showing first 80 references.