pith. machine review for the scientific record. sign in

arxiv: 2404.08471 · v1 · submitted 2024-02-15 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Revisiting Feature Prediction for Learning Visual Representations from Video

Authors on Pith no claims yet

Pith reviewed 2026-05-12 12:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords feature predictionvideo self-supervised learningvisual representationsvision transformerunsupervised pretrainingV-JEPA
0
0 comments X

The pith

Predicting features across video frames produces versatile visual representations that work well when frozen on both video and image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a feature prediction objective applied only to video data can train large vision transformers into useful general representations. These V-JEPA models require no image pretraining, text, negative pairs, or pixel reconstruction. When the backbone stays frozen, the models reach strong accuracy on action recognition and static image classification. A sympathetic reader cares because this points to video as a sufficient source for learning appearance and motion features without extra supervision signals.

Core claim

V-JEPA models learn solely by predicting the encoded features of masked or future video patches from visible context using a transformer encoder and a separate predictor. Trained on two million public videos, the largest ViT-H/16 variant achieves 81.9 percent on Kinetics-400, 72.2 percent on Something-Something-v2, and 77.9 percent on ImageNet-1K with no parameter updates at evaluation time.

What carries the argument

The feature prediction objective, in which visible video patches are used to forecast the high-level features of masked patches through a dedicated predictor network.

If this is right

  • Representations learned this way transfer effectively to both motion-heavy video tasks and appearance-based image tasks without adaptation.
  • Training requires only public video collections and no additional supervision sources.
  • Larger transformer models benefit from the objective and produce higher downstream accuracy.
  • The method removes dependence on pretrained image encoders or text data during pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the claim holds, video data alone could become the dominant pretraining source for general vision backbones.
  • Similar feature-prediction objectives might extend naturally to other time-series domains such as audio or sensor data.
  • Testing the same models on dense prediction tasks like segmentation would clarify how much spatial detail the representations retain.

Load-bearing premise

That performance of the frozen encoder on standard benchmarks accurately reflects the general usefulness of the learned representations.

What would settle it

A new downstream task or dataset where a feature-prediction model underperforms a reconstruction-based or contrastive model trained on the same video data would challenge the central claim.

read the original abstract

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces V-JEPA, a family of vision transformers trained exclusively with a feature-prediction objective on approximately 2 million unlabeled videos drawn from public datasets. No pretrained image encoders, text, negative samples, reconstruction losses, or other supervision are used. The central empirical claim is that the resulting frozen backbones yield versatile representations that perform competitively on both video (Kinetics-400, Something-Something-v2) and image (ImageNet-1K) downstream tasks, with the largest ViT-H/16 variant reaching 81.9%, 72.2%, and 77.9% respectively.

Significance. If the reported numbers prove reproducible under the stated protocol, the work would demonstrate that a pure feature-prediction objective on video alone can produce general-purpose visual representations competitive with contemporary self-supervised methods. This would strengthen the case for video-centric pretraining pipelines that avoid reconstruction, contrastive negatives, or external encoders, and would provide a useful baseline for future ablation studies on target generation and masking strategies.

major comments (2)
  1. [§4] §4 (Experimental Setup) and associated tables: the headline accuracies (e.g., 81.9% Kinetics-400) are presented without accompanying details on optimizer schedule, learning-rate values, exact video sampling strategy, train/val splits of the 2 M video corpus, or number of random seeds used for statistical significance. These omissions are load-bearing because the central claim rests entirely on the downstream frozen-backbone numbers.
  2. [§3.2] §3.2 (Target Generation): the description of how feature targets are obtained for the prediction objective is insufficiently precise. It is unclear whether any auxiliary network or preprocessing step is used to produce the targets, which directly affects the claim that the method is “stand-alone” and free of pretrained encoders.
minor comments (2)
  1. [Figure 2] Figure 2 and Table 1: axis labels and caption wording should explicitly state whether the plotted curves correspond to frozen or fine-tuned evaluation, and whether the ImageNet numbers are linear-probe or full fine-tuning.
  2. Notation: the symbol for the feature predictor head is introduced inconsistently across equations (3)–(5); a single, clearly defined symbol would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to address the concerns about experimental reproducibility and the precision of the target generation description.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup) and associated tables: the headline accuracies (e.g., 81.9% Kinetics-400) are presented without accompanying details on optimizer schedule, learning-rate values, exact video sampling strategy, train/val splits of the 2 M video corpus, or number of random seeds used for statistical significance. These omissions are load-bearing because the central claim rests entirely on the downstream frozen-backbone numbers.

    Authors: We agree that these implementation details are essential for reproducibility and should have been included. In the revised manuscript we will expand §4 (and add an appendix) with the optimizer (AdamW), learning-rate schedule and values, exact video clip sampling procedure, the composition and train/val splits of the 2 M public video corpus, and the number of random seeds used for the reported results. revision: yes

  2. Referee: [§3.2] §3.2 (Target Generation): the description of how feature targets are obtained for the prediction objective is insufficiently precise. It is unclear whether any auxiliary network or preprocessing step is used to produce the targets, which directly affects the claim that the method is “stand-alone” and free of pretrained encoders.

    Authors: We will revise §3.2 to make the target-generation procedure explicit. The feature targets are produced by applying the same V-JEPA ViT encoder (updated via exponential moving average) directly to the target patches of the input video; no auxiliary pretrained network, external encoder, or additional supervision is used at any stage. This preserves the stand-alone character of the method. We will include a clearer algorithmic description and diagram. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper describes an empirical training procedure for vision transformers using a feature-prediction objective on unlabeled video data, followed by frozen-backbone evaluation on separate downstream image and video classification benchmarks. No mathematical derivation chain, uniqueness theorem, or first-principles prediction is presented that reduces by construction to fitted parameters or self-citations. The reported metrics (e.g., 81.9% on Kinetics-400) are externally verifiable against standard datasets and protocols, with no evidence that target features, loss terms, or evaluation quantities are defined in terms of the final performance numbers. The approach is explicitly positioned as stand-alone, without reliance on pretrained encoders or negatives, confirming the results rest on independent experimental outcomes rather than tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a predictive feature objective on video is sufficient for versatile representations; many standard deep-learning hyperparameters and the ViT architecture are inherited without re-derivation.

free parameters (1)
  • ViT-H/16 architecture scale and training schedule
    Model size and optimization details are chosen to achieve the reported numbers.
axioms (1)
  • domain assumption ViT transformer blocks can be trained end-to-end with a feature-prediction loss on video patches
    Invoked implicitly when stating that the models are trained solely on the objective.
invented entities (1)
  • V-JEPA model family no independent evidence
    purpose: Collection of vision transformers trained with the feature-prediction objective
    New named artifact introduced to describe the trained models; no independent falsifiable prediction beyond the reported accuracies.

pith-pipeline@v0.9.0 · 5463 in / 1251 out tokens · 53011 ms · 2026-05-12T12:32:30.198185+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

    cs.LG 2026-05 unverdicted novelty 7.0

    Clin-JEPA supplies a multi-phase co-training method for JEPA pretraining on EHR trajectories that achieves converging latent rollouts and improved multi-task AUROC on MIMIC-IV data.

  2. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  3. Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...

  4. ProteinJEPA: Latent prediction complements protein language models

    cs.LG 2026-05 unverdicted novelty 7.0

    Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.

  5. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  6. Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference

    cs.RO 2026-05 unverdicted novelty 7.0

    Latent Bridge predicts VLM feature deltas to reduce VLM calls by 50-75% in dual-system VLA models while retaining 95-100% performance and achieving 1.65-1.73x speedup across LIBERO, RoboCasa, and ALOHA benchmarks.

  7. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  8. Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

    cs.LG 2026-05 unverdicted novelty 6.0

    A five-phase co-training framework enables stable JEPA pretraining on EHR trajectories, producing converging latent rollouts and higher multi-task AUROC than baselines on MIMIC-IV ICU data.

  9. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  10. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  11. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  12. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  13. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  14. LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

    cs.LG 2026-05 unverdicted novelty 6.0

    LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.

  15. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  16. Understanding Self-Supervised Learning via Latent Distribution Matching

    cs.LG 2026-05 unverdicted novelty 6.0

    Self-supervised learning is cast as latent distribution matching that aligns representations to a model while enforcing uniformity, unifying multiple SSL families and proving identifiability for predictive variants ev...

  17. Text-Conditional JEPA for Learning Semantically Rich Visual Representations

    cs.LG 2026-05 unverdicted novelty 6.0

    TC-JEPA conditions masked feature prediction on text captions via sparse cross-attention to produce more semantically rich visual representations and outperforms contrastive methods on fine-grained tasks.

  18. Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

    cs.AI 2026-05 unverdicted novelty 6.0

    Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.

  19. LA-Pose: Latent Action Pretraining Meets Pose Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...

  20. Exploring High-Order Self-Similarity for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

  21. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  22. Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

    cs.CV 2026-04 unverdicted novelty 6.0

    A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.

  23. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  24. Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

    cs.CV 2026-04 unverdicted novelty 6.0

    BADAS-2.0 scales collision anticipation with a 178k-video long-tail benchmark built via active oracle selection, 7-12x faster distilled edge models, and object-centric attention heatmaps plus VLM-based textual reasoning.

  25. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  26. UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    cs.RO 2025-05 unverdicted novelty 6.0

    UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.

  27. Towards Effective Theory of LLMs: A Representation Learning Approach

    cs.LG 2026-05 unverdicted novelty 5.0

    RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

  28. ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.

  29. Sapiens2

    cs.CV 2026-04 unverdicted novelty 5.0

    Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and...

  30. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  31. The Global Neural World Model: Spatially Grounded Discrete Topologies for Action-Conditioned Planning

    cs.LG 2026-04 unverdicted novelty 4.0

    GNWM maps environments to a discrete 2D grid with snapping to stabilize autoregressive planning and learns generalized dynamics from maximum-entropy random walks.

Reference graph

Works this paper leans on

268 extracted references · 268 canonical work pages · cited by 27 Pith papers · 19 internal anchors

  1. [1]

    Le, Quoc , title=

    Dogus Cubuk, Ekin and Zoph, Barret and Mane, Dandelion andVasudevan, Vijay and V. Le, Quoc , title=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , year=

  2. [2]

    Proceedings of the IEEE international conference on computer vision , year=

    ViViT: A Video Vision Transformer , author=. Proceedings of the IEEE international conference on computer vision , year=

  3. [3]

    arXiv preprint arXiv:2307.12698 , year=

    MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features , author=. arXiv preprint arXiv:2307.12698 , year=

  4. [4]

    arXiv preprint arXiv:2203.16527 , year=

    Exploring plain vision transformer backbones for object detection , author=. arXiv preprint arXiv:2203.16527 , year=

  5. [5]

    arXiv preprint arXiv:2210.01571 , year=

    VICRegL: Self-Supervised Learning of Local Visual Features , author=. arXiv preprint arXiv:2210.01571 , year=

  6. [6]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  7. [7]

    A Theoretical Analysis of Contrastive Unsupervised Representation Learning

    A theoretical analysis of contrastive unsupervised representation learning , author=. arXiv preprint arXiv:1902.09229 , year=

  8. [8]

    Advances in neural information processing systems , volume=

    Unsupervised classifiers, mutual information and'phantom targets , author=. Advances in neural information processing systems , volume=

  9. [9]

    NeurIPS , pages=

    Spectral relaxation for k-means clustering , author=. NeurIPS , pages=

  10. [10]

    Journal of statistical software , volume=

    Spherical k-means clustering , author=. Journal of statistical software , volume=

  11. [11]

    Expert systems with applications , volume=

    A simple and fast algorithm for K-medoids clustering , author=. Expert systems with applications , volume=. 2009 , publisher=

  12. [12]

    , author=

    Visualizing data using t-SNE. , author=. Journal of machine learning research , volume=

  13. [13]

    2010 IEEE International Conference on Data Mining , pages=

    Learning a bi-stochastic data similarity matrix , author=. 2010 IEEE International Conference on Data Mining , pages=. 2010 , organization=

  14. [14]

    Proceedings of the 23rd international conference on Machine learning , pages=

    The uniqueness of a good optimum for k-means , author=. Proceedings of the 23rd international conference on Machine learning , pages=

  15. [15]

    Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Adapting the right measures for k-means clustering , author=. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  16. [16]

    IEEE Transactions on Fuzzy Systems , volume=

    The K -means-type algorithms versus imbalanced data distributions , author=. IEEE Transactions on Fuzzy Systems , volume=. 2012 , publisher=

  17. [17]

    Size matters: Cardinality-constrained clustering and outlier detection via conic optimization , author=. SIAM J. Optimization , volume=. 2019 , publisher=

  18. [18]

    Microsoft Research, Redmond , volume=

    Constrained k-means clustering , author=. Microsoft Research, Redmond , volume=

  19. [19]

    ICML , pages=

    Fair k-center clustering for data summarization , author=. ICML , pages=. 2019 , organization=

  20. [20]

    arXiv preprint arXiv:1904.03035 , year=

    Identifying and reducing gender bias in word-level language models , author=. arXiv preprint arXiv:1904.03035 , year=

  21. [21]

    Conference on Fairness, Accountability and Transparency , pages=

    Gender shades: Intersectional accuracy disparities in commercial gender classification , author=. Conference on Fairness, Accountability and Transparency , pages=. 2018 , organization=

  22. [22]

    Frontiers of Information Technology & Electronic Engineering , pages=

    On the principles of Parsimony and Self-consistency for the emergence of intelligence , author=. Frontiers of Information Technology & Electronic Engineering , pages=. 2022 , publisher=

  23. [23]

    2019 , publisher=

    Cybernetics or Control and Communication in the Animal and the Machine , author=. 2019 , publisher=

  24. [25]

    Advances in neural information processing systems , volume=

    Discriminative clustering by regularized information maximization , author=. Advances in neural information processing systems , volume=

  25. [26]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  26. [27]

    International conference on machine learning , pages=

    Data-efficient image recognition with contrastive predictive coding , author=. International conference on machine learning , pages=. 2020 , organization=

  27. [28]

    International conference on machine learning , pages=

    Learning discrete representations via information maximizing self-augmented training , author=. International conference on machine learning , pages=. 2017 , organization=

  28. [29]

    Computer , volume=

    Self-organization in a perceptual network , author=. Computer , volume=. 1988 , publisher=

  29. [30]

    K., GELLY, S., LUCIC, M

    On mutual information maximization for representation learning , author=. arXiv preprint arXiv:1907.13625 , year=

  30. [31]

    Proceedings of the annual meeting of the cognitive science society , volume=

    One shot learning of simple visual concepts , author=. Proceedings of the annual meeting of the cognitive science society , volume=

  31. [32]

    Artificial Intelligence and Statistics , pages=

    Learning a nonlinear embedding by preserving class neighbourhood structure , author=. Artificial Intelligence and Statistics , pages=. 2007 , organization=

  32. [33]

    1980 , publisher=

    Jean Piaget , author=. 1980 , publisher=

  33. [34]

    Journal of research in science teaching , volume=

    Cognitive development in children: Piaget , author=. Journal of research in science teaching , volume=

  34. [35]

    Synthese , pages=

    Artificial intelligence and Piagetian theory , author=. Synthese , pages=. 1978 , publisher=

  35. [36]

    Annals of the New York Academy of Sciences , volume=

    Reply to Individual and collective problems in the study of thinking , author=. Annals of the New York Academy of Sciences , volume=

  36. [37]

    , author=

    Biology and knowledge: An essay on the relations between organic regulations and cognitive processes. , author=. 1971 , publisher=

  37. [38]

    Semi-supervised learning , pages=

    Entropy regularization , author=. Semi-supervised learning , pages=. 2006 , publisher=

  38. [40]

    arXiv preprint arXiv:2006.10029 , year=

    Big self-supervised models are strong semi-supervised learners , author=. arXiv preprint arXiv:2006.10029 , year=

  39. [44]

    arXiv preprint arXiv:1606.04080 , year=

    Matching networks for one shot learning , author=. arXiv preprint arXiv:1606.04080 , year=

  40. [45]

    Available: https://arxiv.org/abs/1703.05175

    Prototypical networks for few-shot learning , author=. arXiv preprint arXiv:1703.05175 , year=

  41. [46]

    Optimization as a model for few-shot learning , author=

  42. [47]

    Behavioral and brain sciences , volume=

    Building machines that learn and think like people , author=. Behavioral and brain sciences , volume=. 2017 , publisher=

  43. [48]

    International Journal of Computer Vision , volume=

    Imagenet large scale visual recognition challenge , author=. International Journal of Computer Vision , volume=

  44. [49]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Deep residual learning for image recognition , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  45. [50]

    Large Batch Training of Convolutional Networks

    Large batch training of convolutional networks , author=. arXiv preprint arXiv:1708.03888 , year=

  46. [51]

    International conference on machine learning , pages=

    On the importance of initialization and momentum in deep learning , author=. International conference on machine learning , pages=

  47. [52]

    preprint arXiv:1904.12848 , year=

    Unsupervised data augmentation , author=. arXiv preprint arXiv:1904.12848 , year=

  48. [53]

    D., Kurakin, A., Zhang, H., and Raffel, C

    Fixmatch: Simplifying semi-supervised learning with consistency and confidence , author=. arXiv preprint arXiv:2001.07685 , year=

  49. [54]

    preprint arXiv:2003.10580 , year=

    Meta pseudo labels , author=. arXiv preprint arXiv:2003.10580 , year=

  50. [55]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  51. [56]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Self-supervised learning of pretext-invariant representations , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  52. [57]

    arXiv preprint arXiv:1803.00676 , year=

    Meta-learning for semi-supervised few-shot classification , author=. arXiv preprint arXiv:1803.00676 , year=

  53. [58]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He and Haoqi Fan and Yuxin Wu and Saining Xie and Ross Girshick , title =. arXiv preprint arXiv:1911.05722 , year =

  54. [59]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen and Haoqi Fan and Ross Girshick and Kaiming He , title =. arXiv preprint arXiv:2003.04297 , year =

  55. [60]

    arXiv preprint arXiv:1810.02334 , year=

    Unsupervised learning via meta-learning , author=. arXiv preprint arXiv:1810.02334 , year=

  56. [61]

    Exploring simple siamese representation learning

    Exploring Simple Siamese Representation Learning , author=. arXiv preprint arXiv:2011.10566 , year=

  57. [62]

    Loshchilov, Ilya and Hutter, Frank , journal=

  58. [63]

    arXiv preprint arXiv:2004.11362 , year=

    Supervised Contrastive Learning , author=. arXiv preprint arXiv:2004.11362 , year=

  59. [64]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=

  60. [65]

    arXiv preprint arXiv:1903.03825 , year=

    Interpolation Consistency Training for Semi-Supervised Learning , author=. arXiv preprint arXiv:1903.03825 , year=

  61. [66]

    Proceedings of the IEEE international conference on computer vision , pages=

    S4l: Self-supervised semi-supervised learning , author=. Proceedings of the IEEE international conference on computer vision , pages=

  62. [67]

    In International Conference on Machine Learning Workshop , year=

    Lee, Dong-Hyun , title=. In International Conference on Machine Learning Workshop , year=

  63. [68]

    , title=

    Scudder, H. , title=. IEEE Transactions on Information Theory , volume=

  64. [69]

    In Proceedings of the National Conference on Artificial Intelligence , year=

    Riloff, Ellen , title=. In Proceedings of the National Conference on Artificial Intelligence , year=

  65. [70]

    Advances in Neural Information Processing Systems , pages=

    Mixmatch: A holistic approach to semi-supervised learning , author=. Advances in Neural Information Processing Systems , pages=

  66. [71]

    arXiv preprint arXiv:1911.09785 , year=

    ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring , author=. arXiv preprint arXiv:1911.09785 , year=

  67. [72]

    In 33rd Annual Meeting of the Association for Computational Linguistics , year=

    Yarowsky, David , title=. In 33rd Annual Meeting of the Association for Computational Linguistics , year=

  68. [73]

    arXiv preprint arXiv:1911.05371 , year=

    Self-labelling via simultaneous clustering and representation learning , author=. arXiv preprint arXiv:1911.05371 , year=

  69. [74]

    preprint arXiv:2006.06882 , year=

    Rethinking pre-training and self-training , author=. arXiv preprint arXiv:2006.06882 , year=

  70. [75]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Self-training with noisy student improves imagenet classification , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  71. [77]

    arXiv preprint arXiv:2112.10740 , year=

    Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , author=. arXiv preprint arXiv:2112.10740 , year=

  72. [78]

    Representation learning via invariant causal mechanisms

    Representation learning via invariant causal mechanisms , author=. arXiv preprint arXiv:2010.07922 , year=

  73. [79]

    preprint arXiv:2006.10803 , year=

    Supervision accelerates pre-training in contrastive semi-supervised learning of visual representations , author=. arXiv preprint arXiv:2006.10803 , year=

  74. [80]

    arXiv preprint arXiv:1206.6413 , year=

    A convex relaxation for weakly supervised classifiers , author=. arXiv preprint arXiv:1206.6413 , year=

  75. [81]

    arXiv preprint arXiv:1610.02242 , year=

    Temporal ensembling for semi-supervised learning , author=. arXiv preprint arXiv:1610.02242 , year=

  76. [82]

    arXiv preprint arXiv:1902.02336 , year=

    Semi-supervised learning by label gradient alignment , author=. arXiv preprint arXiv:1902.02336 , year=

  77. [83]

    arXiv preprint arXiv:1911.09265 , year=

    Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning , author=. arXiv preprint arXiv:1911.09265 , year=

  78. [84]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  79. [85]

    Wide Residual Networks

    Wide residual networks , author=. arXiv preprint arXiv:1605.07146 , year=

  80. [86]

    Communications of the ACM , volume=

    YFCC100M: The new data in multimedia research , author=. Communications of the ACM , volume=. 2016 , publisher=

Showing first 80 references.