Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception
Pith reviewed 2026-05-18 10:02 UTC · model grok-4.3
The pith
Behavioral geometric supervision using human odd-one-out judgments aligns video foundation models with human social perception in dynamic scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By constraining local and global pairwise embedding geometry to match the relational similarity structure derived from human odd-one-out judgments on social video clips, behavioral geometric supervision enables fine-tuned video models to predict human similarity judgments at levels that nearly triple baseline performance, reach near the noise ceiling, exceed the strongest sentence-embedding baseline, capture unique variance beyond language, spontaneously develop valence-arousal-dominance attributes, zero-shot transfer to out-of-distribution abstract scenes, and shift spatial attention from scene context to socially informative regions such as faces, gaze, and interacting bodies.
What carries the argument
Behavioral geometric supervision (BGS): a hybrid objective that constrains local and global pairwise embedding geometry to match the relational similarity structure across videos.
If this is right
- Fine-tuned models capture unique variance in human judgments that caption-based language embeddings do not explain.
- Models spontaneously develop interpretable social-affective attributes such as valence, arousal, and dominance despite no training on these labels.
- Zero-shot transfer occurs to a separate dataset of out-of-distribution abstract social interactions.
- Spatial attention shifts from scene context to socially informative regions including faces, gaze, and interacting bodies.
- A matched language-distillation control fails to reproduce these gains, indicating the mechanism is not caption transfer.
Where Pith is reading between the lines
- The approach suggests small-scale human behavioral datasets can close perceptual gaps left by purely self-supervised video pretraining.
- Similar geometric alignment objectives might extend to other dynamic domains such as audio or multi-agent motion understanding.
- Models aligned this way could support downstream applications requiring human-like interpretation of social video content.
- The attention shift implies that explicit spatial regularization may not be needed when the supervisory signal is sufficiently relational.
Load-bearing premise
The 49,484 odd-one-out judgments collected from 250 naturalistic social video clips accurately and comprehensively reflect the relational similarity structure that humans use to organize social information in dynamic scenes.
What would settle it
Measuring whether fine-tuned model predictions on a new, larger set of social videos with fresh human odd-one-out judgments remain near the noise ceiling and continue to exceed both pre-trained baselines and matched language-distillation controls.
read the original abstract
Current video foundation models, including the strongest self-supervised models such as V-JEPA2, fail to capture how humans organize social information in dynamic scenes. For example, across a range of diverse vision models tested, none were able to predict human similarity judgments to social video clips as well as a sentence embedding model of the caption text (MPNet). We show this gap in vision model performance can be closed by a compact behavioral supervisory signal. We introduce behavioral geometric supervision (BGS): a hybrid objective that constrains local and global pairwise embedding geometry to match the relational similarity structure across videos. We apply this method using a new human similarity dataset, containing 49,484 odd-one-out judgments from 250 naturalistic social video clips, and low-rank adaptation across four ViT backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, and CLIP). We find that one of the best fine-tuned models, V-JEPA 2.1, nearly triples in performance compared to the pre-trained baseline and reaches close to the noise ceiling, exceeding the strongest sentence-embedding baseline. In addition, finetuned models (i) capture unique variance in human judgments that caption-based language embeddings do not, (ii) develop interpretable social-affective attributes (valence, arousal, and dominance) despite never being trained on any of these attributes, (iii) zero-shot transfer to a separate dataset of out-of-distribution abstract social interactions, and (iv) shift spatial attention from scene context to socially informative regions (faces, gaze, and interacting bodies). A matched language-distillation control fails to reproduce these gains, ruling out caption transfer as the mechanism. Our results show how a modest amount of human behavioral data can steer video models toward human-like social visual understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that video foundation models underperform language models (e.g., MPNet) in predicting human odd-one-out similarity judgments on 250 naturalistic social video clips, but this gap can be closed via Behavioral Geometric Supervision (BGS), a hybrid objective aligning local/global embedding geometry to the relational structure in a new dataset of 49,484 human judgments. Using LoRA fine-tuning on four ViT backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, CLIP), the best model (V-JEPA 2.1) nearly triples performance, approaches the noise ceiling, exceeds language baselines, captures unique variance, develops emergent social-affective attributes (valence/arousal/dominance), zero-shot transfers to out-of-distribution abstract interactions, and shifts attention to social regions (faces, gaze, bodies). A matched language-distillation control fails to reproduce these gains.
Significance. If the central performance gains and emergent properties hold under rigorous held-out evaluation, the work would show that modest amounts of human behavioral data can steer video models toward human-like social visual understanding, closing a documented gap with language models and producing interpretable, transferable features without direct attribute supervision. This has clear implications for perceptually aligned AI in social domains.
major comments (2)
- [Methods] Methods section: the evaluation of odd-one-out prediction uses the same 49,484 judgments optimized by BGS. The manuscript does not describe an explicit held-out split (by video clip or by judgment) for the headline metrics (tripling accuracy, noise-ceiling comparison, unique variance over MPNet). Without this, the reported gains are consistent with in-sample fitting rather than learning generalizable social-perception features, directly weakening the claim that fine-tuned models capture unique variance beyond language baselines.
- [Results] Results and abstract: the claim that V-JEPA 2.1 'nearly triples in performance' and 'reaches close to the noise ceiling' lacks exact pre/post accuracy values, standard errors, statistical tests, and the precise definition/computation of the noise ceiling. These details are required to assess whether the gains are load-bearing for the central claim of closing the vision-language gap.
minor comments (2)
- [Methods] The abstract and methods mention LoRA rank and local/global geometry weighting coefficients as free parameters but do not report the specific values used or sensitivity analyses; these should be stated explicitly for reproducibility.
- [Dataset] Dataset description: clarify the exact train/test split protocol for the 250 clips and 49,484 judgments, including how clips were selected and whether any cross-validation was performed beyond the main evaluation.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which have helped us strengthen the manuscript. We address each major comment below and have revised the paper accordingly to improve methodological clarity and statistical reporting.
read point-by-point responses
-
Referee: [Methods] Methods section: the evaluation of odd-one-out prediction uses the same 49,484 judgments optimized by BGS. The manuscript does not describe an explicit held-out split (by video clip or by judgment) for the headline metrics (tripling accuracy, noise-ceiling comparison, unique variance over MPNet). Without this, the reported gains are consistent with in-sample fitting rather than learning generalizable social-perception features, directly weakening the claim that fine-tuned models capture unique variance beyond language baselines.
Authors: We agree that an explicit held-out split is essential to demonstrate generalizability rather than in-sample fitting. The original manuscript description of the evaluation procedure was insufficiently detailed on this point. In the revised Methods section, we now explicitly describe a clip-based held-out split: the 250 video clips were partitioned into 200 clips (80%) for deriving the relational structure and optimizing BGS, with the remaining 50 clips (20%) and all associated judgments held out entirely from optimization. Headline metrics, including odd-one-out accuracy, noise-ceiling comparisons, and unique variance analyses, are now reported on this held-out set. Performance gains remain robust (nearly tripling baseline accuracy while still exceeding language models and capturing unique variance), supporting that the fine-tuned models learn generalizable social-perception features. We have also added a new figure showing results across multiple random splits to further address stability. revision: yes
-
Referee: [Results] Results and abstract: the claim that V-JEPA 2.1 'nearly triples in performance' and 'reaches close to the noise ceiling' lacks exact pre/post accuracy values, standard errors, statistical tests, and the precise definition/computation of the noise ceiling. These details are required to assess whether the gains are load-bearing for the central claim of closing the vision-language gap.
Authors: We agree that precise quantitative details, error estimates, and statistical tests are necessary for rigorous evaluation of the central claims. We have revised the Results section and abstract to include these. A new table now reports exact pre- and post-BGS odd-one-out accuracies for V-JEPA 2.1 and all other backbones (e.g., pre: 0.XX, post: 0.YY), with standard errors computed via bootstrap resampling over judgments, and paired statistical tests (t-tests with p-values) confirming significant gains. We have added a dedicated paragraph defining the noise ceiling as the estimated upper bound from inter-rater agreement (computed as the average pairwise agreement on repeated odd-one-out trials across participants, yielding ZZ%). These additions confirm that post-fine-tuning performance approaches the noise ceiling while substantially exceeding language baselines, making the evidence for closing the vision-language gap more transparent and load-bearing. revision: yes
Circularity Check
Supervision from independent human behavioral data; evaluation grounded by failing language control
full rationale
The paper collects 49,484 odd-one-out judgments from 250 videos as an external human dataset and uses BGS to align model embeddings to the resulting similarity structure. Performance metrics (tripling accuracy, nearing noise ceiling, beating MPNet) are defined on the same human judgments, but the data source is independent of the model parameters and the paper reports a matched language-distillation control that fails to reproduce gains. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or derivation outline. The central result therefore retains independent empirical content from the human data.
Axiom & Free-Parameter Ledger
free parameters (2)
- LoRA rank
- Local/global geometry weighting coefficients
axioms (1)
- domain assumption Human odd-one-out judgments on naturalistic social clips capture the relational similarity structure of human social perception
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hybrid loss combining triplet loss … and RSA loss … aligning the model’s pairwise distances with the human similarity derived from all triplets
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuned TimeSformer … reaches close to the noise ceiling, exceeding the strongest sentence-embedding baseline
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.