Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

Kathy Garcia; Leyla Isik

arxiv: 2510.01502 · v2 · pith:OZ4YNC6Qnew · submitted 2025-10-01 · 🧬 q-bio.NC · cs.CV· cs.LG

Behavioral Geometric Supervision Aligns Video Foundation Models with Human Social Perception

Kathy Garcia , Leyla Isik This is my paper

Pith reviewed 2026-05-18 10:02 UTC · model grok-4.3

classification 🧬 q-bio.NC cs.CVcs.LG

keywords video foundation modelsbehavioral geometric supervisionsocial perceptionsimilarity judgmentsodd-one-out taskfine-tuninghuman alignmentViT backbones

0 comments

The pith

Behavioral geometric supervision using human odd-one-out judgments aligns video foundation models with human social perception in dynamic scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current video foundation models including V-JEPA2 fall short of human performance when predicting similarity judgments on naturalistic social videos, even underperforming simple text embeddings of video captions. The authors introduce behavioral geometric supervision, a hybrid objective that aligns both local and global pairwise distances in model embeddings to match the structure of human relational judgments. Applying this to four ViT-based backbones with low-rank adaptation on a dataset of 49,484 judgments from 250 clips produces large gains, with the best model nearly tripling its correlation to human data and approaching the noise ceiling while surpassing text baselines. The resulting models also extract social-affective dimensions, transfer to novel abstract interactions, and redirect attention toward faces and interacting bodies without any explicit attribute training.

Core claim

By constraining local and global pairwise embedding geometry to match the relational similarity structure derived from human odd-one-out judgments on social video clips, behavioral geometric supervision enables fine-tuned video models to predict human similarity judgments at levels that nearly triple baseline performance, reach near the noise ceiling, exceed the strongest sentence-embedding baseline, capture unique variance beyond language, spontaneously develop valence-arousal-dominance attributes, zero-shot transfer to out-of-distribution abstract scenes, and shift spatial attention from scene context to socially informative regions such as faces, gaze, and interacting bodies.

What carries the argument

Behavioral geometric supervision (BGS): a hybrid objective that constrains local and global pairwise embedding geometry to match the relational similarity structure across videos.

If this is right

Fine-tuned models capture unique variance in human judgments that caption-based language embeddings do not explain.
Models spontaneously develop interpretable social-affective attributes such as valence, arousal, and dominance despite no training on these labels.
Zero-shot transfer occurs to a separate dataset of out-of-distribution abstract social interactions.
Spatial attention shifts from scene context to socially informative regions including faces, gaze, and interacting bodies.
A matched language-distillation control fails to reproduce these gains, indicating the mechanism is not caption transfer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests small-scale human behavioral datasets can close perceptual gaps left by purely self-supervised video pretraining.
Similar geometric alignment objectives might extend to other dynamic domains such as audio or multi-agent motion understanding.
Models aligned this way could support downstream applications requiring human-like interpretation of social video content.
The attention shift implies that explicit spatial regularization may not be needed when the supervisory signal is sufficiently relational.

Load-bearing premise

The 49,484 odd-one-out judgments collected from 250 naturalistic social video clips accurately and comprehensively reflect the relational similarity structure that humans use to organize social information in dynamic scenes.

What would settle it

Measuring whether fine-tuned model predictions on a new, larger set of social videos with fresh human odd-one-out judgments remain near the noise ceiling and continue to exceed both pre-trained baselines and matched language-distillation controls.

read the original abstract

Current video foundation models, including the strongest self-supervised models such as V-JEPA2, fail to capture how humans organize social information in dynamic scenes. For example, across a range of diverse vision models tested, none were able to predict human similarity judgments to social video clips as well as a sentence embedding model of the caption text (MPNet). We show this gap in vision model performance can be closed by a compact behavioral supervisory signal. We introduce behavioral geometric supervision (BGS): a hybrid objective that constrains local and global pairwise embedding geometry to match the relational similarity structure across videos. We apply this method using a new human similarity dataset, containing 49,484 odd-one-out judgments from 250 naturalistic social video clips, and low-rank adaptation across four ViT backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, and CLIP). We find that one of the best fine-tuned models, V-JEPA 2.1, nearly triples in performance compared to the pre-trained baseline and reaches close to the noise ceiling, exceeding the strongest sentence-embedding baseline. In addition, finetuned models (i) capture unique variance in human judgments that caption-based language embeddings do not, (ii) develop interpretable social-affective attributes (valence, arousal, and dominance) despite never being trained on any of these attributes, (iii) zero-shot transfer to a separate dataset of out-of-distribution abstract social interactions, and (iv) shift spatial attention from scene context to socially informative regions (faces, gaze, and interacting bodies). A matched language-distillation control fails to reproduce these gains, ruling out caption transfer as the mechanism. Our results show how a modest amount of human behavioral data can steer video models toward human-like social visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BGS gets video models closer to human social similarity judgments via geometric supervision from odd-one-out data, with some nice side effects, but the evaluation setup needs a clear held-out check.

read the letter

The main point is that fine-tuning several video backbones with a hybrid local-plus-global geometric loss derived from 49k human odd-one-out judgments on 250 social clips produces large gains in predicting those judgments, beats a strong language baseline, and yields some emergent properties like unique variance capture, zero-shot transfer to abstract scenes, and attention shifts toward faces and bodies. A language-distillation control that fails to match the gains is a useful check against simple caption effects. The method itself is the new piece: applying this behavioral geometric supervision objective via LoRA across V-JEPA, TimeSformer, VideoMAE, and CLIP backbones on a fresh social video dataset. The results look consistent in the abstract, and the emergent attributes and transfer are the parts that stand out as potentially useful beyond the main metric. The paper does a reasonable job showing that modest human behavioral data can steer models toward human-like organization of dynamic social information without direct attribute supervision. On the softer side, the stress-test concern about in-sample evaluation is worth taking seriously. If the final accuracy numbers are computed on the same judgments used to constrain the geometry, the tripling of performance and near-noise-ceiling claims could partly reflect fitting the supervision signal rather than learning generalizable features. The abstract does not spell out explicit held-out splits by clip or judgment, nor does it give full statistical details or exact dataset partitioning, so that needs direct confirmation from the methods. The assumption that these 250 clips and judgments comprehensively reflect human relational structure for social scenes is plausible but not obviously exhaustive. This work is aimed at researchers in social AI, human-computer interaction, and neuroscience-inspired alignment who want a scalable behavioral route to close gaps between video models and human perception. A reader focused on model fine-tuning or interpretability would find the controls and side results worth looking at. I would send it to peer review. The core idea is grounded and the controls are thoughtful, but the evaluation details will decide how far the claims can be pushed.

Referee Report

2 major / 2 minor

Summary. The paper claims that video foundation models underperform language models (e.g., MPNet) in predicting human odd-one-out similarity judgments on 250 naturalistic social video clips, but this gap can be closed via Behavioral Geometric Supervision (BGS), a hybrid objective aligning local/global embedding geometry to the relational structure in a new dataset of 49,484 human judgments. Using LoRA fine-tuning on four ViT backbones (V-JEPA 2/2.1, TimeSformer, VideoMAE, CLIP), the best model (V-JEPA 2.1) nearly triples performance, approaches the noise ceiling, exceeds language baselines, captures unique variance, develops emergent social-affective attributes (valence/arousal/dominance), zero-shot transfers to out-of-distribution abstract interactions, and shifts attention to social regions (faces, gaze, bodies). A matched language-distillation control fails to reproduce these gains.

Significance. If the central performance gains and emergent properties hold under rigorous held-out evaluation, the work would show that modest amounts of human behavioral data can steer video models toward human-like social visual understanding, closing a documented gap with language models and producing interpretable, transferable features without direct attribute supervision. This has clear implications for perceptually aligned AI in social domains.

major comments (2)

[Methods] Methods section: the evaluation of odd-one-out prediction uses the same 49,484 judgments optimized by BGS. The manuscript does not describe an explicit held-out split (by video clip or by judgment) for the headline metrics (tripling accuracy, noise-ceiling comparison, unique variance over MPNet). Without this, the reported gains are consistent with in-sample fitting rather than learning generalizable social-perception features, directly weakening the claim that fine-tuned models capture unique variance beyond language baselines.
[Results] Results and abstract: the claim that V-JEPA 2.1 'nearly triples in performance' and 'reaches close to the noise ceiling' lacks exact pre/post accuracy values, standard errors, statistical tests, and the precise definition/computation of the noise ceiling. These details are required to assess whether the gains are load-bearing for the central claim of closing the vision-language gap.

minor comments (2)

[Methods] The abstract and methods mention LoRA rank and local/global geometry weighting coefficients as free parameters but do not report the specific values used or sensitivity analyses; these should be stated explicitly for reproducibility.
[Dataset] Dataset description: clarify the exact train/test split protocol for the 250 clips and 49,484 judgments, including how clips were selected and whether any cross-validation was performed beyond the main evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which have helped us strengthen the manuscript. We address each major comment below and have revised the paper accordingly to improve methodological clarity and statistical reporting.

read point-by-point responses

Referee: [Methods] Methods section: the evaluation of odd-one-out prediction uses the same 49,484 judgments optimized by BGS. The manuscript does not describe an explicit held-out split (by video clip or by judgment) for the headline metrics (tripling accuracy, noise-ceiling comparison, unique variance over MPNet). Without this, the reported gains are consistent with in-sample fitting rather than learning generalizable social-perception features, directly weakening the claim that fine-tuned models capture unique variance beyond language baselines.

Authors: We agree that an explicit held-out split is essential to demonstrate generalizability rather than in-sample fitting. The original manuscript description of the evaluation procedure was insufficiently detailed on this point. In the revised Methods section, we now explicitly describe a clip-based held-out split: the 250 video clips were partitioned into 200 clips (80%) for deriving the relational structure and optimizing BGS, with the remaining 50 clips (20%) and all associated judgments held out entirely from optimization. Headline metrics, including odd-one-out accuracy, noise-ceiling comparisons, and unique variance analyses, are now reported on this held-out set. Performance gains remain robust (nearly tripling baseline accuracy while still exceeding language models and capturing unique variance), supporting that the fine-tuned models learn generalizable social-perception features. We have also added a new figure showing results across multiple random splits to further address stability. revision: yes
Referee: [Results] Results and abstract: the claim that V-JEPA 2.1 'nearly triples in performance' and 'reaches close to the noise ceiling' lacks exact pre/post accuracy values, standard errors, statistical tests, and the precise definition/computation of the noise ceiling. These details are required to assess whether the gains are load-bearing for the central claim of closing the vision-language gap.

Authors: We agree that precise quantitative details, error estimates, and statistical tests are necessary for rigorous evaluation of the central claims. We have revised the Results section and abstract to include these. A new table now reports exact pre- and post-BGS odd-one-out accuracies for V-JEPA 2.1 and all other backbones (e.g., pre: 0.XX, post: 0.YY), with standard errors computed via bootstrap resampling over judgments, and paired statistical tests (t-tests with p-values) confirming significant gains. We have added a dedicated paragraph defining the noise ceiling as the estimated upper bound from inter-rater agreement (computed as the average pairwise agreement on repeated odd-one-out trials across participants, yielding ZZ%). These additions confirm that post-fine-tuning performance approaches the noise ceiling while substantially exceeding language baselines, making the evidence for closing the vision-language gap more transparent and load-bearing. revision: yes

Circularity Check

0 steps flagged

Supervision from independent human behavioral data; evaluation grounded by failing language control

full rationale

The paper collects 49,484 odd-one-out judgments from 250 videos as an external human dataset and uses BGS to align model embeddings to the resulting similarity structure. Performance metrics (tripling accuracy, nearing noise ceiling, beating MPNet) are defined on the same human judgments, but the data source is independent of the model parameters and the paper reports a matched language-distillation control that fails to reproduce gains. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or derivation outline. The central result therefore retains independent empirical content from the human data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human odd-one-out judgments form a valid target geometry for social perception and that low-rank adaptation can transfer this geometry to video embeddings without introducing new entities.

free parameters (2)

LoRA rank
Hyperparameter controlling adaptation capacity across the four ViT backbones.
Local/global geometry weighting coefficients
Balance terms in the hybrid BGS objective.

axioms (1)

domain assumption Human odd-one-out judgments on naturalistic social clips capture the relational similarity structure of human social perception
This judgment structure is used to define the target embedding geometry that the model must match.

pith-pipeline@v0.9.0 · 5862 in / 1357 out tokens · 44954 ms · 2026-05-18T10:02:37.484755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid loss combining triplet loss … and RSA loss … aligning the model’s pairwise distances with the human similarity derived from all triplets
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuned TimeSformer … reaches close to the noise ceiling, exceeding the strongest sentence-embedding baseline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.