RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail

Amirreza Rouhi; Anoop M. Namboodiri; Parikshit Sakurikar; Rajat Aggarwal; Sashi P. Reddi

arxiv: 2607.00310 · v1 · pith:BC6556JHnew · submitted 2026-07-01 · 💻 cs.CV · cs.AI

RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail

Amirreza Rouhi , Rajat Aggarwal , Parikshit Sakurikar , Anoop M. Namboodiri , Sashi P. Reddi This is my paper

Pith reviewed 2026-07-02 15:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords retail videoegocentric exocentricvideo world modelsLoRA adaptationfoundation model fine-tuningsynchronized multi-viewdomain adaptationvideo diffusion

0 comments

The pith

Exocentric-only adaptation of a foundation video world model matches or beats combined egocentric-exocentric training on retail scenes despite using half the clips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests which viewpoint of training data best adapts a pretrained video diffusion model to retail environments when both egocentric and exocentric synchronized clips are available. It releases RetailSMV, a dataset of over 32,000 captioned clips from five supermarkets showing staff activities such as stocking and checkout. Three matched LoRA adaptations of Cosmos3-Nano are trained under identical conditions: egocentric-only, exocentric-only, and combined. On a held-out test set with seven metrics and paired statistical tests, the exocentric-only model equals or exceeds the combined model on six metrics and is significantly stronger on LPIPS, PSNR, and DreamSim. Adding exocentric data helps an egocentric-only model while adding egocentric data hurts an exocentric-only model, with the largest benefit appearing in short-horizon rollouts.

Core claim

Exocentric-only adaptation matches or exceeds combined adaptation on six of seven point estimates and is significantly better on LPIPS, PSNR, and DreamSim, despite training on only 15,985 exocentric clips versus 32,105 for the combined case. A symmetric paired comparison shows that adding exocentric data to egocentric-only training helps while adding egocentric data to exocentric-only training hurts. The absolute adaptation gap is largest at the shortest rollout time.

What carries the argument

RetailSMV corpus of 32,105 synchronized ego/exo retail clips paired with three matched LoRA configurations of Cosmos3-Nano (egocentric-only, exocentric-only, combined) evaluated under a strict paired statistical protocol on seven metrics.

If this is right

Exocentric data alone is sufficient or preferable for this form of retail adaptation.
Adding egocentric data to an exocentric-only model degrades performance.
The benefit of viewpoint-specific adaptation is strongest for near-term video prediction.
Multi-view training is not automatically superior when the two viewpoints are synchronized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Retail world models may benefit from deliberate viewpoint selection rather than simply collecting more data.
Similar viewpoint comparisons could be run in other embodied domains such as kitchens or warehouses to check whether exocentric preference generalizes.
Short-horizon prediction may be the most practical regime for deploying these adapted models in agent loops.

Load-bearing premise

The seven evaluation metrics serve as valid proxies for how well the adapted models will perform when used by embodied retail agents.

What would settle it

A controlled test in which the three adapted models are deployed inside physical retail robots performing the same stocking and checkout tasks and their task success rates are measured directly against the metric rankings.

read the original abstract

Foundation video diffusion models are increasingly viewed as world simulators for embodied agents, yet their pretraining on internet-scale generic video leaves them poorly aligned with real-world deployment domains. We study parameter-efficient adaptation of a pretrained foundation video world model to retail scenes: when synchronized egocentric and exocentric video of the same activity are available, which viewpoint of training data produces the strongest adapted model? We introduce RetailSMV (Retail Synchronized Multi-View), a corpus of 32,105 captioned retail clips from five supermarkets with synchronized ego/exo capture from the store-staff perspective (stocking, arranging, weighing, managing supply carts, scanning at checkout), rather than the customer-centric framing of prior retail video corpora, and train three matched Low-Rank Adaptation (LoRA) configurations of Cosmos3-Nano (egocentric-only, exocentric-only, combined) under identical hyperparameters. On a 200-clip held-out test set evaluated with seven complementary metrics under a strict paired statistical protocol, exocentric-only adaptation matches or exceeds combined adaptation on six of seven point estimates and is significantly better on LPIPS, PSNR, and DreamSim, despite training on only 15,985 exocentric clips (versus 32,105 for combined). A symmetric paired comparison further shows that adding exocentric data to egocentric-only training helps while adding egocentric data to exocentric-only training hurts. The absolute adaptation gap is largest at the shortest rollout time, identifying the near-horizon prediction window as the regime in which adaptation is most beneficial.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Exocentric-only LoRA beats or matches combined training on most video metrics with half the data, but the work stops at perceptual scores with no link to embodied agent performance.

read the letter

The main thing to know is that in this retail setup, exocentric-only adaptation of Cosmos3-Nano via LoRA matches or exceeds the combined ego+exo version on six of seven point estimates and beats it significantly on LPIPS, PSNR, and DreamSim, even though it uses only 15,985 clips versus 32,105. The symmetric result that adding exocentric data helps egocentric-only training but adding egocentric data hurts exocentric-only training is also reported.

The paper introduces RetailSMV, a new corpus of synchronized multi-view retail clips from the staff perspective covering stocking, scanning, and similar actions. That is a concrete addition over prior customer-centric retail video sets. The experimental design keeps the LoRA rank, learning rate, and other hyperparameters matched across the three conditions and evaluates on a held-out 200-clip test set with a paired statistical protocol. Those choices make the comparison straightforward to interpret.

The soft spot is the evaluation itself. All claims rest on standard video-prediction metrics. The abstract positions the work as improving world models for embodied retail agents, yet there are no ablations or correlations showing that higher LPIPS or DreamSim scores produce better planning success, interaction accuracy, or policy transfer in actual retail tasks. The largest adaptation gap appears at short rollouts, which is consistent but narrows the practical scope.

This is useful for groups working on viewpoint-specific domain adaptation of video models in constrained environments. The dataset and the head-to-head numbers are new enough that a serious editor should send it to referees, with the expectation that reviewers will request downstream task results.

Referee Report

1 major / 0 minor

Summary. The paper introduces the RetailSMV dataset of 32,105 synchronized egocentric/exocentric retail clips and compares three matched LoRA adaptations of Cosmos3-Nano (ego-only, exo-only, combined) under identical hyperparameters. On a 200-clip held-out test set with seven metrics and a paired statistical protocol, it claims exo-only adaptation matches or exceeds combined training on six of seven point estimates (significantly better on LPIPS, PSNR, DreamSim) despite using only 15,985 clips, that adding exo data helps ego-only models while adding ego data hurts exo-only models, and that the adaptation gap is largest at the shortest rollout horizons.

Significance. If the results hold, the work suggests that exocentric data can be more effective than combined or egocentric data for parameter-efficient adaptation of video world models in retail, with potential efficiency gains. Strengths include the new synchronized multi-view corpus focused on staff activities, matched LoRA configurations, held-out evaluation, and paired statistical protocol. The finding that adaptation benefit is largest at short horizons is a concrete, falsifiable observation.

major comments (1)

[Abstract and held-out test set evaluation] The central claim that exocentric-only adaptation produces the 'strongest adapted model' for embodied retail agents (Abstract) is load-bearing on the untested assumption that the seven metrics (LPIPS, PSNR, DreamSim and the remaining four) are valid proxies for downstream usefulness such as planning success, interaction accuracy, or policy transfer. No ablations, correlations, or agent-task experiments linking higher metric scores to improved embodied outcomes are described, so the practical implication for deployment remains unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comment on the link between our metrics and downstream embodied tasks. We address the point below and propose textual revisions to avoid overclaiming.

read point-by-point responses

Referee: [Abstract and held-out test set evaluation] The central claim that exocentric-only adaptation produces the 'strongest adapted model' for embodied retail agents (Abstract) is load-bearing on the untested assumption that the seven metrics (LPIPS, PSNR, DreamSim and the remaining four) are valid proxies for downstream usefulness such as planning success, interaction accuracy, or policy transfer. No ablations, correlations, or agent-task experiments linking higher metric scores to improved embodied outcomes are described, so the practical implication for deployment remains unsupported.

Authors: We acknowledge that the abstract's phrasing implies relevance to embodied retail agents and that our evaluation relies on standard video-generation metrics without direct downstream validation. These metrics (LPIPS, PSNR, DreamSim, and the others) are widely adopted in the world-model literature precisely because they quantify predictive fidelity and perceptual quality, which are prerequisites for planning and interaction. The paper's core contribution is the controlled, matched comparison of ego-only, exo-only, and combined LoRA adaptation on synchronized data; the metrics serve as the evaluation protocol for that comparison. We agree that explicit agent-task ablations or correlations would strengthen deployment claims and are absent here. We will therefore revise the abstract to state that exocentric-only adaptation is strongest with respect to the reported metrics, and we will add an explicit limitations paragraph noting the lack of downstream task experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on held-out test set

full rationale

The paper performs matched LoRA fine-tuning of a pretrained model on three data partitions (ego-only, exo-only, combined) drawn from RetailSMV and reports performance on a 200-clip held-out test set using seven standard perceptual metrics under paired statistics. No equations, parameter-fitting steps, or derivations are present that could reduce to self-definition or fitted-input-as-prediction. No load-bearing self-citations or uniqueness theorems are invoked. The central claim is an empirical observation about relative metric values, not a mathematical result derived from the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LoRA is an appropriate adaptation method and that the chosen metrics reflect useful model behavior; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LoRA adaptation preserves the benefits of the pretrained Cosmos3-Nano model while allowing domain-specific improvement.
The three matched LoRA configurations presuppose that this adaptation technique is effective for the video world model.

pith-pipeline@v0.9.1-grok · 5836 in / 1148 out tokens · 35913 ms · 2026-07-02T15:30:45.307914+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 22 canonical work pages · 16 internal anchors

[1]

Kevin Black et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot World Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers et al. QLoRA: Efficient finetuning of quantized language models.arXiv preprint arXiv:2305.14314,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess et al. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or

arXiv:2508.00400. Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR,

work page arXiv
[7]

Survey of generative world models for embodied ai.arXiv preprint arXiv:2502.00060,

Shenyuan Gao, Jiazhi Yang, and Li Chen. Survey of generative world models for embodied ai.arXiv preprint arXiv:2502.00060,

work page arXiv
[8]

Vista: A generalizable driving world model with high fidelity and versatile controllability.arXiv preprint arXiv:2405.17398,

Shenyuan Gao et al. Vista: A generalizable driving world model with high fidelity and versatile controllability.arXiv preprint arXiv:2405.17398,

work page arXiv
[9]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. arXiv preprint arXiv:2311.18259,

work page arXiv
[10]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, et al. CLIPScore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, et al. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Ziqi Huang et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, et al. HunyuanVideo: A systematic framework for large video generative models. arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Decoupled Weight Decay Regularization

URLhttps://arxiv.org/ abs/1711.05101. Ge Ya Luo, Gian Mario Favero, Zhi-Hao Luo, et al. Beyond FVD: Enhanced evaluation metrics for video generation quality. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Movie Gen: A Cast of Media Foundation Models

Meta GenAI. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos World Foundation Model Platform for Physical AI.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

RetailVision workshop series

RetailVision Organizers. RetailVision workshop series. https://retailvisionworkshop.github.io, 2020–2025. Annual workshop at CVPR/ICCV, 2020–2025. Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, and Sashi Reddi. PRISM: A multi-view multi-capability retail video ...

work page arXiv 2020
[20]

RoboVQA: Multimodal long-horizon reasoning for robotics.arXiv preprint arXiv:2311.00899,

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil Jain, Peng Xu, Yunfei Yuan, et al. RoboVQA: Multimodal long-horizon reasoning for robotics.arXiv preprint arXiv:2311.00899,

work page arXiv
[21]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner et al. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Zhouxia Wang, Ziyang Yuan, Xintao Wang, et al. MotionCtrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024b. Wayve. GAIA-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

every paired sample improves over the base

28 A.4 Split Construction We hold out clips from training and partition the held-out pool into a validation set used for adapter selection (1,388clips, from which we samplen=32paired clips for the rectified-flow validation loss every 100 training steps) and a test set used for final evaluation (200clips, balanced across both egocentric (n=100) and exocent...

2017

[1] [1]

Kevin Black et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

AgiBot World Contributors. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers et al. QLoRA: Efficient finetuning of quantized language models.arXiv preprint arXiv:2305.14314,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess et al. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or

arXiv:2508.00400. Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. InICLR,

work page arXiv

[7] [7]

Survey of generative world models for embodied ai.arXiv preprint arXiv:2502.00060,

Shenyuan Gao, Jiazhi Yang, and Li Chen. Survey of generative world models for embodied ai.arXiv preprint arXiv:2502.00060,

work page arXiv

[8] [8]

Vista: A generalizable driving world model with high fidelity and versatile controllability.arXiv preprint arXiv:2405.17398,

Shenyuan Gao et al. Vista: A generalizable driving world model with high fidelity and versatile controllability.arXiv preprint arXiv:2405.17398,

work page arXiv

[9] [9]

Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

Kristen Grauman et al. Ego-Exo4D: Understanding skilled human activity from first- and third-person perspectives. arXiv preprint arXiv:2311.18259,

work page arXiv

[10] [10]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, et al. CLIPScore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, et al. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Ziqi Huang et al. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, et al. HunyuanVideo: A systematic framework for large video generative models. arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Decoupled Weight Decay Regularization

URLhttps://arxiv.org/ abs/1711.05101. Ge Ya Luo, Gian Mario Favero, Zhi-Hao Luo, et al. Beyond FVD: Enhanced evaluation metrics for video generation quality. InNeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Movie Gen: A Cast of Media Foundation Models

Meta GenAI. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA. Cosmos World Foundation Model Platform for Physical AI.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

RetailVision workshop series

RetailVision Organizers. RetailVision workshop series. https://retailvisionworkshop.github.io, 2020–2025. Annual workshop at CVPR/ICCV, 2020–2025. Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, and Sashi Reddi. PRISM: A multi-view multi-capability retail video ...

work page arXiv 2020

[20] [20]

RoboVQA: Multimodal long-horizon reasoning for robotics.arXiv preprint arXiv:2311.00899,

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil Jain, Peng Xu, Yunfei Yuan, et al. RoboVQA: Multimodal long-horizon reasoning for robotics.arXiv preprint arXiv:2311.00899,

work page arXiv

[21] [21]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner et al. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Zhouxia Wang, Ziyang Yuan, Xintao Wang, et al. MotionCtrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024b. Wayve. GAIA-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

every paired sample improves over the base

28 A.4 Split Construction We hold out clips from training and partition the held-out pool into a validation set used for adapter selection (1,388clips, from which we samplen=32paired clips for the rectified-flow validation loss every 100 training steps) and a test set used for final evaluation (200clips, balanced across both egocentric (n=100) and exocent...

2017