VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026
Pith reviewed 2026-05-21 05:52 UTC · model grok-4.3
The pith
VISTA wins the Ego4D short-term object interaction anticipation challenge by fusing frozen video features into a pretrained object detector.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VISTA follows a StillFast-style design that generates object proposals from the last observed high-resolution frame with a COCO-pretrained Faster R-CNN ResNet-50 FPN, extracts clip-level egocentric context from a frozen V-JEPA temporal branch, injects that context via feature modulation and ROI-level fusion, and feeds the result to multi-head predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence; the resulting system achieves first place on the official EgoVis 2026 Ego4D STA Challenge server.
What carries the argument
Feature modulation and ROI-level context fusion that injects clip-level temporal representations from the frozen video branch directly into the spatial detection pathway.
If this is right
- Multi-head prediction heads simultaneously produce bounding-box refinement, noun category, verb category, time-to-contact value, and interaction confidence from the fused features.
- Ensembling complementary predictions from multiple runs improves robustness on the benchmark without changing the core architecture.
- The design separates high-resolution spatial detection on the final frame from short clip-level temporal context, allowing each component to be swapped independently.
Where Pith is reading between the lines
- The same modulation-plus-ROI fusion pattern could be tested on other egocentric tasks that require both precise localization and short-term timing awareness.
- Freezing the temporal encoder keeps training cost low, which may make the method practical for settings with limited labeled video data.
- Extending the input clip length while keeping the temporal branch frozen would test whether the current short-horizon context is the main performance limiter.
Load-bearing premise
Short-horizon temporal dynamics for accurate noun, verb, and time-to-contact prediction can be captured by injecting features from a frozen video model into a static-image detector without any fine-tuning of the temporal branch.
What would settle it
A controlled ablation on the official challenge server in which the temporal feature injection is removed entirely and the resulting drop in noun, verb, and time-to-contact metrics is measured against the original first-place score.
Figures
read the original abstract
We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. It uses a COCO-pretrained Faster R-CNN ResNet-50 FPN detector on the last observed frame combined with a frozen V-JEPA 2.1 temporal branch whose clip-level features are injected via modulation and ROI-level fusion. Multi-head predictors then produce refined boxes, noun and verb labels, time-to-contact regression, and confidence scores, with ensembling applied for the submitted predictions. The central claim is that this system achieved first place on the official challenge server.
Significance. If the reported ranking holds, the work provides a concrete example of fusing frozen pre-trained video representations with a spatial detector for short-horizon egocentric anticipation. The approach illustrates a practical, low-compute way to incorporate temporal context and the planned code release would support reproducibility and follow-on studies in video understanding.
major comments (1)
- Abstract: the assertion that 'Experimental results on the official challenge server show that VISTA achieves first place' is presented without any accompanying scores, baseline comparisons, ablation results, or error analysis. This is load-bearing for the manuscript's primary empirical claim, which rests entirely on an external server evaluation that is not documented or contextualized inside the paper.
minor comments (1)
- The description of feature modulation and ROI-level fusion remains high-level; adding a diagram or pseudocode would improve clarity of how the frozen temporal features are integrated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our technical report. We agree that the primary empirical claim requires better documentation and contextualization within the manuscript. We address the major comment below and will incorporate the suggested changes in the revised version.
read point-by-point responses
-
Referee: Abstract: the assertion that 'Experimental results on the official challenge server show that VISTA achieves first place' is presented without any accompanying scores, baseline comparisons, ablation results, or error analysis. This is load-bearing for the manuscript's primary empirical claim, which rests entirely on an external server evaluation that is not documented or contextualized inside the paper.
Authors: We acknowledge that the current abstract states the first-place result without supporting numerical details or internal analysis, which limits the reader's ability to evaluate the claim independently of the external leaderboard. As this is a concise technical report focused on the winning submission to the Ego4D STA challenge, the primary evidence is the official server ranking. However, we agree this should be better contextualized. In the revision we will expand the abstract to report the specific official metrics (e.g., the exact mAP, noun/verb accuracy, and TTC error values that secured first place). We will also add a short Results section containing the official challenge scores, a comparison table against other submitted methods where leaderboard data is public, and a summary of key internal ablations (feature modulation vs. fusion, ensembling impact) that informed the final design. This will document the external evaluation inside the paper while preserving the report's brevity. revision: yes
Circularity Check
No significant circularity; purely empirical challenge report
full rationale
The paper presents an empirical architecture (COCO-pretrained Faster R-CNN with frozen V-JEPA features injected via modulation and ROI fusion, followed by multi-head predictors and ensembling) and reports first-place ranking on the external EgoVis 2026 Ego4D STA challenge server. No mathematical derivations, equations, or self-referential definitions exist that reduce any claimed prediction or result to its own inputs by construction. The performance claim is externally validated and independent of internal modeling choices, satisfying the criteria for a self-contained empirical submission with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (1)
- ensemble weights
axioms (1)
- domain assumption COCO-pretrained Faster R-CNN and frozen V-JEPA 2.1 features transfer adequately to egocentric short-term anticipation without temporal-branch fine-tuning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context... injected... through feature modulation and ROI-level context fusion
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
samples eight observed frames at 2 FPS
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. Technical report for Ego4D long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025. 2
-
[2]
Intention-guided cognitive reason- ing for egocentric long-term action anticipation
Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxi- ang Shi, and Liqiang Nie. Intention-guided cognitive reason- ing for egocentric long-term action anticipation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 17436–17444, 2026. 2, 4
work page 2026
-
[3]
SlowFast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2
work page 2019
-
[4]
OSGNet @ Ego4D episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025
Yisen Feng, Haoyu Zhang, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. OSGNet @ Ego4D episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025. 2, 3
-
[5]
Ego4D: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1
work page 2022
-
[6]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 3
work page 2016
-
[7]
Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 2961–2969,
-
[8]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InEuropean Confer- ence on Computer Vision, pages 740–755. Springer, 2014. 3
work page 2014
-
[9]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 3
work page 2017
-
[10]
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 3
-
[11]
FiLM: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3
work page 2018
-
[12]
StillFast: An end-to-end approach for short- term object interaction anticipation
Francesco Ragusa, Giovanni Maria Farinella, and Antonino Furnari. StillFast: An end-to-end approach for short- term object interaction anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3635–3644, 2023. 2
work page 2023
-
[13]
Faster R-CNN: Towards real-time object detection with re- gion proposal networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. InAdvances in Neural Information Processing Systems, 2015. 2, 3
work page 2015
-
[14]
Multimodal dialog system: Rela- tional graph-based context-aware question understanding
Haoyu Zhang, Meng Liu, Zan Gao, Xiaoqiang Lei, Yinglong Wang, and Liqiang Nie. Multimodal dialog system: Rela- tional graph-based context-aware question understanding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 695–703, 2021. 1
work page 2021
-
[15]
Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 3
work page 2023
-
[16]
Multi-factor adaptive vision selec- tion for egocentric video question answering
Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 1, 2, 4
work page 2024
-
[17]
Multi-factor adaptive vision selec- tion for egocentric video question answering
Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 3
work page 2024
-
[18]
HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025
Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025. 2, 3
-
[19]
Spatial understanding from videos: Structured prompts meet simulation data
Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 103202–103229, 2025. 1, 4
work page 2025
-
[20]
Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing
Haoyu Zhang, Qiaohui Chu, Meng Liu, Haoxiang Shi, Yaowei Wang, and Liqiang Nie. Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12502–12510, 2026. 1
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.