pith. sign in

arxiv: 2605.20901 · v1 · pith:HE3JL23Qnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

Pith reviewed 2026-05-21 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords egocentric visionshort-term anticipationobject interaction predictionfeature modulationfrozen video encoderchallenge submission
0
0 comments X

The pith

VISTA wins the Ego4D short-term object interaction anticipation challenge by fusing frozen video features into a pretrained object detector.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system that takes an egocentric video up to the current moment and predicts the next human-object interaction, including the object's future bounding box, its category, the verb of the action, and the time until contact. It starts with object proposals from a detector trained only on static images of the current frame, then adds short-term temporal context from a video model whose weights stay frozen. The temporal signal reaches the detector through simple feature modulation and fusion at the region level, after which separate heads predict each piece of the interaction output. An ensemble of such predictions is submitted to the official server. The approach matters because it shows how off-the-shelf spatial detectors can be given useful timing awareness for prediction tasks without retraining the expensive video encoder.

Core claim

VISTA follows a StillFast-style design that generates object proposals from the last observed high-resolution frame with a COCO-pretrained Faster R-CNN ResNet-50 FPN, extracts clip-level egocentric context from a frozen V-JEPA temporal branch, injects that context via feature modulation and ROI-level fusion, and feeds the result to multi-head predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence; the resulting system achieves first place on the official EgoVis 2026 Ego4D STA Challenge server.

What carries the argument

Feature modulation and ROI-level context fusion that injects clip-level temporal representations from the frozen video branch directly into the spatial detection pathway.

If this is right

  • Multi-head prediction heads simultaneously produce bounding-box refinement, noun category, verb category, time-to-contact value, and interaction confidence from the fused features.
  • Ensembling complementary predictions from multiple runs improves robustness on the benchmark without changing the core architecture.
  • The design separates high-resolution spatial detection on the final frame from short clip-level temporal context, allowing each component to be swapped independently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modulation-plus-ROI fusion pattern could be tested on other egocentric tasks that require both precise localization and short-term timing awareness.
  • Freezing the temporal encoder keeps training cost low, which may make the method practical for settings with limited labeled video data.
  • Extending the input clip length while keeping the temporal branch frozen would test whether the current short-horizon context is the main performance limiter.

Load-bearing premise

Short-horizon temporal dynamics for accurate noun, verb, and time-to-contact prediction can be captured by injecting features from a frozen video model into a static-image detector without any fine-tuning of the temporal branch.

What would settle it

A controlled ablation on the official challenge server in which the temporal feature injection is removed entirely and the resulting drop in noun, verb, and time-to-contact metrics is measured against the original first-place score.

Figures

Figures reproduced from arXiv: 2605.20901 by Dongmei Jiang, Haoyu Zhang, Liqiang Nie, Meng Liu, Qiaohui Chu, Weili Guan, Yisen Feng.

Figure 1
Figure 1. Figure 1: Overview of VISTA. VISTA combines a still-object detection branch with a frozen V-JEPA 2.1 temporal branch. Temporal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Successful qualitative example. VISTA correctly local [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Failure qualitative example. VISTA is distracted by a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript describes VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. It uses a COCO-pretrained Faster R-CNN ResNet-50 FPN detector on the last observed frame combined with a frozen V-JEPA 2.1 temporal branch whose clip-level features are injected via modulation and ROI-level fusion. Multi-head predictors then produce refined boxes, noun and verb labels, time-to-contact regression, and confidence scores, with ensembling applied for the submitted predictions. The central claim is that this system achieved first place on the official challenge server.

Significance. If the reported ranking holds, the work provides a concrete example of fusing frozen pre-trained video representations with a spatial detector for short-horizon egocentric anticipation. The approach illustrates a practical, low-compute way to incorporate temporal context and the planned code release would support reproducibility and follow-on studies in video understanding.

major comments (1)
  1. Abstract: the assertion that 'Experimental results on the official challenge server show that VISTA achieves first place' is presented without any accompanying scores, baseline comparisons, ablation results, or error analysis. This is load-bearing for the manuscript's primary empirical claim, which rests entirely on an external server evaluation that is not documented or contextualized inside the paper.
minor comments (1)
  1. The description of feature modulation and ROI-level fusion remains high-level; adding a diagram or pseudocode would improve clarity of how the frozen temporal features are integrated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We agree that the primary empirical claim requires better documentation and contextualization within the manuscript. We address the major comment below and will incorporate the suggested changes in the revised version.

read point-by-point responses
  1. Referee: Abstract: the assertion that 'Experimental results on the official challenge server show that VISTA achieves first place' is presented without any accompanying scores, baseline comparisons, ablation results, or error analysis. This is load-bearing for the manuscript's primary empirical claim, which rests entirely on an external server evaluation that is not documented or contextualized inside the paper.

    Authors: We acknowledge that the current abstract states the first-place result without supporting numerical details or internal analysis, which limits the reader's ability to evaluate the claim independently of the external leaderboard. As this is a concise technical report focused on the winning submission to the Ego4D STA challenge, the primary evidence is the official server ranking. However, we agree this should be better contextualized. In the revision we will expand the abstract to report the specific official metrics (e.g., the exact mAP, noun/verb accuracy, and TTC error values that secured first place). We will also add a short Results section containing the official challenge scores, a comparison table against other submitted methods where leaderboard data is public, and a summary of key internal ablations (feature modulation vs. fusion, ensembling impact) that informed the final design. This will document the external evaluation inside the paper while preserving the report's brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical challenge report

full rationale

The paper presents an empirical architecture (COCO-pretrained Faster R-CNN with frozen V-JEPA features injected via modulation and ROI fusion, followed by multi-head predictors and ensembling) and reports first-place ranking on the external EgoVis 2026 Ego4D STA challenge server. No mathematical derivations, equations, or self-referential definitions exist that reduce any claimed prediction or result to its own inputs by construction. The performance claim is externally validated and independent of internal modeling choices, satisfying the criteria for a self-contained empirical submission with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on transfer-learning assumptions about pretrained models and on post-hoc ensembling whose weights are not described.

free parameters (1)
  • ensemble weights
    Weights used to combine complementary predictions are chosen after seeing validation performance.
axioms (1)
  • domain assumption COCO-pretrained Faster R-CNN and frozen V-JEPA 2.1 features transfer adequately to egocentric short-term anticipation without temporal-branch fine-tuning.
    Explicitly stated by freezing V-JEPA and using the COCO detector directly.

pith-pipeline@v0.9.0 · 5788 in / 1127 out tokens · 50235 ms · 2026-05-21T05:52:18.469464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Technical report for Ego4D long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025

    Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. Technical report for Ego4D long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025. 2

  2. [2]

    Intention-guided cognitive reason- ing for egocentric long-term action anticipation

    Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxi- ang Shi, and Liqiang Nie. Intention-guided cognitive reason- ing for egocentric long-term action anticipation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 17436–17444, 2026. 2, 4

  3. [3]

    SlowFast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2

  4. [4]

    OSGNet @ Ego4D episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025

    Yisen Feng, Haoyu Zhang, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. OSGNet @ Ego4D episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025. 2, 3

  5. [5]

    Ego4D: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1

  6. [6]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 3

  7. [7]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 2961–2969,

  8. [8]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InEuropean Confer- ence on Computer Vision, pages 740–755. Springer, 2014. 3

  9. [9]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 3

  10. [10]

    V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 3

  11. [11]

    FiLM: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3

  12. [12]

    StillFast: An end-to-end approach for short- term object interaction anticipation

    Francesco Ragusa, Giovanni Maria Farinella, and Antonino Furnari. StillFast: An end-to-end approach for short- term object interaction anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3635–3644, 2023. 2

  13. [13]

    Faster R-CNN: Towards real-time object detection with re- gion proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. InAdvances in Neural Information Processing Systems, 2015. 2, 3

  14. [14]

    Multimodal dialog system: Rela- tional graph-based context-aware question understanding

    Haoyu Zhang, Meng Liu, Zan Gao, Xiaoqiang Lei, Yinglong Wang, and Liqiang Nie. Multimodal dialog system: Rela- tional graph-based context-aware question understanding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 695–703, 2021. 1

  15. [15]

    Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023

    Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 3

  16. [16]

    Multi-factor adaptive vision selec- tion for egocentric video question answering

    Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 1, 2, 4

  17. [17]

    Multi-factor adaptive vision selec- tion for egocentric video question answering

    Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 3

  18. [18]

    HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025

    Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025. 2, 3

  19. [19]

    Spatial understanding from videos: Structured prompts meet simulation data

    Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 103202–103229, 2025. 1, 4

  20. [20]

    Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing

    Haoyu Zhang, Qiaohui Chu, Meng Liu, Haoxiang Shi, Yaowei Wang, and Liqiang Nie. Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12502–12510, 2026. 1