VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

Dongmei Jiang; Haoyu Zhang; Liqiang Nie; Meng Liu; Qiaohui Chu; Weili Guan; Yisen Feng

REVIEW 1 major objections 1 minor 2 cited by

VISTA wins the Ego4D short-term object interaction anticipation challenge by fusing frozen video features into a pretrained object detector.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 05:52 UTC pith:HE3JL23Q

load-bearing objection This tech report won the Ego4D challenge with off-the-shelf fusion but skimps on the supporting numbers. the 1 major comments →

arxiv 2605.20901 v1 pith:HE3JL23Q submitted 2026-05-20 cs.CV cs.AI

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

Qiaohui Chu , Haoyu Zhang , Yisen Feng , Meng Liu , Weili Guan , Dongmei Jiang , Liqiang Nie This is my paper

classification cs.CV cs.AI

keywords egocentric visionshort-term anticipationobject interaction predictionfeature modulationfrozen video encoderchallenge submission

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system that takes an egocentric video up to the current moment and predicts the next human-object interaction, including the object's future bounding box, its category, the verb of the action, and the time until contact. It starts with object proposals from a detector trained only on static images of the current frame, then adds short-term temporal context from a video model whose weights stay frozen. The temporal signal reaches the detector through simple feature modulation and fusion at the region level, after which separate heads predict each piece of the interaction output. An ensemble of such predictions is submitted to the official server. The approach matters because it shows how off-the-shelf spatial detectors can be given useful timing awareness for prediction tasks without retraining the expensive video encoder.

Core claim

VISTA follows a StillFast-style design that generates object proposals from the last observed high-resolution frame with a COCO-pretrained Faster R-CNN ResNet-50 FPN, extracts clip-level egocentric context from a frozen V-JEPA temporal branch, injects that context via feature modulation and ROI-level fusion, and feeds the result to multi-head predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence; the resulting system achieves first place on the official EgoVis 2026 Ego4D STA Challenge server.

What carries the argument

Feature modulation and ROI-level context fusion that injects clip-level temporal representations from the frozen video branch directly into the spatial detection pathway.

Load-bearing premise

Short-horizon temporal dynamics for accurate noun, verb, and time-to-contact prediction can be captured by injecting features from a frozen video model into a static-image detector without any fine-tuning of the temporal branch.

What would settle it

A controlled ablation on the official challenge server in which the temporal feature injection is removed entirely and the resulting drop in noun, verb, and time-to-contact metrics is measured against the original first-place score.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Multi-head prediction heads simultaneously produce bounding-box refinement, noun category, verb category, time-to-contact value, and interaction confidence from the fused features.
Ensembling complementary predictions from multiple runs improves robustness on the benchmark without changing the core architecture.
The design separates high-resolution spatial detection on the final frame from short clip-level temporal context, allowing each component to be swapped independently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modulation-plus-ROI fusion pattern could be tested on other egocentric tasks that require both precise localization and short-term timing awareness.
Freezing the temporal encoder keeps training cost low, which may make the method practical for settings with limited labeled video data.
Extending the input clip length while keeping the temporal branch frozen would test whether the current short-horizon context is the main performance limiter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

This tech report won the Ego4D challenge with off-the-shelf fusion but skimps on the supporting numbers.

read the letter

Hey, the big thing with this one is it's a tech report for winning the Ego4D STA challenge at EgoVis 2026. They got first place by running a COCO-pretrained Faster R-CNN on the last frame and feeding in frozen V-JEPA clip features via modulation and ROI fusion, then ensembling a bit. It's not breaking new ground method-wise. Just a direct use of the StillFast setup with two pretrained models and standard tricks for combining them. The pipeline for getting object proposals, adding short-term context, and predicting the interaction details is laid out plainly. What it does right is deliver a top result on the real challenge metric and commit to open-sourcing the code. That gives the field a solid, usable baseline for egocentric anticipation work that could feed into AR or robot planning. The weak part is how little they show. No actual numbers from the server, no ablations to see if the video features matter much, no look at where predictions go wrong. You have to take the first-place claim on faith from the organizers' evaluation. This is the sort of thing for people in the Ego4D community or folks who want to implement something quick for temporal forecasting tasks. Not so much for someone hunting for fresh concepts. I'd say send it out for review. The external validation of the ranking makes it worth archiving, but it could use some added results to hold up better.

Referee Report

1 major / 1 minor

Summary. The manuscript describes VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. It uses a COCO-pretrained Faster R-CNN ResNet-50 FPN detector on the last observed frame combined with a frozen V-JEPA 2.1 temporal branch whose clip-level features are injected via modulation and ROI-level fusion. Multi-head predictors then produce refined boxes, noun and verb labels, time-to-contact regression, and confidence scores, with ensembling applied for the submitted predictions. The central claim is that this system achieved first place on the official challenge server.

Significance. If the reported ranking holds, the work provides a concrete example of fusing frozen pre-trained video representations with a spatial detector for short-horizon egocentric anticipation. The approach illustrates a practical, low-compute way to incorporate temporal context and the planned code release would support reproducibility and follow-on studies in video understanding.

major comments (1)

Abstract: the assertion that 'Experimental results on the official challenge server show that VISTA achieves first place' is presented without any accompanying scores, baseline comparisons, ablation results, or error analysis. This is load-bearing for the manuscript's primary empirical claim, which rests entirely on an external server evaluation that is not documented or contextualized inside the paper.

minor comments (1)

The description of feature modulation and ROI-level fusion remains high-level; adding a diagram or pseudocode would improve clarity of how the frozen temporal features are integrated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We agree that the primary empirical claim requires better documentation and contextualization within the manuscript. We address the major comment below and will incorporate the suggested changes in the revised version.

read point-by-point responses

Referee: Abstract: the assertion that 'Experimental results on the official challenge server show that VISTA achieves first place' is presented without any accompanying scores, baseline comparisons, ablation results, or error analysis. This is load-bearing for the manuscript's primary empirical claim, which rests entirely on an external server evaluation that is not documented or contextualized inside the paper.

Authors: We acknowledge that the current abstract states the first-place result without supporting numerical details or internal analysis, which limits the reader's ability to evaluate the claim independently of the external leaderboard. As this is a concise technical report focused on the winning submission to the Ego4D STA challenge, the primary evidence is the official server ranking. However, we agree this should be better contextualized. In the revision we will expand the abstract to report the specific official metrics (e.g., the exact mAP, noun/verb accuracy, and TTC error values that secured first place). We will also add a short Results section containing the official challenge scores, a comparison table against other submitted methods where leaderboard data is public, and a summary of key internal ablations (feature modulation vs. fusion, ensembling impact) that informed the final design. This will document the external evaluation inside the paper while preserving the report's brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical challenge report

full rationale

The paper presents an empirical architecture (COCO-pretrained Faster R-CNN with frozen V-JEPA features injected via modulation and ROI fusion, followed by multi-head predictors and ensembling) and reports first-place ranking on the external EgoVis 2026 Ego4D STA challenge server. No mathematical derivations, equations, or self-referential definitions exist that reduce any claimed prediction or result to its own inputs by construction. The performance claim is externally validated and independent of internal modeling choices, satisfying the criteria for a self-contained empirical submission with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on transfer-learning assumptions about pretrained models and on post-hoc ensembling whose weights are not described.

free parameters (1)

ensemble weights
Weights used to combine complementary predictions are chosen after seeing validation performance.

axioms (1)

domain assumption COCO-pretrained Faster R-CNN and frozen V-JEPA 2.1 features transfer adequately to egocentric short-term anticipation without temporal-branch fine-tuning.
Explicitly stated by freezing V-JEPA and using the COCO detector directly.

pith-pipeline@v0.9.0 · 5788 in / 1127 out tokens · 50235 ms · 2026-05-21T05:52:18.469464+00:00 · methodology

0 comments

read the original abstract

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

Figures

Figures reproduced from arXiv: 2605.20901 by Dongmei Jiang, Haoyu Zhang, Liqiang Nie, Meng Liu, Qiaohui Chu, Weili Guan, Yisen Feng.

**Figure 1.** Figure 1: Overview of VISTA. VISTA combines a still-object detection branch with a frozen V-JEPA 2.1 temporal branch. Temporal [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Successful qualitative example. VISTA correctly local [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Failure qualitative example. VISTA is distracted by a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context... injected... through feature modulation and ROI-level context fusion
IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

samples eight observed frames at 2 FPS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation
cs.CV 2026-05 unverdicted novelty 3.0

FROST-STA ranks second in the Ego4D Short-Term Object Interaction Anticipation challenge with 5.13 mAP by adapting frozen V-JEPA features with object-centric heads and ensembling.
TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation
cs.CV 2026-05 unverdicted novelty 3.0

TAP-JEPA applies frozen V-JEPA features, latent future prediction, and two-stage fusion of attentive probes to reach 27.91% MT5R and second place on the EK-100 action anticipation leaderboard.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Technical report for ego4d long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. Technical report for Ego4D long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025. 2

work page arXiv 2025
[2]

Intention-guided cognitive reason- ing for egocentric long-term action anticipation

Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxi- ang Shi, and Liqiang Nie. Intention-guided cognitive reason- ing for egocentric long-term action anticipation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 17436–17444, 2026. 2, 4

work page 2026
[3]

SlowFast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2

work page 2019
[4]

Osgnet@ ego4d episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025

Yisen Feng, Haoyu Zhang, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. OSGNet @ Ego4D episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025. 2, 3

work page arXiv 2025
[5]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1

work page 2022
[6]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 3

work page 2016
[7]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 2961–2969,

work page
[8]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InEuropean Confer- ence on Computer Vision, pages 740–755. Springer, 2014. 3

work page 2014
[9]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 3

work page 2017
[10]

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 3

work page internal anchor Pith review arXiv 2026
[11]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3

work page 2018
[12]

StillFast: An end-to-end approach for short- term object interaction anticipation

Francesco Ragusa, Giovanni Maria Farinella, and Antonino Furnari. StillFast: An end-to-end approach for short- term object interaction anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3635–3644, 2023. 2

work page 2023
[13]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. InAdvances in Neural Information Processing Systems, 2015. 2, 3

work page 2015
[14]

Multimodal dialog system: Rela- tional graph-based context-aware question understanding

Haoyu Zhang, Meng Liu, Zan Gao, Xiaoqiang Lei, Yinglong Wang, and Liqiang Nie. Multimodal dialog system: Rela- tional graph-based context-aware question understanding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 695–703, 2021. 1

work page 2021
[15]

Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023

Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 3

work page 2023
[16]

Multi-factor adaptive vision selec- tion for egocentric video question answering

Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 1, 2, 4

work page 2024
[17]

Multi-factor adaptive vision selec- tion for egocentric video question answering

Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 3

work page 2024
[18]

HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025

Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025. 2, 3

work page arXiv 2025
[19]

Spatial understanding from videos: Structured prompts meet simulation data

Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 103202–103229, 2025. 1, 4

work page 2025
[20]

Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing

Haoyu Zhang, Qiaohui Chu, Meng Liu, Haoxiang Shi, Yaowei Wang, and Liqiang Nie. Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12502–12510, 2026. 1

work page 2026

[1] [1]

Technical report for ego4d long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. Technical report for Ego4D long-term action anticipation challenge 2025.arXiv preprint arXiv:2506.02550, 2025. 2

work page arXiv 2025

[2] [2]

Intention-guided cognitive reason- ing for egocentric long-term action anticipation

Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxi- ang Shi, and Liqiang Nie. Intention-guided cognitive reason- ing for egocentric long-term action anticipation. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 17436–17444, 2026. 2, 4

work page 2026

[3] [3]

SlowFast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6202–6211, 2019. 2

work page 2019

[4] [4]

Osgnet@ ego4d episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025

Yisen Feng, Haoyu Zhang, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. OSGNet @ Ego4D episodic memory challenge 2025.arXiv preprint arXiv:2506.03710, 2025. 2, 3

work page arXiv 2025

[5] [5]

Ego4D: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4D: Around the world in 3,000 hours of egocentric video. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1

work page 2022

[6] [6]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 3

work page 2016

[7] [7]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask R-CNN. InProceedings of the IEEE Inter- national Conference on Computer Vision, pages 2961–2969,

work page

[8] [8]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. InEuropean Confer- ence on Computer Vision, pages 740–755. Springer, 2014. 3

work page 2014

[9] [9]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 3

work page 2017

[10] [10]

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026. 3

work page internal anchor Pith review arXiv 2026

[11] [11]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, 2018. 3

work page 2018

[12] [12]

StillFast: An end-to-end approach for short- term object interaction anticipation

Francesco Ragusa, Giovanni Maria Farinella, and Antonino Furnari. StillFast: An end-to-end approach for short- term object interaction anticipation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 3635–3644, 2023. 2

work page 2023

[13] [13]

Faster R-CNN: Towards real-time object detection with re- gion proposal networks

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with re- gion proposal networks. InAdvances in Neural Information Processing Systems, 2015. 2, 3

work page 2015

[14] [14]

Multimodal dialog system: Rela- tional graph-based context-aware question understanding

Haoyu Zhang, Meng Liu, Zan Gao, Xiaoqiang Lei, Yinglong Wang, and Liqiang Nie. Multimodal dialog system: Rela- tional graph-based context-aware question understanding. In Proceedings of the 29th ACM International Conference on Multimedia, pages 695–703, 2021. 1

work page 2021

[15] [15]

Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023

Haoyu Zhang, Meng Liu, Yuhong Li, Ming Yan, Zan Gao, Xiaojun Chang, and Liqiang Nie. Attribute-guided collab- orative learning for partial person re-identification.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(12):14144–14160, 2023. 3

work page 2023

[16] [16]

Multi-factor adaptive vision selec- tion for egocentric video question answering

Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 1, 2, 4

work page 2024

[17] [17]

Multi-factor adaptive vision selec- tion for egocentric video question answering

Haoyu Zhang, Meng Liu, Zixin Liu, Xuemeng Song, Yaowei Wang, and Liqiang Nie. Multi-factor adaptive vision selec- tion for egocentric video question answering. InProceedings of the 41st International Conference on Machine Learning, pages 59310–59328. PMLR, 2024. 3

work page 2024

[18] [18]

HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025

Haoyu Zhang, Yisen Feng, Qiaohui Chu, Meng Liu, Weili Guan, Yaowei Wang, and Liqiang Nie. HCQA-1.5 @ Ego4D EgoSchema challenge 2025.arXiv preprint arXiv:2505.20644, 2025. 2, 3

work page arXiv 2025

[19] [19]

Spatial understanding from videos: Structured prompts meet simulation data

Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data. In Advances in Neural Information Processing Systems, pages 103202–103229, 2025. 1, 4

work page 2025

[20] [20]

Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing

Haoyu Zhang, Qiaohui Chu, Meng Liu, Haoxiang Shi, Yaowei Wang, and Liqiang Nie. Exo2Ego: Exocentric knowledge guided MLLM for egocentric video understand- ing. InProceedings of the AAAI Conference on Artificial Intelligence, pages 12502–12510, 2026. 1

work page 2026