pith. sign in

arxiv: 2605.19559 · v1 · pith:T6CLO4FLnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

Pith reviewed 2026-05-20 05:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords egocentric videomultimodal large language modelschain of thought reasoninggrounded reasoningbenchmarkhand-object interactionsspatio-temporal scene graphsoperation-centric reasoning
0
0 comments X

The pith

MLLMs often reach correct answers on egocentric tasks but cite evidence that does not match the video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoCoT-Bench to measure whether multimodal large language models can reason step by step about hand-object operations in first-person videos while keeping their explanations tied to explicit visual evidence. The benchmark supplies 3,172 QA pairs drawn from 351 videos and organizes them into perception, retrospection, anticipation, and high-level reasoning groups. Construction relies on spatio-temporal scene graphs to produce questions whose answers and rationales can be checked directly against the footage, followed by human review for relevance and detail. Experiments on current models indicate persistent trouble with fine-grained egocentric details and show frequent cases in which the final answer is correct yet the supporting rationale points to the wrong objects, times, or actions.

Core claim

EgoCoT-Bench supplies 3,172 verifiable QA pairs over 351 egocentric videos together with explicit step-by-step rationale annotations. The benchmark is generated by a spatio-temporal scene graph framework that produces questions whose correct answers and rationales are directly traceable to visible hand-object interactions and state changes; human annotators then refine the items for egocentric perspective and fine-grained quality. When existing MLLMs are tested, they continue to exhibit difficulties with fine-grained operation-centric reasoning and frequently generate explanations whose cited evidence is inconsistent with the chosen answer or with the actual video content.

What carries the argument

EgoCoT-Bench benchmark whose STSG-guided generation and human refinement produce QA pairs with verifiable step-by-step rationales that can be checked against spatio-temporal video evidence.

If this is right

  • Model training must add explicit penalties when generated rationales fail to reference the correct objects or time intervals in the video.
  • Applications that rely on first-person AI guidance, such as step-by-step assistance during physical tasks, will require stronger evidence alignment before they can be trusted.
  • Evaluation of future MLLMs should report both answer correctness and rationale-video consistency rather than answer accuracy alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same consistency checks could be added to third-person video benchmarks to test whether the grounding problem is specific to the egocentric viewpoint.
  • Success on this benchmark would be a useful signal that a model can support reliable real-time coaching for manipulation tasks from the user's own camera.
  • Training objectives that force rationales to cite specific frames or objects may close the gap faster than scaling alone.

Load-bearing premise

The spatio-temporal scene graph generation process plus human refinement produces questions and rationales that accurately reflect real egocentric operations and remain verifiably grounded in the video evidence.

What would settle it

A model that achieves high accuracy on both the final answer and the consistency between its rationales and the explicit video evidence across all four task groups would show that the reported difficulties have been overcome.

Figures

Figures reproduced from arXiv: 2605.19559 by Dian Jiao, Tianwei Lin, Wenqiao Zhang, Yang Dai.

Figure 1
Figure 1. Figure 1: Overview of EgoCoT-Bench. EgoCoT-Bench is a fine-grained benchmark for grounded and verifiable operation-centric [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall statistics of EgoCoT-Bench. Top: represen [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fine-grained radar analysis on EgoCoT-Bench. Left: answer accuracy (%) across 12 subtasks. Middle: reasoning quality [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces EgoCoT-Bench, a benchmark for grounded and verifiable operation-centric chain-of-thought reasoning in multimodal large language models (MLLMs) on egocentric videos. It comprises 3,172 QA pairs across 351 videos organized into four task groups and 12 sub-tasks covering perception/retrospection, anticipation, and high-level reasoning. The benchmark is generated via a spatio-temporal scene graph (STSG) framework and refined by human annotators. Experiments demonstrate persistent difficulties with fine-grained egocentric reasoning and reveal frequent cases of answer-correct but evidence-inconsistent model explanations.

Significance. If the reported construction and evaluation protocols hold, EgoCoT-Bench addresses a genuine gap in egocentric video benchmarks by prioritizing verifiable grounding of rationales over answer accuracy alone. The per-task metrics and qualitative examples provide concrete support for the claims about model limitations in fine-grained operation-centric reasoning. This could serve as a useful testbed for advancing MLLM development in dynamic, first-person settings.

major comments (1)
  1. The stress-test concern about missing details on model selection, statistical significance, and inconsistency measurement does not land after review of the full manuscript; the STSG-guided generation, task breakdowns, and quality-control steps supply sufficient procedural detail to support the central benchmark claims and findings.
minor comments (3)
  1. Abstract: Consider adding a brief parenthetical note on the total number of sub-tasks when first mentioning the four task groups for quicker reader orientation.
  2. §4 (experimental setup): The criteria for selecting the specific MLLMs evaluated could be stated more explicitly to aid reproducibility and contextualize performance comparisons.
  3. Figure captions: Ensure all qualitative examples include explicit references to the corresponding STSG elements used in verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of EgoCoT-Bench and for recommending minor revision. We appreciate the confirmation that the manuscript supplies adequate procedural detail on the STSG-guided generation, task structure, and quality controls to support our benchmark claims and findings.

read point-by-point responses
  1. Referee: The stress-test concern about missing details on model selection, statistical significance, and inconsistency measurement does not land after review of the full manuscript; the STSG-guided generation, task breakdowns, and quality-control steps supply sufficient procedural detail to support the central benchmark claims and findings.

    Authors: We are grateful for this assessment. The manuscript details the STSG construction pipeline, the four task groups and twelve sub-tasks, the human annotation refinement protocol, and the per-task evaluation metrics (including explicit checks for answer-evidence consistency). These elements were designed precisely to enable reproducible model selection, statistical reporting, and inconsistency quantification, thereby addressing the concerns the referee references. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with independent construction and evaluation

full rationale

This paper introduces an empirical benchmark (EgoCoT-Bench) for evaluating MLLMs on egocentric video tasks, constructed via an STSG-guided generation framework followed by human annotation for quality control. The central claims rest on dataset creation procedures, task breakdowns, and reported experimental metrics showing model difficulties with fine-grained reasoning and inconsistent rationales. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The benchmark's verifiability is supported by explicit procedural details rather than reducing to prior self-referential inputs. The derivation chain is self-contained against external benchmarks and human refinement steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that STSGs can reliably capture fine-grained hand-object interactions and that human refinement produces verifiably correct annotations without introducing new biases.

axioms (1)
  • domain assumption Human annotators can reliably ensure correctness, egocentric relevance, and fine-grained quality of generated QA pairs
    Invoked in the description of benchmark construction and refinement process.

pith-pipeline@v0.9.0 · 5824 in / 1332 out tokens · 39389 ms · 2026-05-20T05:49:29.329400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 8 internal anchors

  1. [1]

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. 2025. LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training. InarXiv

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, and Yu-Chiang Frank Wang. 2024. ReXTime: A Benchmark Suite for Reasoning- Across-Time in Videos.arXiv preprint arXiv:2406.19392(2024)

  4. [4]

    Qirui Chen, Shangzhe Di, and Weidi Xie. 2025. Grounded multi-hop videoqa in long-form egocentric videos. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 2159–2167

  5. [5]

    Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, and Xihui Liu. 2024. EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning. arXiv:2312.06722 [cs.CV] https://arxiv.org/abs/2312.06722

  6. [6]

    Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. 2024. EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14291–14302

  7. [7]

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong

  8. [8]

    V-star: Benchmarking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

    V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning. arXiv:2503.11495 [cs.CV] https://arxiv.org/abs/2503.11495

  9. [9]

    Plizzari Chiara, Tonioni Alessio, Yongqin Xian, Ace Kulshrestha, and Tombari Federico. 2025. Omnia de EgoTempo: Benchmarking Temporal Understand- ing of Multi-Modal LLMs in Egocentric Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  10. [10]

    Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, and Yueting Zhuang. 2025. Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs.arXiv preprint arXiv:2506.23940(2025)

  11. [11]

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2022. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100.International Journal of Computer Vision (IJCV)130 (2022), 33–55. https://doi.org/10...

  12. [12]

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. 2018. Scaling Egocentric Vision: The EPIC- KITCHENS Dataset. InProceedings of the European Conference on Computer Vision (ECCV)

  13. [13]

    Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Hig- gins, Sanja Fidler, David Fouhey, and Dima Damen. 2022. EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations. InProceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks

  14. [14]

    Shangzhe Di and Weidi Xie. 2024. Grounded Question-Answering in Long Egocentric Videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12934–12943

  15. [15]

    Ragusa Francesco, Furnari Antonino, and Farinella Giovanni, Maria. 2022. MEC- CANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain. arXiv:2209.08691 [cs.CV]

  16. [16]

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. 2025. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR

  17. [17]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al

  18. [18]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ego4D: Around the World in 3,000 Hours of Egocentric Video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18995–19012

  19. [19]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al . 2024. Ego-Exo4D: Understanding Skilled Human Activ- ity from First- and Third-Person Perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ...

  20. [20]

    Sigurdsson Gunnar, A., Gupta Abhinav, Schmid Cordelia, Farhadi Ali, and Alahari Karteek. 2018. Actor and Observer: Joint Modeling of First and Third-Person Videos. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  21. [21]

    Wu Haoning, Li Dongxu, Chen Bei, and Li Junnan. 2024. LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding. arXiv:2407.15754 [cs.CV] https://arxiv.org/abs/2407.15754

  22. [22]

    Baoxiong Jia, Ting Lei, Song-Chun Zhu, and Siyuan Huang. 2022. EgoTaskQA: Understanding Human Tasks in Egocentric Videos. InThe 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks

  23. [23]

    Mangalam Karttikeya, Akshulakov Raiymbek, and Malik Jitendra. 2023. EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding. arXiv:2308.09126 [cs.CV] https://arxiv.org/abs/2308.09126

  24. [24]

    Li Kunchang, Wang Yali, He Yinan, Li Yizhuo, Wang Yi, Liu Yi, Wang Zun, Xu Jilan, Chen Guo, Luo Ping, Wang Limin, and Qiao Yu. 2023. MVBench: A Comprehensive Multi-modal Video Understanding Benchmark.arXiv(2023). https://arxiv.org/abs/2311.17005

  25. [25]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.arXiv preprint arXiv:2407.07895(2024)

  26. [26]

    Yin Li, Miao Liu, and James M. Rehg. 2018. In the Eye of Beholder: Joint Learn- ing of Gaze and Actions in First Person Video. InProceedings of the European Conference on Computer Vision (ECCV)

  27. [27]

    Yin Li, Miao Liu, and James M Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. InProceedings of the European conference on computer vision (ECCV). 619–635

  28. [28]

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al . 2025. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation.arXiv preprint arXiv:2502.09838(2025)

  29. [29]

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan

  30. [30]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

  31. [31]

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenaman- dra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. 2024. OpenEQA: Embodied Question Answering in the Era of Foundation Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16488–16498

  32. [32]

    OpenAI. 2025. GPT-5.1 Model. https://developers.openai.com/api/docs/models/ gpt-5.1. Official OpenAI API documentation; accessed 2026-03-27

  33. [33]

    OpenAI. 2025. GPT-5.2 Model. https://developers.openai.com/api/docs/models/ gpt-5.2. Official OpenAI API documentation; accessed 2026-03-27

  34. [34]

    Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, and Dima Damen. 2025. HD-EPIC: A Highly- Detailed Egocentric Video Dataset. InProceedings of ...

  35. [35]

    Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

  36. [36]

    Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and Giovanni Maria Farinella. 2021. The MECCANO Dataset: Understanding Human-Object Inter- actions from Egocentric Videos in an Industrial-like Domain. InIEEE Winter Conference on Application of Computer Vision (W ACV). arXiv:2010.05654

  37. [37]

    Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, and Giovanni Maria Farinella. 2025. EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. 2732–2737. Dai et al

  38. [38]

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. 2022. Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 21096– 21106

  39. [39]

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. 2025. Video Understanding with Large Language Models: A Survey.IEEE Transactions on Circuits and Systems for Video Technology(2025). doi:10.1109/TCSVT.2025.3566695

  40. [40]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. InternVL3. 5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency.arXiv preprint arXiv:2508.18265(2025)

  41. [41]

    Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, Felipe Vieira Frujeri, Neel Joshi, and Marc Pollefeys. 2023. HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World. InProceedings of the IEEE/CVF International Conference on Computer Vision (I...

  42. [42]

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan

  43. [43]

    In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS)

    STAR: A Benchmark for Situated Reasoning in Real-World Videos. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS)

  44. [44]

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. 2021. NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9777–9786

  45. [45]

    Zhao Yilun, Xie Lujing, Zhang Haowei, Gan Guo, Long Yitao, Hu Zhiyuan, Hu Tongyan, Chen Weiyuan, Li Chuhan, Song Junyang, et al . 2025. MMVU: Measuring Expert-Level Multi-Discipline Video Understanding. arXiv:2501.12380 [cs.CV] https://arxiv.org/abs/2501.12380

  46. [46]

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. 2025. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference. 18970–18980

  47. [47]

    Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, and Beng Chin Ooi. 2025. Pixelrefer: A unified framework for spatio-temporal object referring with arbitrary granularity.arXiv preprint arXiv:2510.23603(2025)

  48. [48]

    Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, and Beng Chin Ooi. 2026. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation. arXiv preprint arXiv:2604.11789(2026)

  49. [49]

    Liu Yuanxin, Li Shicheng, Liu Yi, Wang Yuxiang, Ren Shuhuai, Li Lei, Chen Sishuo, Sun Xu, and Hou Lu. 2024. TempCompass: Do Video LLMs Really Understand Videos?arXiv preprint arXiv: 2403.00476(2024)

  50. [50]

    He Yuping, Huang Yifei, Chen Guo, Pei Baoqi, Xu Jilan, Lu Tong, and Pang Jiangmiao. 2025. EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs.arXiv(2025). https://arxiv.org/abs/2507.18342

  51. [51]

    Yuan Yuqian, Dang Ronghao, Li Long, Li Wentong, Jiao Dian, Li Xin, Zhao Deli, Wang Fan, Zhang Wenqiao, Xiao Jun, and Zhuang Yueting. 2025. EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?arXiv (2025). https://arxiv.org/abs/2506.05287

  52. [52]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction- tuned Audio-Visual Language Model for Video Understanding. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Yansong Feng and Els Lefever (Eds.). Association for Computational Linguistics, Singapore, 543–553. doi:10.18653/...

  53. [53]

    Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, et al. 2024. Hyperllava: Dynamic visual and language expert tuning for multimodal large language models.arXiv preprint arXiv:2403.13447(2024)

  54. [54]

    Wenqiao Zhang, Changshuo Liu, Lingze Zeng, Bengchin Ooi, Siliang Tang, and Yueting Zhuang. 2023. Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1423–1432

  55. [55]

    Wenqiao Zhang, Zheqi Lv, Hao Zhou, Jia-Wei Liu, Juncheng Li, Mengze Li, Yun- fei Li, Dongping Zhang, Yueting Zhuang, and Siliang Tang. 2024. Revisiting the domain shift and sample uncertainty in multi-source active domain trans- fer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16751–16761

  56. [56]

    Wenqiao Zhang, Lei Zhu, James Hallinan, Shengyu Zhang, Andrew Makmur, Qingpeng Cai, and Beng Chin Ooi. 2022. Boostmis: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20666–20676

  57. [57]

    Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, et al. 2026. Unified Person- alized Understanding, Generating and Editing.arXiv preprint arXiv:2601.06965 (2026)

  58. [58]

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. 2024. Mlvu: A comprehensive bench- mark for multi-task long video understanding.arXiv preprint arXiv:2406.04264 (2024)

  59. [59]

    Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, and Angela Yao. 2025. EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR). 3363–3373