pith. machine review for the scientific record. sign in

arxiv: 2605.08974 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video understandingmultimodal large language modelshallucinationsspatio-temporal reasoningobject trackingbenchmark
0
0 comments X

The pith

Video AI models hallucinate less when they build explicit trajectories for each object across frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that hallucinations in video-understanding models stem from weak spatio-temporal monitoring, meaning they lose track of object identities, states, and relations over time. It introduces STEMO-Bench, a dataset of verified object-centric facts that tests intermediate reasoning steps instead of accepting final answers that might be correct by chance. It then presents STEMO-Track, which breaks scenes into chunks, extracts object states, and aggregates them into trajectories for later reasoning. If this diagnosis holds, the approach would make models more consistent on questions about motion and change rather than relying on static cues or priors. The experiments show lower hallucination rates and better temporal coherence compared to existing multimodal models.

Core claim

The authors claim that decomposing video queries into object-centric sub-questions and constructing explicit trajectories via chunk-wise state extraction and temporal aggregation directly addresses the root cause of hallucinations, yielding measurable gains in reasoning consistency on dynamic scenes as measured by their new benchmark.

What carries the argument

STEMO-Track, the object-centric framework that constructs structured object trajectories by extracting states chunk-wise and aggregating them temporally to support persistent reasoning.

If this is right

  • Models would produce fewer fabricated details about object locations and relations in moving scenes.
  • Benchmark scores would better separate genuine temporal understanding from answers correct only by coincidence.
  • Reasoning consistency would improve across sequences longer than single-frame cues allow.
  • Video question-answering systems would become more reliable for tasks involving state changes over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-construction idea could be tested on tasks like action prediction or anomaly detection in surveillance footage.
  • Future models might embed similar explicit object memory modules as a standard component rather than post-hoc fixes.
  • Scalability to very long videos could be checked by measuring how trajectory aggregation behaves as sequence length grows.
  • Connections to object permanence in developmental psychology might suggest human-like evaluation protocols.

Load-bearing premise

The primary cause of hallucinations in dynamic video scenes is a failure of spatio-temporal monitoring that can be diagnosed and fixed by decomposing queries into object-centric sub-questions and constructing explicit trajectories.

What would settle it

Run the framework on a set of video questions where correct answers depend on global scene context rather than individual object paths; if hallucination rates stay unchanged, the monitoring claim would not hold.

Figures

Figures reproduced from arXiv: 2605.08974 by Anh Tuan Luu, Bryan Hooi, Chunyan Miao, Cong-Duy Nguyen, Khoi Le, Quynh Vo, See-kiong Ng, Shuicheng Yan, Thong Nguyen, Tri Cao.

Figure 1
Figure 1. Figure 1: Comparison of existing evaluation benchmarks and MLLM architectures against our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the STEMO-Bench construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: STEMO-Track divides a video into temporal chunks, extracts object states, aggregates them [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis of target and sub-question behavior. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of the STEMO-Bench dataset. To rigorously evaluate model faithfulness, STEMO-Bench requires models to correctly answer all underlying supporting sub-questions in addition to the complex target question. This ensures the model accurately tracks temporal action sequences—from passing (t0) to scoring (t2)—and grounds specific visual attributes (e.g., “yellow shirt”, “number 10”) rather than merely gue… view at source ↗
Figure 6
Figure 6. Figure 6: STEMO-Bench evaluation on multi-step object interactions. Building on our faithfulness criteria, models must resolve all supporting sub-questions to demonstrate true comprehension. In this scenario, the model must accurately follow a complex sequence—from the initial display of all artifacts (t0) to cutting machine 1 (t3)—while distinguishing specific visual attributes (“number 1”, “number 2”, “cake”) to p… view at source ↗
Figure 7
Figure 7. Figure 7: STEMO-Bench evaluation of temporal ordering. This example further illustrates how mandatory sub-questions enforce reasoning faithfulness. To succeed, the model must track the precise sequence of events—from placing ingredients (t0) to smashing both items (t3)—and ground specific visual objects (“strawberries”, “cookies”). This verifies that the model understands the correct temporal ordering of interaction… view at source ↗
Figure 8
Figure 8. Figure 8: STEMO-Bench evaluation of actor disambiguation over time. This scenario highlights how STEMO-Bench tests complex temporal tracking across multiple subjects. To accurately answer the target question, the model must resolve the chronological sequence of events—from the first student raising his hand (t0) and the second student raising his hand (t1) to the eventual celebration (t3). Mandatory supporting sub-q… view at source ↗
read the original abstract

While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that hallucinations in video MLLMs stem primarily from failures in spatio-temporal monitoring of object identities, states, and relations. It introduces STEMO-Bench, a benchmark of human-verified object-centric facts that decomposes queries into sub-questions to distinguish genuine temporal understanding from coincidental correctness. To address the exposed failures, it proposes STEMO-Track, a framework that explicitly constructs structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments are said to show that this object-centric approach significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

Significance. If the results hold, the work provides a diagnostic benchmark and targeted engineering framework for improving reliability of video MLLMs in dynamic scenes, with credit due for the new human-verified data collection in STEMO-Bench and the explicit trajectory construction in STEMO-Track. This could help shift evaluation from final-answer accuracy to intermediate reasoning verification, though the absence of reported quantitative metrics, error bars, or external benchmark comparisons in the abstract limits immediate assessment of broader impact.

major comments (3)
  1. [Abstract and Experiments] Abstract and Experiments section: the claim that 'extensive experiments demonstrate' significant reductions in hallucinations lacks any quantitative results, error bars, dataset details, ablation studies, or comparisons to unmodified prior video QA benchmarks, making the data-to-claim link unverifiable and the causal attribution to spatio-temporal monitoring insecure.
  2. [STEMO-Bench] STEMO-Bench description: the object-centric query decomposition and human-verified facts are designed precisely to expose deficits that STEMO-Track addresses via trajectory aggregation; this creates a risk that gains reflect alignment with the benchmark's structure rather than a general fix, especially without ablations isolating trajectory construction from simpler decomposition or results on standard hallucination benchmarks.
  3. [Introduction and Method] Introduction and Method: the assumption that spatio-temporal monitoring failure is the primary root cause (as opposed to reliance on local cues or priors) is load-bearing for the framework's motivation, but the benchmark design does not isolate this from other factors, weakening the justification for the chunk-wise extraction and aggregation approach.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'significantly reduces hallucinated answers' should be accompanied by at least a brief summary of key metrics to support the claim without requiring readers to reach the full experiments section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the careful reading and valuable feedback on our manuscript. We address each of the major comments in detail below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the claim that 'extensive experiments demonstrate' significant reductions in hallucinations lacks any quantitative results, error bars, dataset details, ablation studies, or comparisons to unmodified prior video QA benchmarks, making the data-to-claim link unverifiable and the causal attribution to spatio-temporal monitoring insecure.

    Authors: We agree that the abstract would benefit from including key quantitative results to immediately substantiate the claims. The Experiments section provides detailed tables with performance metrics, including reductions in hallucinated answers and improvements in consistency, along with standard deviations, dataset statistics for STEMO-Bench, ablation studies on the components of STEMO-Track, and comparisons to state-of-the-art video MLLMs. To address the concern, we will revise the abstract to incorporate specific quantitative highlights from these experiments. This will strengthen the verifiability of our claims without altering the core findings. revision: yes

  2. Referee: [STEMO-Bench] STEMO-Bench description: the object-centric query decomposition and human-verified facts are designed precisely to expose deficits that STEMO-Track addresses via trajectory aggregation; this creates a risk that gains reflect alignment with the benchmark's structure rather than a general fix, especially without ablations isolating trajectory construction from simpler decomposition or results on standard hallucination benchmarks.

    Authors: This is a valid concern regarding potential benchmark overfitting. We have performed ablations in the Experiments section that isolate the contribution of the trajectory construction and aggregation steps from basic query decomposition. These show that the full STEMO-Track framework provides additional gains beyond decomposition alone. To further demonstrate generalizability, we will add evaluations on standard video hallucination and QA benchmarks in the revised manuscript, allowing readers to assess performance beyond STEMO-Bench. revision: partial

  3. Referee: [Introduction and Method] Introduction and Method: the assumption that spatio-temporal monitoring failure is the primary root cause (as opposed to reliance on local cues or priors) is load-bearing for the framework's motivation, but the benchmark design does not isolate this from other factors, weakening the justification for the chunk-wise extraction and aggregation approach.

    Authors: We appreciate the referee pointing out the need for stronger isolation of the root cause. The design of STEMO-Bench uses human-verified object-centric facts and query decomposition specifically to require persistent tracking over time, as local cues or statistical priors are insufficient for answering the sub-questions correctly (as confirmed during human verification). We will expand the Introduction and Method sections with additional examples and analysis to better illustrate how the benchmark controls for these alternative factors, thereby providing stronger motivation for the chunk-wise state extraction and temporal aggregation in STEMO-Track. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and framework with independent experimental grounding

full rationale

The paper presents an engineering contribution consisting of a new benchmark (STEMO-Bench) for diagnosing spatio-temporal monitoring failures via object-centric query decomposition and a corresponding framework (STEMO-Track) using chunk-wise extraction and aggregation. No equations, parameter fittings, or derivations appear in the manuscript. The central claims rest on human-verified facts in the benchmark and comparative experiments against prior MLLMs, without any step that reduces by construction to its own inputs, self-citations, or renamed known results. The evaluation design is explicitly motivated by the hypothesized failure mode, but this constitutes a standard empirical hypothesis test rather than a self-referential loop. No load-bearing premise collapses to a prior self-citation or fitted quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that hallucinations arise mainly from missing spatio-temporal monitoring and that explicit trajectory construction will fix it; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Hallucinations in dynamic video scenes primarily stem from failure in spatio-temporal monitoring of object identities, states, and relations.
    Explicitly stated as the authors' argument in the abstract.
invented entities (2)
  • STEMO-Bench no independent evidence
    purpose: Benchmark of human-verified object-centric facts for intermediate reasoning evaluation
    Newly introduced evaluation resource
  • STEMO-Track no independent evidence
    purpose: Object-centric framework using chunk-wise state extraction and temporal aggregation
    Newly proposed method

pith-pipeline@v0.9.0 · 5507 in / 1368 out tokens · 44899 ms · 2026-05-12T02:27:32.675862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    Introducing claude sonnet 4.6

    Anthropic. Introducing claude sonnet 4.6. Anthropic, 2026

  2. [2]

    Mash-vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial- temporal representations

    Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash-vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial- temporal representations. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13744–13753, 2025

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495,

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

  5. [5]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, et al. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025

  6. [6]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In2...

  7. [7]

    Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation, 2025

    Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation, 2025

  8. [8]

    Mitigating hallucinations in video large language models via spatiotemporal-semantic contrastive decoding.arXiv preprint arXiv:2601.22574, 2026

    Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Han Bao, Zonghui Wang, and Wenzhi Chen. Mitigating hallucinations in video large language models via spatiotemporal-semantic contrastive decoding.arXiv preprint arXiv:2601.22574, 2026

  9. [9]

    Gemma 4: Our most capable open models to date

    Google. Gemma 4: Our most capable open models to date. Google Blog, 2026

  10. [10]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. Google DeepMind Model Card, 2026

  11. [11]

    Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026

    Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, and Gedas Bertasius. Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026

  12. [12]

    Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13723–13733, 2025

  13. [13]

    Videochat-flash: Hierarchical compression for long-context video modeling

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang. Videochat-flash: Hierarchical compression for long-context video modeling. InThe Fourteenth International Conference on Learning Representations, 2026. 10

  14. [14]

    Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

  15. [15]

    Videomind: A chain-of-lora agent for temporal-grounded video reasoning.arXiv preprint arXiv:2503.13444, 2025

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for temporal-grounded video reasoning.arXiv preprint arXiv:2503.13444, 2025

  16. [16]

    Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding, 2025

    Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, and Lewei Lu. Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding, 2025

  17. [17]

    Egoschema: a diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshkulakov, and Jitendra Malik. Egoschema: a diagnostic benchmark for very long-form video language understanding. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  18. [18]

    Cosmos-reason2

    NVIDIA. Cosmos-reason2. NVIDIA Cosmos Documentation, 2026. Reasoning vision-language model for Physical AI and robotics

  19. [19]

    Gpt-4o system card

    OpenAI. Gpt-4o system card. OpenAI, 2024

  20. [20]

    Introducing gpt-5

    OpenAI. Introducing gpt-5. OpenAI, 2025

  21. [21]

    Qwen3.5-Omni Technical Report

    Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  22. [22]

    Understanding long videos in one multimodal language model pass,

    Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S Ryoo. Understanding long videos with multimodal language models.arXiv preprint arXiv:2403.16998, 2024

  23. [23]

    Argus: Hallucination and omission evaluation in video-llms

    Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, and Tom Goldstein. Argus: Hallucination and omission evaluation in video-llms. In2025 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20280–20290, 2025

  24. [24]

    Traveler: A modular multi-lmm agent framework for video question-answering

    Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. Traveler: A modular multi-lmm agent framework for video question-answering. InProceedings of EMNLP, 2024

  25. [25]

    Smartsight: Mitigating hallucina- tion in video-llms without compromising video understanding via temporal attention collapse

    Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, and Min Yang. Smartsight: Mitigating hallucina- tion in video-llms without compromising video understanding via temporal attention collapse. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9251–9259, 2026

  26. [26]

    Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 36(2):1355–1376, 2026

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for ...

  27. [27]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

  28. [28]

    Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025

  29. [29]

    Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models, 2024

    Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models, 2024

  30. [30]

    Videochat-a1: Thinking with long videos by chain-of-shot reasoning

    Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 10467–10475, 2026. 11

  31. [31]

    Mitigating hallucinations in multi-modal large language models via image token attention-guided decoding

    Xinhao Xu, Hui Chen, Mengyao Lyu, Sicheng Zhao, Yizhe Xiong, Zijia Lin, Jungong Han, and Guiguang Ding. Mitigating hallucinations in multi-modal large language models via image token attention-guided decoding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association f...

  32. [32]

    A survey on multimodal large language models.National Science Review, 11(12):nwae403, 12 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 12 2024

  33. [33]

    Self-chained image-language model for video localization and question answering

    Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering. InAdvances in Neural Information Processing Systems, 2023

  34. [34]

    Videorefer suite: Advancing spatial-temporal object understanding with video llm

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

  35. [35]

    LLaV A- video: Video instruction tuning with synthetic data.Transactions on Machine Learning Research, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A- video: Video instruction tuning with synthetic data.Transactions on Machine Learning Research, 2025

  36. [36]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  37. [37]

    Does the player in yellow shirt, blue short number 10 score goal?

    Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia. Apollo: An exploration of video understanding in large multimodal models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18891–18901, 2025. 12 ...