arxiv: 2605.08974 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Tri Cao , Khoi Le , Thong Nguyen , Cong-Duy Nguyen , Quynh Vo , Anh Tuan Luu , Chunyan Miao , See-kiong Ng

show 2 more authors

Shuicheng Yan Bryan Hooi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video understandingmultimodal large language modelshallucinationsspatio-temporal reasoningobject trackingbenchmark

0 comments

The pith

Video AI models hallucinate less when they build explicit trajectories for each object across frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that hallucinations in video-understanding models stem from weak spatio-temporal monitoring, meaning they lose track of object identities, states, and relations over time. It introduces STEMO-Bench, a dataset of verified object-centric facts that tests intermediate reasoning steps instead of accepting final answers that might be correct by chance. It then presents STEMO-Track, which breaks scenes into chunks, extracts object states, and aggregates them into trajectories for later reasoning. If this diagnosis holds, the approach would make models more consistent on questions about motion and change rather than relying on static cues or priors. The experiments show lower hallucination rates and better temporal coherence compared to existing multimodal models.

Core claim

The authors claim that decomposing video queries into object-centric sub-questions and constructing explicit trajectories via chunk-wise state extraction and temporal aggregation directly addresses the root cause of hallucinations, yielding measurable gains in reasoning consistency on dynamic scenes as measured by their new benchmark.

What carries the argument

STEMO-Track, the object-centric framework that constructs structured object trajectories by extracting states chunk-wise and aggregating them temporally to support persistent reasoning.

If this is right

Models would produce fewer fabricated details about object locations and relations in moving scenes.
Benchmark scores would better separate genuine temporal understanding from answers correct only by coincidence.
Reasoning consistency would improve across sequences longer than single-frame cues allow.
Video question-answering systems would become more reliable for tasks involving state changes over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-construction idea could be tested on tasks like action prediction or anomaly detection in surveillance footage.
Future models might embed similar explicit object memory modules as a standard component rather than post-hoc fixes.
Scalability to very long videos could be checked by measuring how trajectory aggregation behaves as sequence length grows.
Connections to object permanence in developmental psychology might suggest human-like evaluation protocols.

Load-bearing premise

The primary cause of hallucinations in dynamic video scenes is a failure of spatio-temporal monitoring that can be diagnosed and fixed by decomposing queries into object-centric sub-questions and constructing explicit trajectories.

What would settle it

Run the framework on a set of video questions where correct answers depend on global scene context rather than individual object paths; if hallucination rates stay unchanged, the monitoring claim would not hold.

Figures

Figures reproduced from arXiv: 2605.08974 by Anh Tuan Luu, Bryan Hooi, Chunyan Miao, Cong-Duy Nguyen, Khoi Le, Quynh Vo, See-kiong Ng, Shuicheng Yan, Thong Nguyen, Tri Cao.

**Figure 2.** Figure 2: Overview of the STEMO-Bench construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: STEMO-Track divides a video into temporal chunks, extracts object states, aggregates them [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative analysis of target and sub-question behavior. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Example of the STEMO-Bench dataset. To rigorously evaluate model faithfulness, STEMO-Bench requires models to correctly answer all underlying supporting sub-questions in addition to the complex target question. This ensures the model accurately tracks temporal action sequences—from passing (t0) to scoring (t2)—and grounds specific visual attributes (e.g., “yellow shirt”, “number 10”) rather than merely gue… view at source ↗

**Figure 6.** Figure 6: STEMO-Bench evaluation on multi-step object interactions. Building on our faithfulness criteria, models must resolve all supporting sub-questions to demonstrate true comprehension. In this scenario, the model must accurately follow a complex sequence—from the initial display of all artifacts (t0) to cutting machine 1 (t3)—while distinguishing specific visual attributes (“number 1”, “number 2”, “cake”) to p… view at source ↗

**Figure 7.** Figure 7: STEMO-Bench evaluation of temporal ordering. This example further illustrates how mandatory sub-questions enforce reasoning faithfulness. To succeed, the model must track the precise sequence of events—from placing ingredients (t0) to smashing both items (t3)—and ground specific visual objects (“strawberries”, “cookies”). This verifies that the model understands the correct temporal ordering of interaction… view at source ↗

**Figure 8.** Figure 8: STEMO-Bench evaluation of actor disambiguation over time. This scenario highlights how STEMO-Bench tests complex temporal tracking across multiple subjects. To accurately answer the target question, the model must resolve the chronological sequence of events—from the first student raising his hand (t0) and the second student raising his hand (t1) to the eventual celebration (t3). Mandatory supporting sub-q… view at source ↗

read the original abstract

While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete benchmark and tracking method to cut hallucinations in video MLLMs, but the evidence for why it works is still missing.

read the letter

The core idea here is that video MLLMs hallucinate in moving scenes because they lose track of objects over time, and the fix is to force explicit object trajectories instead of letting the model guess from local frames. STEMO-Bench decomposes questions into sub-queries about specific objects and their states, with human-checked facts, so you can see whether the model actually followed the timeline or just got lucky on the final answer. STEMO-Track then builds those trajectories by pulling object info in short chunks and stitching them together. That decomposition step is the genuinely new piece; most prior video QA just scores the end answer and misses the intermediate failures. The approach is straightforward and targets a real pain point in surveillance or robotics clips where objects enter and leave the frame. The experiments are described as showing clear drops in hallucinations and better consistency, which is the kind of result that would matter if the numbers hold. The main weakness is that the abstract gives no tables, no error bars, no ablation on the aggregation step versus plain decomposition, and no results on the usual video QA sets. Without those, it is hard to tell whether the gains come from better monitoring or from the benchmark being built around the same object-centric split the method uses. The stress-test worry about circularity looks real on the current description. If the full paper has the missing controls and shows the method still helps on unmodified benchmarks, the contribution becomes solid. This is worth a referee for groups working on reliable video reasoning; the benchmark design could be adopted even if the framework needs tuning. I would send it out for review but flag the need for broader validation and ablations.

Referee Report

3 major / 1 minor

Summary. The paper claims that hallucinations in video MLLMs stem primarily from failures in spatio-temporal monitoring of object identities, states, and relations. It introduces STEMO-Bench, a benchmark of human-verified object-centric facts that decomposes queries into sub-questions to distinguish genuine temporal understanding from coincidental correctness. To address the exposed failures, it proposes STEMO-Track, a framework that explicitly constructs structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments are said to show that this object-centric approach significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

Significance. If the results hold, the work provides a diagnostic benchmark and targeted engineering framework for improving reliability of video MLLMs in dynamic scenes, with credit due for the new human-verified data collection in STEMO-Bench and the explicit trajectory construction in STEMO-Track. This could help shift evaluation from final-answer accuracy to intermediate reasoning verification, though the absence of reported quantitative metrics, error bars, or external benchmark comparisons in the abstract limits immediate assessment of broader impact.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: the claim that 'extensive experiments demonstrate' significant reductions in hallucinations lacks any quantitative results, error bars, dataset details, ablation studies, or comparisons to unmodified prior video QA benchmarks, making the data-to-claim link unverifiable and the causal attribution to spatio-temporal monitoring insecure.
[STEMO-Bench] STEMO-Bench description: the object-centric query decomposition and human-verified facts are designed precisely to expose deficits that STEMO-Track addresses via trajectory aggregation; this creates a risk that gains reflect alignment with the benchmark's structure rather than a general fix, especially without ablations isolating trajectory construction from simpler decomposition or results on standard hallucination benchmarks.
[Introduction and Method] Introduction and Method: the assumption that spatio-temporal monitoring failure is the primary root cause (as opposed to reliance on local cues or priors) is load-bearing for the framework's motivation, but the benchmark design does not isolate this from other factors, weakening the justification for the chunk-wise extraction and aggregation approach.

minor comments (1)

[Abstract] Abstract: the phrasing 'significantly reduces hallucinated answers' should be accompanied by at least a brief summary of key metrics to support the claim without requiring readers to reach the full experiments section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the careful reading and valuable feedback on our manuscript. We address each of the major comments in detail below, indicating the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the claim that 'extensive experiments demonstrate' significant reductions in hallucinations lacks any quantitative results, error bars, dataset details, ablation studies, or comparisons to unmodified prior video QA benchmarks, making the data-to-claim link unverifiable and the causal attribution to spatio-temporal monitoring insecure.

Authors: We agree that the abstract would benefit from including key quantitative results to immediately substantiate the claims. The Experiments section provides detailed tables with performance metrics, including reductions in hallucinated answers and improvements in consistency, along with standard deviations, dataset statistics for STEMO-Bench, ablation studies on the components of STEMO-Track, and comparisons to state-of-the-art video MLLMs. To address the concern, we will revise the abstract to incorporate specific quantitative highlights from these experiments. This will strengthen the verifiability of our claims without altering the core findings. revision: yes
Referee: [STEMO-Bench] STEMO-Bench description: the object-centric query decomposition and human-verified facts are designed precisely to expose deficits that STEMO-Track addresses via trajectory aggregation; this creates a risk that gains reflect alignment with the benchmark's structure rather than a general fix, especially without ablations isolating trajectory construction from simpler decomposition or results on standard hallucination benchmarks.

Authors: This is a valid concern regarding potential benchmark overfitting. We have performed ablations in the Experiments section that isolate the contribution of the trajectory construction and aggregation steps from basic query decomposition. These show that the full STEMO-Track framework provides additional gains beyond decomposition alone. To further demonstrate generalizability, we will add evaluations on standard video hallucination and QA benchmarks in the revised manuscript, allowing readers to assess performance beyond STEMO-Bench. revision: partial
Referee: [Introduction and Method] Introduction and Method: the assumption that spatio-temporal monitoring failure is the primary root cause (as opposed to reliance on local cues or priors) is load-bearing for the framework's motivation, but the benchmark design does not isolate this from other factors, weakening the justification for the chunk-wise extraction and aggregation approach.

Authors: We appreciate the referee pointing out the need for stronger isolation of the root cause. The design of STEMO-Bench uses human-verified object-centric facts and query decomposition specifically to require persistent tracking over time, as local cues or statistical priors are insufficient for answering the sub-questions correctly (as confirmed during human verification). We will expand the Introduction and Method sections with additional examples and analysis to better illustrate how the benchmark controls for these alternative factors, thereby providing stronger motivation for the chunk-wise state extraction and temporal aggregation in STEMO-Track. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and framework with independent experimental grounding

full rationale

The paper presents an engineering contribution consisting of a new benchmark (STEMO-Bench) for diagnosing spatio-temporal monitoring failures via object-centric query decomposition and a corresponding framework (STEMO-Track) using chunk-wise extraction and aggregation. No equations, parameter fittings, or derivations appear in the manuscript. The central claims rest on human-verified facts in the benchmark and comparative experiments against prior MLLMs, without any step that reduces by construction to its own inputs, self-citations, or renamed known results. The evaluation design is explicitly motivated by the hypothesized failure mode, but this constitutes a standard empirical hypothesis test rather than a self-referential loop. No load-bearing premise collapses to a prior self-citation or fitted quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that hallucinations arise mainly from missing spatio-temporal monitoring and that explicit trajectory construction will fix it; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Hallucinations in dynamic video scenes primarily stem from failure in spatio-temporal monitoring of object identities, states, and relations.
Explicitly stated as the authors' argument in the abstract.

invented entities (2)

STEMO-Bench no independent evidence
purpose: Benchmark of human-verified object-centric facts for intermediate reasoning evaluation
Newly introduced evaluation resource
STEMO-Track no independent evidence
purpose: Object-centric framework using chunk-wise state extraction and temporal aggregation
Newly proposed method

pith-pipeline@v0.9.0 · 5507 in / 1368 out tokens · 44899 ms · 2026-05-12T02:27:32.675862+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat as orbit under generator; embed_strictMono_of_one_lt echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we represent it as a set of persistent entities O = {o1,...} ... trajectory So = {(t, s(o)t) | t in T} ... temporal aggregation module then links these observations across chunks
IndisputableMonolith/Foundation/ArrowOfTime.lean TemporalSequence; zAtStep monotonicity echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

STEMO-Track ... chunk-wise state extraction and temporal aggregation ... 15-second chunks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Introducing claude sonnet 4.6

Anthropic. Introducing claude sonnet 4.6. Anthropic, 2026

work page 2026
[2]

Mash-vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial- temporal representations

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, and Jinwoo Choi. Mash-vlm: Mitigating action-scene hallucination in video-llms through disentangled spatial- temporal representations. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13744–13753, 2025

work page 2025
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495,

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

work page arXiv 2025
[5]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, et al. Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review arXiv 2025
[6]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In2...

work page 2025
[7]

Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation, 2025

Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, and Qingming Huang. Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation, 2025

work page 2025
[8]

Mitigating hallucinations in video large language models via spatiotemporal-semantic contrastive decoding.arXiv preprint arXiv:2601.22574, 2026

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Han Bao, Zonghui Wang, and Wenzhi Chen. Mitigating hallucinations in video large language models via spatiotemporal-semantic contrastive decoding.arXiv preprint arXiv:2601.22574, 2026

work page arXiv 2026
[9]

Gemma 4: Our most capable open models to date

Google. Gemma 4: Our most capable open models to date. Google Blog, 2026

work page 2026
[10]

Gemini 3.1 pro model card

Google DeepMind. Gemini 3.1 pro model card. Google DeepMind Model Card, 2026

work page 2026
[11]

Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026

Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, and Gedas Bertasius. Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026

work page arXiv 2026
[12]

Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding

Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13723–13733, 2025

work page 2025
[13]

Videochat-flash: Hierarchical compression for long-context video modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang. Videochat-flash: Hierarchical compression for long-context video modeling. InThe Fourteenth International Conference on Learning Representations, 2026. 10

work page 2026
[14]

Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

work page 2025
[15]

Videomind: A chain-of-lora agent for temporal-grounded video reasoning.arXiv preprint arXiv:2503.13444, 2025

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for temporal-grounded video reasoning.arXiv preprint arXiv:2503.13444, 2025

work page arXiv 2025
[16]

Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding, 2025

Hao Lu, Jiahao Wang, Yaolun Zhang, Ruohui Wang, Xuanyu Zheng, Yepeng Tang, Dahua Lin, and Lewei Lu. Elv-halluc: Benchmarking semantic aggregation hallucinations in long video understanding, 2025

work page 2025
[17]

Egoschema: a diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshkulakov, and Jitendra Malik. Egoschema: a diagnostic benchmark for very long-form video language understanding. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[18]

Cosmos-reason2

NVIDIA. Cosmos-reason2. NVIDIA Cosmos Documentation, 2026. Reasoning vision-language model for Physical AI and robotics

work page 2026
[19]

Gpt-4o system card

OpenAI. Gpt-4o system card. OpenAI, 2024

work page 2024
[20]

Introducing gpt-5

OpenAI. Introducing gpt-5. OpenAI, 2025

work page 2025
[21]

Qwen3.5-Omni Technical Report

Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Understanding long videos in one multimodal language model pass,

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S Ryoo. Understanding long videos with multimodal language models.arXiv preprint arXiv:2403.16998, 2024

work page arXiv 2024
[23]

Argus: Hallucination and omission evaluation in video-llms

Ruchit Rawal, Reza Shirkavand, Heng Huang, Gowthami Somepalli, and Tom Goldstein. Argus: Hallucination and omission evaluation in video-llms. In2025 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20280–20290, 2025

work page 2025
[24]

Traveler: A modular multi-lmm agent framework for video question-answering

Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, and Roei Herzig. Traveler: A modular multi-lmm agent framework for video question-answering. InProceedings of EMNLP, 2024

work page 2024
[25]

Smartsight: Mitigating hallucina- tion in video-llms without compromising video understanding via temporal attention collapse

Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, and Min Yang. Smartsight: Mitigating hallucina- tion in video-llms without compromising video understanding via temporal attention collapse. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9251–9259, 2026

work page 2026
[26]

Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 36(2):1355–1376, 2026

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali V osoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Systems for ...

work page 2026
[27]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

work page arXiv 2025
[28]

Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2.5: Empowering video mllms with long and rich context modeling, 2025

work page 2025
[29]

Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models, 2024

Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models, 2024

work page 2024
[30]

Videochat-a1: Thinking with long videos by chain-of-shot reasoning

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 10467–10475, 2026. 11

work page 2026
[31]

Mitigating hallucinations in multi-modal large language models via image token attention-guided decoding

Xinhao Xu, Hui Chen, Mengyao Lyu, Sicheng Zhao, Yizhe Xiong, Zijia Lin, Jungong Han, and Guiguang Ding. Mitigating hallucinations in multi-modal large language models via image token attention-guided decoding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association f...

work page 2025
[32]

A survey on multimodal large language models.National Science Review, 11(12):nwae403, 12 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12):nwae403, 12 2024

work page 2024
[33]

Self-chained image-language model for video localization and question answering

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[34]

Videorefer suite: Advancing spatial-temporal object understanding with video llm

Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

work page 2025
[35]

LLaV A- video: Video instruction tuning with synthetic data.Transactions on Machine Learning Research, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun MA, Ziwei Liu, and Chunyuan Li. LLaV A- video: Video instruction tuning with synthetic data.Transactions on Machine Learning Research, 2025

work page 2025
[36]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Does the player in yellow shirt, blue short number 10 score goal?

Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, and Xide Xia. Apollo: An exploration of video understanding in large multimodal models. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18891–18901, 2025. 12 ...

work page 2025