{"paper":{"title":"VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Separating planning from answer authority in video agents reduces evidence misalignment.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"(2) Nanyang Technological University), Chenhao Qiu (1), Shien Song (1), Xin Luo (1), Xusheng Liu (1) ((1) Mango TV, Yechao Zhang (2)","submitted_at":"2026-05-12T10:37:49Z","abstract_excerpt":"Long video question answering requires locating sparse, time-scattered visual evidence within highly redundant content. Although current MLLMs perform well on short videos, long videos introduce long-horizon search and verification, which often necessitates multi-turn, agentic interaction. We show that existing LVU agents can exhibit \"evidence misalignment\": they produce correct answers that are not supported by the retrieved or inspected evidence. To characterize this failure, we introduce two diagnostics (temporal groundedness and semantic groundedness) and use them to reveal two pressures t"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification... improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"that gating final answers on pixel-level verification will reliably eliminate evidence misalignment without introducing new failure modes in long-horizon search","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Separating planning from answer authority in video agents reduces evidence misalignment.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4e972db1f8caeed0e5c4157cfe1dcd6bc4e592ba9d4be10c26d12225fc373357"},"source":{"id":"2605.12571","kind":"arxiv","version":1},"verdict":{"id":"1582dad9-4c35-4c8b-ac3c-8f4d7b297f9e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:48:01.366437Z","strongest_claim":"the decoupled planner-inspector framework, which separates planning from answer authority and gates final answering on pixel-level verification... improves both answer accuracy and evidence alignment, achieving 55.1% on LVBench and 62.0% on LongVideoBench","one_line_summary":"Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"that gating final answers on pixel-level verification will reliably eliminate evidence misalignment without introducing new failure modes in long-horizon search","pith_extraction_headline":"Separating planning from answer authority in video agents reduces evidence misalignment."},"references":{"count":14,"sample":[{"doi":"10.48550/arxiv.2","year":2024,"title":"VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs","work_id":"ccfc3f89-c510-45f1-8a35-ed1a56c0ae5c","ref_index":1,"cited_arxiv_id":"2406.07476","is_internal_anchor":true},{"doi":"10.48550/arxiv.2509.24304","year":2017,"title":"arXiv preprint arXiv:2509.24304 (2025) 9","work_id":"4ed1536d-cea3-451a-bb12-5f4188bc89ec","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1109/cv","year":2024,"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","ref_index":4,"cited_arxiv_id":"2508.18265","is_internal_anchor":true},{"doi":"10.48550/arxiv","year":2025,"title":"URLhttps://doi.org/10.48550/arXiv","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","ref_index":5,"cited_arxiv_id":"2203.14987","is_internal_anchor":true},{"doi":"10.18653/v1/2024.emnlp-","year":2025,"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","ref_index":6,"cited_arxiv_id":"2210.03629","is_internal_anchor":true}],"resolved_work":14,"snapshot_sha256":"ddbc8da9f57c1331dad69f86f3fdffd58fa22c4a9633891e3f3bd07bd883c782","internal_anchors":4},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}