{"paper":{"title":"Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding","license":"http://creativecommons.org/publicdomain/zero/1.0/","headline":"Video reasoning improves when each step anchors explicitly to specific visual objects in the frames.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bo Cheng, Genbao Xu, Nan Ma, Quanxing Zha, Soujanya Poria, Teng Wang, Wei Rao, Wenyuan Gu, Zhixuan Wu","submitted_at":"2026-04-16T06:50:20Z","abstract_excerpt":"Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounde"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, yielding accurate and interpretable multi-step decisions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That optimizing a search-guided controller via reinforcement learning with a format reward will reliably produce grounding capability that improves compositional reasoning over object-agnostic baselines.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video reasoning improves when each step anchors explicitly to specific visual objects in the frames.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9a5a68a61c873ca5566f1a3804cafce6fe6a53a1f2ff7bb83173ab35727ccbd6"},"source":{"id":"2604.14692","kind":"arxiv","version":2},"verdict":{"id":"714feb40-eea7-4a6d-b7eb-b480a879c38c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T17:30:00.450691Z","strongest_claim":"Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, yielding accurate and interpretable multi-step decisions.","one_line_summary":"Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That optimizing a search-guided controller via reinforcement learning with a format reward will reliably produce grounding capability that improves compositional reasoning over object-agnostic baselines.","pith_extraction_headline":"Video reasoning improves when each step anchors explicitly to specific visual objects in the frames."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.14692/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":56,"sample":[{"doi":"","year":2024,"title":"A simple llm framework for long-range video question- answering,","work_id":"c0b0c9c5-0466-4223-a181-5209bd0c7b6b","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Understanding long videos in one multimodal language model pass","work_id":"a0d8f834-29b0-4597-ad47-382843695ca9","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models,","work_id":"83621f58-c138-4c3d-8c5c-f9efbec7184e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Dycoke: Dynamic com- pression of tokens for fast video large language models,","work_id":"6cf93b3f-d9fb-480c-8916-bb03104c64fa","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Vtimellm: Empower llm to grasp video moments,","work_id":"069d139b-0388-4f7e-91b5-de76a3623585","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":56,"snapshot_sha256":"9d6eecbd6e0680a2159ecc4cdd8c72256c80ebb3230bca373ee344d6e5f50e60","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"bed1e382e55644c1628f5545de3cced90f0996a5b6c681db8aa00abedbdaffce"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}