pith. sign in
Pith Number

pith:YB43P2XZ

pith:2026:YB43P2XZE5OHPSQBVHOTWJD2DI
not attested not anchored not stored refs resolved

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

Bo Cheng, Genbao Xu, Nan Ma, Quanxing Zha, Soujanya Poria, Teng Wang, Wei Rao, Wenyuan Gu, Zhixuan Wu

Video reasoning improves when each step anchors explicitly to specific visual objects in the frames.

arxiv:2604.14692 v2 · 2026-04-16 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YB43P2XZE5OHPSQBVHOTWJD2DI}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, yielding accurate and interpretable multi-step decisions.

C2weakest assumption

That optimizing a search-guided controller via reinforcement learning with a format reward will reliably produce grounding capability that improves compositional reasoning over object-agnostic baselines.

C3one line summary

Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

References

56 extracted · 56 resolved · 13 Pith anchors

[1] A simple llm framework for long-range video question- answering, 2024
[2] Understanding long videos in one multimodal language model pass 2024
[3] Stimuvar: Spatiotemporal stimuli-aware video affective reasoning with multimodal large language models, 2025
[4] Dycoke: Dynamic com- pression of tokens for fast video large language models, 2025
[5] Vtimellm: Empower llm to grasp video moments, 2024

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:00:38.048308Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c079b7eaf9275c77ca01a9dd3b247a1a36a188fdff3d484ffc760e4ccba23a98

Aliases

arxiv: 2604.14692 · arxiv_version: 2604.14692v2 · doi: 10.48550/arxiv.2604.14692 · pith_short_12: YB43P2XZE5OH · pith_short_16: YB43P2XZE5OHPSQB · pith_short_8: YB43P2XZ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YB43P2XZE5OHPSQBVHOTWJD2DI \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c079b7eaf9275c77ca01a9dd3b247a1a36a188fdff3d484ffc760e4ccba23a98
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "ab2fe4d66c30a2175231238e2d603870b78bf7ec2b23da5ebb43683443f1adf6",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/publicdomain/zero/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-04-16T06:50:20Z",
    "title_canon_sha256": "dea5e11d196ad376ab465763318eb017b53ceffc5867e5b09fc282eb4fedf5de"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.14692",
    "kind": "arxiv",
    "version": 2
  }
}