Pith Number

pith:JB5TOC7Q

pith:2026:JB5TOC7QDDQ53QAVZVMBXETMM7

not attested not anchored not stored refs resolved

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

Abdul Wasi, Akhil Gorugantu, David Doermann, Mahesh Bhosale, Pengyu Yan, Vishvesh Trivedi

TRACE grounds evidence in text-searchable timelines before visual reasoning for multi-video events.

arxiv:2605.16740 v1 · 2026-05-16 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{JB5TOC7QDDQ53QAVZVMBXETMM7}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall from 0.440 to 0.628 on the MAGMaR validation split.

C2weakest assumption

The method assumes that OCR and object detection produce sufficiently accurate and complete structured timelines, and that a text-only LLM can reliably select query-relevant moments without missing critical visual cues not captured in text (abstract, method description paragraph).

C3one line summary

TRACE builds structured text timelines from videos via OCR and detection, then applies text-only LLM evidence localization before LVLM claim generation, raising MiRAGE F1 from 0.705 to 0.811 on MAGMaR.

References

14 extracted · 14 resolved · 8 Pith anchors

[1] PP-OCR: A practical ultra lightweight OCR system.CoRR, abs/2009.09941 2009

[2] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis · arXiv:2405.21075

[3] Verify exact arXiv ID and au- thor list on Scholar

[4] VideoChat: Chat-Centric Video Understanding · arXiv:2305.06355

[5] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection · arXiv:2303.05499

Receipt and verification

First computed	2026-05-20T00:02:39.251465Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

487b370bf018e1ddc015cd581b926c67e63e2ed656d596b8bef951a76f70e7c4

Aliases

arxiv: 2605.16740 · arxiv_version: 2605.16740v1 · doi: 10.48550/arxiv.2605.16740 · pith_short_12: JB5TOC7QDDQ5 · pith_short_16: JB5TOC7QDDQ53QAV · pith_short_8: JB5TOC7Q

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/JB5TOC7QDDQ53QAVZVMBXETMM7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 487b370bf018e1ddc015cd581b926c67e63e2ed656d596b8bef951a76f70e7c4

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "9a910ccaaf755069dddeda69068f9352cbc58bd13c4b3a9a96e7083cc93d7a23",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-16T01:37:10Z",
    "title_canon_sha256": "30973130461cc3c7190208868ff913d5e109498b56748e2eb099c2e3e66a2b38"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.16740",
    "kind": "arxiv",
    "version": 1
  }
}