Pith Number

pith:EVIGEFXK

pith:2026:EVIGEFXKTZQOFFQLZ6Y5XWO36Z

not attested not anchored not stored refs resolved

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Byoung-Tak Zhang, Lorenzo Torresani, Minjoon Jung

Two self-evolving agents learn video temporal grounding from unlabeled videos alone.

arxiv:2605.13803 v1 · 2026-05-13 · cs.CV

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{EVIGEFXKTZQOFFQLZ6Y5XWO36Z}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

C2weakest assumption

The mutual reinforcement loop between proposer and solver can bootstrap effective temporal grounding and captioning capabilities starting from raw videos and a shared backbone without any initial human supervision or external reward signals.

C3one line summary

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

References

69 extracted · 69 resolved · 20 Pith anchors

[1] Modal-specific pseudo query generation for video corpus moment retrieval, 2022

[2] Detecting moments and highlights in videos via natural language queries, 2021

[3] Can i trust your answer? visually grounded video question answering, 2024

[4] TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models 2024

[5] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis 2024 · arXiv:2405.21075

Receipt and verification

First computed	2026-05-18T02:44:15.487853Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

25506216ea9e60e2960bcfb1dbd9dbf6757bdab43fad1292f5e4c1cdde2ad65c

Aliases

arxiv: 2605.13803 · arxiv_version: 2605.13803v1 · doi: 10.48550/arxiv.2605.13803 · pith_short_12: EVIGEFXKTZQO · pith_short_16: EVIGEFXKTZQOFFQL · pith_short_8: EVIGEFXK

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/EVIGEFXKTZQOFFQLZ6Y5XWO36Z \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 25506216ea9e60e2960bcfb1dbd9dbf6757bdab43fad1292f5e4c1cdde2ad65c

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "114145067ac93f8f12e153b1a05c767d2e23e58d496978b752faf6d5978ee62d",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-13T17:25:51Z",
    "title_canon_sha256": "1f3c58016865f2d0427d7bdf33660b4bf2d489be65239d9f31094989d1ebb763"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13803",
    "kind": "arxiv",
    "version": 1
  }
}