pith. sign in
Pith Number

pith:EVIGEFXK

pith:2026:EVIGEFXKTZQOFFQLZ6Y5XWO36Z
not attested not anchored not stored refs resolved

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Byoung-Tak Zhang, Lorenzo Torresani, Minjoon Jung

Two self-evolving agents learn video temporal grounding from unlabeled videos alone.

arxiv:2605.13803 v1 · 2026-05-13 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EVIGEFXKTZQOFFQLZ6Y5XWO36Z}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

C2weakest assumption

The mutual reinforcement loop between proposer and solver can bootstrap effective temporal grounding and captioning capabilities starting from raw videos and a shared backbone without any initial human supervision or external reward signals.

C3one line summary

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

References

69 extracted · 69 resolved · 20 Pith anchors

[1] Modal-specific pseudo query generation for video corpus moment retrieval, 2022
[2] Detecting moments and highlights in videos via natural language queries, 2021
[3] Can i trust your answer? visually grounded video question answering, 2024
[4] TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models 2024
[5] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis 2024 · arXiv:2405.21075
Receipt and verification
First computed 2026-05-18T02:44:15.487853Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

25506216ea9e60e2960bcfb1dbd9dbf6757bdab43fad1292f5e4c1cdde2ad65c

Aliases

arxiv: 2605.13803 · arxiv_version: 2605.13803v1 · doi: 10.48550/arxiv.2605.13803 · pith_short_12: EVIGEFXKTZQO · pith_short_16: EVIGEFXKTZQOFFQL · pith_short_8: EVIGEFXK
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EVIGEFXKTZQOFFQLZ6Y5XWO36Z \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 25506216ea9e60e2960bcfb1dbd9dbf6757bdab43fad1292f5e4c1cdde2ad65c
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "114145067ac93f8f12e153b1a05c767d2e23e58d496978b752faf6d5978ee62d",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-13T17:25:51Z",
    "title_canon_sha256": "1f3c58016865f2d0427d7bdf33660b4bf2d489be65239d9f31094989d1ebb763"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13803",
    "kind": "arxiv",
    "version": 1
  }
}