{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:7EEIF6IW46QKWYCE6JV7UAV4YB","short_pith_number":"pith:7EEIF6IW","schema_version":"1.0","canonical_sha256":"f90882f916e7a0ab6044f26bfa02bcc0583d66f90901bbdde83aedd8fcbce415","source":{"kind":"arxiv","id":"2503.13377","version":3},"attestation_state":"computed","paper":{"title":"Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning post-training enables large vision-language models to achieve state-of-the-art temporal video grounding with only 2.5K training examples.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Boshen Xu, Dingyi Yang, Jian Luan, Jianzhong Ju, Junqi Lin, Kejun Lin, Liang Zhang, Qin Jin, Wenxuan Wang, Xiangnan Fang, Yang Du, Ye Wang, Zewen He, Zhenbo Luo, Zihan Xiao, Zihao Yue, Ziheng Wang","submitted_at":"2025-03-17T17:04:20Z","abstract_excerpt":"Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training f"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2503.13377","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-03-17T17:04:20Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"97209bd8163ae50201f4f63563823b1191ded3d066d238972ab094ea7017c1da","abstract_canon_sha256":"c47461ba7328122a8cf3804f43f934a641731cf10e3ddf62d5cb9ec145d00de7"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.371202Z","signature_b64":"vyk2HrncO/DFVgjHT2Uh0mEdCxCbyx0x2arBnbRmlEJSCSgNMQEVmArMbUaVoq597ScmHXmoOfW+w+YPXddcCA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"f90882f916e7a0ab6044f26bfa02bcc0583d66f90901bbdde83aedd8fcbce415","last_reissued_at":"2026-05-17T23:38:15.370647Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.370647Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Reinforcement learning post-training enables large vision-language models to achieve state-of-the-art temporal video grounding with only 2.5K training examples.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Boshen Xu, Dingyi Yang, Jian Luan, Jianzhong Ju, Junqi Lin, Kejun Lin, Liang Zhang, Qin Jin, Wenxuan Wang, Xiangnan Fang, Yang Du, Ye Wang, Zewen He, Zhenbo Luo, Zihan Xiao, Zihao Yue, Ziheng Wang","submitted_at":"2025-03-17T17:04:20Z","abstract_excerpt":"Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training f"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That reinforcement learning with verifiable rewards on the curated RL-friendly dataset will produce genuine generalization improvements rather than overfitting to the specific reward formulation or benchmark construction.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning post-training enables large vision-language models to achieve state-of-the-art temporal video grounding with only 2.5K training examples.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b55102ecade32d8548ddc99a911d88eb186c8461438dcbaeee3207f98dafaf50"},"source":{"id":"2503.13377","kind":"arxiv","version":3},"verdict":{"id":"9e8fe946-10eb-4b1b-a214-b9772e061e94","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T02:34:23.564245Z","strongest_claim":"Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.","one_line_summary":"Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That reinforcement learning with verifiable rewards on the curated RL-friendly dataset will produce genuine generalization improvements rather than overfitting to the specific reward formulation or benchmark construction.","pith_extraction_headline":"Reinforcement learning post-training enables large vision-language models to achieve state-of-the-art temporal video grounding with only 2.5K training examples."},"references":{"count":87,"sample":[{"doi":"","year":2025,"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","ref_index":1,"cited_arxiv_id":"2501.12948","is_internal_anchor":true},{"doi":"","year":2023,"title":"Ht- step: Aligning instructional articles with how-to videos","work_id":"8245457b-ee59-461c-8839-ff57852b9855","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Localizing moments in video with natural language","work_id":"00e6da62-472c-45ac-a3ba-63660f0581c5","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":4,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":2015,"title":"Activitynet: A large-scale video benchmark for human activity understanding","work_id":"2f0f351a-b69d-4767-8213-6807af5fda95","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":87,"snapshot_sha256":"cbbe0739cbd3bc94e71dddd357a39e6d354506aab1108973d3319486c959c32c","internal_anchors":10},"formal_canon":{"evidence_count":3,"snapshot_sha256":"f408f7366e7e837e684a095f79a0910b1eded4116ad1df352b74af930bc00337"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.13377","created_at":"2026-05-17T23:38:15.370734+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.13377v3","created_at":"2026-05-17T23:38:15.370734+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.13377","created_at":"2026-05-17T23:38:15.370734+00:00"},{"alias_kind":"pith_short_12","alias_value":"7EEIF6IW46QK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"7EEIF6IW46QKWYCE","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"7EEIF6IW","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2605.06094","citing_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23216","citing_title":"CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20785","citing_title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20342","citing_title":"ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21931","citing_title":"EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21954","citing_title":"MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20342","citing_title":"ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16079","citing_title":"VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2505.20715","citing_title":"MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2507.00748","citing_title":"Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2511.11113","citing_title":"VIDEOP2R: Video Understanding from Perception to Reasoning","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2511.13026","citing_title":"REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2505.21374","citing_title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03963","citing_title":"TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03043","citing_title":"OneThinker: All-in-one Reasoning Model for Image and Video","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16918","citing_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","ref_index":70,"is_internal_anchor":true},{"citing_arxiv_id":"2602.00181","citing_title":"CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2504.06958","citing_title":"VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2602.17555","citing_title":"GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13803","citing_title":"EvoGround: Self-Evolving Video Agents for Video Temporal Grounding","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.01824","citing_title":"STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2502.17419","citing_title":"From System 1 to System 2: A Survey of Reasoning Large Language Models","ref_index":294,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27083","citing_title":"Co-Evolving Policy Distillation","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06094","citing_title":"VISD: Enhancing Video Reasoning via Structured Self-Distillation","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25276","citing_title":"OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding","ref_index":32,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB","json":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB.json","graph_json":"https://pith.science/api/pith-number/7EEIF6IW46QKWYCE6JV7UAV4YB/graph.json","events_json":"https://pith.science/api/pith-number/7EEIF6IW46QKWYCE6JV7UAV4YB/events.json","paper":"https://pith.science/paper/7EEIF6IW"},"agent_actions":{"view_html":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB","download_json":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB.json","view_paper":"https://pith.science/paper/7EEIF6IW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.13377&json=true","fetch_graph":"https://pith.science/api/pith-number/7EEIF6IW46QKWYCE6JV7UAV4YB/graph.json","fetch_events":"https://pith.science/api/pith-number/7EEIF6IW46QKWYCE6JV7UAV4YB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB/action/storage_attestation","attest_author":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB/action/author_attestation","sign_citation":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB/action/citation_signature","submit_replication":"https://pith.science/pith/7EEIF6IW46QKWYCE6JV7UAV4YB/action/replication_record"}},"created_at":"2026-05-17T23:38:15.370734+00:00","updated_at":"2026-05-17T23:38:15.370734+00:00"}