{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:IL2GFSKZ2657G56SVJAYSIRIN6","short_pith_number":"pith:IL2GFSKZ","schema_version":"1.0","canonical_sha256":"42f462c959d7bbf377d2aa418922286f88f1771da0803e0dec95e98189ceb55c","source":{"kind":"arxiv","id":"2403.00476","version":3},"attestation_state":"computed","paper":{"title":"TempCompass: Do Video LLMs Really Understand Videos?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video LLMs exhibit notably poor temporal perception ability across aspects like speed and direction.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Lei Li, Lu Hou, Shicheng Li, Shuhuai Ren, Sishuo Chen, Xu Sun, Yi Liu, Yuanxin Liu, Yuxiang Wang","submitted_at":"2024-03-01T12:02:19Z","abstract_excerpt":"Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2403.00476","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-03-01T12:02:19Z","cross_cats_sorted":[],"title_canon_sha256":"5c3fcd68707f7e1fe9fdbfdc44467dac5589f4b0191d98f5a0d2c1e1882aaf32","abstract_canon_sha256":"8ea710614dbdefbfee1caca2d0d10902ea897a7deb524d9d333cb07dccfc7a7e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.359178Z","signature_b64":"/7vFKF9S1zRNxeci0VwOKoTq/mXN9WuoWuZnOR/GHjN/t/13rq5J0zpGpt+QBOglzdDSqUtG71V72f6ua0eEAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"42f462c959d7bbf377d2aa418922286f88f1771da0803e0dec95e98189ceb55c","last_reissued_at":"2026-05-17T23:38:15.358530Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.358530Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"TempCompass: Do Video LLMs Really Understand Videos?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video LLMs exhibit notably poor temporal perception ability across aspects like speed and direction.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Lei Li, Lu Hou, Shicheng Li, Shuhuai Ren, Sishuo Chen, Xu Sun, Yi Liu, Yuanxin Liu, Yuxiang Wang","submitted_at":"2024-03-01T12:02:19Z","abstract_excerpt":"Recently, there is a surge in interest surrounding video large language models (Video LLMs). However, existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs. On the one hand, most of them are unable to distinguish between different temporal aspects (e.g., speed, direction) and thus cannot reflect the nuanced performance on these specific aspects. On the other hand, they are limited in the diversity of task formats (e.g., only multi-choice QA), which hinders the understanding of how temporal perception performance may vary across different "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Based on TempCompass, these models exhibit notably poor temporal perception ability.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the constructed conflicting videos successfully isolate specific temporal aspects without introducing unintended biases or allowing models to exploit other cues, and that the LLM-based automatic evaluation accurately reflects model performance.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video LLMs exhibit notably poor temporal perception ability across aspects like speed and direction.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3a5dd8c75e0dc8b8d918ad27b7c2ab1ec77fef06481965d0222b403368d63d5d"},"source":{"id":"2403.00476","kind":"arxiv","version":3},"verdict":{"id":"94b6c738-6e1f-48d5-a67c-553b68b85c0b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T02:40:35.903339Z","strongest_claim":"Based on TempCompass, these models exhibit notably poor temporal perception ability.","one_line_summary":"TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the constructed conflicting videos successfully isolate specific temporal aspects without introducing unintended biases or allowing models to exploit other cues, and that the LLM-based automatic evaluation accurately reflects model performance.","pith_extraction_headline":"Video LLMs exhibit notably poor temporal perception ability across aspects like speed and direction."},"references":{"count":135,"sample":[{"doi":"","year":null,"title":"LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=","work_id":"8a8b63b4-c22e-413d-88f5-8753fc5f8402","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. ArXiv , year=","work_id":"abef9ec9-cc35-48c6-968c-28f788c4162f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"and Stoica, Ion and Xing, Eric P","work_id":"cb4b41f6-6d60-4db4-a4d1-6c5bb7899473","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Hashimoto , year =","work_id":"59352350-df66-4d75-a005-0d0cb02e8ccf","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwa","work_id":"15cd97b7-6e24-48b5-b218-f433fead09cd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":135,"snapshot_sha256":"df3ccff164fbbda1d89bd03332bafc7f276da5e41436d1c51549a7153ca1cbb7","internal_anchors":27},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c074fe3136d412d21fb68ac13e935fa92aac864a349e25027700ccfdc7ad3862"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2403.00476","created_at":"2026-05-17T23:38:15.358631+00:00"},{"alias_kind":"arxiv_version","alias_value":"2403.00476v3","created_at":"2026-05-17T23:38:15.358631+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2403.00476","created_at":"2026-05-17T23:38:15.358631+00:00"},{"alias_kind":"pith_short_12","alias_value":"IL2GFSKZ2657","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"IL2GFSKZ2657G56S","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"IL2GFSKZ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":30,"internal_anchor_count":30,"sample":[{"citing_arxiv_id":"2605.23045","citing_title":"The TIME Machine: On The Power of Motion for Efficient Perception","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2501.02955","citing_title":"MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20785","citing_title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19559","citing_title":"EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23617","citing_title":"One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2506.05425","citing_title":"SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2501.01957","citing_title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2407.03320","citing_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","ref_index":89,"is_internal_anchor":true},{"citing_arxiv_id":"2502.04326","citing_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2505.21374","citing_title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2503.13377","citing_title":"Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03963","citing_title":"TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2512.02231","citing_title":"See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2512.06673","citing_title":"Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13281","citing_title":"VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21334","citing_title":"Streaming Video Instruction Tuning","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2603.18856","citing_title":"Motion-o: Trajectory-Grounded Video Reasoning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13803","citing_title":"EvoGround: Self-Evolving Video Agents for Video Temporal Grounding","ref_index":67,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13826","citing_title":"Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2604.01824","citing_title":"STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21776","citing_title":"Video-R1: Reinforcing Video Reasoning in MLLMs","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11627","citing_title":"POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10517","citing_title":"From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2505.07062","citing_title":"Seed1.5-VL Technical Report","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2501.13106","citing_title":"VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding","ref_index":136,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6","json":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6.json","graph_json":"https://pith.science/api/pith-number/IL2GFSKZ2657G56SVJAYSIRIN6/graph.json","events_json":"https://pith.science/api/pith-number/IL2GFSKZ2657G56SVJAYSIRIN6/events.json","paper":"https://pith.science/paper/IL2GFSKZ"},"agent_actions":{"view_html":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6","download_json":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6.json","view_paper":"https://pith.science/paper/IL2GFSKZ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2403.00476&json=true","fetch_graph":"https://pith.science/api/pith-number/IL2GFSKZ2657G56SVJAYSIRIN6/graph.json","fetch_events":"https://pith.science/api/pith-number/IL2GFSKZ2657G56SVJAYSIRIN6/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6/action/timestamp_anchor","attest_storage":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6/action/storage_attestation","attest_author":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6/action/author_attestation","sign_citation":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6/action/citation_signature","submit_replication":"https://pith.science/pith/IL2GFSKZ2657G56SVJAYSIRIN6/action/replication_record"}},"created_at":"2026-05-17T23:38:15.358631+00:00","updated_at":"2026-05-17T23:38:15.358631+00:00"}