{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:F5ML3ZOQZMUC5AV5HCUUZNQI6T","short_pith_number":"pith:F5ML3ZOQ","schema_version":"1.0","canonical_sha256":"2f58bde5d0cb282e82bd38a94cb608f4ef63deee0c73b7a33b59f7092a56a960","source":{"kind":"arxiv","id":"2503.21755","version":2},"attestation_state":"computed","paper":{"title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Dian Zheng, Fan Zhang, Hongbo Liu, Jingwen He, Kai Zou, Lulu Gu, Wei-Shi Zheng, Yinan He, Yuanhan Zhang, Yu Qiao, Ziqi Huang, Ziwei Liu","submitted_at":"2025-03-27T17:57:01Z","abstract_excerpt":"Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increas"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2503.21755","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-03-27T17:57:01Z","cross_cats_sorted":[],"title_canon_sha256":"4d572520ce77819b8c4ae41359fddabd170c5d9a37cb2cfec0b4517dcc41b34a","abstract_canon_sha256":"c68e17eaaa7dd1d5795a73093939a039493a00a463003e3cc3084962130eca80"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:22.166634Z","signature_b64":"B1poH29Xyt49qh80/tDeJomRZ0pbxuiuW6lyeloCCuVJwG3Tk+KMKSEFhhdTew9QU2124ea9WEYQrOE/9dIvAA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"2f58bde5d0cb282e82bd38a94cb608f4ef63deee0c73b7a33b59f7092a56a960","last_reissued_at":"2026-05-17T23:39:22.165900Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:22.165900Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Dian Zheng, Fan Zhang, Hongbo Liu, Jingwen He, Kai Zou, Lulu Gu, Wei-Shi Zheng, Yinan He, Yuanhan Zhang, Yu Qiao, Ziqi Huang, Ziwei Liu","submitted_at":"2025-03-27T17:57:01Z","abstract_excerpt":"Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increas"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That integration of SOTA VLMs, LLMs, and anomaly detection methods, validated by human annotations, will reliably measure intrinsic faithfulness without introducing new biases or missing subtle violations of physical and commonsense rules.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"09a8b2daea40c0c4dad42f39773ce27530bf573ea8d4fc6ce93473ba77384971"},"source":{"id":"2503.21755","kind":"arxiv","version":2},"verdict":{"id":"f50c705c-2fd6-4b3e-bd41-0e34a403590a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:37:01.650654Z","strongest_claim":"To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities.","one_line_summary":"VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That integration of SOTA VLMs, LLMs, and anomaly detection methods, validated by human annotations, will reliably measure intrinsic faithfulness without introducing new biases or missing subtle violations of physical and commonsense rules.","pith_extraction_headline":"VBench-2.0 introduces a benchmark that tests video generation models for intrinsic faithfulness to physical laws, human anatomy, and commonsense."},"references":{"count":96,"sample":[{"doi":"","year":2023,"title":"Magicedit: High-fidelity and temporally coherent video editing","work_id":"809e7a2a-81ca-495f-a5e0-ca72ba719aa7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Stable video diffusion: A novel ap- proach to image-to-video generation.arXiv preprint arXiv:2308.09592, 2023","work_id":"ce4d5b88-8e62-4f9b-81c2-2521f978ec41","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"TokenFlow: Consistent Diffusion Features for Consistent Video Editing","work_id":"2e967b53-6386-4564-b468-ffd540817064","ref_index":3,"cited_arxiv_id":"2307.10373","is_internal_anchor":true},{"doi":"","year":2023,"title":"Inve: Interactive neural video editing,","work_id":"2f964c30-a7ac-4097-97f8-79f34d438b04","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Videdit: Zero-shot and spatially aware text-driven video editing,","work_id":"cd6d4d95-bc1d-4d8e-8c88-5b387f84ffb8","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":96,"snapshot_sha256":"e7dcbfecd26b2c27c6aa4b91d19adffd62a4120bcc6ab08a028ced2935a10768","internal_anchors":22},"formal_canon":{"evidence_count":3,"snapshot_sha256":"cea78f7dc9164844abd8988e65306b6794394e9b5008db2c2e88c65879ca7594"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.21755","created_at":"2026-05-17T23:39:22.166012+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.21755v2","created_at":"2026-05-17T23:39:22.166012+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.21755","created_at":"2026-05-17T23:39:22.166012+00:00"},{"alias_kind":"pith_short_12","alias_value":"F5ML3ZOQZMUC","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"F5ML3ZOQZMUC5AV5","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"F5ML3ZOQ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":46,"internal_anchor_count":46,"sample":[{"citing_arxiv_id":"2605.23699","citing_title":"CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23271","citing_title":"EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2512.01843","citing_title":"PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04727","citing_title":"Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08503","citing_title":"Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14382","citing_title":"Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20731","citing_title":"TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14382","citing_title":"Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16003","citing_title":"Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18303","citing_title":"PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18365","citing_title":"GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18396","citing_title":"NEWTON: Agentic Planning for Physically Grounded Video Generation","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19957","citing_title":"World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19728","citing_title":"Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00503","citing_title":"Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09299","citing_title":"VABench: A Comprehensive Benchmark for Audio-Video Generation","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2601.10632","citing_title":"CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos","ref_index":114,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07775","citing_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","ref_index":113,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13669","citing_title":"EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2603.18636","citing_title":"Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19092","citing_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15199","citing_title":"EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14269","citing_title":"PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14278","citing_title":"KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14382","citing_title":"Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation","ref_index":43,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T","json":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T.json","graph_json":"https://pith.science/api/pith-number/F5ML3ZOQZMUC5AV5HCUUZNQI6T/graph.json","events_json":"https://pith.science/api/pith-number/F5ML3ZOQZMUC5AV5HCUUZNQI6T/events.json","paper":"https://pith.science/paper/F5ML3ZOQ"},"agent_actions":{"view_html":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T","download_json":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T.json","view_paper":"https://pith.science/paper/F5ML3ZOQ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.21755&json=true","fetch_graph":"https://pith.science/api/pith-number/F5ML3ZOQZMUC5AV5HCUUZNQI6T/graph.json","fetch_events":"https://pith.science/api/pith-number/F5ML3ZOQZMUC5AV5HCUUZNQI6T/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T/action/timestamp_anchor","attest_storage":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T/action/storage_attestation","attest_author":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T/action/author_attestation","sign_citation":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T/action/citation_signature","submit_replication":"https://pith.science/pith/F5ML3ZOQZMUC5AV5HCUUZNQI6T/action/replication_record"}},"created_at":"2026-05-17T23:39:22.166012+00:00","updated_at":"2026-05-17T23:39:22.166012+00:00"}