{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:C32FX3ZNYSA4UYVQ3E2E3C6UAA","short_pith_number":"pith:C32FX3ZN","schema_version":"1.0","canonical_sha256":"16f45bef2dc481ca62b0d9344d8bd40035eaae2c01e698f7c62fd2e37dd2db93","source":{"kind":"arxiv","id":"2503.20523","version":1},"attestation_state":"computed","paper":{"title":"GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GAIA-2 generates high-resolution multi-camera driving videos from structured inputs like vehicle dynamics, agent positions, and road semantics.","cross_cats":["cs.AI","cs.RO"],"primary_cat":"cs.CV","authors_text":"Anthony Hu, Elahe Arani, George Fedoseev, Gianluca Corrado, Jamie Shotton, Lloyd Russell, Lorenzo Bertoni","submitted_at":"2025-03-26T13:11:35Z","abstract_excerpt":"Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2503.20523","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-03-26T13:11:35Z","cross_cats_sorted":["cs.AI","cs.RO"],"title_canon_sha256":"da0c2c4ad3f32678ee5c887c2e1236f46fc1af634d54bd819e57d92fa8043270","abstract_canon_sha256":"fb599b61127c4e4ed010c5302ce50dd0aec525513754619b7e36da42bc535029"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.408059Z","signature_b64":"XjsWYMGNjjOi5eYBgmlrnyL8LivVxJB1JIaljFaQxrBNDTefPFjcsKbErslr30SXqaa2dcNn5q+pgq/QtgeiDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"16f45bef2dc481ca62b0d9344d8bd40035eaae2c01e698f7c62fd2e37dd2db93","last_reissued_at":"2026-05-17T23:38:52.407613Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.407613Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GAIA-2 generates high-resolution multi-camera driving videos from structured inputs like vehicle dynamics, agent positions, and road semantics.","cross_cats":["cs.AI","cs.RO"],"primary_cat":"cs.CV","authors_text":"Anthony Hu, Elahe Arani, George Fedoseev, Gianluca Corrado, Jamie Shotton, Lloyd Russell, Lorenzo Bertoni","submitted_at":"2025-03-26T13:11:35Z","abstract_excerpt":"Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the generated videos are sufficiently realistic, consistent, and free of artifacts to serve as effective training data for autonomous driving systems without introducing biases or failures when transferred to real vehicles.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GAIA-2 is a controllable latent diffusion world model that produces spatiotemporally consistent multi-view videos for autonomous driving simulation across diverse geographies.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GAIA-2 generates high-resolution multi-camera driving videos from structured inputs like vehicle dynamics, agent positions, and road semantics.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ec5bedd139e49967a7040a66a5699b7a21c64d84bc56e00b50e24f66cf6a27ff"},"source":{"id":"2503.20523","kind":"arxiv","version":1},"verdict":{"id":"6b0b4c6c-f4d3-422a-9f1e-a00111b8d1e2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T13:44:17.546267Z","strongest_claim":"GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments.","one_line_summary":"GAIA-2 is a controllable latent diffusion world model that produces spatiotemporally consistent multi-view videos for autonomous driving simulation across diverse geographies.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the generated videos are sufficiently realistic, consistent, and free of artifacts to serve as effective training data for autonomous driving systems without introducing biases or failures when transferred to real vehicles.","pith_extraction_headline":"GAIA-2 generates high-resolution multi-camera driving videos from structured inputs like vehicle dynamics, agent positions, and road semantics."},"references":{"count":52,"sample":[{"doi":"","year":2014,"title":"D. P. Kingma and M. Welling. Auto-encoding variational bayes.Proceedings of the International Conference on Learning Representations (ICLR) , 2014","work_id":"cdef2b5c-2c62-403f-b5ab-73923a82dc96","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","ref_index":2,"cited_arxiv_id":"2501.03575","is_internal_anchor":true},{"doi":"","year":2017,"title":"van den Oord, O","work_id":"f2b547b5-0061-4045-b0df-7bb17d07e755","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2021","work_id":"1cd163f3-4f30-4e48-a7b4-718334826ad9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"GAIA-1: A Generative World Model for Autonomous Driving","work_id":"313484e6-a442-4522-8e19-d07e502844a8","ref_index":5,"cited_arxiv_id":"2309.17080","is_internal_anchor":true}],"resolved_work":52,"snapshot_sha256":"a949cb6e0ced50325d9127fde993e0e603ebb9517cff8ee25afb56966bb06950","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"89ae83d22cc36270b3cf44eb377fcb7aa70f49f9bd04200753483d11be8459f0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.20523","created_at":"2026-05-17T23:38:52.407678+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.20523v1","created_at":"2026-05-17T23:38:52.407678+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.20523","created_at":"2026-05-17T23:38:52.407678+00:00"},{"alias_kind":"pith_short_12","alias_value":"C32FX3ZNYSA4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"C32FX3ZNYSA4UYVQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"C32FX3ZN","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":32,"internal_anchor_count":32,"sample":[{"citing_arxiv_id":"2605.11596","citing_title":"HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21061","citing_title":"Grounding Driving VLA via Inverse Kinematics","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10959","citing_title":"Ozone: A Unified Platform for Transportation Research","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18137","citing_title":"Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15391","citing_title":"PanoWorld: Geometry-Consistent Panoramic Video World Modeling","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2508.17588","citing_title":"HERO: Hierarchical Extrapolation and Refresh for Efficient World Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2510.24718","citing_title":"Generative View Stitching","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2512.20563","citing_title":"LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2512.23421","citing_title":"DriveLaW:Unifying Planning and Video Generation in a Latent Driving World","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2512.21714","citing_title":"AstraNav-World: World Model for Foresight Control and Consistency","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2602.06949","citing_title":"DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2601.20540","citing_title":"Advancing Open-source World Models","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2508.05635","citing_title":"Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24527","citing_title":"Training Agents Inside of Scalable World Models","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13591","citing_title":"Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28489","citing_title":"Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms","ref_index":172,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11596","citing_title":"HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27448","citing_title":"LA-Pose: Latent Action Pretraining Meets Pose Estimation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10858","citing_title":"Is Your Driving World Model an All-Around Player?","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21914","citing_title":"VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18468","citing_title":"Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01896","citing_title":"Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01694","citing_title":"Latent State Design for World Models under Sufficiency Constraints","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12857","citing_title":"Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11707","citing_title":"Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction","ref_index":60,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA","json":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA.json","graph_json":"https://pith.science/api/pith-number/C32FX3ZNYSA4UYVQ3E2E3C6UAA/graph.json","events_json":"https://pith.science/api/pith-number/C32FX3ZNYSA4UYVQ3E2E3C6UAA/events.json","paper":"https://pith.science/paper/C32FX3ZN"},"agent_actions":{"view_html":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA","download_json":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA.json","view_paper":"https://pith.science/paper/C32FX3ZN","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.20523&json=true","fetch_graph":"https://pith.science/api/pith-number/C32FX3ZNYSA4UYVQ3E2E3C6UAA/graph.json","fetch_events":"https://pith.science/api/pith-number/C32FX3ZNYSA4UYVQ3E2E3C6UAA/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA/action/timestamp_anchor","attest_storage":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA/action/storage_attestation","attest_author":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA/action/author_attestation","sign_citation":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA/action/citation_signature","submit_replication":"https://pith.science/pith/C32FX3ZNYSA4UYVQ3E2E3C6UAA/action/replication_record"}},"created_at":"2026-05-17T23:38:52.407678+00:00","updated_at":"2026-05-17T23:38:52.407678+00:00"}