{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:YN5B5WVAMMGQXJPXQUYQ6VVC5W","short_pith_number":"pith:YN5B5WVA","schema_version":"1.0","canonical_sha256":"c37a1edaa0630d0ba5f785310f56a2edacfb47d547c81c6bb6bb843d8a586954","source":{"kind":"arxiv","id":"2402.10885","version":3},"attestation_state":"computed","paper":{"title":"3D Diffuser Actor: Policy Diffusion with 3D Scene Representations","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.RO","authors_text":"Katerina Fragkiadaki, Nikolaos Gkanatsios, Tsung-Wei Ke","submitted_at":"2024-02-16T18:43:02Z","abstract_excerpt":"Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer tha"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2402.10885","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","primary_cat":"cs.RO","submitted_at":"2024-02-16T18:43:02Z","cross_cats_sorted":["cs.AI","cs.CV","cs.LG"],"title_canon_sha256":"af756eca498b5c4efcbd0b56e1ba0bd8267b043217f346ce988806667ee0678d","abstract_canon_sha256":"c890707274bd65d77bb4eacf6733d6faa9952bd72f7e0ae9fd9a02d63d3cba2e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:12.920442Z","signature_b64":"69QoAJfdudKrVqfLXfr+vHBN3ZjY0+hcsjcMWwAkwvOH1iJ5CKm9SzqPVXNovjTlJM8sehfxYU9swFbXQ7OuCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c37a1edaa0630d0ba5f785310f56a2edacfb47d547c81c6bb6bb843d8a586954","last_reissued_at":"2026-05-17T23:38:12.919758Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:12.919758Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"3D Diffuser Actor: Policy Diffusion with 3D Scene Representations","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks.","cross_cats":["cs.AI","cs.CV","cs.LG"],"primary_cat":"cs.RO","authors_text":"Katerina Fragkiadaki, Nikolaos Gkanatsios, Tsung-Wei Ke","submitted_at":"2024-02-16T18:43:02Z","abstract_excerpt":"Diffusion policies are conditional diffusion models that learn robot action distributions conditioned on the robot and environment state. They have recently shown to outperform both deterministic and alternative action distribution learning formulations. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy equipped with a novel 3D denoising transformer tha"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 3D scene features aggregated from depth images remain sufficiently accurate and viewpoint-invariant when camera placement or lighting changes in ways not seen during training.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"614b51fef3a1137ee9632e4500d42d3709f7ccd42fa972880d20f8ddca6c8c47"},"source":{"id":"2402.10885","kind":"arxiv","version":3},"verdict":{"id":"2513b553-d607-4398-b189-932cba79db62","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T21:56:13.325864Z","strongest_claim":"3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 18.1% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it improves over the current SOTA by a 9% relative increase.","one_line_summary":"3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 3D scene features aggregated from depth images remain sufficiently accurate and viewpoint-invariant when camera placement or lighting changes in ways not seen during training.","pith_extraction_headline":"A diffusion policy that denoises 3D robot pose trajectories from tokenized scene features, language, and proprioception sets new performance records on standard robot benchmarks."},"references":{"count":120,"sample":[{"doi":"","year":2016,"title":"Generative adversarial imitation learning","work_id":"fd44aba9-c117-44eb-bb35-f653404cb280","ref_index":1,"cited_arxiv_id":"1606.03476","is_internal_anchor":true},{"doi":"","year":2022,"title":"Y . Tsurumine and T. Matsubara. Goal-aware generative adversarial imitation learning from imperfect demonstration for robotic cloth manipulation, 2022","work_id":"61e50760-e7da-415e-87bd-86c2f49b486a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets","work_id":"3dc910b2-2fab-4166-8515-d71eeb08b512","ref_index":4,"cited_arxiv_id":"1705.10479","is_internal_anchor":true},{"doi":"","year":2022,"title":"N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone, 2022","work_id":"77a24d72-f06e-4a9f-8b37-9dff40dad0a6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V . Macua, S. Z. Tan, I. Momennejad, K. Hofmann, and S. Devlin. Imitating human behaviour with diffusion models, 2023","work_id":"6ac53630-0210-4ef1-b738-06c4723c23aa","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":120,"snapshot_sha256":"dc464668ccdb406a8ddba3d84af3edf2f710745b3951693a4d10217f68aa8b1c","internal_anchors":16},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4a90424be26704b8356db32459bf7342bc59f4b53c9f7df1df43ccc4540c1723"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2402.10885","created_at":"2026-05-17T23:38:12.919897+00:00"},{"alias_kind":"arxiv_version","alias_value":"2402.10885v3","created_at":"2026-05-17T23:38:12.919897+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2402.10885","created_at":"2026-05-17T23:38:12.919897+00:00"},{"alias_kind":"pith_short_12","alias_value":"YN5B5WVAMMGQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YN5B5WVAMMGQXJPX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YN5B5WVA","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2506.14135","citing_title":"GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2405.14093","citing_title":"A Survey on Vision-Language-Action Models for Embodied AI","ref_index":113,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19102","citing_title":"FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20856","citing_title":"DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21258","citing_title":"Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21460","citing_title":"HITL-D: Human In The Loop Diffusion Assisted Shared Control","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15536","citing_title":"SkiP: When to Skip and When to Refine for Efficient Robot Manipulation","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21723","citing_title":"VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2511.13312","citing_title":"EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14058","citing_title":"What Matters in Building Vision-Language-Action Models for Generalist Robots","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04447","citing_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2601.07060","citing_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2503.10631","citing_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2502.05855","citing_title":"DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13428","citing_title":"SID: Sliding into Distribution for Robust Few-Demonstration Manipulation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12369","citing_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14803","citing_title":"Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09989","citing_title":"StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2410.06158","citing_title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21914","citing_title":"VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2511.04831","citing_title":"Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11386","citing_title":"ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18088","citing_title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05673","citing_title":"Rectified Schr\\\"odinger Bridge Matching for Few-Step Visual Navigation","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05656","citing_title":"SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation","ref_index":28,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W","json":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W.json","graph_json":"https://pith.science/api/pith-number/YN5B5WVAMMGQXJPXQUYQ6VVC5W/graph.json","events_json":"https://pith.science/api/pith-number/YN5B5WVAMMGQXJPXQUYQ6VVC5W/events.json","paper":"https://pith.science/paper/YN5B5WVA"},"agent_actions":{"view_html":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W","download_json":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W.json","view_paper":"https://pith.science/paper/YN5B5WVA","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2402.10885&json=true","fetch_graph":"https://pith.science/api/pith-number/YN5B5WVAMMGQXJPXQUYQ6VVC5W/graph.json","fetch_events":"https://pith.science/api/pith-number/YN5B5WVAMMGQXJPXQUYQ6VVC5W/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W/action/storage_attestation","attest_author":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W/action/author_attestation","sign_citation":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W/action/citation_signature","submit_replication":"https://pith.science/pith/YN5B5WVAMMGQXJPXQUYQ6VVC5W/action/replication_record"}},"created_at":"2026-05-17T23:38:12.919897+00:00","updated_at":"2026-05-17T23:38:12.919897+00:00"}