{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:PRRIYH5HWHYI4NRKDLGTZEH65O","short_pith_number":"pith:PRRIYH5H","schema_version":"1.0","canonical_sha256":"7c628c1fa7b1f08e362a1acd3c90feeb8e2077a23376125208c7e9d086d0d3ae","source":{"kind":"arxiv","id":"2111.08897","version":3},"attestation_state":"computed","paper":{"title":"ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data","license":"http://creativecommons.org/licenses/by/4.0/","headline":"ARKitScenes is the largest indoor RGB-D dataset captured with widely available mobile LiDAR sensors and includes laser-scanned depth plus manual 3D bounding box labels.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Afshin Dehghan, Arik Schwartz, Brandon Joffe, Daniel Kurz, Elad Shulman, Gilad Baruch, Peter Fu, Tal Dimry, Thomas Gebauer, Yuri Feigin, Zhuoyuan Chen","submitted_at":"2021-11-17T04:27:01Z","abstract_excerpt":"Scene understanding is an active research area. Commercial depth sensors, such as Kinect, have enabled the release of several RGB-D datasets over the past few years which spawned novel methods in 3D scene understanding. More recently with the launch of the LiDAR sensor in Apple's iPads and iPhones, high quality RGB-D data is accessible to millions of people on a device they commonly use. This opens a whole new era in scene understanding for the Computer Vision community as well as app developers. The fundamental research in scene understanding together with the advances in machine learning can"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2111.08897","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2021-11-17T04:27:01Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"1d9c2faad2ac5b1cb29b77b3c1a171793a4e2c0591f490e1224e248e99d896f1","abstract_canon_sha256":"32764f7a0c144c7bd2dab1ad2004ee0ea600580838bd74cdfe964d9fa0b5599c"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.744260Z","signature_b64":"FlcH4cpFWg3MCSDxcIpYcZjH1omaqKCarwZ7fdUw/jyg+FdOUoZSTnFiPTZGrWfBsV1LvAmGr1rpsV1HFOoYDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"7c628c1fa7b1f08e362a1acd3c90feeb8e2077a23376125208c7e9d086d0d3ae","last_reissued_at":"2026-05-17T23:38:52.743728Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.743728Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data","license":"http://creativecommons.org/licenses/by/4.0/","headline":"ARKitScenes is the largest indoor RGB-D dataset captured with widely available mobile LiDAR sensors and includes laser-scanned depth plus manual 3D bounding box labels.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Afshin Dehghan, Arik Schwartz, Brandon Joffe, Daniel Kurz, Elad Shulman, Gilad Baruch, Peter Fu, Tal Dimry, Thomas Gebauer, Yuri Feigin, Zhuoyuan Chen","submitted_at":"2021-11-17T04:27:01Z","abstract_excerpt":"Scene understanding is an active research area. Commercial depth sensors, such as Kinect, have enabled the release of several RGB-D datasets over the past few years which spawned novel methods in 3D scene understanding. More recently with the launch of the LiDAR sensor in Apple's iPads and iPhones, high quality RGB-D data is accessible to millions of people on a device they commonly use. This opens a whole new era in scene understanding for the Computer Vision community as well as app developers. The fundamental research in scene understanding together with the advances in machine learning can"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ARKitScenes is not only the first RGB-D dataset captured with a now widely available depth sensor, but to our best knowledge, it also is the largest indoor scene understanding data released.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the mobile RGB-D captures, laser-scanned depth maps, and manual 3D bounding box labels are sufficiently accurate and representative of real-world indoor scenes to push state-of-the-art methods on the two downstream tasks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ARKitScenes is the largest indoor RGB-D dataset captured with widely available mobile LiDAR sensors and includes laser-scanned depth plus manual 3D bounding box labels.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3c387f667a4e6222b478ef02458f17c5b950c9d51c3ae7e0ff30d381ea12702a"},"source":{"id":"2111.08897","kind":"arxiv","version":3},"verdict":{"id":"e7b663bd-2e04-4d98-a9db-f5bda035c67a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T10:41:47.251267Z","strongest_claim":"ARKitScenes is not only the first RGB-D dataset captured with a now widely available depth sensor, but to our best knowledge, it also is the largest indoor scene understanding data released.","one_line_summary":"ARKitScenes is the largest real-world indoor RGB-D dataset captured with mobile LiDAR, including high-resolution depth maps and 3D furniture bounding box annotations for advancing object detection and depth upsampling.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the mobile RGB-D captures, laser-scanned depth maps, and manual 3D bounding box labels are sufficiently accurate and representative of real-world indoor scenes to push state-of-the-art methods on the two downstream tasks.","pith_extraction_headline":"ARKitScenes is the largest indoor RGB-D dataset captured with widely available mobile LiDAR sensors and includes laser-scanned depth plus manual 3D bounding box labels."},"references":{"count":46,"sample":[{"doi":"","year":2019,"title":"3d-sis: 3d semantic instance segmentation of rgb-d scans","work_id":"539e33f7-2816-4473-843a-b15021c485ff","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Gspn: Generative shape proposal network for 3d instance segmentation in point cloud","work_id":"07e7b214-eaae-4bd1-b860-e357578e92e9","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Sgpn: Similarity group proposal network for 3d point cloud instance segmentation","work_id":"57c8f838-5ec0-4798-b9a0-f7aa85147d09","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Deep hough voting for 3d object detection in point clouds","work_id":"c0b97321-80f0-4ac6-8485-2c4888fa836e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Qi, Xinlei Chen, and Leonidas J","work_id":"d4ca096c-778f-41fa-896e-6997ba6bd3f6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":46,"snapshot_sha256":"60741609255dbc446c82bc40a32e889b0254d68727f86a88d05548cbd9c0368d","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"21450e658b5260556d840098a83bc3ce8df5ac576c70db6ba09db9bfcf857134"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2111.08897","created_at":"2026-05-17T23:38:52.743818+00:00"},{"alias_kind":"arxiv_version","alias_value":"2111.08897v3","created_at":"2026-05-17T23:38:52.743818+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2111.08897","created_at":"2026-05-17T23:38:52.743818+00:00"},{"alias_kind":"pith_short_12","alias_value":"PRRIYH5HWHYI","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"PRRIYH5HWHYI4NRK","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"PRRIYH5H","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":38,"internal_anchor_count":38,"sample":[{"citing_arxiv_id":"2605.07287","citing_title":"SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16258","citing_title":"IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22570","citing_title":"VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22020","citing_title":"ForeSplat: Optimization-Aware Foresight for Feed-Forward 3D Gaussian Splatting","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04128","citing_title":"JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21131","citing_title":"UniT: Unified Geometry Learning with Group Autoregressive Transformer","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16258","citing_title":"IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20165","citing_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2505.20279","citing_title":"VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2510.00978","citing_title":"A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2511.11232","citing_title":"DoReMi: Bridging 3D Domains via Topology-Aware Domain-Representation Mixture of Experts","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2511.16567","citing_title":"POMA-3D: The Point Map Way to 3D Scene Understanding","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2512.09373","citing_title":"FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2512.17817","citing_title":"Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2508.10934","citing_title":"ViPE: Video Pose Engine for 3D Geometric Perception","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04415","citing_title":"Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2507.11539","citing_title":"Streaming 4D Visual Geometry Transformer","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04385","citing_title":"ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2603.13091","citing_title":"Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17980","citing_title":"Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27437","citing_title":"SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02546","citing_title":"Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2507.13347","citing_title":"$\\pi^3$: Permutation-Equivariant Visual Geometry Learning","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10106","citing_title":"ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09899","citing_title":"Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection","ref_index":25,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O","json":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O.json","graph_json":"https://pith.science/api/pith-number/PRRIYH5HWHYI4NRKDLGTZEH65O/graph.json","events_json":"https://pith.science/api/pith-number/PRRIYH5HWHYI4NRKDLGTZEH65O/events.json","paper":"https://pith.science/paper/PRRIYH5H"},"agent_actions":{"view_html":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O","download_json":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O.json","view_paper":"https://pith.science/paper/PRRIYH5H","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2111.08897&json=true","fetch_graph":"https://pith.science/api/pith-number/PRRIYH5HWHYI4NRKDLGTZEH65O/graph.json","fetch_events":"https://pith.science/api/pith-number/PRRIYH5HWHYI4NRKDLGTZEH65O/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O/action/timestamp_anchor","attest_storage":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O/action/storage_attestation","attest_author":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O/action/author_attestation","sign_citation":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O/action/citation_signature","submit_replication":"https://pith.science/pith/PRRIYH5HWHYI4NRKDLGTZEH65O/action/replication_record"}},"created_at":"2026-05-17T23:38:52.743818+00:00","updated_at":"2026-05-17T23:38:52.743818+00:00"}