{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:HS5S2APJEFBPZE5OGTD5PAFGWU","short_pith_number":"pith:HS5S2APJ","schema_version":"1.0","canonical_sha256":"3cbb2d01e92142fc93ae34c7d780a6b5155da6d2bc87b6c8db06e493bb4d329c","source":{"kind":"arxiv","id":"2505.23747","version":1},"attestation_state":"computed","paper":{"title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CV","authors_text":"Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan","submitted_at":"2025-05-29T17:59:04Z","abstract_excerpt":"Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understandi"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2505.23747","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-05-29T17:59:04Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"f3bf03a7423e285470a8a9b66de09bad5f8da3292e1f37ffdaba5dd24a172cd6","abstract_canon_sha256":"147e0d7fc614f806400cd4c3204facd9d7188c981e2bb33eef44a872e1a7b4bd"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.490665Z","signature_b64":"ZGOBgO4Wnyt0JQYUQ2nwImSoqw7p/as1jAusCNV56A8gX/mskEMIB5ZVeAlcbjDD/htmzB7DWJRNHw9ZWqjmAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"3cbb2d01e92142fc93ae34c7d780a6b5155da6d2bc87b6c8db06e493bb4d329c","last_reissued_at":"2026-05-17T23:38:48.490187Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.490187Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CV","authors_text":"Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan","submitted_at":"2025-05-29T17:59:04Z","abstract_excerpt":"Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understandi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"that initializing a spatial encoder from the backbone of a feed-forward visual geometry foundation model will reliably extract usable 3D structure features from purely 2D image or video inputs without any 3D supervision","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"34ef909fdb9faa426dfc176dea00e801f3d26d0aae596bb863cc382771694c27"},"source":{"id":"2505.23747","kind":"arxiv","version":1},"verdict":{"id":"0371eabb-c042-4858-9b50-d798baf2a849","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:31:09.946115Z","strongest_claim":"our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks","one_line_summary":"Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"that initializing a spatial encoder from the backbone of a feed-forward visual geometry foundation model will reliably extract usable 3D structure features from purely 2D image or video inputs without any 3D supervision","pith_extraction_headline":"Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs."},"references":{"count":71,"sample":[{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning,","work_id":"2aa7036b-9bcf-4f86-9e62-24ceaf7eaea7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,","work_id":"9e1df70c-c5c8-459c-a56f-57934f6fd012","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024","work_id":"179f5e44-6c7c-4d41-833a-07c4934c1327","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context","work_id":"80e3e977-f1bb-4c83-8d0c-1ab0a0c5c3f1","ref_index":4,"cited_arxiv_id":"2403.05530","is_internal_anchor":true},{"doi":"","year":2024,"title":"GPT-4o System Card","work_id":"f37bf1c7-4964-4e56-9762-d20da8d9009f","ref_index":5,"cited_arxiv_id":"2410.21276","is_internal_anchor":true}],"resolved_work":71,"snapshot_sha256":"ef71c75076133de46a9772759062ec25ae45903f6e0db99ba9a89d0437c298f8","internal_anchors":21},"formal_canon":{"evidence_count":2,"snapshot_sha256":"cc2ba9c2d17a48c092bf35e3b73d66b74e462d2389c374e9b88259aa2142b9f9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2505.23747","created_at":"2026-05-17T23:38:48.490283+00:00"},{"alias_kind":"arxiv_version","alias_value":"2505.23747v1","created_at":"2026-05-17T23:38:48.490283+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2505.23747","created_at":"2026-05-17T23:38:48.490283+00:00"},{"alias_kind":"pith_short_12","alias_value":"HS5S2APJEFBP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"HS5S2APJEFBPZE5O","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"HS5S2APJ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":31,"internal_anchor_count":31,"sample":[{"citing_arxiv_id":"2512.10719","citing_title":"SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22558","citing_title":"GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22536","citing_title":"SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15876","citing_title":"Unlocking Dense Metric Depth Estimation in VLMs","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15876","citing_title":"Unlocking Dense Metric Depth Estimation in VLMs","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18018","citing_title":"See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20165","citing_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16899","citing_title":"LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2505.20279","citing_title":"VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2506.09082","citing_title":"AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2507.07982","citing_title":"Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21471","citing_title":"SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04476","citing_title":"Vision-aligned Latent Reasoning for Multi-modal Large Language Model","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2602.11635","citing_title":"Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2603.03944","citing_title":"SCP: Spatial Causal Prediction in Video","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17980","citing_title":"Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08991","citing_title":"PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03318","citing_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11462","citing_title":"SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26341","citing_title":"SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10106","citing_title":"ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10588","citing_title":"Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13321","citing_title":"Why MLLMs Struggle to Determine Object Orientations","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10789","citing_title":"ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11331","citing_title":"Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale","ref_index":74,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU","json":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU.json","graph_json":"https://pith.science/api/pith-number/HS5S2APJEFBPZE5OGTD5PAFGWU/graph.json","events_json":"https://pith.science/api/pith-number/HS5S2APJEFBPZE5OGTD5PAFGWU/events.json","paper":"https://pith.science/paper/HS5S2APJ"},"agent_actions":{"view_html":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU","download_json":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU.json","view_paper":"https://pith.science/paper/HS5S2APJ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2505.23747&json=true","fetch_graph":"https://pith.science/api/pith-number/HS5S2APJEFBPZE5OGTD5PAFGWU/graph.json","fetch_events":"https://pith.science/api/pith-number/HS5S2APJEFBPZE5OGTD5PAFGWU/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU/action/timestamp_anchor","attest_storage":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU/action/storage_attestation","attest_author":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU/action/author_attestation","sign_citation":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU/action/citation_signature","submit_replication":"https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU/action/replication_record"}},"created_at":"2026-05-17T23:38:48.490283+00:00","updated_at":"2026-05-17T23:38:48.490283+00:00"}