pith. sign in
Pith Number

pith:HS5S2APJ

pith:2025:HS5S2APJEFBPZE5OGTD5PAFGWU
not attested not anchored not stored refs resolved

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

Spatial-MLLM equips multimodal language models with stronger 3D spatial reasoning using only 2D image and video inputs.

arxiv:2505.23747 v1 · 2025-05-29 · cs.CV · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{HS5S2APJEFBPZE5OGTD5PAFGWU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks

C2weakest assumption

that initializing a spatial encoder from the backbone of a feed-forward visual geometry foundation model will reliably extract usable 3D structure features from purely 2D image or video inputs without any 3D supervision

C3one line summary

Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.

References

71 extracted · 71 resolved · 21 Pith anchors

[1] Flamingo: a visual language model for few-shot learning, 2022
[2] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023
[3] H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2024 2024
[4] Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context 2024 · arXiv:2403.05530
[5] GPT-4o System Card 2024 · arXiv:2410.21276

Formal links

2 machine-checked theorem links

Cited by

31 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.490187Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

3cbb2d01e92142fc93ae34c7d780a6b5155da6d2bc87b6c8db06e493bb4d329c

Aliases

arxiv: 2505.23747 · arxiv_version: 2505.23747v1 · doi: 10.48550/arxiv.2505.23747 · pith_short_12: HS5S2APJEFBP · pith_short_16: HS5S2APJEFBPZE5O · pith_short_8: HS5S2APJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/HS5S2APJEFBPZE5OGTD5PAFGWU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 3cbb2d01e92142fc93ae34c7d780a6b5155da6d2bc87b6c8db06e493bb4d329c
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "147e0d7fc614f806400cd4c3204facd9d7188c981e2bb33eef44a872e1a7b4bd",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-05-29T17:59:04Z",
    "title_canon_sha256": "f3bf03a7423e285470a8a9b66de09bad5f8da3292e1f37ffdaba5dd24a172cd6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2505.23747",
    "kind": "arxiv",
    "version": 1
  }
}