pith. sign in
Pith Number

pith:ZQIGTOMZ

pith:2025:ZQIGTOMZVV6NZV46VFDBKNA5GR
not attested not anchored not stored refs resolved

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Jian Guan, Junfei Wu, Kaituo Feng, Liang Wang, Qiang Liu, Shu Wu, Tieniu Tan, Wei Wu

Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.

arxiv:2506.09965 v2 · 2025-06-11 · cs.CV · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ZQIGTOMZVV6NZV46VFDBKNA5GR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

C2weakest assumption

That basic drawing operations (annotating bounding boxes and drawing auxiliary lines) can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.

C3one line summary

VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.

References

82 extracted · 82 resolved · 11 Pith anchors

[1] Self-RAG: Learning to retrieve, generate, and critique through self-reflection 2024
[2] Qwen2.5-VL Technical Report 2025 · arXiv:2502.13923
[3] Spatial cognition and the brain 2008
[4] Spatialbot: Precise spatial understanding with vision language models, 2025 2025
[5] Spatialvlm: Endowing vision-language models with spatial reasoning capabilities 2024

Formal links

1 machine-checked theorem link

Cited by

24 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:15.062127Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984

Aliases

arxiv: 2506.09965 · arxiv_version: 2506.09965v2 · doi: 10.48550/arxiv.2506.09965 · pith_short_12: ZQIGTOMZVV6N · pith_short_16: ZQIGTOMZVV6NZV46 · pith_short_8: ZQIGTOMZ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "17d4e617e19f707f2010efdcf41dccb3a32e052c43e04f103239214ff95a16ae",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-06-11T17:41:50Z",
    "title_canon_sha256": "49c6f98b018f769c5ca15f31125e0aa5c5cda43b0936e558bf06194b35f34ade"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2506.09965",
    "kind": "arxiv",
    "version": 2
  }
}