Pith Number

pith:ZQIGTOMZ

pith:2025:ZQIGTOMZVV6NZV46VFDBKNA5GR

not attested not anchored not stored refs resolved

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Jian Guan, Junfei Wu, Kaituo Feng, Liang Wang, Qiang Liu, Shu Wu, Tieniu Tan, Wei Wu

Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.

arxiv:2506.09965 v2 · 2025-06-11 · cs.CV · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{ZQIGTOMZVV6NZV46VFDBKNA5GR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

C2weakest assumption

That basic drawing operations (annotating bounding boxes and drawing auxiliary lines) can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.

C3one line summary

VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.

References

82 extracted · 82 resolved · 11 Pith anchors

[1] Self-RAG: Learning to retrieve, generate, and critique through self-reflection 2024

[2] Qwen2.5-VL Technical Report 2025 · arXiv:2502.13923

[3] Spatial cognition and the brain 2008

[4] Spatialbot: Precise spatial understanding with vision language models, 2025 2025

[5] Spatialvlm: Endowing vision-language models with spatial reasoning capabilities 2024

Formal links

1 machine-checked theorem link

Cited by

24 papers in Pith

Gen-Searcher: Reinforcing Agentic Search for Image Generation

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

Interaction Locality in Hierarchical Recursive Reasoning

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Receipt and verification

First computed	2026-05-17T23:38:15.062127Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984

Aliases

arxiv: 2506.09965 · arxiv_version: 2506.09965v2 · doi: 10.48550/arxiv.2506.09965 · pith_short_12: ZQIGTOMZVV6N · pith_short_16: ZQIGTOMZVV6NZV46 · pith_short_8: ZQIGTOMZ

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "17d4e617e19f707f2010efdcf41dccb3a32e052c43e04f103239214ff95a16ae",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-06-11T17:41:50Z",
    "title_canon_sha256": "49c6f98b018f769c5ca15f31125e0aa5c5cda43b0936e558bf06194b35f34ade"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2506.09965",
    "kind": "arxiv",
    "version": 2
  }
}