pith:ZQIGTOMZ
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.
arxiv:2506.09965 v2 · 2025-06-11 · cs.CV · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ZQIGTOMZVV6NZV46VFDBKNA5GR}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.
That basic drawing operations (annotating bounding boxes and drawing auxiliary lines) can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.
VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:15.062127Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "17d4e617e19f707f2010efdcf41dccb3a32e052c43e04f103239214ff95a16ae",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
"primary_cat": "cs.CV",
"submitted_at": "2025-06-11T17:41:50Z",
"title_canon_sha256": "49c6f98b018f769c5ca15f31125e0aa5c5cda43b0936e558bf06194b35f34ade"
},
"schema_version": "1.0",
"source": {
"id": "2506.09965",
"kind": "arxiv",
"version": 2
}
}