{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:ZQIGTOMZVV6NZV46VFDBKNA5GR","short_pith_number":"pith:ZQIGTOMZ","schema_version":"1.0","canonical_sha256":"cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984","source":{"kind":"arxiv","id":"2506.09965","version":2},"attestation_state":"computed","paper":{"title":"Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Jian Guan, Junfei Wu, Kaituo Feng, Liang Wang, Qiang Liu, Shu Wu, Tieniu Tan, Wei Wu","submitted_at":"2025-06-11T17:41:50Z","abstract_excerpt":"As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2506.09965","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","primary_cat":"cs.CV","submitted_at":"2025-06-11T17:41:50Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"49c6f98b018f769c5ca15f31125e0aa5c5cda43b0936e558bf06194b35f34ade","abstract_canon_sha256":"17d4e617e19f707f2010efdcf41dccb3a32e052c43e04f103239214ff95a16ae"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.062806Z","signature_b64":"nNKpXqQTpdER9mTK6inaYsl3xqV1kllBoz471cf00j88MmkjI41bcIULmReUGhtyp6OmmTt9texZBhgmwWHuAQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"cc1069b999ad7cdcd79ea94615341d3443697b464b87cd791eda08e027553984","last_reissued_at":"2026-05-17T23:38:15.062127Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.062127Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Jian Guan, Junfei Wu, Kaituo Feng, Liang Wang, Qiang Liu, Shu Wu, Tieniu Tan, Wei Wu","submitted_at":"2025-06-11T17:41:50Z","abstract_excerpt":"As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That basic drawing operations (annotating bounding boxes and drawing auxiliary lines) can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3bf2a0083579bee24f385826283e3fe21b7a9747c7c929c43ba0939380ffd02e"},"source":{"id":"2506.09965","kind":"arxiv","version":2},"verdict":{"id":"47369373-6a09-416d-b94f-d11b2d96a3cb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T04:53:42.203896Z","strongest_claim":"our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.","one_line_summary":"VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That basic drawing operations (annotating bounding boxes and drawing auxiliary lines) can be learned and used by LVLMs to achieve precise geometric understanding and continuous spatial tracking without specialized external perception tools.","pith_extraction_headline":"Vision-language models improve spatial reasoning by drawing boxes and lines on images during thinking."},"references":{"count":82,"sample":[{"doi":"","year":2024,"title":"Self-RAG: Learning to retrieve, generate, and critique through self-reflection","work_id":"da3d632e-f0c3-464e-9cae-ad09201f96eb","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":2,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":2008,"title":"Spatial cognition and the brain","work_id":"f820e4fa-cc03-42e8-aab3-bd12faf6dc43","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Spatialbot: Precise spatial understanding with vision language models, 2025","work_id":"2b053aa0-1df8-4728-aa94-69ac92bbf110","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Spatialvlm: Endowing vision-language models with spatial reasoning capabilities","work_id":"62d217f3-07e5-4f8c-8252-21b4514bbae2","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":82,"snapshot_sha256":"6ce627b12ae6db4de84fc4bad84824023d061cdac82ba10da7f4980c0fe2dcdc","internal_anchors":11},"formal_canon":{"evidence_count":1,"snapshot_sha256":"2acacb8b411c1ec02311b29a916a4620c9b9f7637127543c945c6ca860977935"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2506.09965","created_at":"2026-05-17T23:38:15.062232+00:00"},{"alias_kind":"arxiv_version","alias_value":"2506.09965v2","created_at":"2026-05-17T23:38:15.062232+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2506.09965","created_at":"2026-05-17T23:38:15.062232+00:00"},{"alias_kind":"pith_short_12","alias_value":"ZQIGTOMZVV6N","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ZQIGTOMZVV6NZV46","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ZQIGTOMZ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2603.28767","citing_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20785","citing_title":"LongVT: Incentivizing \"Thinking with Long Videos\" via Native Tool Calling","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22558","citing_title":"GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20784","citing_title":"Interaction Locality in Hierarchical Recursive Reasoning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16079","citing_title":"VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18740","citing_title":"Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20165","citing_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2512.03043","citing_title":"OneThinker: All-in-one Reasoning Model for Image and Video","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2512.16918","citing_title":"AdaTooler-V: Adaptive Tool-Use for Images and Videos","ref_index":71,"is_internal_anchor":true},{"citing_arxiv_id":"2603.01070","citing_title":"How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27494","citing_title":"Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28767","citing_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03318","citing_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2503.21776","citing_title":"Video-R1: Reinforcing Video Reasoning in MLLMs","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09449","citing_title":"SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10588","citing_title":"Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22875","citing_title":"SketchVLM: Vision language models can annotate images to explain thoughts and guide users","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19945","citing_title":"Visual Reasoning through Tool-supervised Reinforcement Learning","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11025","citing_title":"Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07296","citing_title":"OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07148","citing_title":"Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09712","citing_title":"LAST: Leveraging Tools as Hints to Enhance Spatial Reasoning for Multimodal Large Language Models","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17385","citing_title":"SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18484","citing_title":"XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments","ref_index":101,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR","json":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR.json","graph_json":"https://pith.science/api/pith-number/ZQIGTOMZVV6NZV46VFDBKNA5GR/graph.json","events_json":"https://pith.science/api/pith-number/ZQIGTOMZVV6NZV46VFDBKNA5GR/events.json","paper":"https://pith.science/paper/ZQIGTOMZ"},"agent_actions":{"view_html":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR","download_json":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR.json","view_paper":"https://pith.science/paper/ZQIGTOMZ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2506.09965&json=true","fetch_graph":"https://pith.science/api/pith-number/ZQIGTOMZVV6NZV46VFDBKNA5GR/graph.json","fetch_events":"https://pith.science/api/pith-number/ZQIGTOMZVV6NZV46VFDBKNA5GR/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR/action/storage_attestation","attest_author":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR/action/author_attestation","sign_citation":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR/action/citation_signature","submit_replication":"https://pith.science/pith/ZQIGTOMZVV6NZV46VFDBKNA5GR/action/replication_record"}},"created_at":"2026-05-17T23:38:15.062232+00:00","updated_at":"2026-05-17T23:38:15.062232+00:00"}