pith. sign in

arxiv: 2605.20743 · v1 · pith:7ODBBSTVnew · submitted 2026-05-20 · 💻 cs.CV · cs.CL

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Pith reviewed 2026-05-21 06:02 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords geometry reasoningvision-language modelsconstraint engineGeoGebraPropose-Draw-Verify loopconstruction fidelityspatial reasoning
0
0 comments X

The pith

Vision-language models improve geometry problem solving by interacting with a constraint engine to verify their drawings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can externalize their geometric reasoning by using a Propose-Draw-Verify loop with a constraint engine like GeoGebra. This turns latent inferences into explicit, checkable canvas states where geometric relations are enforced algebraically. A sympathetic reader would care because it addresses the unverifiability of intermediate reasoning steps in current methods. By feeding back exact measurements and relations, the model can refine its thinking based on grounded observations rather than guesses. This leads to higher construction fidelity and better outcomes on geometry benchmarks.

Core claim

Draw2Think recasts geometric reasoning as agentic interaction with the GeoGebra constraint engine. In the Propose-Draw-Verify loop, hypotheses are externalized onto an executable canvas, exact geometric quantities are measured, and structured observations are fed back to the model, allowing subsequent reasoning to proceed from checked canvas state grounded by the shared workspace.

What carries the argument

The Propose-Draw-Verify loop, which externalizes hypotheses onto a constraint-checked evolving canvas and measures exact geometric quantities for feedback.

Load-bearing premise

The vision-language model can reliably interpret the structured observations from the constraint engine and use them to improve reasoning without introducing new errors.

What would settle it

An experiment showing no improvement in accuracy or low construction pass rates when using the Propose-Draw-Verify loop compared to baseline methods without engine interaction.

Figures

Figures reproduced from arXiv: 2605.20743 by Jiawei Du, Joey Tianyi Zhou, Juncheng Hu, Xin Zhang.

Figure 1
Figure 1. Figure 1: Paradigms for externalizing intermediate geometry. Prior routes externalize intermediate geometry as visual artifacts, textual traces, or executable scripts. Draw2Think adds a constraint￾agentic harness: a frozen VLM selects typed ToolSpecs, the GeoGebra engine updates an engine￾valid canvas state, and structured observations return after each action. The distinction is less about externalizing state than … view at source ↗
Figure 2
Figure 2. Figure 2: Mechanism comparison on MathVista/290. Baseline applies alternate-interior-angles and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Faithful canvases (n=215) dominate unfaithful (n=41) at three Ti-match thresholds. (b) Structural Ti plateau at ∼88% (tolerance-invariant); numerical Ti climb to the engine-exact SR (dotted) as tolerance approaches ∼0.1% rel [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Draw2Think, a framework that recasts geometric reasoning in vision-language models as agentic interaction with the GeoGebra constraint engine via a Propose-Draw-Verify loop. Hypotheses are externalized to an executable canvas, exact geometric quantities are measured, and structured observations are fed back so that subsequent reasoning proceeds from checked state. The central claims are high construction fidelity (95.9% predicate-level and 84.0% strict problem-level on GeoGoal), outcome accuracy gains (up to 4.1% planar / 16.4% solid), and rendering scores (68.2% strict / 90.5% relaxed on GenExam-math).

Significance. If the empirical claims hold after proper controls, the work supplies a concrete mechanism for making intermediate geometric states auditable and constraint-satisfying rather than latent, which could improve reliability of VLM-based geometry solvers. The separation of model-level Construction Fidelity from engine-level Measurement Faithfulness is a useful conceptual contribution.

major comments (2)
  1. [Abstract / Experimental Evaluation] Abstract and results sections: performance numbers (95.9% predicate-level checks, 84.0% strict problem-level, up to 4.1%/16.4% accuracy gains, 68.2%/90.5% rendering scores) are stated without any description of baselines, statistical significance tests, error bars, or train/test splits, so the central claim of improvement rests on incompletely reported evidence.
  2. [Method / Experiments] Propose-Draw-Verify loop (and any associated ablation or analysis sections): no experiment isolates whether the VLM actually consumes and incorporates the returned structured observations (predicate strings, numeric measurements) versus merely benefiting from the presence of a canvas. No traces of model input/output, no count of cases where the model hallucinates a relation contradicting engine state, and no ablation removing the Verify feedback are provided, leaving the load-bearing assumption that feedback improves subsequent reasoning unsupported.
minor comments (1)
  1. [Method] Clarify the precise string format and encoding of the structured observations that are appended to the VLM prompt after each Verify step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and have made revisions to improve the clarity and completeness of our experimental reporting and analysis.

read point-by-point responses
  1. Referee: [Abstract / Experimental Evaluation] Abstract and results sections: performance numbers (95.9% predicate-level checks, 84.0% strict problem-level, up to 4.1%/16.4% accuracy gains, 68.2%/90.5% rendering scores) are stated without any description of baselines, statistical significance tests, error bars, or train/test splits, so the central claim of improvement rests on incompletely reported evidence.

    Authors: We agree that the abstract could benefit from more context on the evaluation setup. The full paper details the baselines (including direct VLM prompting and other externalization methods) in Section 4.1, with results averaged over multiple runs and reported with standard deviations as error bars. Train/test splits follow the standard partitions of GeoGoal, GeoQA, and GenExam-math as described in Section 3. To make this more prominent, we will revise the abstract to briefly note the comparative evaluation and add explicit references to the statistical reporting in the results section. revision: yes

  2. Referee: [Method / Experiments] Propose-Draw-Verify loop (and any associated ablation or analysis sections): no experiment isolates whether the VLM actually consumes and incorporates the returned structured observations (predicate strings, numeric measurements) versus merely benefiting from the presence of a canvas. No traces of model input/output, no count of cases where the model hallucinates a relation contradicting engine state, and no ablation removing the Verify feedback are provided, leaving the load-bearing assumption that feedback improves subsequent reasoning unsupported.

    Authors: This is a valid point regarding the need for more direct evidence on the role of the Verify feedback. While the manuscript includes qualitative examples of the loop in Figure 3 and Section 3.2, and quantitative gains over baselines that lack the full loop, we did not include a specific ablation removing only the Verify step or input/output traces. We will add an ablation study comparing the full Propose-Draw-Verify to a Propose-Draw variant without feedback, along with sample traces of model reasoning before and after verification in the revised manuscript. This will better support the claim that the structured observations are incorporated. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluations rest on external benchmarks and independent measurements

full rationale

The paper describes an agentic Propose-Draw-Verify loop that interacts with the GeoGebra constraint engine to externalize geometric hypotheses and obtain structured observations. All reported metrics—95.9% predicate-level and 84.0% problem-level construction checks on GeoGoal, accuracy gains on planar/solid benchmarks, and rendering scores on GenExam-math—are obtained by direct comparison against held-out external test sets and ground-truth constructions. No parameters are fitted to the target outcomes inside the paper, no equations reduce the claimed improvements to quantities defined by the same loop, and no self-citations supply the load-bearing justification for the core results. The framework is therefore evaluated against independent standards rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that VLMs can generate valid drawing actions and correctly interpret engine feedback; no free parameters or new invented entities are introduced.

axioms (1)
  • domain assumption Vision-language models can generate drawing actions and interpret structured feedback from a geometry constraint engine to refine reasoning.
    Implicit in the description of the Propose-Draw-Verify loop and the claim that feedback improves subsequent steps.

pith-pipeline@v0.9.0 · 5806 in / 1304 out tokens · 45549 ms · 2026-05-21T06:02:29.162757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 9 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...

  2. [2]

    Automated theorem proving in GeoGebra: Current achievements

    Francisco Botana, Markus Hohenwarter, Predrag Janiˇci´c, Zoltán Kovács, Ivan Petrovi´c, Tomás Recio, and Simon Weitzhofer. Automated theorem proving in GeoGebra: Current achievements. Journal of Automated Reasoning, 55(4):339–360, 2015

  3. [3]

    Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

    Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

  4. [4]

    GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513– 523, 2021. doi: 10.18653/v1/2021.findings-acl.46. URL https://aclanthology.org/ 2021.fi...

  5. [5]

    UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression

    Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, pages 3313–3323, 2022. doi: 10.18653/v1/2022.emnlp-main.218. URL https: //aclanthology.org/2...

  6. [6]

    Toward effective tool-integrated reasoning via self-evolved preference learning

    Yifei Chen, Guanting Dong, and Zhicheng Dou. Toward effective tool-integrated reasoning via self-evolved preference learning. InICLR, 2026. URL https://openreview.net/ forum?id=mNeitRAdWV

  7. [7]

    Trinh, et al

    Yuri Chervonyi, Trieu H. Trinh, et al. Gold-medalist performance in solving olympiad geometry with AlphaGeometry2.Journal of Machine Learning Research, 26(241):1–39, 2025. URL https://www.jmlr.org/papers/volume26/25-1654/25-1654.pdf

  8. [8]

    Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

    Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

  9. [9]

    Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

    Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, and Bo Zhang. Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

  10. [10]

    GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

    Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, and Jun Liu. GeoLaux: A benchmark for evaluating MLLMs’ geometry performance on long-step problems requiring auxiliary lines.arXiv preprint arXiv:2508.06226, 2025

  11. [11]

    Gemini 2.5: Our most intelligent AI model, 2025

    Google DeepMind. Gemini 2.5: Our most intelligent AI model, 2025. URL https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/

  12. [12]

    Gemini 3 Flash: Frontier intelligence built for speed, 2025

    Google DeepMind. Gemini 3 Flash: Frontier intelligence built for speed, 2025. URLhttps://blog.google/products-and-platforms/products/gemini/ gemini-3-flash/

  13. [13]

    Welcome Gemma 4: Frontier multimodal intelligence on device, 2026

    Google DeepMind. Welcome Gemma 4: Frontier multimodal intelligence on device, 2026. URLhttps://huggingface.co/blog/gemma4

  14. [14]

    ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, et al. ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. InICLR, 2026. URL https: //openreview.net/forum?id=mB3vxfrQZM. 10

  15. [15]

    Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

    Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. GeoVLMath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025

  16. [16]

    OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...

  17. [17]

    Geogebra

    Markus Hohenwarter and Judith Hohenwarter. Geogebra. https://www.geogebra.org. Dynamic Mathematics Software

  18. [18]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

    Yushi Hu, Weijia Shi, et al. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, pages 139348–139379, 2024. arXiv:2406.09403

  19. [19]

    Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026

    Zhengbo Jiao, Shaobo Wang, Zifan Zhang, et al. Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026

  20. [20]

    GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

    Jinwoong Kim, Rui Yang, and Huishuai Zhang. GeoBuildBench: A benchmark for interactive and executable geometry construction from natural language.arXiv preprint arXiv:2605.13167, 2026

  21. [21]

    Giac and GeoGebra – improved Gröbner basis computations

    Zoltán Kovács and Bernard Parisse. Giac and GeoGebra – improved Gröbner basis computations. InComputer Algebra and Polynomials, volume 8942 ofLNCS, pages 126–138. Springer, 2015

  22. [22]

    Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026

    Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, and Tieying Zhang. Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026

  23. [23]

    In-the-flow agentic system optimization for effective planning and tool use

    Zhuofeng Li, Haoxiang Zhang, et al. In-the-flow agentic system optimization for effective planning and tool use. InICLR, 2026. URL https://openreview.net/forum?id= Mf5AleTUVK. Oral

  24. [24]

    Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025

    Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025

  25. [25]

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    Jiahang Lin, Shichun Liu, Chengjun Pan, et al. Agentic harness engineering: Observability- driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

  26. [26]

    Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. AutoHarness: improving LLM agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

  27. [27]

    Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. InACL, pages 6774–6786, 2021. doi: 10.18653/ v1/2021.acl-long.528. URLhttps://aclanthology.org/2021.acl-long.528/

  28. [28]

    MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts. InICLR, 2024. URL https: //openreview.net/forum?id=KUNzEQMWU7. Oral

  29. [29]

    Thinking with visual primitives

    Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, et al. Thinking with visual primitives. 2026. DeepSeek-AI

  30. [30]

    From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning

    Ruilin Luo, Chufan Shi, Yizhen Zhang, et al. From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning. InICLR, 2026. URL https:// openreview.net/forum?id=4tsfY0lI1w

  31. [31]

    Geogram- bench: Benchmarking the geometric program reasoning in modern llms

    Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, and Yong Wu. Geogram- bench: Benchmarking the geometric program reasoning in modern llms. InICLR, 2026. URL https://openreview.net/forum?id=MrJoBgN1VO. 11

  32. [32]

    A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025

    Jianzhe Ma, Wenxuan Wang, and Qin Jin. A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025

  33. [33]

    GPT-4o system card, 2024

    OpenAI. GPT-4o system card, 2024. URL https://openai.com/index/ gpt-4o-system-card/

  34. [34]

    Introducing o3 and o4-mini, 2025

    OpenAI. Introducing o3 and o4-mini, 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

  35. [35]

    Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

    Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

  36. [36]

    GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning

    Shuai Peng, Di Fu, Yijun Liang, Liangcai Gao, and Zhi Tang. GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13468–13480,

  37. [37]

    URL https://aclanthology.org/ 2023.findings-acl.850/

    doi: 10.18653/v1/2023.findings-acl.850. URL https://aclanthology.org/ 2023.findings-acl.850/

  38. [38]

    AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning

    Bowen Ping, Minnan Luo, Zhuohang Dang, Chenxi Wang, and Chengyou Jia. AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning. InICLR, 2026. URLhttps://openreview.net/forum?id=PVtZnUh04m

  39. [39]

    SMART: Self-aware agent for tool overuse mitigation

    Cheng Qian, Emre Can Acikgoz, et al. SMART: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621,

  40. [40]

    URL https://aclanthology.org/ 2025.findings-acl.239/

    doi: 10.18653/v1/2025.findings-acl.239. URL https://aclanthology.org/ 2025.findings-acl.239/

  41. [41]

    We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning

    Runqi Qiao, Qiuna Tan, et al. We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning. InICLR, 2026. URL https://openreview.net/forum?id= I7fTPLT8A9

  42. [42]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick et al. Toolformer: Language models can teach themselves to use tools. In NeurIPS, pages 68539–68551, 2023. URL https://openreview.net/forum?id= Yacmpz84TH. Oral

  43. [43]

    Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview...

  44. [44]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  45. [45]

    MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025

    Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025

  46. [46]

    Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024

    Vladmir Sicca, Tianxiang Xia, Mathïs Fédérico, Philip John Gorinski, Simon Frieder, and Shangling Jui. Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024

  47. [47]

    Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026

    Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, et al. Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026

  48. [48]

    Math blind: Failures in diagram understanding undermine reasoning in MLLMs

    Yanpeng Sun, Shan Zhang, et al. Math blind: Failures in diagram understanding undermine reasoning in MLLMs. InICLR, 2026. URL https://openreview.net/forum?id= RtvmTxdQV9. 12

  49. [49]

    Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

  50. [50]

    SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025

    Peijie Wang, Chao Yang, et al. SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025

  51. [51]

    GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization

    Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, and Jiaqi Wang. GeometryZero: Advancing geometry solving via group contrastive policy optimization. arXiv preprint arXiv:2506.07160, 2025

  52. [52]

    GenExam: A multidisciplinary text-to-image exam

    Zhaokai Wang, Penghao Yin, et al. GenExam: A multidisciplinary text-to-image exam. In ICML, 2026

  53. [53]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, volume 35, pages 24824–24837, 2022. URL https://openreview. net/forum?id=_VjQlMeSB_J

  54. [54]

    GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025

    Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, and Cheng Tan. GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025

  55. [55]

    Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026

    Lei Wei, Xiao Peng, Jinpeng Ou, and Bin Wang. Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026

  56. [56]

    GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

    Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Hanmeng Liu. GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

  57. [57]

    NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation

    Weiming Wu, Zi-Kang Wang, Jin Ye, Zhi Zhou, Yu-Feng Li, and Lan-Zhe Guo. NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation. In2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview.net/forum?id= t4tIV04qUp. arXiv:2505.17121

  58. [58]

    Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration

    Wenjun Wu, Lingling Zhang, Bo Zhao, Muye Huang, Qianying Wang, and Jun Liu. Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration. InNeurIPS,

  59. [59]

    URLhttps://openreview.net/forum?id=eRgYGhFRgZ

  60. [60]

    Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026

    Zhenyu Wu, Yanxi Long, Jian Li, and Hua Huang. Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026

  61. [61]

    GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

    Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

  62. [62]

    Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025

    Ningning Xu, Yuxuan Jiang, et al. Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025

  63. [63]

    Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

    Chengrui Zhang, Maizhen Ning, et al. Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

  64. [64]

    Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024

    Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Liang Lin, and Cheng-Lin Liu. Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024

  65. [65]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. MathVerse: Does your multi-modal LLM truly see the diagrams in math problems? InECCV, 2024. doi: 10.1007/978-3-031-73242-3_10. arXiv:2403.14624. 13

  66. [66]

    How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

    Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, et al. How RL unlocks the aha moment in geometric interleaved reasoning.arXiv preprint arXiv:2603.01070, 2026

  67. [67]

    Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning

    Haiteng Zhao, Junhao Shen, Yiming Zhang, et al. Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning. InICLR, 2026. URL https://openreview.net/forum?id=1sffPGGQyT

  68. [68]

    Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information

    Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1526–1536, 2025

  69. [69]

    Towards geometry prob- lem solving in the large model era: A survey

    Yurui Zhao, Xiang Wang, Jiahong Liu, Irwin King, and Zhitao Huang. Towards geometry prob- lem solving in the large model era: A survey. In2nd AI for Math Workshop @ ICML 2025, 2025. URLhttps://openreview.net/forum?id=8o2hHIXrzV. arXiv:2506.02690

  70. [70]

    Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, et al. External- ization in LLM agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026. 14 A Positioning in the Broader Agentic Landscape Th...

  71. [71]

    Semantic

    T2I = text-to-image; MLLM = multimodal large language model. Base model shown in gray below each closed-source method; secondary per-vendor variants commented in the source for reference. Closed-source Models Method Strict Relaxed Draw2Think Gemini 3 Flash Preview | Dec.2025 68.2 90.5 Nano Banana 2 Gemini 3.1 Flash Image | Feb.2026 56.3 87.8 Nano Banana P...

  72. [72]

    Identify parallel: arrows on QN, POgiveQN∥PO

  73. [73]

    Apply BPT: in△MPO , MQ QP = MN NO

  74. [74]

    Substitute: MQ =5, QP=x, MN =6, NO=3.6, ⇒5 x = 6 3.6

  75. [75]

    O": {"type

    Solve: 6x = 18⇒x = 3 . Answer.QP = 3 . Correct; each quantity is asserted in text without an external verifier. Problem Question. FindQP . Choices. (A) 2 (B) 3 (C) 5 (D) 6 Expected answer. B (QP = 3 ) AutoGPS: proof-graph search Stepwise Reasoning Process for Noise Data, borrowed intact from AutoGPS (ICLR2026) Step 1: Known facts:start =⇒x =PQ, 6 = MN, 3 +...

  76. [76]

    The altitude DB⊥AC bisectsAC, henceAB =BC =x

    Tick marks give AD =CD = 32 , so△ADC is isosceles. The altitude DB⊥AC bisectsAC, henceAB =BC =x

  77. [77]

    In right △DBC : cos ∠C =BC/CD , so BC = 32 cos 54 ◦≈32·0.5878≈18.81

  78. [78]

    C": {"type

    x≈18.8. Answer: A. Causal-R: causal-graph deduction two equivalent 2-step solutions, reproduced verbatim from CausalR (NeurIPS2025) Solution 1. Step 1: Use Isosceles Triangle Theorem. ∵BD =BC ∴ ∠BDA = ∠BCA = 54 ◦. Step 2: Use Cosine of Triangle. ∵ ∠BAD = 90 ◦ ∴AD =BD×cos(∠BDA) ∴ AD = 32×cos(54◦) = 18.8. Solution 2. Step 1: Use Cosine of Triangle. ∵ ∠BAC =...

  79. [79]

    Identify dimensions: widthw =AB = 11 cm; height h =BF = 11 cm; lengthl =GF = 15 cm

  80. [80]

    3D Pythagorean theorem:y = √ w2 +l2 +h2

Showing first 80 references.