Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
Pith reviewed 2026-05-21 06:02 UTC · model grok-4.3
The pith
Vision-language models improve geometry problem solving by interacting with a constraint engine to verify their drawings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Draw2Think recasts geometric reasoning as agentic interaction with the GeoGebra constraint engine. In the Propose-Draw-Verify loop, hypotheses are externalized onto an executable canvas, exact geometric quantities are measured, and structured observations are fed back to the model, allowing subsequent reasoning to proceed from checked canvas state grounded by the shared workspace.
What carries the argument
The Propose-Draw-Verify loop, which externalizes hypotheses onto a constraint-checked evolving canvas and measures exact geometric quantities for feedback.
Load-bearing premise
The vision-language model can reliably interpret the structured observations from the constraint engine and use them to improve reasoning without introducing new errors.
What would settle it
An experiment showing no improvement in accuracy or low construction pass rates when using the Propose-Draw-Verify loop compared to baseline methods without engine interaction.
Figures
read the original abstract
Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Draw2Think, a framework that recasts geometric reasoning in vision-language models as agentic interaction with the GeoGebra constraint engine via a Propose-Draw-Verify loop. Hypotheses are externalized to an executable canvas, exact geometric quantities are measured, and structured observations are fed back so that subsequent reasoning proceeds from checked state. The central claims are high construction fidelity (95.9% predicate-level and 84.0% strict problem-level on GeoGoal), outcome accuracy gains (up to 4.1% planar / 16.4% solid), and rendering scores (68.2% strict / 90.5% relaxed on GenExam-math).
Significance. If the empirical claims hold after proper controls, the work supplies a concrete mechanism for making intermediate geometric states auditable and constraint-satisfying rather than latent, which could improve reliability of VLM-based geometry solvers. The separation of model-level Construction Fidelity from engine-level Measurement Faithfulness is a useful conceptual contribution.
major comments (2)
- [Abstract / Experimental Evaluation] Abstract and results sections: performance numbers (95.9% predicate-level checks, 84.0% strict problem-level, up to 4.1%/16.4% accuracy gains, 68.2%/90.5% rendering scores) are stated without any description of baselines, statistical significance tests, error bars, or train/test splits, so the central claim of improvement rests on incompletely reported evidence.
- [Method / Experiments] Propose-Draw-Verify loop (and any associated ablation or analysis sections): no experiment isolates whether the VLM actually consumes and incorporates the returned structured observations (predicate strings, numeric measurements) versus merely benefiting from the presence of a canvas. No traces of model input/output, no count of cases where the model hallucinates a relation contradicting engine state, and no ablation removing the Verify feedback are provided, leaving the load-bearing assumption that feedback improves subsequent reasoning unsupported.
minor comments (1)
- [Method] Clarify the precise string format and encoding of the structured observations that are appended to the VLM prompt after each Verify step.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and have made revisions to improve the clarity and completeness of our experimental reporting and analysis.
read point-by-point responses
-
Referee: [Abstract / Experimental Evaluation] Abstract and results sections: performance numbers (95.9% predicate-level checks, 84.0% strict problem-level, up to 4.1%/16.4% accuracy gains, 68.2%/90.5% rendering scores) are stated without any description of baselines, statistical significance tests, error bars, or train/test splits, so the central claim of improvement rests on incompletely reported evidence.
Authors: We agree that the abstract could benefit from more context on the evaluation setup. The full paper details the baselines (including direct VLM prompting and other externalization methods) in Section 4.1, with results averaged over multiple runs and reported with standard deviations as error bars. Train/test splits follow the standard partitions of GeoGoal, GeoQA, and GenExam-math as described in Section 3. To make this more prominent, we will revise the abstract to briefly note the comparative evaluation and add explicit references to the statistical reporting in the results section. revision: yes
-
Referee: [Method / Experiments] Propose-Draw-Verify loop (and any associated ablation or analysis sections): no experiment isolates whether the VLM actually consumes and incorporates the returned structured observations (predicate strings, numeric measurements) versus merely benefiting from the presence of a canvas. No traces of model input/output, no count of cases where the model hallucinates a relation contradicting engine state, and no ablation removing the Verify feedback are provided, leaving the load-bearing assumption that feedback improves subsequent reasoning unsupported.
Authors: This is a valid point regarding the need for more direct evidence on the role of the Verify feedback. While the manuscript includes qualitative examples of the loop in Figure 3 and Section 3.2, and quantitative gains over baselines that lack the full loop, we did not include a specific ablation removing only the Verify step or input/output traces. We will add an ablation study comparing the full Propose-Draw-Verify to a Propose-Draw variant without feedback, along with sample traces of model reasoning before and after verification in the revised manuscript. This will better support the claim that the structured observations are incorporated. revision: yes
Circularity Check
No circularity: evaluations rest on external benchmarks and independent measurements
full rationale
The paper describes an agentic Propose-Draw-Verify loop that interacts with the GeoGebra constraint engine to externalize geometric hypotheses and obtain structured observations. All reported metrics—95.9% predicate-level and 84.0% problem-level construction checks on GeoGoal, accuracy gains on planar/solid benchmarks, and rendering scores on GenExam-math—are obtained by direct comparison against held-out external test sets and ground-truth constructions. No parameters are fitted to the target outcomes inside the paper, no equations reduce the claimed improvements to quantities defined by the same loop, and no self-citations supply the load-bearing justification for the core results. The framework is therefore evaluated against independent standards rather than by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-language models can generate drawing actions and interpret structured feedback from a geometry constraint engine to refine reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Measurement Faithfulness is the complementary engine-level property: because accepted objects are stored as algebraic relations and resolved by GeoGebra’s embedded Giac CAS using Gröbner-basis elimination
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Automated theorem proving in GeoGebra: Current achievements
Francisco Botana, Markus Hohenwarter, Predrag Janiˇci´c, Zoltán Kovács, Ivan Petrovi´c, Tomás Recio, and Simon Weitzhofer. Automated theorem proving in GeoGebra: Current achievements. Journal of Automated Reasoning, 55(4):339–360, 2015
work page 2015
-
[3]
Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026
-
[4]
GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning
Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513– 523, 2021. doi: 10.18653/v1/2021.findings-acl.46. URL https://aclanthology.org/ 2021.fi...
-
[5]
UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression
Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, pages 3313–3323, 2022. doi: 10.18653/v1/2022.emnlp-main.218. URL https: //aclanthology.org/2...
-
[6]
Toward effective tool-integrated reasoning via self-evolved preference learning
Yifei Chen, Guanting Dong, and Zhicheng Dou. Toward effective tool-integrated reasoning via self-evolved preference learning. InICLR, 2026. URL https://openreview.net/ forum?id=mNeitRAdWV
work page 2026
-
[7]
Yuri Chervonyi, Trieu H. Trinh, et al. Gold-medalist performance in solving olympiad geometry with AlphaGeometry2.Journal of Machine Learning Research, 26(241):1–39, 2025. URL https://www.jmlr.org/papers/volume26/25-1654/25-1654.pdf
work page 2025
-
[8]
Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025
-
[9]
Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, and Bo Zhang. Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026
-
[10]
Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, and Jun Liu. GeoLaux: A benchmark for evaluating MLLMs’ geometry performance on long-step problems requiring auxiliary lines.arXiv preprint arXiv:2508.06226, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Gemini 2.5: Our most intelligent AI model, 2025
Google DeepMind. Gemini 2.5: Our most intelligent AI model, 2025. URL https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/
work page 2025
-
[12]
Gemini 3 Flash: Frontier intelligence built for speed, 2025
Google DeepMind. Gemini 3 Flash: Frontier intelligence built for speed, 2025. URLhttps://blog.google/products-and-platforms/products/gemini/ gemini-3-flash/
work page 2025
-
[13]
Welcome Gemma 4: Frontier multimodal intelligence on device, 2026
Google DeepMind. Welcome Gemma 4: Frontier multimodal intelligence on device, 2026. URLhttps://huggingface.co/blog/gemma4
work page 2026
-
[14]
ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, et al. ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. InICLR, 2026. URL https: //openreview.net/forum?id=mB3vxfrQZM. 10
work page 2026
-
[15]
Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. GeoVLMath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025
-
[16]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...
work page 2024
- [17]
-
[18]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024
Yushi Hu, Weijia Shi, et al. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, pages 139348–139379, 2024. arXiv:2406.09403
-
[19]
Zhengbo Jiao, Shaobo Wang, Zifan Zhang, et al. Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026
-
[20]
Jinwoong Kim, Rui Yang, and Huishuai Zhang. GeoBuildBench: A benchmark for interactive and executable geometry construction from natural language.arXiv preprint arXiv:2605.13167, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Giac and GeoGebra – improved Gröbner basis computations
Zoltán Kovács and Bernard Parisse. Giac and GeoGebra – improved Gröbner basis computations. InComputer Algebra and Polynomials, volume 8942 ofLNCS, pages 126–138. Springer, 2015
work page 2015
-
[22]
Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, and Tieying Zhang. Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026
-
[23]
In-the-flow agentic system optimization for effective planning and tool use
Zhuofeng Li, Haoxiang Zhang, et al. In-the-flow agentic system optimization for effective planning and tool use. InICLR, 2026. URL https://openreview.net/forum?id= Mf5AleTUVK. Oral
work page 2026
-
[24]
Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025
Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025
-
[25]
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Jiahang Lin, Shichun Liu, Chengjun Pan, et al. Agentic harness engineering: Observability- driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [26]
-
[27]
Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. InACL, pages 6774–6786, 2021. doi: 10.18653/ v1/2021.acl-long.528. URLhttps://aclanthology.org/2021.acl-long.528/
work page 2021
-
[28]
MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts. InICLR, 2024. URL https: //openreview.net/forum?id=KUNzEQMWU7. Oral
work page 2024
-
[29]
Thinking with visual primitives
Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, et al. Thinking with visual primitives. 2026. DeepSeek-AI
work page 2026
-
[30]
From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning
Ruilin Luo, Chufan Shi, Yizhen Zhang, et al. From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning. InICLR, 2026. URL https:// openreview.net/forum?id=4tsfY0lI1w
work page 2026
-
[31]
Geogram- bench: Benchmarking the geometric program reasoning in modern llms
Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, and Yong Wu. Geogram- bench: Benchmarking the geometric program reasoning in modern llms. InICLR, 2026. URL https://openreview.net/forum?id=MrJoBgN1VO. 11
work page 2026
-
[32]
A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025
Jianzhe Ma, Wenxuan Wang, and Qin Jin. A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025
-
[33]
OpenAI. GPT-4o system card, 2024. URL https://openai.com/index/ gpt-4o-system-card/
work page 2024
-
[34]
Introducing o3 and o4-mini, 2025
OpenAI. Introducing o3 and o4-mini, 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/
work page 2025
-
[35]
Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025
-
[36]
Shuai Peng, Di Fu, Yijun Liang, Liangcai Gao, and Zhi Tang. GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13468–13480,
work page 2023
-
[37]
URL https://aclanthology.org/ 2023.findings-acl.850/
doi: 10.18653/v1/2023.findings-acl.850. URL https://aclanthology.org/ 2023.findings-acl.850/
-
[38]
AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning
Bowen Ping, Minnan Luo, Zhuohang Dang, Chenxi Wang, and Chengyou Jia. AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning. InICLR, 2026. URLhttps://openreview.net/forum?id=PVtZnUh04m
work page 2026
-
[39]
SMART: Self-aware agent for tool overuse mitigation
Cheng Qian, Emre Can Acikgoz, et al. SMART: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621,
work page 2025
-
[40]
URL https://aclanthology.org/ 2025.findings-acl.239/
doi: 10.18653/v1/2025.findings-acl.239. URL https://aclanthology.org/ 2025.findings-acl.239/
-
[41]
We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning
Runqi Qiao, Qiuna Tan, et al. We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning. InICLR, 2026. URL https://openreview.net/forum?id= I7fTPLT8A9
work page 2026
-
[42]
Toolformer: Language models can teach themselves to use tools
Timo Schick et al. Toolformer: Language models can teach themselves to use tools. In NeurIPS, pages 68539–68551, 2023. URL https://openreview.net/forum?id= Yacmpz84TH. Oral
work page 2023
-
[43]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview...
work page 2024
-
[44]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025
-
[46]
Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024
Vladmir Sicca, Tianxiang Xia, Mathïs Fédérico, Philip John Gorinski, Simon Frieder, and Shangling Jui. Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024
-
[47]
Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, et al. Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026
-
[48]
Math blind: Failures in diagram understanding undermine reasoning in MLLMs
Yanpeng Sun, Shan Zhang, et al. Math blind: Failures in diagram understanding undermine reasoning in MLLMs. InICLR, 2026. URL https://openreview.net/forum?id= RtvmTxdQV9. 12
work page 2026
-
[49]
Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025
-
[50]
Peijie Wang, Chao Yang, et al. SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025
-
[51]
GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization
Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, and Jiaqi Wang. GeometryZero: Advancing geometry solving via group contrastive policy optimization. arXiv preprint arXiv:2506.07160, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
GenExam: A multidisciplinary text-to-image exam
Zhaokai Wang, Penghao Yin, et al. GenExam: A multidisciplinary text-to-image exam. In ICML, 2026
work page 2026
-
[53]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, volume 35, pages 24824–24837, 2022. URL https://openreview. net/forum?id=_VjQlMeSB_J
work page 2022
-
[54]
Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, and Cheng Tan. GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025
-
[55]
Lei Wei, Xiao Peng, Jinpeng Ou, and Bin Wang. Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026
-
[56]
Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Hanmeng Liu. GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025
-
[57]
NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation
Weiming Wu, Zi-Kang Wang, Jin Ye, Zhi Zhou, Yu-Feng Li, and Lan-Zhe Guo. NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation. In2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview.net/forum?id= t4tIV04qUp. arXiv:2505.17121
-
[58]
Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration
Wenjun Wu, Lingling Zhang, Bo Zhao, Muye Huang, Qianying Wang, and Jun Liu. Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration. InNeurIPS,
-
[59]
URLhttps://openreview.net/forum?id=eRgYGhFRgZ
-
[60]
Zhenyu Wu, Yanxi Long, Jian Li, and Hua Huang. Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026
-
[61]
Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025
-
[62]
Ningning Xu, Yuxuan Jiang, et al. Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025
-
[63]
Chengrui Zhang, Maizhen Ning, et al. Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025
-
[64]
Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Liang Lin, and Cheng-Lin Liu. Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024
-
[65]
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. MathVerse: Does your multi-modal LLM truly see the diagrams in math problems? InECCV, 2024. doi: 10.1007/978-3-031-73242-3_10. arXiv:2403.14624. 13
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-031-73242-3_10 2024
-
[66]
How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning
Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, et al. How RL unlocks the aha moment in geometric interleaved reasoning.arXiv preprint arXiv:2603.01070, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
Haiteng Zhao, Junhao Shen, Yiming Zhang, et al. Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning. InICLR, 2026. URL https://openreview.net/forum?id=1sffPGGQyT
work page 2026
-
[68]
Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information
Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1526–1536, 2025
work page 2025
-
[69]
Towards geometry prob- lem solving in the large model era: A survey
Yurui Zhao, Xiang Wang, Jiahong Liu, Irwin King, and Zhitao Huang. Towards geometry prob- lem solving in the large model era: A survey. In2nd AI for Math Workshop @ ICML 2025, 2025. URLhttps://openreview.net/forum?id=8o2hHIXrzV. arXiv:2506.02690
-
[70]
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, et al. External- ization in LLM agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026. 14 A Positioning in the Broader Agentic Landscape Th...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[71]
T2I = text-to-image; MLLM = multimodal large language model. Base model shown in gray below each closed-source method; secondary per-vendor variants commented in the source for reference. Closed-source Models Method Strict Relaxed Draw2Think Gemini 3 Flash Preview | Dec.2025 68.2 90.5 Nano Banana 2 Gemini 3.1 Flash Image | Feb.2026 56.3 87.8 Nano Banana P...
work page 2025
-
[72]
Identify parallel: arrows on QN, POgiveQN∥PO
-
[73]
Apply BPT: in△MPO , MQ QP = MN NO
-
[74]
Substitute: MQ =5, QP=x, MN =6, NO=3.6, ⇒5 x = 6 3.6
-
[75]
Solve: 6x = 18⇒x = 3 . Answer.QP = 3 . Correct; each quantity is asserted in text without an external verifier. Problem Question. FindQP . Choices. (A) 2 (B) 3 (C) 5 (D) 6 Expected answer. B (QP = 3 ) AutoGPS: proof-graph search Stepwise Reasoning Process for Noise Data, borrowed intact from AutoGPS (ICLR2026) Step 1: Known facts:start =⇒x =PQ, 6 = MN, 3 +...
-
[76]
The altitude DB⊥AC bisectsAC, henceAB =BC =x
Tick marks give AD =CD = 32 , so△ADC is isosceles. The altitude DB⊥AC bisectsAC, henceAB =BC =x
-
[77]
In right △DBC : cos ∠C =BC/CD , so BC = 32 cos 54 ◦≈32·0.5878≈18.81
-
[78]
x≈18.8. Answer: A. Causal-R: causal-graph deduction two equivalent 2-step solutions, reproduced verbatim from CausalR (NeurIPS2025) Solution 1. Step 1: Use Isosceles Triangle Theorem. ∵BD =BC ∴ ∠BDA = ∠BCA = 54 ◦. Step 2: Use Cosine of Triangle. ∵ ∠BAD = 90 ◦ ∴AD =BD×cos(∠BDA) ∴ AD = 32×cos(54◦) = 18.8. Solution 2. Step 1: Use Cosine of Triangle. ∵ ∠BAC =...
-
[79]
Identify dimensions: widthw =AB = 11 cm; height h =BF = 11 cm; lengthl =GF = 15 cm
-
[80]
3D Pythagorean theorem:y = √ w2 +l2 +h2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.