Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Jiawei Du; Joey Tianyi Zhou; Juncheng Hu; Xin Zhang

arxiv: 2605.20743 · v1 · pith:7ODBBSTVnew · submitted 2026-05-20 · 💻 cs.CV · cs.CL

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Juncheng Hu , Jiawei Du , Xin Zhang , Joey Tianyi Zhou This is my paper

Pith reviewed 2026-05-21 06:02 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords geometry reasoningvision-language modelsconstraint engineGeoGebraPropose-Draw-Verify loopconstruction fidelityspatial reasoning

0 comments

The pith

Vision-language models improve geometry problem solving by interacting with a constraint engine to verify their drawings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can externalize their geometric reasoning by using a Propose-Draw-Verify loop with a constraint engine like GeoGebra. This turns latent inferences into explicit, checkable canvas states where geometric relations are enforced algebraically. A sympathetic reader would care because it addresses the unverifiability of intermediate reasoning steps in current methods. By feeding back exact measurements and relations, the model can refine its thinking based on grounded observations rather than guesses. This leads to higher construction fidelity and better outcomes on geometry benchmarks.

Core claim

Draw2Think recasts geometric reasoning as agentic interaction with the GeoGebra constraint engine. In the Propose-Draw-Verify loop, hypotheses are externalized onto an executable canvas, exact geometric quantities are measured, and structured observations are fed back to the model, allowing subsequent reasoning to proceed from checked canvas state grounded by the shared workspace.

What carries the argument

The Propose-Draw-Verify loop, which externalizes hypotheses onto a constraint-checked evolving canvas and measures exact geometric quantities for feedback.

Load-bearing premise

The vision-language model can reliably interpret the structured observations from the constraint engine and use them to improve reasoning without introducing new errors.

What would settle it

An experiment showing no improvement in accuracy or low construction pass rates when using the Propose-Draw-Verify loop compared to baseline methods without engine interaction.

Figures

Figures reproduced from arXiv: 2605.20743 by Jiawei Du, Joey Tianyi Zhou, Juncheng Hu, Xin Zhang.

**Figure 1.** Figure 1: Paradigms for externalizing intermediate geometry. Prior routes externalize intermediate geometry as visual artifacts, textual traces, or executable scripts. Draw2Think adds a constraintagentic harness: a frozen VLM selects typed ToolSpecs, the GeoGebra engine updates an enginevalid canvas state, and structured observations return after each action. The distinction is less about externalizing state than … view at source ↗

**Figure 2.** Figure 2: Mechanism comparison on MathVista/290. Baseline applies alternate-interior-angles and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Faithful canvases (n=215) dominate unfaithful (n=41) at three Ti-match thresholds. (b) Structural Ti plateau at ∼88% (tolerance-invariant); numerical Ti climb to the engine-exact SR (dotted) as tolerance approaches ∼0.1% rel [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Draw2Think adds a live GeoGebra constraint loop to VLM geometry work and reports strong construction fidelity, but the gains depend on an untested claim that the model actually uses the engine feedback.

read the letter

The main takeaway is that this paper gives VLMs a way to externalize geometry reasoning through repeated interaction with a constraint engine instead of relying on latent text or images. The Propose-Draw-Verify cycle keeps an executable canvas that enforces relations and returns exact measurements, which is a clear step past one-shot scripts or pixel renders. That setup makes construction fidelity and measurement faithfulness checkable separately, and the reported 95.9% predicate-level and 84% problem-level passes on GeoGoal show the canvas usually ends up correct. The accuracy lifts on planar and solid benchmarks are smaller but point in the right direction, and the rendering scores on GenExam-math are usable. The framework itself is described clearly enough that someone could try to reimplement the loop. The soft spot is the missing link between the engine output and the model's next step. There are no ablations that isolate the structured observations from just having the canvas present, no examples of how the VLM parses the predicate strings or numbers, and no count of cases where the model states something the engine state contradicts. Without those, the central claim that the verify step improves reasoning rests on an assumption that may not hold. The numbers also lack baseline details, error bars, or split information, so the size of the real improvement is hard to judge. This is worth a look for groups building reliable spatial or math-reasoning agents. A reader who cares about verifiable externalization in VLMs will find the interaction pattern useful even if the gains need more support. I would send it to peer review so the authors can add the ablations and full experimental controls.

Referee Report

2 major / 1 minor

Summary. The paper presents Draw2Think, a framework that recasts geometric reasoning in vision-language models as agentic interaction with the GeoGebra constraint engine via a Propose-Draw-Verify loop. Hypotheses are externalized to an executable canvas, exact geometric quantities are measured, and structured observations are fed back so that subsequent reasoning proceeds from checked state. The central claims are high construction fidelity (95.9% predicate-level and 84.0% strict problem-level on GeoGoal), outcome accuracy gains (up to 4.1% planar / 16.4% solid), and rendering scores (68.2% strict / 90.5% relaxed on GenExam-math).

Significance. If the empirical claims hold after proper controls, the work supplies a concrete mechanism for making intermediate geometric states auditable and constraint-satisfying rather than latent, which could improve reliability of VLM-based geometry solvers. The separation of model-level Construction Fidelity from engine-level Measurement Faithfulness is a useful conceptual contribution.

major comments (2)

[Abstract / Experimental Evaluation] Abstract and results sections: performance numbers (95.9% predicate-level checks, 84.0% strict problem-level, up to 4.1%/16.4% accuracy gains, 68.2%/90.5% rendering scores) are stated without any description of baselines, statistical significance tests, error bars, or train/test splits, so the central claim of improvement rests on incompletely reported evidence.
[Method / Experiments] Propose-Draw-Verify loop (and any associated ablation or analysis sections): no experiment isolates whether the VLM actually consumes and incorporates the returned structured observations (predicate strings, numeric measurements) versus merely benefiting from the presence of a canvas. No traces of model input/output, no count of cases where the model hallucinates a relation contradicting engine state, and no ablation removing the Verify feedback are provided, leaving the load-bearing assumption that feedback improves subsequent reasoning unsupported.

minor comments (1)

[Method] Clarify the precise string format and encoding of the structured observations that are appended to the VLM prompt after each Verify step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and have made revisions to improve the clarity and completeness of our experimental reporting and analysis.

read point-by-point responses

Referee: [Abstract / Experimental Evaluation] Abstract and results sections: performance numbers (95.9% predicate-level checks, 84.0% strict problem-level, up to 4.1%/16.4% accuracy gains, 68.2%/90.5% rendering scores) are stated without any description of baselines, statistical significance tests, error bars, or train/test splits, so the central claim of improvement rests on incompletely reported evidence.

Authors: We agree that the abstract could benefit from more context on the evaluation setup. The full paper details the baselines (including direct VLM prompting and other externalization methods) in Section 4.1, with results averaged over multiple runs and reported with standard deviations as error bars. Train/test splits follow the standard partitions of GeoGoal, GeoQA, and GenExam-math as described in Section 3. To make this more prominent, we will revise the abstract to briefly note the comparative evaluation and add explicit references to the statistical reporting in the results section. revision: yes
Referee: [Method / Experiments] Propose-Draw-Verify loop (and any associated ablation or analysis sections): no experiment isolates whether the VLM actually consumes and incorporates the returned structured observations (predicate strings, numeric measurements) versus merely benefiting from the presence of a canvas. No traces of model input/output, no count of cases where the model hallucinates a relation contradicting engine state, and no ablation removing the Verify feedback are provided, leaving the load-bearing assumption that feedback improves subsequent reasoning unsupported.

Authors: This is a valid point regarding the need for more direct evidence on the role of the Verify feedback. While the manuscript includes qualitative examples of the loop in Figure 3 and Section 3.2, and quantitative gains over baselines that lack the full loop, we did not include a specific ablation removing only the Verify step or input/output traces. We will add an ablation study comparing the full Propose-Draw-Verify to a Propose-Draw variant without feedback, along with sample traces of model reasoning before and after verification in the revised manuscript. This will better support the claim that the structured observations are incorporated. revision: yes

Circularity Check

0 steps flagged

No circularity: evaluations rest on external benchmarks and independent measurements

full rationale

The paper describes an agentic Propose-Draw-Verify loop that interacts with the GeoGebra constraint engine to externalize geometric hypotheses and obtain structured observations. All reported metrics—95.9% predicate-level and 84.0% problem-level construction checks on GeoGoal, accuracy gains on planar/solid benchmarks, and rendering scores on GenExam-math—are obtained by direct comparison against held-out external test sets and ground-truth constructions. No parameters are fitted to the target outcomes inside the paper, no equations reduce the claimed improvements to quantities defined by the same loop, and no self-citations supply the load-bearing justification for the core results. The framework is therefore evaluated against independent standards rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that VLMs can generate valid drawing actions and correctly interpret engine feedback; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Vision-language models can generate drawing actions and interpret structured feedback from a geometry constraint engine to refine reasoning.
Implicit in the description of the Propose-Draw-Verify loop and the claim that feedback improves subsequent steps.

pith-pipeline@v0.9.0 · 5806 in / 1304 out tokens · 45549 ms · 2026-05-21T06:02:29.162757+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Measurement Faithfulness is the complementary engine-level property: because accepted objects are stored as algebraic relations and resolved by GeoGebra’s embedded Giac CAS using Gröbner-basis elimination

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · 9 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Automated theorem proving in GeoGebra: Current achievements

Francisco Botana, Markus Hohenwarter, Predrag Janiˇci´c, Zoltán Kovács, Ivan Petrovi´c, Tomás Recio, and Simon Weitzhofer. Automated theorem proving in GeoGebra: Current achievements. Journal of Automated Reasoning, 55(4):339–360, 2015

work page 2015
[3]

Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

work page arXiv 2026
[4]

GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513– 523, 2021. doi: 10.18653/v1/2021.findings-acl.46. URL https://aclanthology.org/ 2021.fi...

work page doi:10.18653/v1/2021.findings-acl.46 2021
[5]

UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, pages 3313–3323, 2022. doi: 10.18653/v1/2022.emnlp-main.218. URL https: //aclanthology.org/2...

work page doi:10.18653/v1/2022.emnlp-main.218 2022
[6]

Toward effective tool-integrated reasoning via self-evolved preference learning

Yifei Chen, Guanting Dong, and Zhicheng Dou. Toward effective tool-integrated reasoning via self-evolved preference learning. InICLR, 2026. URL https://openreview.net/ forum?id=mNeitRAdWV

work page 2026
[7]

Trinh, et al

Yuri Chervonyi, Trieu H. Trinh, et al. Gold-medalist performance in solving olympiad geometry with AlphaGeometry2.Journal of Machine Learning Research, 26(241):1–39, 2025. URL https://www.jmlr.org/papers/volume26/25-1654/25-1654.pdf

work page 2025
[8]

Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

work page arXiv 2025
[9]

Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, and Bo Zhang. Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

work page arXiv 2026
[10]

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, and Jun Liu. GeoLaux: A benchmark for evaluating MLLMs’ geometry performance on long-step problems requiring auxiliary lines.arXiv preprint arXiv:2508.06226, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Gemini 2.5: Our most intelligent AI model, 2025

Google DeepMind. Gemini 2.5: Our most intelligent AI model, 2025. URL https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/

work page 2025
[12]

Gemini 3 Flash: Frontier intelligence built for speed, 2025

Google DeepMind. Gemini 3 Flash: Frontier intelligence built for speed, 2025. URLhttps://blog.google/products-and-platforms/products/gemini/ gemini-3-flash/

work page 2025
[13]

Welcome Gemma 4: Frontier multimodal intelligence on device, 2026

Google DeepMind. Welcome Gemma 4: Frontier multimodal intelligence on device, 2026. URLhttps://huggingface.co/blog/gemma4

work page 2026
[14]

ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, et al. ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. InICLR, 2026. URL https: //openreview.net/forum?id=mB3vxfrQZM. 10

work page 2026
[15]

Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. GeoVLMath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025

work page arXiv 2025
[16]

OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...

work page 2024
[17]

Geogebra

Markus Hohenwarter and Judith Hohenwarter. Geogebra. https://www.geogebra.org. Dynamic Mathematics Software

work page
[18]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

Yushi Hu, Weijia Shi, et al. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, pages 139348–139379, 2024. arXiv:2406.09403

work page arXiv 2024
[19]

Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, et al. Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026

work page arXiv 2026
[20]

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

Jinwoong Kim, Rui Yang, and Huishuai Zhang. GeoBuildBench: A benchmark for interactive and executable geometry construction from natural language.arXiv preprint arXiv:2605.13167, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Giac and GeoGebra – improved Gröbner basis computations

Zoltán Kovács and Bernard Parisse. Giac and GeoGebra – improved Gröbner basis computations. InComputer Algebra and Polynomials, volume 8942 ofLNCS, pages 126–138. Springer, 2015

work page 2015
[22]

Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026

Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, and Tieying Zhang. Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026

work page arXiv 2026
[23]

In-the-flow agentic system optimization for effective planning and tool use

Zhuofeng Li, Haoxiang Zhang, et al. In-the-flow agentic system optimization for effective planning and tool use. InICLR, 2026. URL https://openreview.net/forum?id= Mf5AleTUVK. Oral

work page 2026
[24]

Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025

Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025

work page arXiv 2025
[25]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, et al. Agentic harness engineering: Observability- driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. AutoHarness: improving LLM agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

work page arXiv 2026
[27]

Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. InACL, pages 6774–6786, 2021. doi: 10.18653/ v1/2021.acl-long.528. URLhttps://aclanthology.org/2021.acl-long.528/

work page 2021
[28]

MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts. InICLR, 2024. URL https: //openreview.net/forum?id=KUNzEQMWU7. Oral

work page 2024
[29]

Thinking with visual primitives

Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, et al. Thinking with visual primitives. 2026. DeepSeek-AI

work page 2026
[30]

From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning

Ruilin Luo, Chufan Shi, Yizhen Zhang, et al. From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning. InICLR, 2026. URL https:// openreview.net/forum?id=4tsfY0lI1w

work page 2026
[31]

Geogram- bench: Benchmarking the geometric program reasoning in modern llms

Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, and Yong Wu. Geogram- bench: Benchmarking the geometric program reasoning in modern llms. InICLR, 2026. URL https://openreview.net/forum?id=MrJoBgN1VO. 11

work page 2026
[32]

A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025

Jianzhe Ma, Wenxuan Wang, and Qin Jin. A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025

work page arXiv 2025
[33]

GPT-4o system card, 2024

OpenAI. GPT-4o system card, 2024. URL https://openai.com/index/ gpt-4o-system-card/

work page 2024
[34]

Introducing o3 and o4-mini, 2025

OpenAI. Introducing o3 and o4-mini, 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

work page 2025
[35]

Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

work page arXiv 2025
[36]

GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning

Shuai Peng, Di Fu, Yijun Liang, Liangcai Gao, and Zhi Tang. GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13468–13480,

work page 2023
[37]

URL https://aclanthology.org/ 2023.findings-acl.850/

doi: 10.18653/v1/2023.findings-acl.850. URL https://aclanthology.org/ 2023.findings-acl.850/

work page doi:10.18653/v1/2023.findings-acl.850 2023
[38]

AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning

Bowen Ping, Minnan Luo, Zhuohang Dang, Chenxi Wang, and Chengyou Jia. AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning. InICLR, 2026. URLhttps://openreview.net/forum?id=PVtZnUh04m

work page 2026
[39]

SMART: Self-aware agent for tool overuse mitigation

Cheng Qian, Emre Can Acikgoz, et al. SMART: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621,

work page 2025
[40]

URL https://aclanthology.org/ 2025.findings-acl.239/

doi: 10.18653/v1/2025.findings-acl.239. URL https://aclanthology.org/ 2025.findings-acl.239/

work page doi:10.18653/v1/2025.findings-acl.239 2025
[41]

We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning

Runqi Qiao, Qiuna Tan, et al. We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning. InICLR, 2026. URL https://openreview.net/forum?id= I7fTPLT8A9

work page 2026
[42]

Toolformer: Language models can teach themselves to use tools

Timo Schick et al. Toolformer: Language models can teach themselves to use tools. In NeurIPS, pages 68539–68551, 2023. URL https://openreview.net/forum?id= Yacmpz84TH. Oral

work page 2023
[43]

Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview...

work page 2024
[44]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025

work page arXiv 2025
[46]

Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024

Vladmir Sicca, Tianxiang Xia, Mathïs Fédérico, Philip John Gorinski, Simon Frieder, and Shangling Jui. Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024

work page arXiv 2024
[47]

Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026

Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, et al. Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026

work page arXiv 2026
[48]

Math blind: Failures in diagram understanding undermine reasoning in MLLMs

Yanpeng Sun, Shan Zhang, et al. Math blind: Failures in diagram understanding undermine reasoning in MLLMs. InICLR, 2026. URL https://openreview.net/forum?id= RtvmTxdQV9. 12

work page 2026
[49]

Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

work page arXiv 2025
[50]

SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025

Peijie Wang, Chao Yang, et al. SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025

work page arXiv 2025
[51]

GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization

Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, and Jiaqi Wang. GeometryZero: Advancing geometry solving via group contrastive policy optimization. arXiv preprint arXiv:2506.07160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

GenExam: A multidisciplinary text-to-image exam

Zhaokai Wang, Penghao Yin, et al. GenExam: A multidisciplinary text-to-image exam. In ICML, 2026

work page 2026
[53]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, volume 35, pages 24824–24837, 2022. URL https://openreview. net/forum?id=_VjQlMeSB_J

work page 2022
[54]

GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025

Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, and Cheng Tan. GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025

work page arXiv 2025
[55]

Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026

Lei Wei, Xiao Peng, Jinpeng Ou, and Bin Wang. Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026

work page arXiv 2026
[56]

GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Hanmeng Liu. GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

work page arXiv 2025
[57]

NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation

Weiming Wu, Zi-Kang Wang, Jin Ye, Zhi Zhou, Yu-Feng Li, and Lan-Zhe Guo. NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation. In2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview.net/forum?id= t4tIV04qUp. arXiv:2505.17121

work page arXiv 2025
[58]

Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration

Wenjun Wu, Lingling Zhang, Bo Zhao, Muye Huang, Qianying Wang, and Jun Liu. Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration. InNeurIPS,

work page
[59]

URLhttps://openreview.net/forum?id=eRgYGhFRgZ

work page
[60]

Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026

Zhenyu Wu, Yanxi Long, Jian Li, and Hua Huang. Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026

work page arXiv 2026
[61]

GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

work page arXiv 2025
[62]

Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025

Ningning Xu, Yuxuan Jiang, et al. Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025

work page arXiv 2025
[63]

Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

Chengrui Zhang, Maizhen Ning, et al. Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

work page arXiv 2025
[64]

Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024

Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Liang Lin, and Cheng-Lin Liu. Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024

work page arXiv 2024
[65]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. MathVerse: Does your multi-modal LLM truly see the diagrams in math problems? InECCV, 2024. doi: 10.1007/978-3-031-73242-3_10. arXiv:2403.14624. 13

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-031-73242-3_10 2024
[66]

How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, et al. How RL unlocks the aha moment in geometric interleaved reasoning.arXiv preprint arXiv:2603.01070, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning

Haiteng Zhao, Junhao Shen, Yiming Zhang, et al. Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning. InICLR, 2026. URL https://openreview.net/forum?id=1sffPGGQyT

work page 2026
[68]

Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information

Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1526–1536, 2025

work page 2025
[69]

Towards geometry prob- lem solving in the large model era: A survey

Yurui Zhao, Xiang Wang, Jiahong Liu, Irwin King, and Zhitao Huang. Towards geometry prob- lem solving in the large model era: A survey. In2nd AI for Math Workshop @ ICML 2025, 2025. URLhttps://openreview.net/forum?id=8o2hHIXrzV. arXiv:2506.02690

work page arXiv 2025
[70]

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, et al. External- ization in LLM agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026. 14 A Positioning in the Broader Agentic Landscape Th...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[71]

Semantic

T2I = text-to-image; MLLM = multimodal large language model. Base model shown in gray below each closed-source method; secondary per-vendor variants commented in the source for reference. Closed-source Models Method Strict Relaxed Draw2Think Gemini 3 Flash Preview | Dec.2025 68.2 90.5 Nano Banana 2 Gemini 3.1 Flash Image | Feb.2026 56.3 87.8 Nano Banana P...

work page 2025
[72]

Identify parallel: arrows on QN, POgiveQN∥PO

work page
[73]

Apply BPT: in△MPO , MQ QP = MN NO

work page
[74]

Substitute: MQ =5, QP=x, MN =6, NO=3.6, ⇒5 x = 6 3.6

work page
[75]

O": {"type

Solve: 6x = 18⇒x = 3 . Answer.QP = 3 . Correct; each quantity is asserted in text without an external veriﬁer. Problem Question. FindQP . Choices. (A) 2 (B) 3 (C) 5 (D) 6 Expected answer. B (QP = 3 ) AutoGPS: proof-graph search Stepwise Reasoning Process for Noise Data, borrowed intact from AutoGPS (ICLR2026) Step 1: Known facts:start =⇒x =PQ, 6 = MN, 3 +...

work page
[76]

The altitude DB⊥AC bisectsAC, henceAB =BC =x

Tick marks give AD =CD = 32 , so△ADC is isosceles. The altitude DB⊥AC bisectsAC, henceAB =BC =x

work page
[77]

In right △DBC : cos ∠C =BC/CD , so BC = 32 cos 54 ◦≈32·0.5878≈18.81

work page
[78]

C": {"type

x≈18.8. Answer: A. Causal-R: causal-graph deduction two equivalent 2-step solutions, reproduced verbatim from CausalR (NeurIPS2025) Solution 1. Step 1: Use Isosceles Triangle Theorem. ∵BD =BC ∴ ∠BDA = ∠BCA = 54 ◦. Step 2: Use Cosine of Triangle. ∵ ∠BAD = 90 ◦ ∴AD =BD×cos(∠BDA) ∴ AD = 32×cos(54◦) = 18.8. Solution 2. Step 1: Use Cosine of Triangle. ∵ ∠BAC =...

work page
[79]

Identify dimensions: widthw =AB = 11 cm; height h =BF = 11 cm; lengthl =GF = 15 cm

work page
[80]

3D Pythagorean theorem:y = √ w2 +l2 +h2

work page

Showing first 80 references.

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin, et al. Qwen2.5-VL technical report...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Automated theorem proving in GeoGebra: Current achievements

Francisco Botana, Markus Hohenwarter, Predrag Janiˇci´c, Zoltán Kovács, Ivan Petrovi´c, Tomás Recio, and Simon Weitzhofer. Automated theorem proving in GeoGebra: Current achievements. Journal of Automated Reasoning, 55(4):339–360, 2015

work page 2015

[3] [3]

Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, and Renqiu Xia. Milestones over outcome: Unlocking geometric reasoning with sub-goal verifiable reward.arXiv preprint arXiv:2601.05073, 2026

work page arXiv 2026

[4] [4]

GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513– 523, 2021. doi: 10.18653/v1/2021.findings-acl.46. URL https://aclanthology.org/ 2021.fi...

work page doi:10.18653/v1/2021.findings-acl.46 2021

[5] [5]

UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, pages 3313–3323, 2022. doi: 10.18653/v1/2022.emnlp-main.218. URL https: //aclanthology.org/2...

work page doi:10.18653/v1/2022.emnlp-main.218 2022

[6] [6]

Toward effective tool-integrated reasoning via self-evolved preference learning

Yifei Chen, Guanting Dong, and Zhicheng Dou. Toward effective tool-integrated reasoning via self-evolved preference learning. InICLR, 2026. URL https://openreview.net/ forum?id=mNeitRAdWV

work page 2026

[7] [7]

Trinh, et al

Yuri Chervonyi, Trieu H. Trinh, et al. Gold-medalist performance in solving olympiad geometry with AlphaGeometry2.Journal of Machine Learning Research, 26(241):1–39, 2025. URL https://www.jmlr.org/papers/volume26/25-1654/25-1654.pdf

work page 2025

[8] [8]

Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, and Ji-Rong Wen. Tool-star: Empowering LLM-brained multi-tool reasoner via reinforcement learning.arXiv preprint arXiv:2505.16410, 2025

work page arXiv 2025

[9] [9]

Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, and Bo Zhang. Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

work page arXiv 2026

[10] [10]

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, and Jun Liu. GeoLaux: A benchmark for evaluating MLLMs’ geometry performance on long-step problems requiring auxiliary lines.arXiv preprint arXiv:2508.06226, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Gemini 2.5: Our most intelligent AI model, 2025

Google DeepMind. Gemini 2.5: Our most intelligent AI model, 2025. URL https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/

work page 2025

[12] [12]

Gemini 3 Flash: Frontier intelligence built for speed, 2025

Google DeepMind. Gemini 3 Flash: Frontier intelligence built for speed, 2025. URLhttps://blog.google/products-and-platforms/products/gemini/ gemini-3-flash/

work page 2025

[13] [13]

Welcome Gemma 4: Frontier multimodal intelligence on device, 2026

Google DeepMind. Welcome Gemma 4: Frontier multimodal intelligence on device, 2026. URLhttps://huggingface.co/blog/gemma4

work page 2026

[14] [14]

ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning

Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, et al. ThinkMorph: Emergent properties in multimodal interleaved chain-of-thought reasoning. InICLR, 2026. URL https: //openreview.net/forum?id=mB3vxfrQZM. 10

work page 2026

[15] [15]

Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. GeoVLMath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020, 2025

work page arXiv 2025

[16] [16]

OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computa...

work page 2024

[17] [17]

Geogebra

Markus Hohenwarter and Judith Hohenwarter. Geogebra. https://www.geogebra.org. Dynamic Mathematics Software

work page

[18] [18]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024

Yushi Hu, Weijia Shi, et al. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InNeurIPS, pages 139348–139379, 2024. arXiv:2406.09403

work page arXiv 2024

[19] [19]

Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, et al. Socratic-geo: Synthetic data generation and geometric reasoning via multi-agent interaction.arXiv preprint arXiv:2602.03414, 2026

work page arXiv 2026

[20] [20]

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

Jinwoong Kim, Rui Yang, and Huishuai Zhang. GeoBuildBench: A benchmark for interactive and executable geometry construction from natural language.arXiv preprint arXiv:2605.13167, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Giac and GeoGebra – improved Gröbner basis computations

Zoltán Kovács and Bernard Parisse. Giac and GeoGebra – improved Gröbner basis computations. InComputer Algebra and Polynomials, volume 8942 ofLNCS, pages 126–138. Springer, 2015

work page 2015

[22] [22]

Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026

Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, and Tieying Zhang. Reasoning and tool-use compete in agentic RL: From quantifying interference to disentangled tuning.arXiv preprint arXiv:2602.00994, 2026

work page arXiv 2026

[23] [23]

In-the-flow agentic system optimization for effective planning and tool use

Zhuofeng Li, Haoxiang Zhang, et al. In-the-flow agentic system optimization for effective planning and tool use. InICLR, 2026. URL https://openreview.net/forum?id= Mf5AleTUVK. Oral

work page 2026

[24] [24]

Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025

Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv preprint arXiv:2508.19201, 2025

work page arXiv 2025

[25] [25]

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Jiahang Lin, Shichun Liu, Chengjun Pan, et al. Agentic harness engineering: Observability- driven automatic evolution of coding-agent harnesses.arXiv preprint arXiv:2604.25850, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[26] [26]

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P. Murphy. AutoHarness: improving LLM agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

work page arXiv 2026

[27] [27]

Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, et al. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. InACL, pages 6774–6786, 2021. doi: 10.18653/ v1/2021.acl-long.528. URLhttps://aclanthology.org/2021.acl-long.528/

work page 2021

[28] [28]

MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathe- matical reasoning of foundation models in visual contexts. InICLR, 2024. URL https: //openreview.net/forum?id=KUNzEQMWU7. Oral

work page 2024

[29] [29]

Thinking with visual primitives

Ruijie Lu, Yiyang Ma, Xiaokang Chen, Lingxiao Luo, Zhiyu Wu, Zizheng Pan, Xingchao Liu, et al. Thinking with visual primitives. 2026. DeepSeek-AI

work page 2026

[30] [30]

From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning

Ruilin Luo, Chufan Shi, Yizhen Zhang, et al. From narrow to panoramic vision: Attention- guided cold-start reshapes multimodal reasoning. InICLR, 2026. URL https:// openreview.net/forum?id=4tsfY0lI1w

work page 2026

[31] [31]

Geogram- bench: Benchmarking the geometric program reasoning in modern llms

Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, and Yong Wu. Geogram- bench: Benchmarking the geometric program reasoning in modern llms. InICLR, 2026. URL https://openreview.net/forum?id=MrJoBgN1VO. 11

work page 2026

[32] [32]

A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025

Jianzhe Ma, Wenxuan Wang, and Qin Jin. A survey of deep learning for geometry problem solving.arXiv preprint arXiv:2507.11936, 2025

work page arXiv 2025

[33] [33]

GPT-4o system card, 2024

OpenAI. GPT-4o system card, 2024. URL https://openai.com/index/ gpt-4o-system-card/

work page 2024

[34] [34]

Introducing o3 and o4-mini, 2025

OpenAI. Introducing o3 and o4-mini, 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

work page 2025

[35] [35]

Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multimodal LLMs via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

work page arXiv 2025

[36] [36]

GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning

Shuai Peng, Di Fu, Yijun Liang, Liangcai Gao, and Zhi Tang. GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13468–13480,

work page 2023

[37] [37]

URL https://aclanthology.org/ 2023.findings-acl.850/

doi: 10.18653/v1/2023.findings-acl.850. URL https://aclanthology.org/ 2023.findings-acl.850/

work page doi:10.18653/v1/2023.findings-acl.850 2023

[38] [38]

AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning

Bowen Ping, Minnan Luo, Zhuohang Dang, Chenxi Wang, and Chengyou Jia. AutoGPS: Automated geometry problem solving via multimodal formalization and deductive reasoning. InICLR, 2026. URLhttps://openreview.net/forum?id=PVtZnUh04m

work page 2026

[39] [39]

SMART: Self-aware agent for tool overuse mitigation

Cheng Qian, Emre Can Acikgoz, et al. SMART: Self-aware agent for tool overuse mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4604–4621,

work page 2025

[40] [40]

URL https://aclanthology.org/ 2025.findings-acl.239/

doi: 10.18653/v1/2025.findings-acl.239. URL https://aclanthology.org/ 2025.findings-acl.239/

work page doi:10.18653/v1/2025.findings-acl.239 2025

[41] [41]

We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning

Runqi Qiao, Qiuna Tan, et al. We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning. InICLR, 2026. URL https://openreview.net/forum?id= I7fTPLT8A9

work page 2026

[42] [42]

Toolformer: Language models can teach themselves to use tools

Timo Schick et al. Toolformer: Language models can teach themselves to use tools. In NeurIPS, pages 68539–68551, 2023. URL https://openreview.net/forum?id= Yacmpz84TH. Oral

work page 2023

[43] [43]

Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual CoT: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. InThe Thirty-Eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview...

work page 2024

[44] [44]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025

Weikang Shi, Aldrich Yu, Rongyao Fang, Houxing Ren, Ke Wang, Aojun Zhou, Changyao Tian, Xinyu Fu, Yuxuan Hu, Zimu Lu, Linjiang Huang, Si Liu, Rui Liu, and Hongsheng Li. MathCanvas: Intrinsic visual chain-of-thought for multimodal mathematical reasoning.arXiv preprint arXiv:2510.14958, 2025

work page arXiv 2025

[46] [46]

Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024

Vladmir Sicca, Tianxiang Xia, Mathïs Fédérico, Philip John Gorinski, Simon Frieder, and Shangling Jui. Newclid: A user-friendly replacement for AlphaGeometry.arXiv preprint arXiv:2411.11938, 2024

work page arXiv 2024

[47] [47]

Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026

Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, et al. Canvas-of-thought: Grounding reasoning via mutable structured states.arXiv preprint arXiv:2602.10494, 2026

work page arXiv 2026

[48] [48]

Math blind: Failures in diagram understanding undermine reasoning in MLLMs

Yanpeng Sun, Shan Zhang, et al. Math blind: Failures in diagram understanding undermine reasoning in MLLMs. InICLR, 2026. URL https://openreview.net/forum?id= RtvmTxdQV9. 12

work page 2026

[49] [49]

Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently.arXiv preprint arXiv:2504.14870, 2025

work page arXiv 2025

[50] [50]

SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025

Peijie Wang, Chao Yang, et al. SOLIDGEO: Measuring multimodal spatial math reasoning in solid geometry.arXiv preprint arXiv:2505.21177, 2025

work page arXiv 2025

[51] [51]

GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization

Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, and Jiaqi Wang. GeometryZero: Advancing geometry solving via group contrastive policy optimization. arXiv preprint arXiv:2506.07160, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

GenExam: A multidisciplinary text-to-image exam

Zhaokai Wang, Penghao Yin, et al. GenExam: A multidisciplinary text-to-image exam. In ICML, 2026

work page 2026

[53] [53]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, volume 35, pages 24824–24837, 2022. URL https://openreview. net/forum?id=_VjQlMeSB_J

work page 2022

[54] [54]

GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025

Jingxuan Wei, Caijun Jia, Xi Bai, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Lijun Wu, and Cheng Tan. GGBench: A geometric generative reasoning benchmark for unified multimodal models.arXiv preprint arXiv:2511.11134, 2025

work page arXiv 2025

[55] [55]

Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026

Lei Wei, Xiao Peng, Jinpeng Ou, and Bin Wang. Think-augmented function calling: Improving LLM parameter accuracy through embedded reasoning.arXiv preprint arXiv:2601.18282, 2026

work page arXiv 2026

[56] [56]

GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu, Zhiyang Teng, Xiaozhang Liu, and Hanmeng Liu. GeoSketch: A neural-symbolic approach to geometric multimodal reasoning with auxiliary line construction and affine transformation.arXiv preprint arXiv:2509.22460, 2025

work page arXiv 2025

[57] [57]

NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation

Weiming Wu, Zi-Kang Wang, Jin Ye, Zhi Zhou, Yu-Feng Li, and Lan-Zhe Guo. NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation. In2nd AI for Math Workshop @ ICML 2025, 2025. URL https://openreview.net/forum?id= t4tIV04qUp. arXiv:2505.17121

work page arXiv 2025

[58] [58]

Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration

Wenjun Wu, Lingling Zhang, Bo Zhao, Muye Huang, Qianying Wang, and Jun Liu. Causal-R: A causal-reasoning geometry problem solver for optimized solution exploration. InNeurIPS,

work page

[59] [59]

URLhttps://openreview.net/forum?id=eRgYGhFRgZ

work page

[60] [60]

Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026

Zhenyu Wu, Yanxi Long, Jian Li, and Hua Huang. Geo-code: A code framework for reverse code generation from geometric images based on two-stage multi-agent evolution.arXiv preprint arXiv:2602.07749, 2026

work page arXiv 2026

[61] [61]

GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, Jun Song, and Bo Zheng. GeoSense: Evaluating identification and application of geometric principles in multimodal reasoning.arXiv preprint arXiv:2504.12597, 2025

work page arXiv 2025

[62] [62]

Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025

Ningning Xu, Yuxuan Jiang, et al. Learning how to use tools, not just when: Pattern-aware tool-integrated reasoning.arXiv preprint arXiv:2509.23292, 2025

work page arXiv 2025

[63] [63]

Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

Chengrui Zhang, Maizhen Ning, et al. Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

work page arXiv 2025

[64] [64]

Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024

Ming-Liang Zhang, Zhong-Zhi Li, Fei Yin, Liang Lin, and Cheng-Lin Liu. Fuse, reason and verify: Geometry problem solving with parsed clauses from diagram.arXiv preprint arXiv:2407.07327, 2024

work page arXiv 2024

[65] [65]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, et al. MathVerse: Does your multi-modal LLM truly see the diagrams in math problems? InECCV, 2024. doi: 10.1007/978-3-031-73242-3_10. arXiv:2403.14624. 13

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-031-73242-3_10 2024

[66] [66]

How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, et al. How RL unlocks the aha moment in geometric interleaved reasoning.arXiv preprint arXiv:2603.01070, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [67]

Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning

Haiteng Zhao, Junhao Shen, Yiming Zhang, et al. Achieving olympia-level geometry large language model agent via complexity boosting reinforcement learning. InICLR, 2026. URL https://openreview.net/forum?id=1sffPGGQyT

work page 2026

[68] [68]

Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information

Junbo Zhao, Ting Zhang, Jiayu Sun, Mi Tian, and Hua Huang. Pi-GPS: Enhancing geometry problem solving by unleashing the power of diagrammatic information. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1526–1536, 2025

work page 2025

[69] [69]

Towards geometry prob- lem solving in the large model era: A survey

Yurui Zhao, Xiang Wang, Jiahong Liu, Irwin King, and Zhitao Huang. Towards geometry prob- lem solving in the large model era: A survey. In2nd AI for Math Workshop @ ICML 2025, 2025. URLhttps://openreview.net/forum?id=8o2hHIXrzV. arXiv:2506.02690

work page arXiv 2025

[70] [70]

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

Chenyu Zhou, Huacan Chai, Wenteng Chen, Zihan Guo, Rong Shan, Yuanyi Song, Tianyi Xu, Yingxuan Yang, Aofan Yu, Weiming Zhang, Congming Zheng, Jiachen Zhu, et al. External- ization in LLM agents: A unified review of memory, skills, protocols and harness engineering. arXiv preprint arXiv:2604.08224, 2026. 14 A Positioning in the Broader Agentic Landscape Th...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[71] [71]

Semantic

T2I = text-to-image; MLLM = multimodal large language model. Base model shown in gray below each closed-source method; secondary per-vendor variants commented in the source for reference. Closed-source Models Method Strict Relaxed Draw2Think Gemini 3 Flash Preview | Dec.2025 68.2 90.5 Nano Banana 2 Gemini 3.1 Flash Image | Feb.2026 56.3 87.8 Nano Banana P...

work page 2025

[72] [72]

Identify parallel: arrows on QN, POgiveQN∥PO

work page

[73] [73]

Apply BPT: in△MPO , MQ QP = MN NO

work page

[74] [74]

Substitute: MQ =5, QP=x, MN =6, NO=3.6, ⇒5 x = 6 3.6

work page

[75] [75]

O": {"type

Solve: 6x = 18⇒x = 3 . Answer.QP = 3 . Correct; each quantity is asserted in text without an external veriﬁer. Problem Question. FindQP . Choices. (A) 2 (B) 3 (C) 5 (D) 6 Expected answer. B (QP = 3 ) AutoGPS: proof-graph search Stepwise Reasoning Process for Noise Data, borrowed intact from AutoGPS (ICLR2026) Step 1: Known facts:start =⇒x =PQ, 6 = MN, 3 +...

work page

[76] [76]

The altitude DB⊥AC bisectsAC, henceAB =BC =x

Tick marks give AD =CD = 32 , so△ADC is isosceles. The altitude DB⊥AC bisectsAC, henceAB =BC =x

work page

[77] [77]

In right △DBC : cos ∠C =BC/CD , so BC = 32 cos 54 ◦≈32·0.5878≈18.81

work page

[78] [78]

C": {"type

x≈18.8. Answer: A. Causal-R: causal-graph deduction two equivalent 2-step solutions, reproduced verbatim from CausalR (NeurIPS2025) Solution 1. Step 1: Use Isosceles Triangle Theorem. ∵BD =BC ∴ ∠BDA = ∠BCA = 54 ◦. Step 2: Use Cosine of Triangle. ∵ ∠BAD = 90 ◦ ∴AD =BD×cos(∠BDA) ∴ AD = 32×cos(54◦) = 18.8. Solution 2. Step 1: Use Cosine of Triangle. ∵ ∠BAC =...

work page

[79] [79]

Identify dimensions: widthw =AB = 11 cm; height h =BF = 11 cm; lengthl =GF = 15 cm

work page

[80] [80]

3D Pythagorean theorem:y = √ w2 +l2 +h2

work page