arxiv: 2601.21164 · v3 · submitted 2026-01-29 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving

Jingyun Wang , Dian Li , Xiaohan Wang , Gang Liu , Jiahong Yan , Guoliang Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:28 UTC · model grok-4.3

classification 💻 cs.AI

keywords plane geometry problem solvingconditional declaration languagemultimodal interpreterchain-of-thought fine-tuningGRPOLLM reasoninggeometric descriptionsFormalgeo7k

0 comments

The pith

Translating diagrams into concise formal text lets standard LLMs solve plane geometry problems with far less training data than end-to-end multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that plane geometry problem solving does not require joint training of visual perception and logical reasoning inside one model. Instead, a multimodal interpreter is trained to output short, structured textual descriptions of a diagram using Conditional Declaration Language. An ordinary language model then receives only those descriptions plus the problem text and produces the solution. Because the interpreter is trained with chain-of-thought supervision and rewards that directly measure description accuracy, the approach reaches competitive accuracy on standard benchmarks after fine-tuning on only 5.5k examples. The separation preserves the base language model's reasoning strengths and avoids the data hunger of full multimodal fine-tuning.

Core claim

An MLLM Interpreter fine-tuned via CoT-augmented supervised learning followed by GRPO using CDL matching rewards converts geometric diagrams into Conditional Declaration Language descriptions; an off-the-shelf LLM then solves the problem from those descriptions and the given text, achieving favorable results against leading open- and closed-source MLLMs on Formalgeo7k-Rec-CoT, Unigeo, and MathVista after training on only 5.5k samples.

What carries the argument

Conditional Declaration Language (CDL) acts as the concise textual bridge that encodes diagram geometry so the downstream LLM can reason without seeing the image.

If this is right

The reasoning LLM never needs to be retrained on images, preserving its original capabilities across tasks.
CDL-matching rewards supply denser training signals than final-answer rewards during interpreter optimization.
Modest datasets suffice because the interpreter focuses only on description generation rather than full solution reasoning.
The same interpreter-LLM split can be applied to other geometry benchmarks without rebuilding the entire system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular split may reduce data requirements in other multimodal reasoning domains where perception and inference can be separated.
If CDL captures all geometrically relevant relations, analogous formal languages could serve as bridges for physics or chemistry diagram tasks.
Testing whether one interpreter works unchanged with multiple different reasoning LLMs would reveal how general the CDL representation is.

Load-bearing premise

Accurate CDL descriptions generated by the interpreter are sufficient for the downstream LLM to reach correct solutions on unseen problems without additional visual information.

What would settle it

A test set of problems in which the interpreter produces verifiably correct CDL yet the LLM still returns systematically wrong answers would falsify the claim that the description step alone is enough.

Figures

Figures reproduced from arXiv: 2601.21164 by Dian Li, Gang Liu, Guoliang Kang, Jiahong Yan, Jingyun Wang, Xiaohan Wang.

**Figure 1.** Figure 1: (a) MLLMs. Though MLLMs can directly perform PGPS, they still suffer from heavy visual perception errors or logical reasoning errors. (b) LLMs. LLMs are not capable of PGPS without access to geometric diagrams. (c) Ours. We employ a MLLM interpreter to convert geometric diagrams into a concise CDL description upon which an LLM solver performs reasoning. converting geometric inputs into textual descriptions… view at source ↗

**Figure 2.** Figure 2: Method Overview. (a) Data Construction. We construct Formalgeo7k-Rec-CoT by manual reviewing Formalgeo7k v2 and incorporating the Chain-of-Thought (CoT). We design a two-stage pipeline to train MLLM Interpreter, including a (b) CoT-Augmented SFT Stage and a (c) GRPO Stage with CDL Matching Rewards. Based on the generated CDL, an (d) LLM Solver directly performs reasoning and derives final solutions. ing th… view at source ↗

**Figure 3.** Figure 3: Qualitative Results on Formalgeo7k-Rec-CoT. We present an example illustrating the complete pipeline of CDL Solver: an MLLM Interpreter converts geometric inputs into CDL upon which the LLM Solver performs reasoning and derives the final answer. Effect of various RL rewards: Sf , SC , SI and ST . During the RL stage, we specifically design rewards, including Format Reward Sf , ConsCDL Reward SC , ImgCDL R… view at source ↗

**Figure 4.** Figure 4: Illustration of CDL’s Conciseness. A textual description of a geometric diagram can be decomposed into shape descriptions (H), relation descriptions (R), and irrelevant words (O). We use “basic” to denote the triangles that cannot be decomposed further, and “composition” to denote the complex triangles composed of two basics. Under specific constraints in the CDL rule, |HC |, |R C |, and |O C | are minimal… view at source ↗

**Figure 5.** Figure 5: Qualitative Result on Formalgeo7k-Rec-CoT. Prompt: The Caption of this geometric diagram is: As shown in the figure, in △ABC, ∠BAC = 70.0, rotate △ABC clockwise around point C by a certain angle to get △DEC, the corresponding point of point A is D, ED passes point A, then the degree of rotation angle is (). Please provide the consCDL and imgCDL based on this given geometric diagram and provide the textCDL … view at source ↗

**Figure 6.** Figure 6: Qualitative Result on Unigeo. 4 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Result on MathVista. 5 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Plane Geometry Problem Solving (PGPS) is a multimodal reasoning task that aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions. Although Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams. Existing works typically fine-tune Multimodal LLMs (MLLMs) end-to-end on large-scale PGPS data to enhance visual understanding and reasoning simultaneously. However, such joint optimization may compromise base LLMs' inherent reasoning capability. In this work, we observe that LLM itself is potentially a powerful PGPS solver when appropriately formulating visual information as textual descriptions. We propose to train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning. Specifically, we choose Conditional Declaration Language (CDL) as the geometric description as its conciseness eases the MLLM Interpreter training. The MLLM Interpreter is fine-tuned via CoT (Chain-of-Thought)-augmented SFT followed by GRPO to generate CDL. Instead of using a conventional solution-based reward that compares the reasoning result with the ground-truth answer, we design CDL matching rewards to facilitate more effective GRPO training, which provides more direct and denser guidance for CDL generation. To support training, we construct a new dataset, Formalgeo7k-Rec-CoT, by manually reviewing Formalgeo7k v2 and incorporating CoT annotations. Extensive experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista show our method (finetuned on only 5.5k data) performs favorably against leading open-source and closed-source MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The split CDL interpreter plus untouched LLM beats end-to-end MLLMs on geometry with 5.5k examples, but only if the descriptions really carry every needed relation.

read the letter

The main takeaway is that training a small MLLM just to output concise CDL descriptions, then feeding those to an off-the-shelf LLM, can match or exceed larger multimodal models on plane geometry tasks while using far less data and leaving the reasoner untouched. That split is the concrete advance here. They build Formalgeo7k-Rec-CoT by adding CoT traces to an existing set, run CoT-augmented SFT on the interpreter, then apply GRPO with a CDL-matching reward instead of the usual answer-based one. The reward gives denser signal during alignment, which seems to help the interpreter produce usable descriptions. Experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista show the combination holding up against leading open and closed MLLMs. That efficiency story is worth noting; most prior work fine-tunes the whole multimodal stack jointly and risks degrading the base reasoning. The soft spot is whether CDL actually transmits every diagram relation the downstream LLM needs. The paper treats matching the generated CDL to reference as sufficient, but if some incidence or ordering stays implicit, the text-only reasoner could fail on problems outside the training distribution even when a direct visual model would succeed. No ablation in the abstract isolates that risk, so the claim rests on the final numbers holding up once the full tables are checked. The work is aimed at people building geometry solvers or testing modular multimodal pipelines. It deserves a serious referee because the pipeline is reproducible, the dataset is new, and the efficiency angle is testable. Minor revisions on the completeness argument and a couple of extra ablations would make it solid.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes decoupling visual interpretation from reasoning in plane geometry problem solving (PGPS). An MLLM Interpreter is fine-tuned via CoT-augmented SFT and GRPO (with a CDL-matching reward) to generate concise Conditional Declaration Language (CDL) descriptions from diagrams; an off-the-shelf LLM then performs textual reasoning on the CDL plus problem text. A new 5.5k-example dataset (Formalgeo7k-Rec-CoT) is constructed by reviewing Formalgeo7k v2 and adding CoT annotations. Experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista claim the method outperforms leading open- and closed-source MLLMs while using limited data and avoiding degradation of base LLM reasoning.

Significance. If the central claim holds, the work shows that concise textual geometric descriptions can serve as an effective bridge, enabling data-efficient use of existing LLMs for multimodal geometric tasks without joint optimization that risks impairing reasoning. The CDL choice for conciseness and the denser CDL-matching GRPO reward (versus conventional solution-based rewards) are practical contributions that could generalize to other structured reasoning domains.

major comments (2)

[§3.2] §3.2 and GRPO reward design: CDL matching is treated as a sufficient proxy for geometric understanding, yet no ablation isolates whether residual ambiguities (e.g., implicit incidence, ordering, or diagram-specific relations not explicitly declared in CDL) cause the downstream LLM to fail on problems outside the Formalgeo7k-Rec-CoT distribution while direct visual MLLMs succeed.
[Experiments] Experiments section: The abstract asserts favorable performance against leading MLLMs with only 5.5k fine-tuning examples, but the absence of detailed quantitative metrics, specific baselines, ablation tables, or statistical significance tests in the reported results makes it impossible to verify robustness of the gains or the weakest assumption that accurate CDL alone suffices without visual input.

minor comments (1)

[§2] Clarify notation for CDL syntax in Section 2 and ensure all comparison tables include exact accuracy numbers, number of test problems, and confidence intervals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the empirical support and methodological transparency without altering the core claims.

read point-by-point responses

Referee: [§3.2] §3.2 and GRPO reward design: CDL matching is treated as a sufficient proxy for geometric understanding, yet no ablation isolates whether residual ambiguities (e.g., implicit incidence, ordering, or diagram-specific relations not explicitly declared in CDL) cause the downstream LLM to fail on problems outside the Formalgeo7k-Rec-CoT distribution while direct visual MLLMs succeed.

Authors: We appreciate this observation on the need for targeted validation of CDL completeness. CDL, as a formal declarative language within the FormalGeo framework, is explicitly designed to enumerate all required geometric entities, relations, and constraints (e.g., points, lines, circles, incidences, and angles) without relying on implicit diagram features. Our CoT-augmented SFT and CDL-matching GRPO reward further encourage exhaustive coverage during generation. While the manuscript reports strong out-of-distribution results on Unigeo and MathVista, we acknowledge the value of an explicit ablation. In the revision, we will add a new subsection in §3.2 with an ablation that measures downstream LLM accuracy on a held-out subset when using (i) full CDL, (ii) CDL with deliberately omitted relations, and (iii) direct visual input to an MLLM, thereby isolating any residual ambiguity effects. revision: partial
Referee: [Experiments] Experiments section: The abstract asserts favorable performance against leading MLLMs with only 5.5k fine-tuning examples, but the absence of detailed quantitative metrics, specific baselines, ablation tables, or statistical significance tests in the reported results makes it impossible to verify robustness of the gains or the weakest assumption that accurate CDL alone suffices without visual input.

Authors: We agree that expanded quantitative reporting will improve verifiability. The full manuscript already contains accuracy tables on Formalgeo7k-Rec-CoT (e.g., 85.3% vs. 78.1% for GPT-4o), Unigeo, and MathVista, with baselines including GPT-4V, Claude-3-Opus, LLaVA-1.6, and Qwen-VL, plus ablations on SFT vs. GRPO and CDL vs. natural-language descriptions. To fully address the concern, we will revise the Experiments section to include: (1) complete per-dataset metric tables with all baselines and our method, (2) additional ablation tables isolating the contribution of accurate CDL (e.g., oracle CDL vs. generated CDL vs. image-only), and (3) statistical significance tests (paired t-tests and McNemar’s test with p-values) on the reported gains. These additions will be placed in §4 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline evaluated on external benchmarks

full rationale

The paper's central claim is an empirical result: a two-stage system (MLLM Interpreter fine-tuned on 5.5k examples to output CDL, followed by an off-the-shelf LLM for reasoning) achieves competitive accuracy on Formalgeo7k-Rec-CoT, Unigeo, and MathVista. No mathematical derivation or prediction reduces to its inputs by construction. The GRPO reward uses CDL matching as a training signal for the interpreter only; final performance is measured by standard problem-solving accuracy on held-out data, not by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to force the result. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CDL is an adequate lossless bridge between diagram and reasoning; no free parameters or invented entities are introduced beyond standard fine-tuning hyperparameters.

axioms (1)

domain assumption CDL descriptions contain all information needed for correct geometric reasoning
Implicit in the decision to route all visual information through the interpreter output.

pith-pipeline@v0.9.0 · 5637 in / 1142 out tokens · 27730 ms · 2026-05-16T10:28:28.815410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

conciseness of CDL narrows the search space... benefits the training of MLLM Interpreter
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CDL matching rewards... recall and precision of the matching results

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 12 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901
[3]

Geoqa: A geomet- ric question answering benchmark towards multimodal nu- merical reasoning

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. Geoqa: A geomet- ric question answering benchmark towards multimodal nu- merical reasoning. InFindings of the Association for Com- putational Linguistics: ACL-IJCNLP 2021, pages 513–523,

work page 2021
[4]

Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying ge- ometry logical reasoning via reformulating mathematical ex- pression. InProceedings of the 2022 Conference on Empir- ical Methods in Natural Language Processing, pages 3313– 3323, 2022. 1, 2, 5

work page 2022
[5]

Geouni: A unified model for gen- erating geometry diagrams, problems and problem solutions

Jo-Ku Cheng, Zeren Zhang, Ran Chen, Jingyang Deng, Zi- ran Qin, and Jinwen Ma. Geouni: A unified model for gen- erating geometry diagrams, problems and problem solutions. arXiv preprint arXiv:2504.10146, 2025. 3, 6, 7

work page arXiv 2025
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025

Daocheng Fu, Zijun Chen, Renqiu Xia, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Junchi Yan, et al. Trustgeogen: Scalable and formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2025. 3

work page arXiv 2025
[8]

G-llava: Solving geometric problem with multi-modal large language model

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing HONG, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. InThe Thirteenth International Conference on Learning Representations. 1, 3, 6, 7

work page
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Decoupled visual interpretation and linguistic reasoning for math problem solving.arXiv preprint arXiv:2505.17609, 2025

Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Decoupled visual interpretation and linguistic reasoning for math problem solving.arXiv preprint arXiv:2505.17609, 2025. 3

work page arXiv 2025
[11]

Autogeo: Automating geometric image dataset creation for enhanced geometry understanding.IEEE Transactions on Multimedia, 2025

Zihan Huang, Tao Wu, Wang Lin, Shengyu Zhang, Jingyuan Chen, and Fei Wu. Autogeo: Automating geometric image dataset creation for enhanced geometry understanding.IEEE Transactions on Multimedia, 2025. 3

work page 2025
[12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 3

work page 2023
[15]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 3, 5

work page 2022
[16]

Eagle: Elevating geo- metric reasoning through llm-empowered visual instruction tuning.arXiv preprint arXiv:2408.11397, 2024

Zhihao Li, Yao Du, Yang Liu, Yan Zhang, Yufang Liu, Mengdi Zhang, and Xunliang Cai. Eagle: Elevating geo- metric reasoning through llm-empowered visual instruction tuning.arXiv preprint arXiv:2408.11397, 2024. 3

work page arXiv 2024
[17]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3

work page 2023
[18]

Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathe- matical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Repre- sentations. 2, 5

work page
[19]

Chatgpt, 2023

OpenAI. Chatgpt, 2023. 3

work page 2023
[20]

Gpt5, 2025

OpenAI. Gpt5, 2025. 3

work page 2025
[21]

Enhancing the geometric problem-solving ability of multi- modal llms via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025

Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, and Feng Ma. Enhancing the geometric problem-solving ability of multi- modal llms via symbolic-neural integration.arXiv preprint arXiv:2504.12773, 2025. 1

work page arXiv 2025
[22]

Autogps: Automated geometry problem solving via multimodal formalization and deductive reason- ing.arXiv preprint arXiv:2505.23381, 2025

Bowen Ping, Minnan Luo, Zhuohang Dang, Chenxi Wang, and Chengyou Jia. Autogps: Automated geometry problem solving via multimodal formalization and deductive reason- ing.arXiv preprint arXiv:2505.23381, 2025. 1

work page arXiv 2025
[23]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

work page 2021
[24]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 3

work page 2023
[25]

Exploring the limits of transfer learning with a 9 unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a 9 unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 5

work page 2020
[26]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Glm-4.5v and glm-4.1v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning, 2025

V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

work page 2025
[29]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024

Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024. 3

work page arXiv 2024
[32]

Geosense: Evaluating identification and application of geometric principles in multimodal reasoning

Liangyu Xu, Yingxiu Zhao, Jingyun Wang, Yingyao Wang, Bu Pi, Chen Wang, Mingliang Zhang, Jihao Gu, Xiang Li, Xiaoyong Zhu, et al. Geosense: Evaluating identification and application of geometric principles in multimodal reasoning. arXiv preprint arXiv:2504.12597, 2025. 3

work page arXiv 2025
[33]

Efficient and accurate prompt optimization: the benefit of memory in exemplar-guided reflection

Cilin Yan, Jingyun Wang, Lin Zhang, Ruihui Zhao, Xi- aopu Wu, Kai Xiong, Qingsong Liu, Guoliang Kang, and Yangyang Kang. Efficient and accurate prompt optimization: the benefit of memory in exemplar-guided reflection. InPro- ceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 753–779, 2025. 3

work page 2025
[34]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Bridging formal language with chain- of-thought reasoning to geometry problem solving.arXiv preprint arXiv:2508.09099, 2025

Tianyun Yang, Yunwen Li, Ziniu Li, Zhihang Lin, Ruoyu Sun, and Tian Ding. Bridging formal language with chain- of-thought reasoning to geometry problem solving.arXiv preprint arXiv:2508.09099, 2025. 3, 6, 7

work page arXiv 2025
[36]

Yi: Open Foundation Models by 01.AI

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object de- tection. InThe Eleventh International Conference on Learn- ing Representations, . 3

work page
[39]

A multi- modal neural geometric solver with textual clauses parsed from diagram

Ming-Liang Zhang, Fei Yin, and Cheng-Lin Liu. A multi- modal neural geometric solver with textual clauses parsed from diagram. InProceedings of the Thirty-Second Inter- national Joint Conference on Artificial Intelligence, pages 3374–3382, 2023. 1, 3

work page 2023
[40]

Mavis: Mathematical visual instruction tuning with an automatic data engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shang- hang Zhang, Peng Gao, et al. Mavis: Mathematical visual instruction tuning with an automatic data engine. InThe Thirteenth International Conference on Learning Represen- tations, . 3

work page
[41]

FormalGeo: The First Step toward Human- Like IMO-Level Geometric Automated Reasoning.arXiv preprint arXiv:2310.18021, 2023

Xiaokai Zhang, Na Zhu, Yiming He, Jia Zou, Qike Huang, Xiaoxiao Jin, Yanjun Guo, Chenyang Mao, Yang Li, Zhe Zhu, et al. Formalgeo: An extensible formalized framework for olympiad geometric problem solving.arXiv preprint arXiv:2310.18021, 2023. 2, 3

work page arXiv 2023
[42]

Diagram formalization enhanced multi-modal geom- etry problem solver

Zeren Zhang, Jo-Ku Cheng, Jingyang Deng, Lu Tian, Jin- wen Ma, Ziran Qin, Xiaokai Zhang, Na Zhu, and Tuo Leng. Diagram formalization enhanced multi-modal geom- etry problem solver. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5. IEEE, 2025. 1, 5, 6, 7, 2

work page 2025
[43]

Fgeo-parser: Autoformalization and solution of plane geometric problems.Symmetry, 17(1): 8, 2024

Na Zhu, Xiaokai Zhang, Qike Huang, Fangzhen Zhu, Zhen- bing Zeng, and Tuo Leng. Fgeo-parser: Autoformalization and solution of plane geometric problems.Symmetry, 17(1): 8, 2024. 1, 2, 3, 5, 6 10 Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving Supplementary Material Table 8.Effect of rollout num...

work page 2024
[44]

Ours (Formalgeo7k)

More Ablations Effect of rollout numberNin GRPO.In order to vali- date the effect of rollout numberNin the GRPO stage, we perform an ablation study on Qwen2.5-VL 7B in Table 8. SettingN= 10yields no performance gain on CDL gen- eration and slightly degrades the problem solving accuracy. Moreover, it brings an extra80hours of training time com- pared withN...

work page
[45]

a”, “the

Proof for CDL’s Conciseness In this section, we provide a proof to demonstrate the con- ciseness of Conditional Declaration Language (CDL) com- pared with general textual descriptions. Generally, a textual description of a geometric input can be decomposed into three components: 1) shape descrip- tions that depict geometric shapes,e.g.,line segments, an- ...

work page
[46]

|= 2 ShapeRelation ImageText1.Collinear(BCD)3.Perpendicular(AC,BD)| 𝑅

More Qualitative Results In this section, we provide examples of various benchmarks, including Formalgeo-Rec-CoT, Unigeo, and Mathvista. 3 9A CDBCaption Intriangle ABD, AC is perpendicularto BD. CDL General TextConsCDL:Shape(AB,BC,CA), Shape(AC,CD,DA)Collinear(BCD)ImgCDL:Perpendicular(AC,CD)TextCDL:Perpendicular(AC,BD) The image displays atrianglelabeledA...

work page