pith. sign in

arxiv: 2605.16371 · v1 · pith:BQPXED4Bnew · submitted 2026-05-10 · 💻 cs.CV · cs.AI

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Pith reviewed 2026-05-20 22:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords geometric reasoningmultimodal modelsneuro-symbolic synthesischain-of-thought datasymbolic verificationdataset generationvision-language modelsdiagram understanding
0
0 comments X

The pith

A neuro-symbolic engine generates 127K symbolically verified geometric questions and diagrams to train multimodal models on precise diagram reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the GeoSym Engine as a way to automatically create large volumes of training data for geometric reasoning in large multimodal models. It combines a type-conditional grammar to define problem structures with an analytic solver that produces exact symbolic ground truths and step-by-step reasoning chains. These elements feed a rendering pipeline that outputs high-resolution diagrams paired with verifiable question-answer data. The resulting GeoSym127K dataset, when used for supervised fine-tuning, produces targeted gains on tasks that depend on reading diagrams and maintaining long reasoning sequences. The authors also show that starting reinforcement learning from these checkpoints raises the performance ceiling compared with starting from scratch.

Core claim

We propose the GeoSym Engine, an automated and scalable neuro-symbolic framework that leverages a type-conditional grammar and an analytic SymGT Solver to derive exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs, and demonstrate through supervised fine-tuning that the data drives concentrated improvements on diagram-dependent and multi-step geometry tasks.

What carries the argument

The GeoSym Engine, which uses a type-conditional grammar to generate geometric problem instances and an analytic SymGT Solver to compute exact symbolic ground truths together with verified Chain-of-Thought sequences.

If this is right

  • Supervised fine-tuning on the dataset produces absolute gains of 22.21 percent on the MathVerse Vision-Only subset and 6.19 percent on WeMath.
  • The improvements concentrate on diagram-dependent and multi-step geometry tasks while reducing long-horizon logic fragmentation.
  • Initializing reinforcement learning with verifiable rewards from the structural SFT checkpoints raises the performance ceiling relative to zero-shot RL.
  • The approach yields verifiable exact-match signals that support robust scaling of reasoning synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grammar-plus-solver pattern could be applied to other visual reasoning domains such as physics diagrams or algebraic geometry where exact symbolic answers are computable.
  • Larger datasets produced by the same engine might further close the gap between open and closed models on complex diagram tasks.
  • The verifiable reward structure could be reused for online data filtering or active learning loops that prioritize hard multi-step examples.

Load-bearing premise

The generated symbolic ground truths and CoT pairs contain no systematic errors from the grammar or solver and transfer to real-world diagrams without distribution shift or overfitting to the synthetic rendering style.

What would settle it

A test set of real photographed or hand-drawn geometry diagrams paired with the same questions shows no accuracy gain or introduces new systematic errors in the fine-tuned model compared with the baseline.

Figures

Figures reproduced from arXiv: 2605.16371 by Benyou Wang, Jingjing Bai, Jing Yang, Jinhao Jing, Jinwei Liang, Lewei Lu, Por Lip Yee, Prayag Tiwari, Qiannian Zhao, Shawn Chen, Zhan Su, Zheng Ma.

Figure 3
Figure 3. Figure 3: GeoSym Instruct Dataset Overview. (a-b) Distributions of total tokens per instance and difficulty scores, demonstrating the dataset’s broad logical depth and text-rich reasoning chains. (c) A hierarchical nested ring chart illustrating the proportion of different geometric types (inner ring) and subtypes (outer ring), with core overall statistics embedded in the center. From this stratified generative pool… view at source ↗
Figure 5
Figure 5. Figure 5: (Top Row) identifies 3–5 SFT epochs and 100 GRPO steps as the optimal training sweet spot. Critically, initializing GRPO from SFT checkpoints substantially elevates the performance ceiling compared to zero-shot RL, demonstrating that foundational neuro-symbolic alignment is a prerequisite for maximizing RL efficacy. Exhaustive data logs for these ablations are deferred to Appendix E.4 (Tables 16 and 17). 1… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of rendering quality and topological diversity across GeoSym127K. The dataset covers extreme geometric variations including multi-hop translations, complex shaded regions, and precision￾aligned vertices. Every diagram is generated from exact mathematical coordinates, ensuring zero visual hallucination during model training. To further contextualize GeoSym127K among existing geometry-oriented … view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of representative geometry-related datasets in terms of diagram image, caption availability, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GeoSym127K Instruct Dataset Example. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GeoSym127K Instruct Dataset Example. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GeoSym127K Instruct Dataset Example. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GeoSym127K Instruct Dataset Example. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pass Rate Trend. Monotonic decline in accuracy as Dtotal increases. 1 2 3 4 5 6 7 8 9 10 0 10 20 30 40 50 60 Micro-Level Verification Pass Rate (%) C.3 Detailed Dataset Statistics and Verification Bottlenecks To provide a comprehensive perspective on our hierarchical complexity stratification, [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Hierarchical Subtype Distribution. The double-ring charts illustrate the dataset composition across the Entry, Hard, and Expert levels. The inner rings denote the macro-categories (Angle, Length, Area), while the outer rings break down the specific problem subtypes. Angle 11.4% Length 47.9% Area 40.6% 11.4% 36.1% 11.8% 11.7% 16.9% 12.1% Entry Level Breakdown Angle 11.8% Length 47.0% Area 41.2% 11.8% 35.2%… view at source ↗
Figure 14
Figure 14. Figure 14: Failure Mode: Minor CoT Hallucination in the Generative Pipeline. Although the final Ground Truth (π/6) is mathematically sound, the MLLM’s generated CoT exhibits a severe logical breakdown. The model misinterprets the topological definition of the arc, falsely claims a geometric contradiction, and hallucinates a central angle of 180◦ before inexplicably outputting the correct answer. This highlights the … view at source ↗
Figure 15
Figure 15. Figure 15: Failure Mode 2: Proprietary Model Breakdown on GeoSym-Bench. This case highlights a "logical shortcut" hallucination. While Gemini-3-Pro correctly parses the text instructions, it fails to perform the spatial reasoning required to distinguish the intersection I2 from the vertex A. In the actual manifold, I2 is a secondary derivation derived from the overlapping transformed parallelograms. By falsely assum… view at source ↗
Figure 16
Figure 16. Figure 16: Failure Mode 3: Spatial Misalignment and Vertex Hallucination. In this nested geometry task, Gemini-3-Pro perfectly executes the algebraic scaling logic but severely misinterprets the visual topology. The model hallucinates that segments JK and EF connect the outer corners of the parallelograms, ignoring the explicit visual evidence showing they lie on the inner horizontal boundaries. This "blindness" to … view at source ↗
Figure 17
Figure 17. Figure 17: Distribution of GeoSym-Bench samples by type and subtype. The inner ring represents the main [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
read the original abstract

Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the GeoSym Engine, a neuro-symbolic framework leveraging a type-conditional grammar and analytic SymGT Solver to generate the GeoSym127K dataset (51K high-resolution images, 127K questions with symbolic ground truths, 55K answer-verified CoT pairs) along with the expert-curated GeoSym-Bench (511 samples). Experiments demonstrate that SFT on this data yields +22.21% absolute gain on MathVerse Vision-Only for Qwen3-VL-8B and +6.19% on WeMath, with further gains from GRPO-based RLVR initialized from SFT checkpoints; code and data are released.

Significance. If the symbolic ground truths are verifiably correct and transfer without substantial distribution shift, the work provides a scalable, reproducible pipeline for creating mathematically precise training data that targets visual hallucinations and long-horizon reasoning failures in LMMs. The public release of datasets and code is a clear strength supporting reproducibility and follow-on research. The reported concentration of gains on diagram-dependent tasks is consistent with the motivating hypothesis.

major comments (2)
  1. [§3.2] §3.2 (SymGT Solver description): No error rates, sample-wise human verification, or cross-validation against independent geometry libraries (e.g., SymPy or GeoGebra) are reported for the analytic solver outputs across the 127K questions. Because the central performance attribution rests on the claim of 'exact symbolic ground truths' and 'answer-verified CoT pairs,' even modest systematic solver failures would mean SFT reinforces incorrect reasoning rather than mitigating fragmentation.
  2. [§5] §5 (Experimental results): The reported benchmark improvements lack details on train/validation/test splits of GeoSym127K, statistical significance testing (e.g., multiple random seeds or confidence intervals), or controls for synthetic-style artifacts versus real diagram distribution shift. These omissions make it difficult to confirm that the +22.21% and +6.19% gains are robust and attributable to the symbolic supervision rather than dataset-specific overfitting.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrase 'long-horizon logic fragmentation' is used without a precise definition or citation; a short formalization would improve clarity for readers outside the immediate subfield.
  2. [Figure 1] Figure 1 and rendering pipeline description: The caption and text could more explicitly state the resolution and anti-aliasing settings used for the 51K images to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on verification of the SymGT Solver and robustness of the reported gains. We respond to each major comment below and will incorporate revisions to address the concerns.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (SymGT Solver description): No error rates, sample-wise human verification, or cross-validation against independent geometry libraries (e.g., SymPy or GeoGebra) are reported for the analytic solver outputs across the 127K questions. Because the central performance attribution rests on the claim of 'exact symbolic ground truths' and 'answer-verified CoT pairs,' even modest systematic solver failures would mean SFT reinforces incorrect reasoning rather than mitigating fragmentation.

    Authors: The SymGT Solver relies on deterministic analytic procedures grounded in Euclidean geometry axioms and exact symbolic algebra, ensuring correctness by construction for all problems generated from the type-conditional grammar. We acknowledge that the original manuscript did not include quantitative error analysis or external cross-validation. In the revision we will add a dedicated validation subsection reporting results on a 1,000-question random sample: (i) expert manual verification of 200 cases, (ii) consistency checks against SymPy for all algebraic sub-expressions, and (iii) a statement of the observed zero-error rate on the sampled set. This directly mitigates the risk of reinforcing incorrect reasoning. revision: yes

  2. Referee: [§5] §5 (Experimental results): The reported benchmark improvements lack details on train/validation/test splits of GeoSym127K, statistical significance testing (e.g., multiple random seeds or confidence intervals), or controls for synthetic-style artifacts versus real diagram distribution shift. These omissions make it difficult to confirm that the +22.21% and +6.19% gains are robust and attributable to the symbolic supervision rather than dataset-specific overfitting.

    Authors: GeoSym127K was used in its entirety for SFT; no internal train/validation split was held out because evaluation targeted generalization to the fixed external benchmarks MathVerse and WeMath. We agree that statistical significance and distribution-shift controls strengthen the claims. In the revision we will (i) rerun SFT with three random seeds and report mean ± std, (ii) add a short analysis showing that gains remain concentrated on diagram-dependent subsets even after controlling for question length, and (iii) include a qualitative comparison of model behavior on synthetic versus real diagrams from the benchmarks. These additions will clarify that the improvements arise from the symbolic supervision rather than overfitting to synthetic artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline uses independent external benchmarks

full rationale

The derivation chain consists of a type-conditional grammar and SymGT Solver producing symbolic ground truths, followed by dataset construction and SFT/RLVR training, with gains reported on external benchmarks (MathVerse Vision-Only, WeMath). These benchmarks are independent of the synthetic generation process and not fitted or redefined within the paper. No equations, self-citations, or ansatzes reduce the central claims to inputs by construction. The approach is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the correctness of the type-conditional grammar and SymGT Solver for producing exact symbolic ground truths; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The type-conditional grammar and analytic SymGT Solver produce exact symbolic ground truths without systematic errors.
    Invoked to justify the quality of the 127K questions and 55K CoT pairs.

pith-pipeline@v0.9.0 · 5880 in / 1397 out tokens · 25431 ms · 2026-05-20T22:39:34.353217+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  3. [3]

    Geogpt4v: Towards geometric multi-modal large language models with geometric image generation

    Shihao Cai, Keqin Bao, Hangyu Guo, Jizhi Zhang, Jun Song, and Bo Zheng. Geogpt4v: Towards geometric multi-modal large language models with geometric image generation, 2024. URL https: //arxiv.org/abs/2406.11503

  4. [4]

    GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, Online, August 20...

  5. [5]

    UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression

    Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, Abu Dhabi, United A...

  6. [6]

    Theorem-validated reverse chain-of-thought problem generation for geometric reasoning,

    Linger Deng, Linghao Zhu, Yuliang Liu, Yu Wang, Qunyi Xie, Jingjing Wu, Gang Zhang, Yingying Zhu, and Xiang Bai. Theorem-validated reverse chain-of-thought problem generation for geometric reasoning,

  7. [7]

    URLhttps://arxiv.org/abs/2410.17885

  8. [8]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, 2024

    Haodong Duan, Xinyu Fang, Junming Yang, Xiangyu Zhao, Yuxuan Qiao, Mo Li, Amit Agarwal, Zhe Chen, Lin Chen, Yuan Liu, Yubo Ma, Hailong Sun, Yifan Zhang, Shiyin Lu, Tack Hwa Wong, Weiyun Wang, Peiheng Zhou, Xiaozhe Li, Chaoyou Fu, Junbo Cui, Jixuan Chen, Enxin Song, Song Mao, Shengyuan Ding, Tianhao Liang, Zicheng Zhang, Xiaoyi Dong, Yuhang Zang, Pan Zhang...

  9. [9]

    Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation, 2025

    Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Zijun Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, and Junchi Yan. Geobench: Rethinking multimodal geometric problem-solving via hierarchical evaluation, 2025. URLhttps://arxiv.org/abs/2512.24119

  10. [10]

    Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving.arXiv preprint arXiv:2504.15780, 2026

    Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, and Bo Zhang. Trustgeogen: Formal-verified data engine for trustworthy multi-modal geometric problem solving, 2026. URLhttps://arxiv.org/abs/2504.15780

  11. [11]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  12. [12]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. URLhttps://arxiv.org/abs/2503.06749

  13. [13]

    Autogeo: Automating geometric image dataset creation for enhanced geometry understanding, 2024

    Zihan Huang, Tao Wu, Wang Lin, Shengyu Zhang, Jingyuan Chen, and Fei Wu. Autogeo: Automating geometric image dataset creation for enhanced geometry understanding, 2024. URL https://arxiv. org/abs/2409.09039

  14. [14]

    Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter- gps: Interpretable geometry problem solving with formal language and symbolic reasoning, 2021. URL https://arxiv.org/abs/2105.04165

  15. [15]

    Mathvista: Evaluating mathematical reason- ing of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reason- ing of foundation models in visual contexts. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representati...

  16. [16]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct, 2025. URL https://arxiv.org/abs/2308.09583

  17. [17]

    Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models

    Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Rep- resentations, volume 2025, pages 94743–94765, 2025. URL ...

  18. [18]

    Nvidia nemotron nano v2 vl.arXiv preprint arXiv:2511.03929, 2025

    NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Guo Chen, Karan Sapra, Zhiding Yu, Adi Renduchintala, Charles Wang, Peter Jin, Arushi Goel, Mike Ranzinger, Lukas V oegtle, Philipp Fischer, Timo Roman, Wei Ping, Boxin Wang, Zhuolin Yang, Nay...

  19. [19]

    URL https://aclanthology.org/ 2023.findings-acl.850/

    Shuai Peng, Di Fu, Yijun Liang, Liangcai Gao, and Zhi Tang. GeoDRL: A self-learning framework for geometry problem solving using reinforcement learning in deductive reasoning. In Anna Rogers, Jordan 11 Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 13468–13480, Toronto, Canada, July 2023...

  20. [20]

    We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning?, 2024. URLhttps://arxiv.org/abs/2407.01284

  21. [21]

    Rosenfeld and J.L

    A. Rosenfeld and J.L. Pfaltz. Distance functions on digital pictures.Pattern Recognition, 1(1):33– 61, 1968. ISSN 0031-3203. doi: https://doi.org/10.1016/0031-3203(68)90013-7. URL https://www. sciencedirect.com/science/article/pii/0031320368900137

  22. [22]

    Seed1.8 Model Card: Towards Generalized Real-World Agency

    Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency, 2026. URL https: //arxiv.org/abs/2603.20633

  23. [23]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300

  24. [24]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  25. [25]

    Defining and characteriz- ing reward gaming

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characteriz- ing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, edi- tors,Advances in Neural Information Processing Systems, volume 35, pages 9460–9471. Curran As- sociates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/pa...

  26. [26]

    Math blind: Failures in diagram understanding undermine reasoning in mllms, 2025

    Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, and Anton van den Hengel. Math blind: Failures in diagram understanding undermine reasoning in mllms, 2025. URL https://arxiv.org/abs/2503.20745

  27. [27]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  28. [28]

    Trinh, Yuhuai Wu, Quoc V

    Trieu H. Trinh, Yuhuai Wu, Quoc V . Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations.Nat., 625(7995):476–482, 2024. doi: 10.1038/S41586-023-06747-5. URL https://doi.org/10.1038/s41586-023-06747-5

  29. [29]

    Measuring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hong- sheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neu- ral Information Processing Systems, volume 37, pages 95095–95169. Curran ...

  30. [30]

    Do large language models truly understand geometric structures?, 2025

    Xiaofeng Wang, Yiming Wang, Wenhong Zhu, and Rui Wang. Do large language models truly understand geometric structures?, 2025. URLhttps://arxiv.org/abs/2501.13773

  31. [31]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Ad- vances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Asso...

  32. [32]

    NesyGeo: A neuro-symbolic framework for multimodal geometric reasoning data generation

    Weiming Wu, Jin Ye, Zi kang Wang, Zhi Zhou, Yu-Feng Li, and Lan-Zhe Guo. Nesygeo: A neuro-symbolic framework for multimodal geometric reasoning data generation, 2025. URL https://arxiv.org/abs/ 2505.17121

  33. [33]

    Geox: Geometric problem solving through unified formalized vision-language pre-training

    Renqiu Xia, mingsheng li, Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, and Bo Zhang. Geox: Geometric problem solving through unified formalized vision-language pre-training. In Y . Yue, 13 A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Confe...

  34. [34]

    Geo-llava: A large multi-modal model for solving geometry math problems with meta in-context learning

    Shihao Xu, Yiyang Luo, and Wei Shi. Geo-llava: A large multi-modal model for solving geometry math problems with meta in-context learning. InProceedings of the 2nd Workshop on Large Generative Models Meet Multimodal Applications, LGM3A ’24, page 11–15, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 9798400711930. doi: 10.1145/3688866....

  35. [35]

    Geosdf: Plane geometry diagram synthesis via signed distance field.arXiv preprint arXiv:2506.13492, 2025

    Chengrui Zhang, Maizhen Ning, Tianyi Liu, Zihao Zhou, Jie Sun, Qiufeng Wang, and Kaizhu Huang. Geosdf: Plane geometry diagram synthesis via signed distance field, 2025. URL https://arxiv.org/ abs/2506.13492

  36. [36]

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InComputer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part VII...

  37. [37]

    Mavis: Mathematical vi- sual instruction tuning with an automatic data engine

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Gao Peng, and Hongsheng Li. Mavis: Mathematical vi- sual instruction tuning with an automatic data engine. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors,International Conference on Learning Representations, volume 2025, pages...

  38. [38]

    Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language, 2025

    Yuhao Zhang, Dingxin Hu, Tinghao Yu, Hao Liu, and Yiting Liu. Geofm: Enhancing geometric reasoning of mllms via synthetic data generation through formal language, 2025. URL https://arxiv.org/abs/ 2510.27448. 14 A GeoSym127K Dataset Samples and Comparison In this section, we present representative samples from the GeoSym127K dataset to demonstrate our rigo...

  39. [39]

    Find CD. A. 6.0 B.6.5 C.7.0 D.8.5 Answer:B G-LLaVa -- <ImageHere> Give reasoning steps and answers. There is a trapezoid in the diagram, represented by the symbol TQWC. Line TQ is parallel to line WU. Is line WC parallel to line TQ?" Step1: According to TQWC being a trapezoid and TQ being parallel to WU, we know that in a trapezoid, only one pair of oppos...

  40. [40]

    Since points F, J, and E are collinear, we have:\n \\(\\angle GEJ = \\angle GEF\\).\n3

    By angle addition, we have:\n \\(\\angle HEJ = \\angle GEJ + \\angle HEG\\).\n2. Since points F, J, and E are collinear, we have:\n \\(\\angle GEJ = \\angle GEF\\).\n3. Since quadrilateral EFGH is a parallelogram, we have:

  41. [41]

    arc A-O3

    Given that RSTU forms a square, we have:\n - TU = UR. .....TW = \\angle RTU\\).\n4. Therefore:\n - \\(\\angle RTW = \\angle URT\\).", GeoGPT4V -- <ImageHere> If VX is parallel to SU and angle VWY is equal to angle STW, what can we say about angle RTU? A.11.3 B.13.5 C.5.7 D.8.7 Stepl: Draw trapezoid TRVB withTRIIVB,TR=4,VB=6,RV=4,/TRV=12 0°. First, Calcula...