arxiv: 2604.02893 · v1 · submitted 2026-04-03 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

Hai Nguyen-Truong , Alper Balbay , Tunga Bayrak

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:32 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords referring image segmentationgeometry educationvision-language modelsprocedural data generationsynthetic datasetsvisual groundingdiagram understanding

0 comments

The pith

Procedural generation of 200000 synthetic diagrams lets fine-tuned vision-language models segment referred geometric elements at 49 percent IoU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats visual grounding in geometry education as a referring image segmentation task where models must produce pixel masks for elements named in natural language. Existing models trained on photographic scenes fail on abstract diagrams because of the domain gap in texture and structure. To close the gap the authors built an automated engine that creates over 200000 diagrams together with exact masks and varied referring phrases, needing no human labels. Fine-tuning Florence-2 on this corpus raises performance from under 1 percent IoU to 49 percent IoU and 85 percent Buffered IoU, a new metric that tolerates small localization errors around thin lines. The result supplies a scalable route toward systems that can point to and explain diagram parts step by step.

Core claim

A fully automated procedural data engine generates more than 200000 geometry diagrams complete with pixel-perfect segmentation masks and linguistically diverse referring expressions. Domain-specific fine-tuning of a vision-language model such as Florence-2 on this corpus produces 49 percent IoU and 85 percent Buffered IoU on geometric diagrams, compared with less than 1 percent IoU in the zero-shot regime. The authors also define Buffered IoU to better capture localization quality on thin geometric structures and present the pipeline as the foundation for Artificial General Teachers that deliver visually grounded explanations.

What carries the argument

Fully automated procedural data engine that synthesizes geometry diagrams, pixel-perfect masks, and referring expressions to train referring image segmentation models on abstract, textureless schematics.

If this is right

Zero-shot vision-language models produce near-zero accuracy on geometric diagrams because of the shift from textured photographs to line drawings.
Fine-tuning on the procedurally generated corpus raises both standard IoU and the geometry-aware Buffered IoU by large margins.
Buffered IoU penalizes thin-line errors less harshly than standard IoU and therefore ranks models more consistently with human judgment of diagram quality.
The same engine can supply unlimited labeled examples for any geometric construction without further annotation cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same procedural approach could generate training data for other abstract visual domains such as physics free-body diagrams or electrical schematics.
Pairing the segmentation model with a language model would allow tutoring systems to generate both verbal steps and visual pointers inside the same diagram.
Measuring performance on diagrams drawn by students rather than by the procedural engine would reveal how robust the learned grounding remains under natural drawing variation.

Load-bearing premise

The synthetic diagrams and referring expressions generated by the procedural engine are sufficiently representative of real geometry education materials and natural language queries to enable effective transfer.

What would settle it

Running the fine-tuned model on a collection of scanned real textbook geometry diagrams and measuring whether mean IoU stays above 30 percent would directly test whether performance transfers beyond the synthetic distribution.

Figures

Figures reproduced from arXiv: 2604.02893 by Alper Balbay, Hai Nguyen-Truong, Tunga Bayrak.

**Figure 1.** Figure 1: Overview. We formalize the “pointing” mechanism of human geometry teaching as a Referring Image Segmentation (RIS) task. Given a diagram and a textual description (e.g., “triangle ABC”), the model produces a pixel-level mask highlighting the referred element. 1. Procedural data engine. We develop a fully automated pipeline that generates 200,000+ synthetic geometry diagrams with pixel-perfect ground-truth… view at source ↗

**Figure 2.** Figure 2: Data generation pipeline. Stage 1: Constraint-based coordinate generation via analytical solvers. Stage 2: Vector rendering through LATEX/TikZ templates. Stage 3: Render-based ground-truth mask extraction using channel arithmetic. Synthetic Data for Vision. Procedural data generation has been employed effectively in domains ranging from autonomous driving (Dosovitskiy et al., 2017) to document understandin… view at source ↗

**Figure 3.** Figure 3: Hierarchical language template system. Referring expressions are generated at multiple levels of linguistic complexity to encourage robust visual grounding [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: VLM fine-tuning architecture with polygon-based RIS. The vision-language model employs LoRA adapters (shown in red) for efficient fine-tuning while keeping the vision encoder, cross-attention fusion and Token Decoder weights frozen (shown in purple). The model autoregressively generates coordinate tokens that are rasterized into binary masks. 5 EVALUATION PROTOCOL 5.1 BUFFERED IOU Standard Intersection-ove… view at source ↗

**Figure 5.** Figure 5: Qualitative results. The fine-tuned Florence-2 model accurately segments lines, triangles, and circles across diverse geometric queries and diagrammatic complexities. 6.5 COMPUTATIONAL EFFICIENCY [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Artificial General Teacher (AGT) pipeline. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a procedural generator for 200k geometry diagrams with perfect masks and shows fine-tuning lifts Florence-2 from near-zero to 49% IoU on its own synthetic test set, but never checks transfer to real diagrams.

read the letter

The core contribution is the automated engine that produces large numbers of clean geometry diagrams, pixel-accurate masks, and varied referring expressions without manual labeling. That directly tackles the data bottleneck for this narrow domain. Fine-tuning a VLM on it produces usable numbers on the held-out synthetic split, and the buffered IoU metric is a reasonable adjustment for thin-line cases where ordinary overlap is too punishing. Those pieces are concrete and reproducible in principle.

Referee Report

3 major / 2 minor

Summary. The paper frames visual grounding in geometry education as a referring image segmentation task and introduces a fully automated procedural engine that generates over 200,000 synthetic diagrams with pixel-perfect masks and diverse referring expressions. It fine-tunes Florence-2 on this data to report 49% IoU and 85% Buffered IoU on a held-out synthetic test split (versus <1% zero-shot), introduces Buffered IoU as a geometry-aware metric for thin structures, and positions the work as a foundation for Artificial General Teachers.

Significance. The procedural data engine and Buffered IoU metric represent useful engineering contributions that could scale training data for educational VLMs if the synthetic distribution proves representative. However, the reported performance gains are obtained exclusively on in-distribution synthetic test data, so the significance for real-world geometry education remains conditional on untested transfer.

major comments (3)

[Experiments / Results] Experiments / Results: All headline numbers (49% IoU, 85% BIoU) are measured on a test split drawn from the identical procedural generator used for the 200k training diagrams. No cross-domain evaluation on authentic textbook or classroom diagrams is reported, leaving the synthetic-to-real transfer—the central premise for AGT utility—unverified.
[Data generation] Data generation section: The manuscript provides no quantitative description of the sampling distribution over diagram types, line thicknesses, label placements, or referring-expression templates used to produce the 200k diagrams. This absence prevents assessment of dataset diversity and potential mode collapse.
[Evaluation protocol] Evaluation protocol: No ablation studies on fine-tuning hyperparameters, no comparisons against other VLMs, and no error bars or multiple-run statistics accompany the reported IoU/BIoU figures, making it impossible to judge whether the gains are robust or sensitive to random seeds.

minor comments (2)

[Abstract] Abstract: The phrase 'zero manual annotation' is technically correct but could explicitly note that human effort was still required to design the procedural rules and templates.
[Metric definition] Notation: The exact buffering radius and implementation details for Buffered IoU are not provided in a reproducible form (e.g., pseudocode or parameter values).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript's claims regarding data diversity, evaluation robustness, and real-world applicability.

read point-by-point responses

Referee: [Experiments / Results] Experiments / Results: All headline numbers (49% IoU, 85% BIoU) are measured on a test split drawn from the identical procedural generator used for the 200k training diagrams. No cross-domain evaluation on authentic textbook or classroom diagrams is reported, leaving the synthetic-to-real transfer—the central premise for AGT utility—unverified.

Authors: We acknowledge that all reported metrics are computed on in-distribution synthetic data, which limits direct claims about real-world transfer. This choice was intentional to first validate the procedural engine and the Buffered IoU metric in a controlled setting. We agree that synthetic-to-real transfer is essential for the AGT premise. In the revised version we will add a preliminary cross-domain experiment using a small collection of authentic textbook diagrams (approximately 200 images) with manually annotated masks, report the corresponding IoU/BIoU numbers, and include a dedicated discussion of observed domain gaps and mitigation strategies. revision: yes
Referee: [Data generation] Data generation section: The manuscript provides no quantitative description of the sampling distribution over diagram types, line thicknesses, label placements, or referring-expression templates used to produce the 200k diagrams. This absence prevents assessment of dataset diversity and potential mode collapse.

Authors: We agree that a quantitative characterization of the sampling process is necessary. The engine samples diagram complexity uniformly (1–6 primitives), line thickness from {1,2,3,4,5} pixels, label placement offsets from a discrete set of 8 directions, and referring expressions from 12 template families with controlled synonym substitution. In the revision we will insert a new table that reports the exact parameter ranges, sampling probabilities, and resulting empirical frequencies for each category, together with a brief analysis of coverage and potential mode-collapse risks. revision: yes
Referee: [Evaluation protocol] Evaluation protocol: No ablation studies on fine-tuning hyperparameters, no comparisons against other VLMs, and no error bars or multiple-run statistics accompany the reported IoU/BIoU figures, making it impossible to judge whether the gains are robust or sensitive to random seeds.

Authors: We accept that the current single-run results are insufficient to demonstrate robustness. The revised manuscript will include (i) an ablation table varying learning rate, batch size, and number of epochs, (ii) a comparison against two additional VLMs (LLaVA-1.5 and a fine-tuned Segment Anything Model variant) under identical data conditions, and (iii) mean and standard deviation of IoU/BIoU computed over three independent random seeds. These additions will be placed in a new “Ablation and Robustness” subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance measured on independently generated synthetic test data

full rationale

The paper describes a procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect masks and referring expressions, followed by fine-tuning of Florence-2 and reporting of IoU/BIoU metrics on a held-out test split from the same generator. No equations, derivations, or fitted parameters are presented that reduce the reported performance numbers to quantities defined inside the paper by construction. The introduction of Buffered IoU is an explicit new metric definition rather than a self-referential claim. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central result is therefore an ordinary empirical measurement on self-generated but independently sampled data, making the derivation self-contained with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the assumption that procedurally generated diagrams transfer to real educational use; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Synthetic diagrams and referring expressions sufficiently represent real geometry education materials.
The transfer from generated data to practical use is taken as given without further justification in the abstract.

invented entities (1)

Buffered IoU metric no independent evidence
purpose: Geometry-aware evaluation that accounts for thin-structure localization
New metric introduced to better reflect segmentation quality on lines and curves.

pith-pipeline@v0.9.0 · 5522 in / 1276 out tokens · 45680 ms · 2026-05-13T20:32:02.725813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Xing, and Liang Lin

Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric P. Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.\ 513--523. Association for Computational Linguistics, August 2021

work page 2021
[4]

Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression

Jiaqi Chen, Tong Li, Jinghui Qin, Pan Lu, Liang Lin, Chongyu Chen, and Xiaodan Liang. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746, 2022

work page arXiv 2022
[5]

Carla: An open urban driving simulator

Alexey Dosovitskiy et al. Carla: An open urban driving simulator. In Conference on Robot Learning (CoRL), 2017

work page 2017
[6]

G-llava: Solving geometric problem with multi-modal large language model

Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023

work page arXiv 2023
[7]

Explaining math: Gesturing lightens the load

Susan Goldin-Meadow, Howard Nusbaum, Spencer Kelly, and Susan Cook. Explaining math: Gesturing lightens the load. Psychological science, 12: 0 516--22, 12 2001. doi:10.1111/1467-9280.00395

work page doi:10.1111/1467-9280.00395 2001
[8]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

work page 2022
[9]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. In European Conference on Computer Vision (ECCV), 2016

work page 2016
[10]

R efer I t G ame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. R efer I t G ame: Referring to objects in photographs of natural scenes. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pp.\ 787--798, Doha, Qatar, October 2014. Association for...

work page doi:10.3115/v1/d14-1086 2014
[11]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9579--9589, June 2024

work page 2024
[12]

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan L. Yuille. Recurrent multimodal interaction for referring image segmentation. In 2017 IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[13]

Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

work page 2021
[14]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu et al. Mathvista: Evaluating mathematical reasoning in visual contexts. In arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Yuille, and Kevin Murphy

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

work page 2016
[16]

Richard E. Mayer. Multimedia Learning. Cambridge University Press, 2002

work page 2002
[17]

Richard E. Mayer. The past, present, and future of the cognitive theory of multimedia learning. Educational Psychology Review, 36, 2024. URL https://api.semanticscholar.org/CorpusID:267130958

work page 2024
[18]

Vision-aware text features in referring image segmentation: From object understanding to context understanding

Hai Nguyen-Truong, E-Ro Nguyen, Tuan-Anh Vu, Minh-Triet Tran, Binh-Son Hua, and Sai-Kit Yeung. Vision-aware text features in referring image segmentation: From object understanding to context understanding. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), pp.\ 4988--4998, February 2025

work page 2025
[19]

Math-llava: Boosting visual mathematical reasoning for large language models

Yuan Shi et al. Math-llava: Boosting visual mathematical reasoning for large language models. In arXiv preprint arXiv:2401.XXXX, 2024

work page 2024
[20]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 11686--11695, June 2022

work page 2022
[21]

Florence-2: Advancing a unified representation for a variety of vision tasks

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 4818--4829, June 2024

work page 2024
[22]

Lavt: Language-aware vision transformer for referring image segmentation

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 18155--18165, 2022

work page 2022
[23]

Berg, and Tamara L

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions. In European Conference on Computer Vision (ECCV), pp.\ 69--85. Springer, 2016

work page 2016
[24]

Mathverse: A benchmark for multi-modal math reasoning

Yuhui Zhang et al. Mathverse: A benchmark for multi-modal math reasoning. In arXiv preprint arXiv:2403.XXXX, 2024 a

work page 2024
[25]

Mavis: Math-aware visual instruction tuning

Yuhui Zhang et al. Mavis: Math-aware visual instruction tuning. In arXiv preprint arXiv:2404.XXXX, 2024 b

work page 2024
[26]

Publaynet: Largest dataset ever for document layout analysis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of ICDAR, 2019

work page 2019
[27]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[28]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[29]

Line segment from point A to point B

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page