pith. machine review for the scientific record. sign in

arxiv: 2604.06079 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords graphics program synthesisTikZ code generationreinforcement learningmultimodal large language modelsscientific visualizationself-consistencybenchmark evaluationcode synthesis
0
0 comments X

The pith

Dual self-consistency reinforcement learning lets an 8B model generate accurate TikZ code for scientific graphics and outperform much larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of turning static scientific images into precise editable TikZ code by identifying two main barriers: poor data quality in existing image-code pairs and weak evaluation standards. It closes these gaps with a large dataset of executable pairs built through an execution-focused engine and a benchmark that tests both visual match and logical structure. The central technique applies a dual self-consistency reinforcement learning loop that uses round-trip verification to discard inconsistent or degenerate code. This training produces a compact model that exceeds the performance of proprietary systems and far larger open models on the new benchmark. If the approach holds, scientific diagrams could be routinely converted into modifiable code across many fields without needing giant models.

Core claim

The paper shows that an Execution-Centric Data Engine can produce SciTikZ-230K strictly executable and visually aligned image-TikZ pairs across eleven disciplines, and that Dual Self-Consistency Reinforcement Learning with Round-Trip Verification can optimize a model to penalize degenerate outputs and raise overall consistency, allowing the resulting SciTikZer-8B model to reach state-of-the-art performance that surpasses Gemini-2.5-Pro and Qwen3-VL-235B-A22B-Instruct on both visual fidelity and structural logic.

What carries the argument

Dual Self-Consistency Reinforcement Learning optimization paradigm that applies Round-Trip Verification to penalize degenerate code and raise self-consistency in the generated TikZ programs.

If this is right

  • High-quality executable image-TikZ datasets can be scaled across scientific domains to support further model training.
  • Multifaceted benchmarks that separately score visual alignment and logical structure can become standard for graphics synthesis.
  • Reinforcement learning with consistency verification can improve code output from multimodal models without requiring extreme parameter counts.
  • Generated TikZ code can be used directly for precise rendering and editing of complex hierarchical scientific schematics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same round-trip verification pattern could be applied to synthesize code for other diagram languages or plotting libraries.
  • Targeted consistency reinforcement may prove more efficient than raw scale for narrow-domain code generation tasks.
  • Automated conversion of figures from published papers into editable code could become practical for reproducibility checks.
  • Testing the trained model on figures extracted from recent journal articles would show whether the benchmark gains transfer to real research outputs.

Load-bearing premise

The Execution-Centric Data Engine produces image-TikZ pairs that are strictly executable and visually aligned, and the round-trip verification step in reinforcement learning improves consistency without introducing new biases or degenerate solutions.

What would settle it

Direct evaluation of SciTikZer-8B on SciTikZ-Bench showing lower scores than Gemini-2.5-Pro or Qwen3-VL-235B on combined visual-fidelity and structural-logic metrics would disprove the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.06079 by Honglin Lin, Juekai Lin, Lijun Wu, Sijing Li, Tianwei Lin, Wenqiao Zhang, Xiaoyang Wang, Yun Zhu, Zheng Liu.

Figure 1
Figure 1. Figure 1: Challenges in graphics program synthesis. Current open-source MLLMs struggle with the strict constraints of TikZ, exhibiting critical issues such as syntax hallucinations, dependency omissions and geometric misalignments. web-scraped data suffers from intrinsic noise, existing synthetic corpora are often low-quality, with cluttered layouts and limited structural coherence. This results in models that frequ… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Execution-Centric Data Engine. Beyond passive filtering, it uses MLLM￾guided remediation to correct compilation faults and distill misaligned pairs, preserving diversity under strict visual-program alignment. 3.1 Source Aggregation To construct a diverse data foundation, we aggregate data from HuggingFace, TeX StackExchange, and arXiv. However, direct training on these sources is impeded by… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SciTikZ-230K Dataset. The chart illustrates the hierarchical distribution of our [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the SciTikZ Framework. The pipeline initializes with SFT and Curriculum Data Selection. Core DSC-RL integrates Visual Alignment for pixel precision and Self-Consistency Refinement via symbolic back-translation. 4.1 Supervised Warm-up for Initialization We initialize the policy πθ using our curated dataset Dfinal. Formulated as a conditional sequence gen￾eration task, we optimize the model to ma… view at source ↗
Figure 5
Figure 5. Figure 5: Impact of Data Curation. Curated data enhances both executability and visual fidelity in 4B and 8B models compared with raw data and DaTikZ-v3. 4B 8B 94 95 96 97 94.9 96.1 95.9 97.2 +1.0 +1.1 Succ. ( ) 4B 8B 89 90 91 92 93 94 91.1 92.3 92.4 93.8 +1.3 +1.5 SigLIP ( ) 4B 8B 28 29 30 31 32 32.2 31.1 30.8 29.7 -1.4 -1.4 LPIPS ( ) 4B 8B 9 10 11 12 13 14 13.9 12.2 12.3 10.9 -1.7 -1.4 DreamSim ( ) 4B 8B 28.0 28.2… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of Dual Self-Consistency (DSC). Comparison between standard RL and our full DSC method. SciTikZ-230K yields consistent improvements across model scales. For the 8B model, curation boosts the compilation rate from 76.4% to 81.0% and increases the SigLIP score by +5.0 points. Compared with DaTikZ-v3, SciTikZ-230K performs better on most metrics, further validating our execution-centric curation stra… view at source ↗
Figure 7
Figure 7. Figure 7: Case Analysis. A qualitative comparison of rendered TikZ code generated by different models against the Ground Truth. We evaluate SciTikZer-8B against SOTA baselines, including Gemini￾2.5-Pro, DeTikZify-v2.5-8B, and Qwen3-VL-Instruct-32B. forms standard RL baselines in both executability and visual fidelity, confirming its robustness across different coding languages. 5.6 Human Evaluation and Case Analysis… view at source ↗
Figure 8
Figure 8. Figure 8: Example of diagnostic remediation for compilation errors. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Representative samples from SciTikZ-Bench across three difficulty tiers. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of diagnostic remediation across 10 diverse cases. We evaluate [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Detailed analysis of program synthesis for a circuit schematic. We compare SciTikZer [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
read the original abstract

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to address data quality and evaluation gaps in graphics program synthesis by introducing an Execution-Centric Data Engine that produces the SciTikZ-230K dataset of executable image-TikZ pairs across 11 disciplines, the SciTikZ-Bench for assessing structural and visual fidelity, and a Dual Self-Consistency Reinforcement Learning paradigm that uses Round-Trip Verification to train SciTikZer-8B, which reportedly achieves SOTA by outperforming Gemini-2.5-Pro and Qwen3-VL-235B-A22B-Instruct.

Significance. If the closed-loop verification proves robust, the work would be significant by supplying a large-scale executable dataset and multifaceted benchmark that the field currently lacks, plus a novel RL objective for boosting self-consistency in code generation. These resources could enable more reliable training and assessment of MLLMs for precise scientific visualization reverse-engineering.

major comments (2)
  1. [§3] §3 (Execution-Centric Data Engine and SciTikZ-Bench): both the 230K training set and the benchmark are generated by the same engine, so any systematic rendering mismatch (coordinate drift, missing labels, style artifacts) would be invisible to Round-Trip Verification yet would inflate all reported metrics; the manuscript must show that verification detects such misalignments.
  2. [§4] §4 (Dual Self-Consistency RL and Round-Trip Verification): the claim that the RL objective reliably penalizes degenerate yet compilable solutions and boosts true visual fidelity lacks supporting ablations or analysis demonstrating it avoids semantic drift; without external real-world figures or human preference studies, the SOTA claim on SciTikZ-Bench rests on an untested internal loop.
minor comments (1)
  1. [Abstract] Abstract: the SOTA claim is stated without any quantitative metrics, error bars, or comparison details, which would allow readers to assess the headline result immediately.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each major comment below and outline the revisions we plan to make to improve the clarity and robustness of our work.

read point-by-point responses
  1. Referee: [§3] §3 (Execution-Centric Data Engine and SciTikZ-Bench): both the 230K training set and the benchmark are generated by the same engine, so any systematic rendering mismatch (coordinate drift, missing labels, style artifacts) would be invisible to Round-Trip Verification yet would inflate all reported metrics; the manuscript must show that verification detects such misalignments.

    Authors: We agree that this is an important consideration to ensure the reliability of our closed-loop system. The Execution-Centric Data Engine incorporates multiple checks for executability and visual alignment during generation. To directly address the concern, we will include in the revised manuscript an experiment that injects synthetic misalignments (such as coordinate drifts and missing elements) into TikZ code and demonstrates that the Round-Trip Verification successfully identifies and filters these cases, preventing them from affecting the metrics. revision: yes

  2. Referee: [§4] §4 (Dual Self-Consistency RL and Round-Trip Verification): the claim that the RL objective reliably penalizes degenerate yet compilable solutions and boosts true visual fidelity lacks supporting ablations or analysis demonstrating it avoids semantic drift; without external real-world figures or human preference studies, the SOTA claim on SciTikZ-Bench rests on an untested internal loop.

    Authors: We thank the referee for highlighting this. Our manuscript does include ablation studies in §4 comparing the Dual Self-Consistency RL against baselines and variants, showing consistent gains in both structural accuracy and visual similarity metrics on SciTikZ-Bench. These ablations help demonstrate the objective's effectiveness in penalizing degenerate solutions. However, we acknowledge that further analysis specifically targeting semantic drift would be beneficial. We will expand the ablation section in the revision to include additional metrics and case studies illustrating how the RL avoids semantic inconsistencies. Regarding external real-world figures and human preference studies, while SciTikZ-Bench draws from diverse scientific domains to simulate real-world scenarios, comprehensive human evaluations are resource-intensive and were not conducted in this work. We believe the multifaceted benchmark provides a strong proxy for visual fidelity. revision: partial

standing simulated objections not resolved
  • Comprehensive human preference studies on external real-world figures, which would require significant additional experimental resources beyond the scope of this revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contributions—an Execution-Centric Data Engine for generating SciTikZ-230K, the SciTikZ-Bench, and the Dual Self-Consistency RL paradigm with Round-Trip Verification—are presented as empirical engineering steps leading to a trained model whose performance is measured on the benchmark. No equations, derivations, or claims reduce by construction to their own inputs (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations of uniqueness theorems). The SOTA claim is an external empirical outcome relative to other models, not a tautological restatement of the training procedure. The internal generation of data and benchmark raises separate validity questions but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; full text required for ledger construction.

pith-pipeline@v0.9.0 · 5567 in / 1143 out tokens · 50746 ms · 2026-05-10T19:40:36.798845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

    cs.CV 2026-04 unverdicted novelty 3.0

    This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

63 extracted references · 32 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Claude sonnet 4.5 system card.https://www.anthropic.com/system-cards, 2025

    Anthropic. Claude sonnet 4.5 system card.https://www.anthropic.com/system-cards, 2025

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  5. [5]

    Automatikz: Text-guided synthesis of scientific vector graphics with tikz.arXiv preprint arXiv:2310.00367, 2023

    Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz.arXiv preprint arXiv:2310.00367, 2023

  6. [6]

    Detikzify: Synthesizing graphics programs for scientific figures and sketches with tikz.Advances in Neural Information Processing Systems, 37:85074–85108, 2024

    Jonas Belouadi, Simone Ponzetto, and Steffen Eger. Detikzify: Synthesizing graphics programs for scientific figures and sketches with tikz.Advances in Neural Information Processing Systems, 37:85074–85108, 2024

  7. [7]

    Tikzero: Zero-shot text-guided graphics program synthesis

    Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, and Simone Ponzetto. Tikzero: Zero-shot text-guided graphics program synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17793–17806, 2025

  8. [8]

    Learning to synthesize graphics programs for geometric artworks

    Qi Bing, Chaoyi Zhang, and Weidong Cai. Learning to synthesize graphics programs for geometric artworks. InInternational Conference on Pattern Recognition, pages 259–274. Springer, 2025

  9. [9]

    Blecher, G

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic, and M Ai. Nougat: Neural optical understanding for academic documents, 2023.arXiv preprint arXiv:2308.13418

  10. [10]

    Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

    Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

  11. [11]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  12. [12]

    Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  14. [14]

    Image-to-markup generation with coarse-to-fine attention

    Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. Image-to-markup generation with coarse-to-fine attention. InInternational Conference on Machine Learning, pages 980–989. PMLR, 2017

  15. [15]

    Crystalbleu: precisely and efficiently measuring the similarity of code

    Aryaz Eghbali and Michael Pradel. Crystalbleu: precisely and efficiently measuring the similarity of code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–12, 2022. 15 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

  16. [16]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879, 2025

  17. [17]

    Rlef: Grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

  18. [18]

    Chartllama: A multimodal llm for chart understanding and generation, 2023

    Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation, 2023

  19. [19]

    Dual learning for machine translation.Advances in neural information processing systems, 29, 2016

    Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation.Advances in neural information processing systems, 29, 2016

  20. [20]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

  21. [21]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  22. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  23. [23]

    Self-training large language models for improved visual program synthesis with visual reinforcement

    Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, and Manmohan Chandraker. Self-training large language models for improved visual program synthesis with visual reinforcement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14344–14353, 2024

  24. [24]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  25. [25]

    Tikzilla: Scaling text-to-tikz with high-quality data and reinforcement learning

    REINFORCEMENT LEARNING. Tikzilla: Scaling text-to-tikz with high-quality data and reinforcement learning

  26. [26]

    Metal: A multi-agent framework for chart generation with test-time scaling, 2025

    Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, and Nanyun Peng. Metal: A multi-agent framework for chart generation with test-time scaling, 2025. URLhttps://arxiv.org/abs/2502.17651

  27. [27]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  28. [28]

    Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020

    Tzu-Mao Li, Michal Luk ´aˇc, Micha ¨el Gharbi, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020

  29. [29]

    Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026

    Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026. URLhttps://arxiv.org/abs/2601.21821

  30. [30]

    Deplot: One-shot visual language reasoning by plot-to-table translation

    Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table translation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, 2023

  31. [31]

    Wang, et al

    Haoyu Liu, Daya Guo, Junzhao Zheng, J.L. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.13602, 2024

  32. [32]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the 7th Interna- tional Conference on Learning Representations (ICLR). OpenReview.net, 2019. URL https://openreview.net/ forum?id=Bkg6RiCqY7

  33. [33]

    Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation

    Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15839–15855, 2025. 16 Scientific Graphics Program Synth...

  34. [34]

    Image2struct: A benchmark for evaluating vision-language models in extracting structured information from images, 2024

    Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, and Percy Liang. Image2struct: A benchmark for evaluating vision-language models in extracting structured information from images, 2024

  35. [35]

    A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mondal, A., Samsami, M

    Juan A Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, et al. Rendering-aware reinforcement learning for vector graphics generation, 2025.URL https://arxiv. org/abs/2505.20793

  36. [36]

    Sketch2diagram: Generating vector diagrams from hand-drawn sketches

    Itsumi Saito, Haruto Yoshida, and Keisuke Sakaguchi. Sketch2diagram: Generating vector diagrams from hand-drawn sketches. In13th International Conference on Learning Representations, ICLR 2025, pages 52825– 52847. International Conference on Learning Representations, ICLR, 2025

  37. [37]

    Vispath: Automated visualization code synthesis via multi-path reasoning and feedback-driven optimization.arXiv e-prints, pages arXiv–2502, 2025

    Wonduk Seo, Seungyong Lee, Daye Kang, Zonghao Yuan, and Seunghyun Lee. Vispath: Automated visualization code synthesis via multi-path reasoning and feedback-driven optimization.arXiv e-prints, pages arXiv–2502, 2025

  38. [38]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024

  39. [39]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  40. [40]

    Eed: Extended edit distance measure for machine translation

    Peter Stanchev, Weiyue Wang, and Hermann Ney. Eed: Extended edit distance measure for machine translation. InProceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 514–520, 2019

  41. [41]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning, 2025.URL https://arxiv. org/abs/2505.08617

  42. [42]

    Vipergpt: Visual inference via python execution for reasoning

    D´ıdac Sur´ıs, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

  43. [43]

    Graph drawing in ti k z

    Till Tantau. Graph drawing in ti k z. InInternational Symposium on Graph Drawing, pages 517–528. Springer, 2012

  44. [44]

    Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning.arXiv preprint arXiv:2505.10557, 2025

    Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, et al. Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning.arXiv preprint arXiv:2505.10557, 2025

  45. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  46. [46]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  47. [47]

    Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots

    Chengyue Wu, Zhixuan Liang, Yixiao Ge, Qiushan Guo, Zeyu Lu, Jiahao Wang, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3006–3028, 2025

  48. [48]

    Davinci: Reinforcing visual-structural syntax in mllms for generalized scientific diagram parsing

    ZENG Xingchen, Zhewei Su, Hengming Zhang, Juyong Jiang, Jiazhi Xia, and Wei Zeng. Davinci: Reinforcing visual-structural syntax in mllms for generalized scientific diagram parsing. InThe Fourteenth International Conference on Learning Representations

  49. [49]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 17 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

  50. [50]

    Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024

    Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024

  51. [51]

    Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

    Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. arXiv preprint arXiv:2402.11453, 2024

  52. [52]

    Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Ke- qing He, Zejun Ma, and Junxian He

    Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning.arXiv preprint arXiv:2507.08267, 2025

  53. [53]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  54. [54]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  55. [55]

    Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

    Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

  56. [56]

    Chartedit: How far are mllms from automating chart analysis? evaluating mllms’ capability via chart editing

    Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, and Chi Chen. Chartedit: How far are mllms from automating chart analysis? evaluating mllms’ capability via chart editing. arXiv preprint arXiv:2505.11935, 2025

  57. [57]

    Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,

    Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598, 2025

  58. [58]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand,

  59. [59]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403.13372

  60. [60]

    Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

  61. [61]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning Appendix ...

  62. [62]

    5.Self-contained.Remove/disable external dependencies (\includegraphics,\input,.bib, file paths)

    Layout fix (only if needed).If standalone wrapping shifts layout/clipping, apply minimal local fixes (e.g., border,baseline, missing libraries, minimal macro/color defs). 5.Self-contained.Remove/disable external dependencies (\includegraphics,\input,.bib, file paths). Image reference. <IMAGE START> {image} <IMAGE END> Code to standardize. <CODE START> {co...

  63. [63]

    correctness

    Minimal edits.Fix only what the error indicates (e.g., missing packages/commands, missing files, fragment wrappers). Keep coordinates and drawing commands unchanged whenever possible. Compilation error excerpt. <ERROR START> {error} <ERROR END> Code to repair. <CODE START> {code} <CODE END> \documentclass[border=5pt]{standalone} \usepackage{tikz} \usetikz...