arxiv: 2604.06079 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

Juekai Lin , Yun Zhu , Honglin Lin , Sijing Li , Tianwei Lin , Zheng Liu , Xiaoyang Wang , Wenqiao Zhang

show 1 more author

Lijun Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords graphics program synthesisTikZ code generationreinforcement learningmultimodal large language modelsscientific visualizationself-consistencybenchmark evaluationcode synthesis

0 comments

The pith

Dual self-consistency reinforcement learning lets an 8B model generate accurate TikZ code for scientific graphics and outperform much larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the difficulty of turning static scientific images into precise editable TikZ code by identifying two main barriers: poor data quality in existing image-code pairs and weak evaluation standards. It closes these gaps with a large dataset of executable pairs built through an execution-focused engine and a benchmark that tests both visual match and logical structure. The central technique applies a dual self-consistency reinforcement learning loop that uses round-trip verification to discard inconsistent or degenerate code. This training produces a compact model that exceeds the performance of proprietary systems and far larger open models on the new benchmark. If the approach holds, scientific diagrams could be routinely converted into modifiable code across many fields without needing giant models.

Core claim

The paper shows that an Execution-Centric Data Engine can produce SciTikZ-230K strictly executable and visually aligned image-TikZ pairs across eleven disciplines, and that Dual Self-Consistency Reinforcement Learning with Round-Trip Verification can optimize a model to penalize degenerate outputs and raise overall consistency, allowing the resulting SciTikZer-8B model to reach state-of-the-art performance that surpasses Gemini-2.5-Pro and Qwen3-VL-235B-A22B-Instruct on both visual fidelity and structural logic.

What carries the argument

Dual Self-Consistency Reinforcement Learning optimization paradigm that applies Round-Trip Verification to penalize degenerate code and raise self-consistency in the generated TikZ programs.

If this is right

High-quality executable image-TikZ datasets can be scaled across scientific domains to support further model training.
Multifaceted benchmarks that separately score visual alignment and logical structure can become standard for graphics synthesis.
Reinforcement learning with consistency verification can improve code output from multimodal models without requiring extreme parameter counts.
Generated TikZ code can be used directly for precise rendering and editing of complex hierarchical scientific schematics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same round-trip verification pattern could be applied to synthesize code for other diagram languages or plotting libraries.
Targeted consistency reinforcement may prove more efficient than raw scale for narrow-domain code generation tasks.
Automated conversion of figures from published papers into editable code could become practical for reproducibility checks.
Testing the trained model on figures extracted from recent journal articles would show whether the benchmark gains transfer to real research outputs.

Load-bearing premise

The Execution-Centric Data Engine produces image-TikZ pairs that are strictly executable and visually aligned, and the round-trip verification step in reinforcement learning improves consistency without introducing new biases or degenerate solutions.

What would settle it

Direct evaluation of SciTikZer-8B on SciTikZ-Bench showing lower scores than Gemini-2.5-Pro or Qwen3-VL-235B on combined visual-fidelity and structural-logic metrics would disprove the claimed superiority.

Figures

Figures reproduced from arXiv: 2604.06079 by Honglin Lin, Juekai Lin, Lijun Wu, Sijing Li, Tianwei Lin, Wenqiao Zhang, Xiaoyang Wang, Yun Zhu, Zheng Liu.

**Figure 1.** Figure 1: Challenges in graphics program synthesis. Current open-source MLLMs struggle with the strict constraints of TikZ, exhibiting critical issues such as syntax hallucinations, dependency omissions and geometric misalignments. web-scraped data suffers from intrinsic noise, existing synthetic corpora are often low-quality, with cluttered layouts and limited structural coherence. This results in models that frequ… view at source ↗

**Figure 2.** Figure 2: Overview of the Execution-Centric Data Engine. Beyond passive filtering, it uses MLLMguided remediation to correct compilation faults and distill misaligned pairs, preserving diversity under strict visual-program alignment. 3.1 Source Aggregation To construct a diverse data foundation, we aggregate data from HuggingFace, TeX StackExchange, and arXiv. However, direct training on these sources is impeded by… view at source ↗

**Figure 3.** Figure 3: Overview of SciTikZ-230K Dataset. The chart illustrates the hierarchical distribution of our [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the SciTikZ Framework. The pipeline initializes with SFT and Curriculum Data Selection. Core DSC-RL integrates Visual Alignment for pixel precision and Self-Consistency Refinement via symbolic back-translation. 4.1 Supervised Warm-up for Initialization We initialize the policy πθ using our curated dataset Dfinal. Formulated as a conditional sequence generation task, we optimize the model to ma… view at source ↗

**Figure 5.** Figure 5: Impact of Data Curation. Curated data enhances both executability and visual fidelity in 4B and 8B models compared with raw data and DaTikZ-v3. 4B 8B 94 95 96 97 94.9 96.1 95.9 97.2 +1.0 +1.1 Succ. ( ) 4B 8B 89 90 91 92 93 94 91.1 92.3 92.4 93.8 +1.3 +1.5 SigLIP ( ) 4B 8B 28 29 30 31 32 32.2 31.1 30.8 29.7 -1.4 -1.4 LPIPS ( ) 4B 8B 9 10 11 12 13 14 13.9 12.2 12.3 10.9 -1.7 -1.4 DreamSim ( ) 4B 8B 28.0 28.2… view at source ↗

**Figure 6.** Figure 6: Ablation of Dual Self-Consistency (DSC). Comparison between standard RL and our full DSC method. SciTikZ-230K yields consistent improvements across model scales. For the 8B model, curation boosts the compilation rate from 76.4% to 81.0% and increases the SigLIP score by +5.0 points. Compared with DaTikZ-v3, SciTikZ-230K performs better on most metrics, further validating our execution-centric curation stra… view at source ↗

**Figure 7.** Figure 7: Case Analysis. A qualitative comparison of rendered TikZ code generated by different models against the Ground Truth. We evaluate SciTikZer-8B against SOTA baselines, including Gemini2.5-Pro, DeTikZify-v2.5-8B, and Qwen3-VL-Instruct-32B. forms standard RL baselines in both executability and visual fidelity, confirming its robustness across different coding languages. 5.6 Human Evaluation and Case Analysis… view at source ↗

**Figure 8.** Figure 8: Example of diagnostic remediation for compilation errors. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Representative samples from SciTikZ-Bench across three difficulty tiers. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of diagnostic remediation across 10 diverse cases. We evaluate [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Detailed analysis of program synthesis for a circuit schematic. We compare SciTikZer [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

read the original abstract

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a 230K synthetic dataset and benchmark for turning scientific images into TikZ code, then uses a round-trip RL loop to train an 8B model that claims to beat much larger VLMs on its own test set.

read the letter

The main things here are the SciTikZ-230K dataset generated by an execution-centric engine, the SciTikZ-Bench that checks both structure and visuals, and the Dual Self-Consistency RL that adds round-trip verification to penalize code that compiles but drifts visually. These are concrete additions to the graphics program synthesis area, and the closed-loop data creation addresses the executability problem more directly than prior scraped corpora. The RL objective is a reasonable way to push self-consistency without needing extra human labels. That part of the work is solid and targeted at a real pain point for scientific diagrams. The 8B model size is also practical if the gains hold. The evaluation stays inside the same synthetic pipeline for both training and testing, so any consistent rendering artifacts or alignment gaps in the engine would be invisible to the round-trip check and could inflate the reported wins over Gemini-2.5-Pro and Qwen3-VL-235B. The abstract and stress-test note give no sign of external real-world figures or human preference tests to break that loop, which leaves the SOTA claim resting on internal consistency alone. Ablations on the dual RL components and exact metric tables would help, but without them the strength of the result is hard to judge. This is for people working on multimodal code generation or tools that reverse-engineer scientific figures. A reader who needs data or benchmarks in this niche would get something usable from the dataset and the RL framing. It deserves peer review because the contributions are specific and the problem is well-defined, even though the closed evaluation needs scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper claims to address data quality and evaluation gaps in graphics program synthesis by introducing an Execution-Centric Data Engine that produces the SciTikZ-230K dataset of executable image-TikZ pairs across 11 disciplines, the SciTikZ-Bench for assessing structural and visual fidelity, and a Dual Self-Consistency Reinforcement Learning paradigm that uses Round-Trip Verification to train SciTikZer-8B, which reportedly achieves SOTA by outperforming Gemini-2.5-Pro and Qwen3-VL-235B-A22B-Instruct.

Significance. If the closed-loop verification proves robust, the work would be significant by supplying a large-scale executable dataset and multifaceted benchmark that the field currently lacks, plus a novel RL objective for boosting self-consistency in code generation. These resources could enable more reliable training and assessment of MLLMs for precise scientific visualization reverse-engineering.

major comments (2)

[§3] §3 (Execution-Centric Data Engine and SciTikZ-Bench): both the 230K training set and the benchmark are generated by the same engine, so any systematic rendering mismatch (coordinate drift, missing labels, style artifacts) would be invisible to Round-Trip Verification yet would inflate all reported metrics; the manuscript must show that verification detects such misalignments.
[§4] §4 (Dual Self-Consistency RL and Round-Trip Verification): the claim that the RL objective reliably penalizes degenerate yet compilable solutions and boosts true visual fidelity lacks supporting ablations or analysis demonstrating it avoids semantic drift; without external real-world figures or human preference studies, the SOTA claim on SciTikZ-Bench rests on an untested internal loop.

minor comments (1)

[Abstract] Abstract: the SOTA claim is stated without any quantitative metrics, error bars, or comparison details, which would allow readers to assess the headline result immediately.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their detailed and insightful comments on our manuscript. We address each major comment below and outline the revisions we plan to make to improve the clarity and robustness of our work.

read point-by-point responses

Referee: [§3] §3 (Execution-Centric Data Engine and SciTikZ-Bench): both the 230K training set and the benchmark are generated by the same engine, so any systematic rendering mismatch (coordinate drift, missing labels, style artifacts) would be invisible to Round-Trip Verification yet would inflate all reported metrics; the manuscript must show that verification detects such misalignments.

Authors: We agree that this is an important consideration to ensure the reliability of our closed-loop system. The Execution-Centric Data Engine incorporates multiple checks for executability and visual alignment during generation. To directly address the concern, we will include in the revised manuscript an experiment that injects synthetic misalignments (such as coordinate drifts and missing elements) into TikZ code and demonstrates that the Round-Trip Verification successfully identifies and filters these cases, preventing them from affecting the metrics. revision: yes
Referee: [§4] §4 (Dual Self-Consistency RL and Round-Trip Verification): the claim that the RL objective reliably penalizes degenerate yet compilable solutions and boosts true visual fidelity lacks supporting ablations or analysis demonstrating it avoids semantic drift; without external real-world figures or human preference studies, the SOTA claim on SciTikZ-Bench rests on an untested internal loop.

Authors: We thank the referee for highlighting this. Our manuscript does include ablation studies in §4 comparing the Dual Self-Consistency RL against baselines and variants, showing consistent gains in both structural accuracy and visual similarity metrics on SciTikZ-Bench. These ablations help demonstrate the objective's effectiveness in penalizing degenerate solutions. However, we acknowledge that further analysis specifically targeting semantic drift would be beneficial. We will expand the ablation section in the revision to include additional metrics and case studies illustrating how the RL avoids semantic inconsistencies. Regarding external real-world figures and human preference studies, while SciTikZ-Bench draws from diverse scientific domains to simulate real-world scenarios, comprehensive human evaluations are resource-intensive and were not conducted in this work. We believe the multifaceted benchmark provides a strong proxy for visual fidelity. revision: partial

standing simulated objections not resolved

Comprehensive human preference studies on external real-world figures, which would require significant additional experimental resources beyond the scope of this revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contributions—an Execution-Centric Data Engine for generating SciTikZ-230K, the SciTikZ-Bench, and the Dual Self-Consistency RL paradigm with Round-Trip Verification—are presented as empirical engineering steps leading to a trained model whose performance is measured on the benchmark. No equations, derivations, or claims reduce by construction to their own inputs (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations of uniqueness theorems). The SOTA claim is an external empirical outcome relative to other models, not a tautological restatement of the training procedure. The internal generation of data and benchmark raises separate validity questions but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities; full text required for ledger construction.

pith-pipeline@v0.9.0 · 5567 in / 1143 out tokens · 50746 ms · 2026-05-10T19:40:36.798845+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dual Self-Consistency Reinforcement Learning ... Round-Trip Verification to penalize degenerate code and boost overall self-consistency
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SciTikZ-230K ... Execution-Centric Data Engine

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...

Reference graph

Works this paper leans on

63 extracted references · 32 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

Claude sonnet 4.5 system card.https://www.anthropic.com/system-cards, 2025

Anthropic. Claude sonnet 4.5 system card.https://www.anthropic.com/system-cards, 2025

2025
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Automatikz: Text-guided synthesis of scientific vector graphics with tikz.arXiv preprint arXiv:2310.00367, 2023

Jonas Belouadi, Anne Lauscher, and Steffen Eger. Automatikz: Text-guided synthesis of scientific vector graphics with tikz.arXiv preprint arXiv:2310.00367, 2023

work page arXiv 2023
[6]

Detikzify: Synthesizing graphics programs for scientific figures and sketches with tikz.Advances in Neural Information Processing Systems, 37:85074–85108, 2024

Jonas Belouadi, Simone Ponzetto, and Steffen Eger. Detikzify: Synthesizing graphics programs for scientific figures and sketches with tikz.Advances in Neural Information Processing Systems, 37:85074–85108, 2024

2024
[7]

Tikzero: Zero-shot text-guided graphics program synthesis

Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, and Simone Ponzetto. Tikzero: Zero-shot text-guided graphics program synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17793–17806, 2025

2025
[8]

Learning to synthesize graphics programs for geometric artworks

Qi Bing, Chaoyi Zhang, and Weidong Cai. Learning to synthesize graphics programs for geometric artworks. InInternational Conference on Pattern Recognition, pages 259–274. Springer, 2025

2025
[9]

Blecher, G

Lukas Blecher, Guillem Cucurull, Thomas Scialom, Robert Stojnic, and M Ai. Nougat: Neural optical understanding for academic documents, 2023.arXiv preprint arXiv:2308.13418

work page arXiv 2023
[10]

Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

work page arXiv 2025
[11]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual- linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

2024
[13]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Image-to-markup generation with coarse-to-fine attention

Yuntian Deng, Anssi Kanervisto, Jeffrey Ling, and Alexander M Rush. Image-to-markup generation with coarse-to-fine attention. InInternational Conference on Machine Learning, pages 980–989. PMLR, 2017

2017
[15]

Crystalbleu: precisely and efficiently measuring the similarity of code

Aryaz Eghbali and Michael Pradel. Crystalbleu: precisely and efficiently measuring the similarity of code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–12, 2022. 15 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

2022
[16]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images.arXiv preprint arXiv:2505.15879, 2025

work page internal anchor Pith review arXiv 2025
[17]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning.arXiv preprint arXiv:2410.02089, 2024

work page arXiv 2024
[18]

Chartllama: A multimodal llm for chart understanding and generation, 2023

Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation, 2023

2023
[19]

Dual learning for machine translation.Advances in neural information processing systems, 29, 2016

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation.Advances in neural information processing systems, 29, 2016

2016
[20]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

2024
[21]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review arXiv 2025
[22]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Self-training large language models for improved visual program synthesis with visual reinforcement

Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, and Manmohan Chandraker. Self-training large language models for improved visual program synthesis with visual reinforcement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14344–14353, 2024

2024
[24]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

2012
[25]

Tikzilla: Scaling text-to-tikz with high-quality data and reinforcement learning

REINFORCEMENT LEARNING. Tikzilla: Scaling text-to-tikz with high-quality data and reinforcement learning
[26]

Metal: A multi-agent framework for chart generation with test-time scaling, 2025

Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, and Nanyun Peng. Metal: A multi-agent framework for chart generation with test-time scaling, 2025. URLhttps://arxiv.org/abs/2502.17651

work page arXiv 2025
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020

Tzu-Mao Li, Michal Luk ´aˇc, Micha ¨el Gharbi, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):1–15, 2020

2020
[29]

Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026

Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026. URLhttps://arxiv.org/abs/2601.21821

work page arXiv 2026
[30]

Deplot: One-shot visual language reasoning by plot-to-table translation

Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. Deplot: One-shot visual language reasoning by plot-to-table translation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 10381–10399, 2023

2023
[31]

Wang, et al

Haoyu Liu, Daya Guo, Junzhao Zheng, J.L. Wang, et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.13602, 2024

work page arXiv 2024
[32]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the 7th Interna- tional Conference on Learning Representations (ICLR). OpenReview.net, 2019. URL https://openreview.net/ forum?id=Bkg6RiCqY7

2019
[33]

Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation

Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15839–15855, 2025. 16 Scientific Graphics Program Synth...

2025
[34]

Image2struct: A benchmark for evaluating vision-language models in extracting structured information from images, 2024

Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, and Percy Liang. Image2struct: A benchmark for evaluating vision-language models in extracting structured information from images, 2024

2024
[35]

A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mondal, A., Samsami, M

Juan A Rodriguez, Haotian Zhang, Abhay Puri, Aarash Feizi, Rishav Pramanik, Pascal Wichmann, Arnab Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, et al. Rendering-aware reinforcement learning for vector graphics generation, 2025.URL https://arxiv. org/abs/2505.20793

work page arXiv 2025
[36]

Sketch2diagram: Generating vector diagrams from hand-drawn sketches

Itsumi Saito, Haruto Yoshida, and Keisuke Sakaguchi. Sketch2diagram: Generating vector diagrams from hand-drawn sketches. In13th International Conference on Learning Representations, ICLR 2025, pages 52825– 52847. International Conference on Learning Representations, ICLR, 2025

2025
[37]

Vispath: Automated visualization code synthesis via multi-path reasoning and feedback-driven optimization.arXiv e-prints, pages arXiv–2502, 2025

Wonduk Seo, Seungyong Lee, Daye Kang, Zonghao Yuan, and Seunghyun Lee. Vispath: Automated visualization code synthesis via multi-path reasoning and feedback-driven optimization.arXiv e-prints, pages arXiv–2502, 2025

2025
[38]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review arXiv 2024
[40]

Eed: Extended edit distance measure for machine translation

Peter Stanchev, Weiyue Wang, and Hermann Ney. Eed: Extended edit distance measure for machine translation. InProceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 514–520, 2019

2019
[41]

Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning, 2025.URL https://arxiv. org/abs/2505.08617

work page arXiv 2025
[42]

Vipergpt: Visual inference via python execution for reasoning

D´ıdac Sur´ıs, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. InProceedings of the IEEE/CVF international conference on computer vision, pages 11888–11898, 2023

2023
[43]

Graph drawing in ti k z

Till Tantau. Graph drawing in ti k z. InInternational Symposium on Graph Drawing, pages 517–528. Springer, 2012

2012
[44]

Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning.arXiv preprint arXiv:2505.10557, 2025

Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, et al. Mathcoder-vl: Bridging vision and code for enhanced multimodal mathematical reasoning.arXiv preprint arXiv:2505.10557, 2025

work page arXiv 2025
[45]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots

Chengyue Wu, Zhixuan Liang, Yixiao Ge, Qiushan Guo, Zeyu Lu, Jiahao Wang, Ying Shan, and Ping Luo. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3006–3028, 2025

2025
[48]

Davinci: Reinforcing visual-structural syntax in mllms for generalized scientific diagram parsing

ZENG Xingchen, Zhewei Su, Hengming Zhang, Juyong Jiang, Jiazhi Xia, and Wei Zeng. Davinci: Reinforcing visual-structural syntax in mllms for generalized scientific diagram parsing. InThe Fourteenth International Conference on Learning Representations
[49]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 17 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024

Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, et al. Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation.arXiv preprint arXiv:2406.09961, 2024

work page arXiv 2024
[51]

Matplotagent: Method and evaluation for llm-based agentic scientific data visualization

Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. arXiv preprint arXiv:2402.11453, 2024

work page arXiv 2024
[52]

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Ke- qing He, Zejun Ma, and Junxian He

Hiroshi Yoshihara, Taiki Yamaguchi, and Yuichi Inoue. A practical two-stage recipe for mathematical llms: Maximizing accuracy with sft and efficiency with reinforcement learning.arXiv preprint arXiv:2507.08267, 2025

work page arXiv 2025
[53]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

2023
[54]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

2018
[55]

Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning.arXiv preprint arXiv:2511.00391, 2025

work page arXiv 2025
[56]

Chartedit: How far are mllms from automating chart analysis? evaluating mllms’ capability via chart editing

Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, and Chi Chen. Chartedit: How far are mllms from automating chart analysis? evaluating mllms’ capability via chart editing. arXiv preprint arXiv:2505.11935, 2025

work page arXiv 2025
[57]

Chartcoder: Ad- vancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598,

Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to-code generation.arXiv preprint arXiv:2501.06598, 2025

work page arXiv 2025
[58]

Llamafactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand,
[59]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Association for Computational Linguistics. URLhttp://arxiv.org/abs/2403.13372

work page internal anchor Pith review arXiv
[60]

Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025

2025
[61]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning Appendix ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

5.Self-contained.Remove/disable external dependencies (\includegraphics,\input,.bib, file paths)

Layout fix (only if needed).If standalone wrapping shifts layout/clipping, apply minimal local fixes (e.g., border,baseline, missing libraries, minimal macro/color defs). 5.Self-contained.Remove/disable external dependencies (\includegraphics,\input,.bib, file paths). Image reference. <IMAGE START> {image} <IMAGE END> Code to standardize. <CODE START> {co...
[63]

correctness

Minimal edits.Fix only what the error indicates (e.g., missing packages/commands, missing files, fragment wrappers). Keep coordinates and drawing commands unchanged whenever possible. Compilation error excerpt. <ERROR START> {error} <ERROR END> Code to repair. <CODE START> {code} <CODE END> \documentclass[border=5pt]{standalone} \usepackage{tikz} \usetikz...