pith. machine review for the scientific record. sign in

arxiv: 2604.22192 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords chart-to-code generationvision-language modelsdata-centric trainingtri-perspective tuninginquiry-driven reinforcement learningmultimodal alignmentinformation invarianceatomic QA verification
0
0 comments X

The pith

CharTide decouples chart training into visual perception, code logic, and fusion streams plus invariance-based verification to let 7B models surpass GPT-4o.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that simply scaling chart-code pairs mixes visual perception with program logic and limits how well models can use multimodal data. It constructs a 2M-sample dataset by training separately on visual perception, pure-text code logic, and their fusion, then adds an Inquiry-Driven RL stage that treats alignment as verification. In that stage a frozen inspector checks whether generated charts answer the same atomic visual questions as the originals, enforcing information invariance. Experiments across ChartMimic, Plot2Code, and ChartX show the resulting 7B and 8B models beat open-source baselines and GPT-4o while remaining competitive with GPT-5.

Core claim

A data-centric redesign that first builds separate visual, logic, and fusion training streams and then replaces heuristic scoring with atomic-QA verification grounded in information invariance produces chart-to-code models that, at 7B scale, outperform both specialized open-source systems and GPT-4o on standard benchmarks.

What carries the argument

Tri-Perspective Tuning that explicitly splits training data into visual-perception, pure-text code-logic, and modality-fusion streams, combined with Inquiry-Driven RL that uses a frozen inspector to verify generated charts via atomic QA tasks under the principle of information invariance.

If this is right

  • A 7B model trained only on supervised data can exceed specialized chart-to-code baselines.
  • Verification via atomic QA yields objective reward signals without relying on VLM scoring or rigid rule matching.
  • The same invariance principle can be applied to other generation tasks where perceptual fidelity must be preserved.
  • Smaller open-source models become competitive with frontier closed models on this narrow domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decoupling approach may transfer to other multimodal generation problems that currently suffer from entangled perception and reasoning.
  • Atomic-QA verification could serve as a general post-training filter for reducing visual hallucinations in code or diagram outputs.
  • If the invariance signal proves stable across chart styles, the method could scale to larger unlabeled chart collections without additional human annotation.

Load-bearing premise

Explicitly separating visual perception, code logic, and fusion during training, then enforcing answer consistency across original and generated charts, produces better multimodal alignment than standard joint training without creating new inconsistencies.

What would settle it

A controlled ablation that trains the same 7B model on merged homogeneous chart-code pairs instead of the three separate streams and measures whether accuracy on ChartMimic drops below the reported CharTide level.

Figures

Figures reproduced from arXiv: 2604.22192 by Alex Jinpeng Wang, Anxiang Zeng, Jiayi Hu, Kuang He, Peng Hou, Ping Yu, Rui Yan, Xiangxi Zheng, Yuan Yao.

Figure 1
Figure 1. Figure 1: Overview of the Data-Centric CharTide Framework. (Top) Tri-Perspective SFT explicitly de￾couples data streams to break the performance plateau. (Bottom) Inquiry-Driven RL replaces subjective scor￾ing with objective, fact-based verification. have further extended to multimodal code gener￾ation for UIs and sketches (Si et al., 2025; Jiang et al., 2025b; Wan et al., 2025; Yun et al., 2024; Gui et al., 2025). … view at source ↗
Figure 2
Figure 2. Figure 2: The detailed pipeline of CharTide. The pipeline consists of two stages: (1) Tri-Perspective SFT constructs three complementary data streams, including Visual Perception, Code Logic, and Modality Fusion to distill multi-dimensional capabilities into the foundational model; (2) Inquiry-Driven RL aligns the model using a hybrid verification loop, where a frozen Inspector provides objective semantic rewards (r… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics of the Inquiry-Driven RL phase. The Train Reward, Pass Rate, and Consistency Reward exhibit synchronized upward trajectories, indicating stable and effective optimization. choices. The results are summarized in view at source ↗
Figure 4
Figure 4. Figure 4: Reward Hacking Analysis. The continuous rise in rewards normalized by pass rate confirms that the model improves visual fidelity per executable sample, distinct from merely exploiting execution rates. 4.4 Analysis 4.4.1 Reward Hacking Examination As shown in view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on challenging chart types from ChartMimic Benchmark. We compare CharTide against various baselines on topologically complex and information-dense samples. While baselines often suffer from structural collapse or omit fine-grained visual details, our model achieves high-fidelity reproduction aligned with the ground truth. We provide additional visualization examples in Appendix D.3. … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of visual similarity matching. We display sample pairs from the SFT filter￾ing stage. The scores assigned by WebSSL-1B (shown above each pair) correlate strongly with the visual and structural consistency observed by human judges, ef￾fectively identifying high-quality reproductions while rejecting structural hallucinations. D.2 Data Contamination Analysis To ensure that our model’s h… view at source ↗
Figure 7
Figure 7. Figure 7: Worst-case leakage check. We display the 6 ChartMimic test samples with the highest similarity scores against the chartcap training set. The retrieved training samples (right) visually resemble the queries (left) in style and layout but differ in specific content and data values, confirming no direct leakage exists. 16 view at source ↗
Figure 8
Figure 8. Figure 8: Average-case leakage check. We visualize 6 uniformly sampled ChartMimic images and their nearest neighbors in the chartcap training set, showing significant semantic and visual differences. 17 view at source ↗
Figure 9
Figure 9. Figure 9: More visualization results on ChartMimic benchmark. view at source ↗
Figure 10
Figure 10. Figure 10: More visualization results on ChartMimic benchmark. view at source ↗
Figure 11
Figure 11. Figure 11: More visualization results on ChartMimic benchmark. view at source ↗
read the original abstract

Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CharTide, a data-centric framework for chart-to-code generation in VLMs. It constructs a 2M-sample dataset via Tri-Perspective Tuning that decouples training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines with only supervised data. It then reformulates alignment as Inquiry-Driven RL using a frozen Inspector to verify generated charts via atomic QA tasks grounded in information invariance (consistent answers to identical visual queries on original vs. generated charts). Experiments on ChartMimic, Plot2Code, and ChartX report that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.

Significance. If the performance gains are robustly attributable to the proposed data redesign rather than scale or unisolated factors, the work would advance data-centric methods for precise multimodal generation tasks. It provides a concrete alternative to homogeneous pair scaling and introduces verifiable, non-heuristic rewards via external inspection, which could generalize to other chart or diagram generation settings where visual fidelity and code correctness must be jointly enforced.

major comments (3)
  1. [Experiments / abstract] The central claim that Tri-Perspective Tuning enables a 7B model to surpass specialized baselines 'using only supervised data' (abstract) is load-bearing but unsupported by controls. No experiment is described that trains the identical base 7B model on an equivalently sized (2M) set of standard, non-decoupled homogeneous chart-code pairs using ordinary supervised fine-tuning. Without this ablation, gains cannot be isolated from dataset volume, the Inquiry-Driven RL component, or the Inspector's QA rewards.
  2. [Inquiry-Driven RL framework] § on Inquiry-Driven RL and information invariance: the claim that the frozen Inspector provides 'objective' and 'verifiable' rewards via atomic QA accuracy rests on the assumption that the Inspector is fully independent and that answer consistency across original/generated charts directly measures generation quality. No details are given on Inspector training data, potential overlap with evaluation benchmarks, or how false positives/negatives in QA are handled, which directly affects whether the RL signal is unbiased.
  3. [Experiments] Table/figure reporting results on ChartMimic, Plot2Code, ChartX: the manuscript provides no information on baseline re-implementations (e.g., whether open-source models were fine-tuned on the same 2M data or used off-the-shelf), statistical significance tests, variance across runs, or checks for data leakage between the constructed 2M dataset and the three evaluation benchmarks. These omissions make it impossible to assess whether the reported outperformance of CharTide-7B/8B over GPT-4o is reproducible or attributable to the proposed method.
minor comments (2)
  1. [Introduction / Method] The abstract and method description introduce several new terms (Tri-Perspective Tuning, Inquiry-Driven Evolution/RL, Inspector) without a concise summary table or diagram showing how the three streams and the RL loop interact; a single overview figure would improve readability.
  2. [Inquiry-Driven RL] Notation for the information-invariance reward (e.g., how atomic QA accuracy is aggregated into a scalar reward) is not formalized; an equation or pseudocode block would clarify the exact computation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important gaps in experimental controls, methodological transparency, and reporting that we address below. We have revised the manuscript to incorporate the requested ablations, details, and statistical information.

read point-by-point responses
  1. Referee: [Experiments / abstract] The central claim that Tri-Perspective Tuning enables a 7B model to surpass specialized baselines 'using only supervised data' (abstract) is load-bearing but unsupported by controls. No experiment is described that trains the identical base 7B model on an equivalently sized (2M) set of standard, non-decoupled homogeneous chart-code pairs using ordinary supervised fine-tuning. Without this ablation, gains cannot be isolated from dataset volume, the Inquiry-Driven RL component, or the Inspector's QA rewards.

    Authors: We agree that the absence of this specific control ablation leaves the contribution of Tri-Perspective Tuning incompletely isolated. In the revised manuscript we have added the requested experiment: the identical 7B base model was trained on the same 2M samples using standard homogeneous chart-code pairs and ordinary supervised fine-tuning. Results show that this homogeneous-SFT baseline underperforms the tri-perspective version, indicating that the performance advantage is not solely due to dataset scale. We have updated the abstract and added the ablation table and discussion in the experiments section. Note that the abstract claim refers specifically to the supervised stage; the Inquiry-Driven RL stage is presented separately. revision: yes

  2. Referee: [Inquiry-Driven RL framework] § on Inquiry-Driven RL and information invariance: the claim that the frozen Inspector provides 'objective' and 'verifiable' rewards via atomic QA accuracy rests on the assumption that the Inspector is fully independent and that answer consistency across original/generated charts directly measures generation quality. No details are given on Inspector training data, potential overlap with evaluation benchmarks, or how false positives/negatives in QA are handled, which directly affects whether the RL signal is unbiased.

    Authors: We acknowledge that additional details are required to substantiate the objectivity of the Inspector. In the revised manuscript we have expanded the Inquiry-Driven RL section with: (i) the Inspector's training data (a separately constructed chart-QA corpus with no samples from the 2M training set or the three evaluation benchmarks), (ii) explicit verification of zero overlap with ChartMimic, Plot2Code, and ChartX via embedding similarity and manual inspection, and (iii) mitigation of QA false positives/negatives through multi-query consistency thresholds and majority voting across atomic questions. These additions clarify that the reward signal is derived from an independent, frozen model and is grounded in information invariance. revision: yes

  3. Referee: [Experiments] Table/figure reporting results on ChartMimic, Plot2Code, ChartX: the manuscript provides no information on baseline re-implementations (e.g., whether open-source models were fine-tuned on the same 2M data or used off-the-shelf), statistical significance tests, variance across runs, or checks for data leakage between the constructed 2M dataset and the three evaluation benchmarks. These omissions make it impossible to assess whether the reported outperformance of CharTide-7B/8B over GPT-4o is reproducible or attributable to the proposed method.

    Authors: We agree that these reporting omissions hinder reproducibility assessment. In the revised manuscript we have added: (i) explicit statements on baseline implementations (open-source models were fine-tuned on the identical 2M dataset for fair comparison; GPT-4o and GPT-5 results are reported from the original APIs without fine-tuning), (ii) statistical significance via paired t-tests (p < 0.05) across three independent runs with different random seeds, (iii) standard deviation and variance reported for all metrics, and (iv) data-leakage analysis (deduplication via exact match and semantic similarity thresholds confirming no overlap between the 2M dataset and the evaluation benchmarks). These changes are incorporated into the experimental setup and results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks and independent inspector

full rationale

The paper's core claims rest on constructing a 2M-sample dataset via explicit decoupling into three streams and using a frozen Inspector for atomic QA-based rewards in RL. No equations, self-definitional loops, or fitted parameters are presented that reduce the reported outperformance to the inputs by construction. Experiments are conducted on external benchmarks (ChartMimic, Plot2Code, ChartX) against open baselines and GPT models, with the Inspector described as independent and frozen. No self-citation chains or ansatz smuggling are evident in the provided text that would make the derivation tautological. The absence of an internal ablation does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claims depend on the effectiveness of the newly introduced data decoupling strategy and the information invariance principle for verification, with no explicit free parameters or external benchmarks detailed in the abstract.

axioms (2)
  • domain assumption Decoupling training into visual perception, pure-text code logic, and modality fusion streams enables better leveraging of multimodal supervision than homogeneous chart-code pairs.
    This underpins the Tri-Perspective Tuning strategy described as the first contribution.
  • domain assumption A downstream model should yield consistent answers to identical visual queries across both original and generated charts (information invariance principle).
    This grounds the Inquiry-Driven RL framework and the use of the inspector for reward signals.
invented entities (3)
  • Tri-Perspective Tuning no independent evidence
    purpose: Construct 2M-sample dataset by explicitly decoupling training into three streams
    New data construction method introduced to address conflation of visual perception and program logic.
  • Inquiry-Driven RL no independent evidence
    purpose: Reformulate alignment as data verification using atomic QA tasks instead of heuristic scoring
    New RL framework based on information invariance for providing verifiable rewards.
  • Inspector no independent evidence
    purpose: Frozen model to objectively verify generated charts through atomic QA tasks and provide reward signals based on answer accuracy
    Component used to move beyond rigid rule matching or VLM scoring.

pith-pipeline@v0.9.0 · 5576 in / 1687 out tokens · 56625 ms · 2026-05-08T12:35:45.129184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025 a . Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025 b . Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923

  5. [5]

    Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. 2025 a . Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation. arXiv preprint arXiv:2508.13587

  6. [6]

    Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, and Lin Ma. 2025 b . https://arxiv.org/abs/2507.15509 Chart-r1: Chain-of-thought supervision and reinforcement for advanced chart reasoner . Preprint, arXiv:2507.15509

  7. [7]

    Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. 2025 c . https://arxiv.org/abs/2507.20766 Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback . Preprint, arXiv:2507.20766

  8. [8]

    DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948

  9. [9]

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, and 1 others. 2024. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM international conference on multimedia, pages 11198--11201

  10. [10]

    David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, and 1 others. 2025. Scaling language-free visual representation learning. arXiv preprint arXiv:2504.01017

  11. [11]

    Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, and 1 others. 2025. Webcode2m: A real-world dataset for code generation from webpage designs. In Proceedings of the ACM on Web Conference 2025, pages 1834--1845

  12. [12]

    Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. 2023. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483

  13. [13]

    Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran Fan, Yiwen Ding, Zifei Shan, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Distill visual chart reasoning ability from llms to mllms. arXiv preprint arXiv:2410.18798

  14. [14]

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749

  15. [15]

    Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, and Furu Wei. 2025 a . Viscodex: Unified multimodal code generation via merging vision and coding models. arXiv preprint arXiv:2508.09945

  16. [16]

    Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R Lyu, and Xiangyu Yue. 2025 b . Screencoder: Advancing visual-to-code generation for front-end automation via modular multimodal agents. arXiv preprint arXiv:2507.22827

  17. [17]

    Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Zexue He, Shafiq Abedin, Jennifer Sun, Ben Wiesel, Eli Schwartz, Ahmed Nassar, Bo Wu, and 1 others. 2025. Chartgen: Scaling chart understanding via code-guided synthetic chart generation. arXiv preprint arXiv:2507.19492

  18. [18]

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, and 1 others. 2024. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124

  19. [19]

    Junyoung Lim, Jaewoo Ahn, and Gunhee Kim. 2025. Chartcap: Mitigating hallucination of dense chart captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13171--13182

  20. [20]

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263--2279

  21. [21]

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, and 1 others. 2025. Chartqapro: A more diverse and challenging benchmark for chart question answering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 191...

  22. [22]

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, and 1 others. 2025. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365

  23. [23]

    Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Lijuan Wang. 2025. Point-rft: Improving multimodal reasoning with visually grounded reinforcement finetuning. arXiv preprint arXiv:2505.19702

  24. [24]

    Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. 2025. Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15839--15855

  25. [25]

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and 1 others. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  27. [27]

    Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. 2025. Design2code: Benchmarking multimodal code generation for automated front-end engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pag...

  28. [28]

    Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, and Xiaodong He. 2025. Chartmaster: Advancing chart-to-code generation with real-world charts and chart similarity reinforcement learning. arXiv preprint arXiv:2508.17608

  29. [29]

    Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu. 2025. Divide-and-conquer: Generating ui code from screenshots. Proceedings of the ACM on Software Engineering, 2(FSE):2099--2122

  30. [30]

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265

  31. [31]

    Chengyue Wu, Zhixuan Liang, Yixiao Ge, Qiushan Guo, Zeyu Lu, Jiahao Wang, Ying Shan, and Ping Luo. 2025. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3006--3028

  32. [32]

    Renqiu Xia, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Botian Shi, Junchi Yan, and Bo Zhang. 2025. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. IEEE Transactions on Image Processing

  33. [33]

    Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, and Dahua Lin. 2025. Caprl: Stimulating dense image caption capabilities via reinforcement learning. arXiv preprint arXiv:2509.22647

  34. [34]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . Qwen3 technical report. arXiv preprint arXiv:2505.09388

  35. [35]

    Cheng Yang, Chufan Shi, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, and 1 others. 2024. Chartmimic: Evaluating lmm's cross-modal reasoning capability via chart-to-code generation. arXiv preprint arXiv:2406.09961

  36. [36]

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. 2025 b . R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization. arXiv preprint arXiv:2503.10615

  37. [37]

    Sukmin Yun, Rusiru Thushara, Mohammad Bhat, Yongxin Wang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, and 1 others. 2024. Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms. Advances in neural information processing systems, 37:112134--112157

  38. [38]

    Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. 2025. Vision-g1: Towards general vision language reasoning with multi-domain data curation. arXiv preprint arXiv:2508.12680

  39. [39]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975--11986

  40. [40]

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, and 1 others. 2025 a . A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827

  41. [41]

    Zhihan Zhang, Yixin Cao, and Lizi Liao. 2025 b . Boosting chart-to-code generation in mllm via dual preference-guided refinement. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 11032--11041

  42. [42]

    Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, and Lin Ma. 2025 a . Vincicoder: Unifying multimodal code generation via coarse-to-fine visual reinforcement learning. arXiv preprint arXiv:2511.00391

  43. [43]

    Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2025 b . Chartcoder: Advancing multimodal large language model for chart-to-code generation. arXiv preprint arXiv:2501.06598