Recognition: unknown
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
Pith reviewed 2026-05-08 12:35 UTC · model grok-4.3
The pith
CharTide decouples chart training into visual perception, code logic, and fusion streams plus invariance-based verification to let 7B models surpass GPT-4o.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A data-centric redesign that first builds separate visual, logic, and fusion training streams and then replaces heuristic scoring with atomic-QA verification grounded in information invariance produces chart-to-code models that, at 7B scale, outperform both specialized open-source systems and GPT-4o on standard benchmarks.
What carries the argument
Tri-Perspective Tuning that explicitly splits training data into visual-perception, pure-text code-logic, and modality-fusion streams, combined with Inquiry-Driven RL that uses a frozen inspector to verify generated charts via atomic QA tasks under the principle of information invariance.
If this is right
- A 7B model trained only on supervised data can exceed specialized chart-to-code baselines.
- Verification via atomic QA yields objective reward signals without relying on VLM scoring or rigid rule matching.
- The same invariance principle can be applied to other generation tasks where perceptual fidelity must be preserved.
- Smaller open-source models become competitive with frontier closed models on this narrow domain.
Where Pith is reading between the lines
- The decoupling approach may transfer to other multimodal generation problems that currently suffer from entangled perception and reasoning.
- Atomic-QA verification could serve as a general post-training filter for reducing visual hallucinations in code or diagram outputs.
- If the invariance signal proves stable across chart styles, the method could scale to larger unlabeled chart collections without additional human annotation.
Load-bearing premise
Explicitly separating visual perception, code logic, and fusion during training, then enforcing answer consistency across original and generated charts, produces better multimodal alignment than standard joint training without creating new inconsistencies.
What would settle it
A controlled ablation that trains the same 7B model on merged homogeneous chart-code pairs instead of the three separate streams and measures whether accuracy on ChartMimic drops below the reported CharTide level.
Figures
read the original abstract
Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CharTide, a data-centric framework for chart-to-code generation in VLMs. It constructs a 2M-sample dataset via Tri-Perspective Tuning that decouples training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines with only supervised data. It then reformulates alignment as Inquiry-Driven RL using a frozen Inspector to verify generated charts via atomic QA tasks grounded in information invariance (consistent answers to identical visual queries on original vs. generated charts). Experiments on ChartMimic, Plot2Code, and ChartX report that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
Significance. If the performance gains are robustly attributable to the proposed data redesign rather than scale or unisolated factors, the work would advance data-centric methods for precise multimodal generation tasks. It provides a concrete alternative to homogeneous pair scaling and introduces verifiable, non-heuristic rewards via external inspection, which could generalize to other chart or diagram generation settings where visual fidelity and code correctness must be jointly enforced.
major comments (3)
- [Experiments / abstract] The central claim that Tri-Perspective Tuning enables a 7B model to surpass specialized baselines 'using only supervised data' (abstract) is load-bearing but unsupported by controls. No experiment is described that trains the identical base 7B model on an equivalently sized (2M) set of standard, non-decoupled homogeneous chart-code pairs using ordinary supervised fine-tuning. Without this ablation, gains cannot be isolated from dataset volume, the Inquiry-Driven RL component, or the Inspector's QA rewards.
- [Inquiry-Driven RL framework] § on Inquiry-Driven RL and information invariance: the claim that the frozen Inspector provides 'objective' and 'verifiable' rewards via atomic QA accuracy rests on the assumption that the Inspector is fully independent and that answer consistency across original/generated charts directly measures generation quality. No details are given on Inspector training data, potential overlap with evaluation benchmarks, or how false positives/negatives in QA are handled, which directly affects whether the RL signal is unbiased.
- [Experiments] Table/figure reporting results on ChartMimic, Plot2Code, ChartX: the manuscript provides no information on baseline re-implementations (e.g., whether open-source models were fine-tuned on the same 2M data or used off-the-shelf), statistical significance tests, variance across runs, or checks for data leakage between the constructed 2M dataset and the three evaluation benchmarks. These omissions make it impossible to assess whether the reported outperformance of CharTide-7B/8B over GPT-4o is reproducible or attributable to the proposed method.
minor comments (2)
- [Introduction / Method] The abstract and method description introduce several new terms (Tri-Perspective Tuning, Inquiry-Driven Evolution/RL, Inspector) without a concise summary table or diagram showing how the three streams and the RL loop interact; a single overview figure would improve readability.
- [Inquiry-Driven RL] Notation for the information-invariance reward (e.g., how atomic QA accuracy is aggregated into a scalar reward) is not formalized; an equation or pseudocode block would clarify the exact computation.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify important gaps in experimental controls, methodological transparency, and reporting that we address below. We have revised the manuscript to incorporate the requested ablations, details, and statistical information.
read point-by-point responses
-
Referee: [Experiments / abstract] The central claim that Tri-Perspective Tuning enables a 7B model to surpass specialized baselines 'using only supervised data' (abstract) is load-bearing but unsupported by controls. No experiment is described that trains the identical base 7B model on an equivalently sized (2M) set of standard, non-decoupled homogeneous chart-code pairs using ordinary supervised fine-tuning. Without this ablation, gains cannot be isolated from dataset volume, the Inquiry-Driven RL component, or the Inspector's QA rewards.
Authors: We agree that the absence of this specific control ablation leaves the contribution of Tri-Perspective Tuning incompletely isolated. In the revised manuscript we have added the requested experiment: the identical 7B base model was trained on the same 2M samples using standard homogeneous chart-code pairs and ordinary supervised fine-tuning. Results show that this homogeneous-SFT baseline underperforms the tri-perspective version, indicating that the performance advantage is not solely due to dataset scale. We have updated the abstract and added the ablation table and discussion in the experiments section. Note that the abstract claim refers specifically to the supervised stage; the Inquiry-Driven RL stage is presented separately. revision: yes
-
Referee: [Inquiry-Driven RL framework] § on Inquiry-Driven RL and information invariance: the claim that the frozen Inspector provides 'objective' and 'verifiable' rewards via atomic QA accuracy rests on the assumption that the Inspector is fully independent and that answer consistency across original/generated charts directly measures generation quality. No details are given on Inspector training data, potential overlap with evaluation benchmarks, or how false positives/negatives in QA are handled, which directly affects whether the RL signal is unbiased.
Authors: We acknowledge that additional details are required to substantiate the objectivity of the Inspector. In the revised manuscript we have expanded the Inquiry-Driven RL section with: (i) the Inspector's training data (a separately constructed chart-QA corpus with no samples from the 2M training set or the three evaluation benchmarks), (ii) explicit verification of zero overlap with ChartMimic, Plot2Code, and ChartX via embedding similarity and manual inspection, and (iii) mitigation of QA false positives/negatives through multi-query consistency thresholds and majority voting across atomic questions. These additions clarify that the reward signal is derived from an independent, frozen model and is grounded in information invariance. revision: yes
-
Referee: [Experiments] Table/figure reporting results on ChartMimic, Plot2Code, ChartX: the manuscript provides no information on baseline re-implementations (e.g., whether open-source models were fine-tuned on the same 2M data or used off-the-shelf), statistical significance tests, variance across runs, or checks for data leakage between the constructed 2M dataset and the three evaluation benchmarks. These omissions make it impossible to assess whether the reported outperformance of CharTide-7B/8B over GPT-4o is reproducible or attributable to the proposed method.
Authors: We agree that these reporting omissions hinder reproducibility assessment. In the revised manuscript we have added: (i) explicit statements on baseline implementations (open-source models were fine-tuned on the identical 2M dataset for fair comparison; GPT-4o and GPT-5 results are reported from the original APIs without fine-tuning), (ii) statistical significance via paired t-tests (p < 0.05) across three independent runs with different random seeds, (iii) standard deviation and variance reported for all metrics, and (iv) data-leakage analysis (deduplication via exact match and semantic similarity thresholds confirming no overlap between the 2M dataset and the evaluation benchmarks). These changes are incorporated into the experimental setup and results sections. revision: yes
Circularity Check
No significant circularity; claims rest on external benchmarks and independent inspector
full rationale
The paper's core claims rest on constructing a 2M-sample dataset via explicit decoupling into three streams and using a frozen Inspector for atomic QA-based rewards in RL. No equations, self-definitional loops, or fitted parameters are presented that reduce the reported outperformance to the inputs by construction. Experiments are conducted on external benchmarks (ChartMimic, Plot2Code, ChartX) against open baselines and GPT models, with the Inspector described as independent and frozen. No self-citation chains or ansatz smuggling are evident in the provided text that would make the derivation tautological. The absence of an internal ablation does not constitute circularity under the defined criteria.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Decoupling training into visual perception, pure-text code logic, and modality fusion streams enables better leveraging of multimodal supervision than homogeneous chart-code pairs.
- domain assumption A downstream model should yield consistent answers to identical visual queries across both original and generated charts (information invariance principle).
invented entities (3)
-
Tri-Perspective Tuning
no independent evidence
-
Inquiry-Driven RL
no independent evidence
-
Inspector
no independent evidence
Reference graph
Works this paper leans on
-
[1]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[2]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025 a . Qwen3-vl technical report. arXiv preprint arXiv:2511.21631
work page internal anchor Pith review arXiv 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025 b . Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923
work page internal anchor Pith review arXiv 2025
- [5]
- [6]
-
[7]
Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. 2025 c . https://arxiv.org/abs/2507.20766 Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback . Preprint, arXiv:2507.20766
-
[8]
DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948
work page internal anchor Pith review arXiv 2025
-
[9]
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, and 1 others. 2024. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In Proceedings of the 32nd ACM international conference on multimedia, pages 11198--11201
2024
- [10]
-
[11]
Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, and 1 others. 2025. Webcode2m: A real-world dataset for code generation from webpage designs. In Proceedings of the ACM on Web Conference 2025, pages 1834--1845
2025
- [12]
- [13]
-
[14]
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749
work page internal anchor Pith review arXiv 2025
- [15]
- [16]
- [17]
-
[18]
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, and 1 others. 2024. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124
work page internal anchor Pith review arXiv 2024
-
[19]
Junyoung Lim, Jaewoo Ahn, and Gunhee Kim. 2025. Chartcap: Mitigating hallucination of dense chart captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13171--13182
2025
-
[20]
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263--2279
2022
-
[21]
Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, and 1 others. 2025. Chartqapro: A more diverse and challenging benchmark for chart question answering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 191...
2025
-
[22]
Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, and 1 others. 2025. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365
work page Pith review arXiv 2025
- [23]
-
[24]
Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. 2025. Chart2code53: A large-scale diverse and complex dataset for enhancing chart-to-code generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15839--15855
2025
-
[25]
Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, and 1 others. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193
work page internal anchor Pith review arXiv 2023
-
[26]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review arXiv 2024
-
[27]
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. 2025. Design2code: Benchmarking multimodal code generation for automated front-end engineering. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pag...
2025
- [28]
-
[29]
Yuxuan Wan, Chaozheng Wang, Yi Dong, Wenxuan Wang, Shuqing Li, Yintong Huo, and Michael Lyu. 2025. Divide-and-conquer: Generating ui code from screenshots. Proceedings of the ACM on Software Engineering, 2(FSE):2099--2122
2025
-
[30]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265
work page internal anchor Pith review arXiv 2025
-
[31]
Chengyue Wu, Zhixuan Liang, Yixiao Ge, Qiushan Guo, Zeyu Lu, Jiahao Wang, Ying Shan, and Ping Luo. 2025. Plot2code: A comprehensive benchmark for evaluating multi-modal large language models in code generation from scientific plots. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3006--3028
2025
-
[32]
Renqiu Xia, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Botian Shi, Junchi Yan, and Bo Zhang. 2025. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. IEEE Transactions on Image Processing
2025
- [33]
-
[34]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025 a . Qwen3 technical report. arXiv preprint arXiv:2505.09388
work page internal anchor Pith review arXiv 2025
- [35]
- [36]
-
[37]
Sukmin Yun, Rusiru Thushara, Mohammad Bhat, Yongxin Wang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, and 1 others. 2024. Web2code: A large-scale webpage-to-code dataset and evaluation framework for multimodal llms. Advances in neural information processing systems, 37:112134--112157
2024
- [38]
-
[39]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11975--11986
2023
- [40]
-
[41]
Zhihan Zhang, Yixin Cao, and Lizi Liao. 2025 b . Boosting chart-to-code generation in mllm via dual preference-guided refinement. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 11032--11041
2025
- [42]
- [43]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.