pith. machine review for the scientific record.
sign in

arxiv: 2510.23642 · v2 · submitted 2025-10-24 · 💻 cs.SE · cs.AI· cs.CL· cs.PL

VisCoder2: Building Multi-Language Visualization Coding Agents

Pith reviewed 2026-05-18 04:10 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.PL
keywords visualization code generationmulti-language coding agentsself-debugging LLMsexecutable code benchmarkssupervised fine-tuning datasetsLLM for data visualization
0
0 comments X

The pith

A new 679K-sample dataset with multi-turn correction dialogues trains VisCoder2 models to generate executable visualization code across 12 languages, reaching 82.4 percent pass rate at 32B scale with self-debug.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds VisCode-Multi-679K, a supervised collection of 679K validated visualization samples spanning 12 programming languages and including multi-turn correction dialogues. It pairs this with VisPlotBench, a new evaluation set that tests both single-round generation and iterative self-debug protocols with rendered outputs. Models fine-tuned on the dataset, called VisCoder2, are shown to beat strong open-source baselines and approach GPT-4.1 performance. Further gains come from allowing the models to debug their own outputs over multiple rounds, with the largest lift appearing in symbolic and compiler-dependent languages. The work matters because earlier visualization agents were restricted to narrow language coverage and single-shot generation, limiting their use in realistic workflows.

Core claim

VisCode-Multi-679K supplies 679K executable visualization samples across 12 languages together with multi-turn correction dialogues; VisPlotBench supplies systematic tasks with execution and rendering checks for both initial generation and self-debug; VisCoder2 models trained on the dataset outperform open-source baselines, approach proprietary performance, and reach an overall 82.4 percent execution pass rate at the 32B scale once iterative self-debug is applied, with especially strong results in symbolic or compiler-dependent languages.

What carries the argument

VisCode-Multi-679K dataset of validated executable samples paired with multi-turn correction dialogues that supplies supervised training for both initial code generation and self-correction across 12 languages.

If this is right

  • VisCoder2 models significantly outperform strong open-source baselines on VisPlotBench.
  • Iterative self-debug produces additional gains that push overall execution pass rate to 82.4 percent at the 32B scale.
  • Performance improvements are largest for symbolic and compiler-dependent languages.
  • The combination of large-scale supervised data and multi-round correction protocols enables models to approach the performance of proprietary systems such as GPT-4.1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dataset-plus-self-debug pattern could be applied to other code-generation domains that require rendering or execution feedback, such as data analysis scripts or web front-end code.
  • The multi-turn correction dialogues may reduce the need for external tool use in agent loops, lowering latency for interactive visualization tools.
  • If the benchmark tasks prove narrower than real user needs, the reported pass rates may overestimate performance on open-ended or domain-specific visualization requests.
  • Scaling the approach to larger models or mixing in human preference data could further close the remaining gap to the strongest closed models.

Load-bearing premise

The VisCode-Multi-679K dataset and VisPlotBench tasks represent the distribution of real practical multi-language visualization requests and that benchmark execution pass rate predicts actual user utility.

What would settle it

A side-by-side test in which human users issue natural-language visualization requests in the 12 languages, the model produces code, and independent judges measure both execution success and visual match to intent without additional human editing.

Figures

Figures reproduced from arXiv: 2510.23642 by Fei Yuan, Jiaqi Deng, Jiarong Liang, Kai Zou, Ping Nie, Songcheng Cai, Wenhu Chen, Xiangchao Chen, Xiang Yue, Yuansheng Ni, Zhiheng Lyu.

Figure 1
Figure 1. Figure 1: Overview of VisCoder2. We present three components: 1) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data construction pipeline for VisCode-Multi-679K. We collect code blocks across twelve [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of VisPlotBench. The benchmark covers eight visualization languages and con [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of fine-grained visualization types in VisPlotBench. Tasks are organized into [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of a successful generation in Python (ID: 1). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 38 [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of a failed generation in Python (ID: 69), where the initial code raises a ValueError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 39 [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of a failed generation in Python (ID: 115), where the initial code raises a AttributeError and is still failed after three rounds self-debug. [Back to Appendix Contents] 40 [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of a successful generation in Vega-Lite (ID: 18). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 41 [PITH_FULL_IMAGE:figures/full_fig_p041_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of a failed generation in Vega-Lite (ID: 14), where the initial code raises a TypeError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 42 [PITH_FULL_IMAGE:figures/full_fig_p042_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of a failed generation in Vega-Lite (ID: 50), where the initial code raises a TypeError and is still failed after three rounds self-debug. [Back to Appendix Contents] 43 [PITH_FULL_IMAGE:figures/full_fig_p043_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example of a successful generation in Lilypond (ID: 15). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 44 [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of a failed generation in Lilypond (ID: 13), where the initial code raises a SyntaxError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 45 [PITH_FULL_IMAGE:figures/full_fig_p045_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of a failed generation in Lilypond (ID: 48), where the initial code raises a TypeError and is still failed after three rounds self-debug. [Back to Appendix Contents] 46 [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of a successful generation in Mermaid (ID: 19). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 47 [PITH_FULL_IMAGE:figures/full_fig_p047_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example of a failed generation in Mermaid (ID: 88), where the initial code raises a SyntaxError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 48 [PITH_FULL_IMAGE:figures/full_fig_p048_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example of a failed generation in Mermaid (ID: 80), where the initial code raises a AttributeError and is still failed after three rounds self-debug. [Back to Appendix Contents] 49 [PITH_FULL_IMAGE:figures/full_fig_p049_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Example of a successful generation in SVG (ID: 52). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 50 [PITH_FULL_IMAGE:figures/full_fig_p050_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Example of a failed generation in SVG (ID: 42), where the initial code raises a ExPatError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 51 [PITH_FULL_IMAGE:figures/full_fig_p051_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Example of a failed generation in SVG (ID: 12), where the initial code raises a ParseError and is still failed after three rounds self-debug. [Back to Appendix Contents] 52 [PITH_FULL_IMAGE:figures/full_fig_p052_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Example of a successful generation in LaTeX (ID: 33). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 53 [PITH_FULL_IMAGE:figures/full_fig_p053_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Example of a failed generation in [PITH_FULL_IMAGE:figures/full_fig_p054_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Example of a failed generation in LaTeX (ID: 97), where the initial code raises a NameError and is still failed after three rounds self-debug. [Back to Appendix Contents] 55 [PITH_FULL_IMAGE:figures/full_fig_p055_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Example of a successful generation in Asymptote (ID: 31). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 56 [PITH_FULL_IMAGE:figures/full_fig_p056_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Example of a failed generation in Asymptote (ID: 53), where the initial code raises a NameError and is resolved in the third round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 57 [PITH_FULL_IMAGE:figures/full_fig_p057_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Example of a failed generation in Asymptote (ID: 79), where the initial code raises a TypeError and is still failed after three rounds self-debug. [Back to Appendix Contents] 58 [PITH_FULL_IMAGE:figures/full_fig_p058_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Example of a successful generation in HTML (ID: 6). The model generates code that executes successfully and produces a plot consistent with the ground truth. [Back to Appendix Contents] 59 [PITH_FULL_IMAGE:figures/full_fig_p059_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Example of a failed generation in HTML (ID: 9), where the initial code raises a ImportError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics. [Back to Appendix Contents] 60 [PITH_FULL_IMAGE:figures/full_fig_p060_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Example of a failed generation in HTML (ID: 85), where the initial code raises a TypeError and is still failed after three rounds self-debug. [Back to Appendix Contents] 61 [PITH_FULL_IMAGE:figures/full_fig_p061_28.png] view at source ↗
read the original abstract

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces three resources for visualization coding agents: VisCode-Multi-679K, a supervised dataset of 679K validated executable visualization samples with multi-turn correction dialogues across 12 programming languages; VisPlotBench, a benchmark featuring executable tasks, rendered outputs, and protocols for initial generation plus multi-round self-debug; and VisCoder2, a family of models trained on the dataset. Experiments claim that VisCoder2 significantly outperforms strong open-source baselines, approaches proprietary models such as GPT-4.1, and reaches an 82.4% overall execution pass rate at the 32B scale (with further gains from iterative self-debug), particularly in symbolic or compiler-dependent languages.

Significance. If the central claims hold, the large-scale multi-language dataset and benchmark would constitute useful community resources for training and evaluating visualization coding agents, addressing gaps in language coverage and iterative correction. The reported performance numbers, if robust, would indicate meaningful progress toward reliable multi-language agents. The work supplies concrete, reusable artifacts rather than purely theoretical advances.

major comments (1)
  1. [Abstract and VisPlotBench description] Abstract / VisPlotBench description: the headline result (82.4% execution pass rate at 32B with self-debug) and the claim of practical multi-language utility rest on execution success alone. The benchmark is described as featuring 'executable tasks, rendered outputs' yet the reported metric is solely execution pass rate; no automated visual similarity metric, perceptual hash, pixel-level comparison, or human review of rendered plots is mentioned. In a 12-language setting that includes symbolic and compiler-dependent languages, clean execution can still yield empty, mis-scaled, or semantically incorrect figures. This is load-bearing for the central claim that the numbers demonstrate effective visualization agents.
minor comments (1)
  1. [Abstract] The abstract would benefit from explicitly stating the number of tasks in VisPlotBench and the exact open-source baselines used for comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential value of our dataset, benchmark, and models. Below we respond point by point to the major comment on evaluation metrics.

read point-by-point responses
  1. Referee: [Abstract and VisPlotBench description] Abstract / VisPlotBench description: the headline result (82.4% execution pass rate at 32B with self-debug) and the claim of practical multi-language utility rest on execution success alone. The benchmark is described as featuring 'executable tasks, rendered outputs' yet the reported metric is solely execution pass rate; no automated visual similarity metric, perceptual hash, pixel-level comparison, or human review of rendered plots is mentioned. In a 12-language setting that includes symbolic and compiler-dependent languages, clean execution can still yield empty, mis-scaled, or semantically incorrect figures. This is load-bearing for the central claim that the numbers demonstrate effective visualization agents.

    Authors: We appreciate the referee highlighting this important aspect of our evaluation. VisPlotBench does include rendered outputs for each task to support potential visual inspection, as noted in the benchmark description. We selected execution pass rate as the primary automated metric because it provides an objective, scalable, and language-agnostic signal of whether the generated code runs to completion and produces a visualization artifact without runtime errors. This is especially relevant for our 12-language setting, where many symbolic or compiler-dependent languages (e.g., R, Julia, C++) have strict requirements for successful plotting calls. We acknowledge, however, that execution success alone does not guarantee visual quality, correct scaling, or semantic appropriateness of the resulting figure, and that empty or misleading plots remain possible. In the revised manuscript we will expand the VisPlotBench and evaluation sections to explicitly discuss this limitation, add qualitative examples of rendered outputs from VisCoder2 and baselines, and note that future extensions could incorporate automated visual similarity measures where cross-language standardization is feasible. We believe the reported execution rates still reflect meaningful progress on reliable multi-language code generation, but agree that visual fidelity is a valuable complementary dimension for fuller validation of visualization agents. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with independent measurement

full rationale

The paper presents an empirical ML contribution: construction of VisCode-Multi-679K dataset, VisPlotBench benchmark, and training of VisCoder2 models, followed by direct measurement of execution pass rates on the benchmark. No mathematical derivation chain, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the reported results. The headline 82.4% figure is a measured outcome on held-out tasks rather than a quantity obtained by construction from the training data or prior self-references. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; the central claims rest on the assumption that the collected dataset is high-quality and executable and that the benchmark tasks reflect real usage. No explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5769 in / 1113 out tokens · 27062 ms · 2026-05-18T04:10:34.851236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    URLhttps://arxiv.org/abs/2502.02737. Nan Chen, Yuge Zhang, Jiahang Xu, Kan Ren, and Yuqing Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Com- puter Graphics, 2024. Xinyun Chen, Maxwell Lin, Nathanael Sch ¨arli, and Denny Zhou. Teaching large language models to self-debug.arXiv ...

  2. [2]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    URLhttps://arxiv.org/abs/2409.17146. Victor Dibia. Lida: A tool for automatic generation of grammar-agnostic visualizations and info- graphics using large language models.ArXiv preprint, abs/2303.02927, 2023. URLhttps: //arxiv.org/abs/2303.02927. 12 Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva, and Jo ˜ao P. Matos-Carvalh...

  3. [3]

    Qwen2.5-Coder Technical Report

    URLhttps://arxiv.org/abs/2409.12186. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. Hao Li, Haoxiang Zhang, and Ahmed E Hassan. The rise of ai teammates in software engineering (se) 3.0: How auto...

  4. [4]

    Setup (state the language and rendering context, including any tools or libs implied)

  5. [5]

    - If the code is not data-driven: summarize the visible content of the image

    Data/Visual Description - If the code is data-driven: summarize the inputs the code relies on and any shaping operations. - If the code is not data-driven: summarize the visible content of the image

  6. [6]

    Data Generation (the data-generation lines copied verbatim, or “None” if not applicable)

  7. [7]

    Generate

    Output Description (omit language constructs; start with “Generate...” or “Create...”, and describe the final image conceptually)

  8. [8]

    Each part must start on a new line, numbered 1 through 5

    Style Description (describe appearance and layout without naming language constructs). Each part must start on a new line, numbered 1 through 5. Use plain text only; no markdown. Code: {code} Image: Instruct Generation Prompt:CoSyn-400K Model: GPT-4.1 #FOR DATA-DRIVEN LANGUAGES #LANGUAGE= [Python,Vega-Lite,HTML,LilyPond,Mermaid] You are given a{LANGUAGE}c...

  9. [9]

    Setup (state the{LANGUAGE}and its rendering context, including any tools or specification frameworks implied)

  10. [10]

    Data/Content Description (summarize the input fields, entities, or content the code relies on, including any shaping or transformation operations)

  11. [11]

    Generate

    Output Description (omit library, directive, or element names; start with “Generate...” or “Create...”, and describe the visual conceptually)

  12. [12]

    Each part must start on a new line, numbered 1 through 4

    Style Description (describe appearance and layout without naming language constructs). Each part must start on a new line, numbered 1 through 4. Use plain text only; no markdown. Code: {code} Image: 19 Instruct Generation Prompt: CoSyn-400K Model: GPT-4.1 #FOR NONE DATA-DRIVEN LANGUAGES #LANGUAGE= [Asymptote,SVG] You are given a{LANGUAGE}code snippet that...

  13. [13]

    Setup (state the{LANGUAGE}and its rendering context)

  14. [14]

    Visual Elements (summarize the visible components of the image)

  15. [15]

    Generate

    Output Description (omit language constructs; start with “Generate...” or “Create...”, and describe the image conceptually)

  16. [16]

    ‘vegalite SOME CODE“‘ containing one complete Vega-Lite specification. Example minimal spec: “‘vegalite {

    Style Description (describe appearance and layout without naming language constructs). Each part must start on a new line, numbered 1 through 4. Use plain text only; no markdown. Code: {code} Image: 20 A.2 PROMPTUSED INVISPLOTBENCH Task & Style Description Generation Prompt: Model: GPT-4.1 #LANGUAGE= [Python,Vega-Lite,HTML,LilyPond,Mermaid,Asymptote,HTML,...