VisCoder2: Building Multi-Language Visualization Coding Agents
Pith reviewed 2026-05-18 04:10 UTC · model grok-4.3
The pith
A new 679K-sample dataset with multi-turn correction dialogues trains VisCoder2 models to generate executable visualization code across 12 languages, reaching 82.4 percent pass rate at 32B scale with self-debug.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VisCode-Multi-679K supplies 679K executable visualization samples across 12 languages together with multi-turn correction dialogues; VisPlotBench supplies systematic tasks with execution and rendering checks for both initial generation and self-debug; VisCoder2 models trained on the dataset outperform open-source baselines, approach proprietary performance, and reach an overall 82.4 percent execution pass rate at the 32B scale once iterative self-debug is applied, with especially strong results in symbolic or compiler-dependent languages.
What carries the argument
VisCode-Multi-679K dataset of validated executable samples paired with multi-turn correction dialogues that supplies supervised training for both initial code generation and self-correction across 12 languages.
If this is right
- VisCoder2 models significantly outperform strong open-source baselines on VisPlotBench.
- Iterative self-debug produces additional gains that push overall execution pass rate to 82.4 percent at the 32B scale.
- Performance improvements are largest for symbolic and compiler-dependent languages.
- The combination of large-scale supervised data and multi-round correction protocols enables models to approach the performance of proprietary systems such as GPT-4.1.
Where Pith is reading between the lines
- The same dataset-plus-self-debug pattern could be applied to other code-generation domains that require rendering or execution feedback, such as data analysis scripts or web front-end code.
- The multi-turn correction dialogues may reduce the need for external tool use in agent loops, lowering latency for interactive visualization tools.
- If the benchmark tasks prove narrower than real user needs, the reported pass rates may overestimate performance on open-ended or domain-specific visualization requests.
- Scaling the approach to larger models or mixing in human preference data could further close the remaining gap to the strongest closed models.
Load-bearing premise
The VisCode-Multi-679K dataset and VisPlotBench tasks represent the distribution of real practical multi-language visualization requests and that benchmark execution pass rate predicts actual user utility.
What would settle it
A side-by-side test in which human users issue natural-language visualization requests in the 12 languages, the model produces code, and independent judges measure both execution success and visual match to intent without additional human editing.
Figures
read the original abstract
Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces three resources for visualization coding agents: VisCode-Multi-679K, a supervised dataset of 679K validated executable visualization samples with multi-turn correction dialogues across 12 programming languages; VisPlotBench, a benchmark featuring executable tasks, rendered outputs, and protocols for initial generation plus multi-round self-debug; and VisCoder2, a family of models trained on the dataset. Experiments claim that VisCoder2 significantly outperforms strong open-source baselines, approaches proprietary models such as GPT-4.1, and reaches an 82.4% overall execution pass rate at the 32B scale (with further gains from iterative self-debug), particularly in symbolic or compiler-dependent languages.
Significance. If the central claims hold, the large-scale multi-language dataset and benchmark would constitute useful community resources for training and evaluating visualization coding agents, addressing gaps in language coverage and iterative correction. The reported performance numbers, if robust, would indicate meaningful progress toward reliable multi-language agents. The work supplies concrete, reusable artifacts rather than purely theoretical advances.
major comments (1)
- [Abstract and VisPlotBench description] Abstract / VisPlotBench description: the headline result (82.4% execution pass rate at 32B with self-debug) and the claim of practical multi-language utility rest on execution success alone. The benchmark is described as featuring 'executable tasks, rendered outputs' yet the reported metric is solely execution pass rate; no automated visual similarity metric, perceptual hash, pixel-level comparison, or human review of rendered plots is mentioned. In a 12-language setting that includes symbolic and compiler-dependent languages, clean execution can still yield empty, mis-scaled, or semantically incorrect figures. This is load-bearing for the central claim that the numbers demonstrate effective visualization agents.
minor comments (1)
- [Abstract] The abstract would benefit from explicitly stating the number of tasks in VisPlotBench and the exact open-source baselines used for comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the potential value of our dataset, benchmark, and models. Below we respond point by point to the major comment on evaluation metrics.
read point-by-point responses
-
Referee: [Abstract and VisPlotBench description] Abstract / VisPlotBench description: the headline result (82.4% execution pass rate at 32B with self-debug) and the claim of practical multi-language utility rest on execution success alone. The benchmark is described as featuring 'executable tasks, rendered outputs' yet the reported metric is solely execution pass rate; no automated visual similarity metric, perceptual hash, pixel-level comparison, or human review of rendered plots is mentioned. In a 12-language setting that includes symbolic and compiler-dependent languages, clean execution can still yield empty, mis-scaled, or semantically incorrect figures. This is load-bearing for the central claim that the numbers demonstrate effective visualization agents.
Authors: We appreciate the referee highlighting this important aspect of our evaluation. VisPlotBench does include rendered outputs for each task to support potential visual inspection, as noted in the benchmark description. We selected execution pass rate as the primary automated metric because it provides an objective, scalable, and language-agnostic signal of whether the generated code runs to completion and produces a visualization artifact without runtime errors. This is especially relevant for our 12-language setting, where many symbolic or compiler-dependent languages (e.g., R, Julia, C++) have strict requirements for successful plotting calls. We acknowledge, however, that execution success alone does not guarantee visual quality, correct scaling, or semantic appropriateness of the resulting figure, and that empty or misleading plots remain possible. In the revised manuscript we will expand the VisPlotBench and evaluation sections to explicitly discuss this limitation, add qualitative examples of rendered outputs from VisCoder2 and baselines, and note that future extensions could incorporate automated visual similarity measures where cross-language standardization is feasible. We believe the reported execution rates still reflect meaningful progress on reliable multi-language code generation, but agree that visual fidelity is a valuable complementary dimension for fuller validation of visualization agents. revision: partial
Circularity Check
No circularity: empirical benchmark results with independent measurement
full rationale
The paper presents an empirical ML contribution: construction of VisCode-Multi-679K dataset, VisPlotBench benchmark, and training of VisCoder2 models, followed by direct measurement of execution pass rates on the benchmark. No mathematical derivation chain, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the reported results. The headline 82.4% figure is a measured outcome on held-out tasks rather than a quantity obtained by construction from the training data or prior self-references. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments show that VisCoder2 significantly outperforms strong open-source baselines ... reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VisCode-Multi-679K ... 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
URLhttps://arxiv.org/abs/2502.02737. Nan Chen, Yuge Zhang, Jiahang Xu, Kan Ren, and Yuqing Yang. Viseval: A benchmark for data visualization in the era of large language models.IEEE Transactions on Visualization and Com- puter Graphics, 2024. Xinyun Chen, Maxwell Lin, Nathanael Sch ¨arli, and Denny Zhou. Teaching large language models to self-debug.arXiv ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
URLhttps://arxiv.org/abs/2409.17146. Victor Dibia. Lida: A tool for automatic generation of grammar-agnostic visualizations and info- graphics using large language models.ArXiv preprint, abs/2303.02927, 2023. URLhttps: //arxiv.org/abs/2303.02927. 12 Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva, and Jo ˜ao P. Matos-Carvalh...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Qwen2.5-Coder Technical Report
URLhttps://arxiv.org/abs/2409.12186. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023. Hao Li, Haoxiang Zhang, and Ahmed E Hassan. The rise of ai teammates in software engineering (se) 3.0: How auto...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Setup (state the language and rendering context, including any tools or libs implied)
-
[5]
- If the code is not data-driven: summarize the visible content of the image
Data/Visual Description - If the code is data-driven: summarize the inputs the code relies on and any shaping operations. - If the code is not data-driven: summarize the visible content of the image
-
[6]
Data Generation (the data-generation lines copied verbatim, or “None” if not applicable)
- [7]
-
[8]
Each part must start on a new line, numbered 1 through 5
Style Description (describe appearance and layout without naming language constructs). Each part must start on a new line, numbered 1 through 5. Use plain text only; no markdown. Code: {code} Image: Instruct Generation Prompt:CoSyn-400K Model: GPT-4.1 #FOR DATA-DRIVEN LANGUAGES #LANGUAGE= [Python,Vega-Lite,HTML,LilyPond,Mermaid] You are given a{LANGUAGE}c...
-
[9]
Setup (state the{LANGUAGE}and its rendering context, including any tools or specification frameworks implied)
-
[10]
Data/Content Description (summarize the input fields, entities, or content the code relies on, including any shaping or transformation operations)
- [11]
-
[12]
Each part must start on a new line, numbered 1 through 4
Style Description (describe appearance and layout without naming language constructs). Each part must start on a new line, numbered 1 through 4. Use plain text only; no markdown. Code: {code} Image: 19 Instruct Generation Prompt: CoSyn-400K Model: GPT-4.1 #FOR NONE DATA-DRIVEN LANGUAGES #LANGUAGE= [Asymptote,SVG] You are given a{LANGUAGE}code snippet that...
-
[13]
Setup (state the{LANGUAGE}and its rendering context)
-
[14]
Visual Elements (summarize the visible components of the image)
- [15]
-
[16]
Style Description (describe appearance and layout without naming language constructs). Each part must start on a new line, numbered 1 through 4. Use plain text only; no markdown. Code: {code} Image: 20 A.2 PROMPTUSED INVISPLOTBENCH Task & Style Description Generation Prompt: Model: GPT-4.1 #LANGUAGE= [Python,Vega-Lite,HTML,LilyPond,Mermaid,Asymptote,HTML,...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.