Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Dae Yon Hwang; Freda Shi; Jesse C. Cresswell; Yifan Jiang

arxiv: 2605.27311 · v1 · pith:EK5SK6XRnew · submitted 2026-05-26 · 💻 cs.CL · cs.CV

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Yifan Jiang , Dae Yon Hwang , Jesse C. Cresswell , Freda Shi This is my paper

Pith reviewed 2026-06-29 17:42 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords counterfactual chartsvision-language modelschart question answeringvisual reasoning evaluationgeneralization failuresshortcut detectionbenchmark construction

0 comments

The pith

Counterfactual charts show vision-language models often fail to generalize after correctly answering the original chart.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes counterfactual charts that keep the question fixed while varying the underlying chart and its answer. Chartographer reverse-engineers charts into code, validates the reconstruction, generates controlled variants, and computes new answers from the same logic. Testing on existing datasets reveals that high single-chart accuracy frequently hides inability to handle new visual reasoning demands. This approach distinguishes genuine visual reasoning from reliance on shortcuts or prior familiarity.

Core claim

Chartographer reverse engineers charts into executable code, validates fidelity, generates seed-controlled counterfactual variants, and derives new answers; evaluations of proprietary and open-source VLMs show they often fail to generalize after answering the original chart correctly, with failures most common when updated charts require novel visual reasoning pathways.

What carries the argument

Chartographer framework that reverse-engineers charts into executable code to generate and validate counterfactual variants while preserving question validity.

If this is right

Single-chart accuracy overestimates a model's ability to perform visual reasoning across chart variations.
Failures concentrate on charts that demand reasoning pathways different from those used on the original.
Existing chart QA benchmarks can be extended with counterfactual variants to isolate shortcut use from true generalization.
Model robustness improves only when training or evaluation explicitly varies the visual elements while holding the question fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark designers could embed counterfactual generation directly into dataset creation to reduce the chance of models learning fixed chart templates.
The same reverse-engineering step might be applied to other visual domains, such as diagrams or maps, to create analogous generalization tests.
If the performance gap persists across many model scales, it points to a structural limitation in how current VLMs encode chart structure rather than a simple data-coverage issue.

Load-bearing premise

The reverse-engineered code faithfully reconstructs the original chart and the generated variants preserve question validity while introducing genuinely new visual reasoning requirements without exploitable artifacts.

What would settle it

A controlled test where models that fail on counterfactual charts are re-tested on minimal chart edits that keep the same visual reasoning pathway and the same answer; consistent success on those minimal edits would indicate the original failures stem from something other than the intended new reasoning demand.

Figures

Figures reproduced from arXiv: 2605.27311 by Dae Yon Hwang, Freda Shi, Jesse C. Cresswell, Yifan Jiang.

**Figure 1.** Figure 1: The CHARTOGRAPHER pipeline for constructing counterfactual chart-question families. Starting from a source chart QA example, the pipeline reconstructs semantic chart data and chart-rendering code, iteratively revises the render until accepted, generates seed-controlled counterfactual variants, and recomputes answers with executable QA logic. The resulting families test whether success on an original chart-… view at source ↗

**Figure 2.** Figure 2: Example where GPT-5.4-mini solves the reconstruction (right), but not the original (left). The reconstruc [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of conditional variant accuracy [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: ChartMuseum CVA by reasoning type and model group. Bars show mean CVA for each group, with individual model scores overlaid. Reasoning-type names follow the ChartMuseum terminology. the chart, either source, or their combination. Figure 4 reports ChartMuseum CVA by category; Appendix B.4 gives the per-model breakdown. The best generalization occurs when it is possible to rely on text only, followed by V… view at source ↗

**Figure 5.** Figure 5: Claude Sonnet 4.6 responses for the first case [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: ChartMuseum failure case studies illustrating update outcomes. Each row shows the original chart, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Additional examples where GPT-5.4-mini gives a correct answer on the reconstruction (right), but not the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Additional reasoning traces for the first case [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Counterfactual variant responses for the met [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: CharXiv case study: trajectory tracking in risk curves. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: CharXiv case study: trajectory comparison across economic subplots. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: CharXiv case study: spatial step counting. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: CharXiv case study: spatial line-marker binding. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: CharXiv case study: legend-bound bar comparison. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: CharXiv case study: paired bar matching. [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: CharXiv case study: thresholded bar counting. [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: CharXiv case study: thresholded scatter-point counting. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

read the original abstract

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chartographer gives a workable pipeline for counterfactual chart QA via code reverse-engineering, but the claims about revealing real generalization failures rest on unshown fidelity checks.

read the letter

The paper's core move is turning existing chart QA examples into code, then using that code to produce controlled variants where the question stays the same but the visual data and answer change. This is a direct way to test whether VLMs are actually doing the visual work or just pattern-matching the original chart.

What stands out is the end-to-end framing: reverse-engineer, validate reconstruction, generate seed-controlled counterfactuals, and derive fresh answers from the executable logic. Applying the method to prior datasets and measuring how often models drop performance on the variants is a sensible diagnostic step. The reported pattern—that success on the original often does not carry over, especially when the variant forces a new visual operation—matches what people have seen in other robustness tests.

The soft spot is the missing quantitative backbone for the fidelity claim. The abstract and method sketch do not report numbers on how close the reconstructed chart is to the original (pixel error, data table match, or human judgment), nor do they show error rates on the answer derivation step. Without those, it is hard to rule out that some of the observed failures come from rendering quirks or answer leakage rather than the intended change in reasoning path. The stress-test note on this point holds up on the given description.

The work is aimed at groups building or auditing multimodal benchmarks. It is worth sending to peer review so the implementation details and any validation metrics can be checked; the idea itself is clear enough to merit that step even if the current evidence is still light.

Referee Report

2 major / 2 minor

Summary. The paper introduces Chartographer, a framework that reverse-engineers charts from existing QA datasets into executable code, validates reconstruction, generates seed-controlled counterfactual variants, and derives new answers via the code. It applies this to chart QA benchmarks and evaluates VLMs, claiming that counterfactuals expose generalization failures hidden by single-chart accuracy, with failures most common when variants require novel visual reasoning pathways.

Significance. If the fidelity of reverse-engineering and the substantive novelty of the introduced reasoning pathways can be demonstrated, the framework would strengthen evaluation of visual reasoning in VLMs by mitigating shortcut learning and background-knowledge exploitation. The executable-code approach for answer derivation is a positive feature for reproducibility and controlled variation.

major comments (2)

[Chartographer framework] The Chartographer framework description states that reconstruction fidelity is validated, but no quantitative metrics (e.g., data-value extraction accuracy, structural similarity index, or perceptual distance between original and reconstructed charts) are reported. This is load-bearing for the central claim, as unquantified fidelity leaves open the possibility that observed VLM failures arise from generation artifacts rather than reasoning limitations.
[Evaluation and results] The results on variation sensitivity claim that failures are most prevalent when updated charts require novel visual reasoning pathways, yet no metric, control condition, or analysis is provided to establish that the pathways are genuinely novel (as opposed to reordering of the same visual primitives). This directly affects the interpretation of the differential failure rates.

minor comments (2)

[Abstract] The abstract summarizes high-level findings without any numerical results (e.g., failure rates or variation-sensitivity scores); including one or two key statistics would better ground the claims.
Notation for the seed-controlled generation and the executable QA logic could be clarified with a small pseudocode example or explicit variable definitions to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Chartographer framework] The Chartographer framework description states that reconstruction fidelity is validated, but no quantitative metrics (e.g., data-value extraction accuracy, structural similarity index, or perceptual distance between original and reconstructed charts) are reported. This is load-bearing for the central claim, as unquantified fidelity leaves open the possibility that observed VLM failures arise from generation artifacts rather than reasoning limitations.

Authors: We agree that providing quantitative metrics for reconstruction fidelity is essential to substantiate our claims and rule out generation artifacts. Although the manuscript describes the validation process, specific numerical results were not included. In the revised manuscript, we will add quantitative metrics, including data-value extraction accuracy, structural similarity measures, and perceptual distance metrics, along with details on how they were computed. revision: yes
Referee: [Evaluation and results] The results on variation sensitivity claim that failures are most prevalent when updated charts require novel visual reasoning pathways, yet no metric, control condition, or analysis is provided to establish that the pathways are genuinely novel (as opposed to reordering of the same visual primitives). This directly affects the interpretation of the differential failure rates.

Authors: This is a valid concern. Our classification of novel visual reasoning pathways was based on examining the modifications to the underlying chart generation code and the resulting visual changes. We did not provide a formal quantitative metric or control condition in the original submission. We will expand the analysis section to include more detailed examples and a categorization of the types of changes (e.g., altering data distributions vs. changing chart types), which we believe supports the novelty claim. However, a fully objective metric for 'novel reasoning pathways' may require additional methodological development. revision: partial

Circularity Check

0 steps flagged

No circularity: methodological framework with no derivations or fitted predictions

full rationale

The paper introduces Chartographer as a procedural framework for reverse-engineering charts into code, generating counterfactual variants, and evaluating VLMs on them. No equations, parameters, or predictive models are defined or fitted. The central claims rest on empirical observations from applying the framework to existing datasets rather than any self-referential derivation, self-citation chain, or renaming of known results. The method is self-contained as a generation and evaluation pipeline whose validity can be checked externally via fidelity metrics or human inspection, with no load-bearing step reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review contains no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5681 in / 1005 out tokens · 25345 ms · 2026-06-29T17:42:38.830645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding.ArXiv preprint, abs/2509.22437. Google. 2025. We’re expanding our Gemini 2.5 fam- ily of models. https://blog.google/products/ gemini/gemini-2-5-model-family-expands/. Google. 2026. Gemma 4: Byte for byte, the most capable open models. https://blog. google/innovation-and-ai/technology/...

work page arXiv 2025
[2]

LLaVA-OneVision: Easy Visual Task Transfer

Annotation artifacts in natural language infer- ence data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computa- tional Linguistics. Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran ...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding.ArXiv preprint, abs/2509.22437. Google. 2025. We’re expanding our Gemini 2.5 fam- ily of models. https://blog.google/products/ gemini/gemini-2-5-model-family-expands/. Google. 2026. Gemma 4: Byte for byte, the most capable open models. https://blog. google/innovation-and-ai/technology/...

work page arXiv 2025

[2] [2]

LLaVA-OneVision: Easy Visual Task Transfer

Annotation artifacts in natural language infer- ence data. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, V olume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computa- tional Linguistics. Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran ...

work page internal anchor Pith review Pith/arXiv arXiv 2018