pith. sign in

arxiv: 2605.07415 · v2 · pith:X7IS4N2Bnew · submitted 2026-05-08 · 💻 cs.CV · cs.CL

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords chart referring expression groundingmulti-target referringcode-driven synthesispixel accurate masksinstance segmentationmultimodal modelsbenchmarkchart understanding
0
0 comments X

The pith

A new benchmark and code-driven synthesis pipeline improve referring expression grounding on charts with multiple targets and diverse clues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a more complete benchmark for chart referring expression grounding that handles multiple target instances, various localization forms beyond bounding boxes, a range of referring clues, and multiple chart types. Existing multimodal models exhibit large performance gaps on this benchmark. The authors also present a code-driven synthesis pipeline that generates pixel-accurate instance masks by leveraging the alignment between plotting code and rendered chart elements, then train an instance segmentation model on these masks and integrate it into a multimodal grounding system.

Core claim

The authors introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. They further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. Training an instance segmentation model with the synthesized masks and integrating it into a general-purpose multimodal grounding framework produces a system that consistently outperforms baselines on the benchmark and generalizes well to a ChartQA-derived real

What carries the argument

The code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities.

If this is right

  • Localization of fine chart elements can shift from bounding boxes to pixel-accurate masks.
  • Multi-instance target references become tractable in chart grounding tasks.
  • Performance improves across a wider variety of chart types and referring clue types.
  • The trained system transfers to grounding tasks on real charts drawn from ChartQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The synthesis approach could extend to other structured visualization types such as diagrams or infographics.
  • Improved chart grounding may benefit downstream applications like automated chart question answering.
  • The benchmark could act as a targeted test for spatial reasoning in vision-language models focused on data visualizations.

Load-bearing premise

The code-driven synthesis pipeline produces masks that faithfully match real rendered charts and the benchmark's distribution of clues and chart types reflects practical use cases.

What would settle it

A pixel-level comparison between synthesized masks and manually annotated real rendered chart instances would show systematic misalignment, or the trained model would show no performance gain on a held-out real-chart test set with new clue distributions.

Figures

Figures reproduced from arXiv: 2605.07415 by Qingfu Zhu, Tianhao Niu, Wanxiang Che, Xuan Dong, Ziyu Han.

Figure 1
Figure 1. Figure 1: Comparison between ChartREG++ (c) and prior benchmarks. Prior work (a), such as RefChartQA [18] and ChartLens [15], evaluates attribution-aware chart ques￾tion answering, while (b) ChartRef [17] evaluates the ability to link natural language to chart image elements. In these benchmarks, referred targets are mostly identified from textual/location cues in the expression or simple ranking cues in the data, a… view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of dataset complexity and taxonomy. Top: (left) target image com￾plexity measured by the number of lines in the corresponding plotting code; (middle) complexity of referring expressions measured by sentence length; (right) distribution of the number of referred target instances per query (shown only for multi-target sam￾ples). Bottom: (left) distribution of referring cue types; (right) distri… view at source ↗
Figure 3
Figure 3. Figure 3: Proposed pipeline for multi-granularity instance masks with fine-grained chart￾element labels.We start from large-scale Matplotlib plotting code collected from the web or synthesized at scale, and trace each plotting API call to the rendered Artist objects together with their associated metadata.Using the Artist hierarchy, we construct a multi-granularity Artist-to-visual mapping that links code-level prim… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative cases between our method and existing methods bounding box so that the box covers the target point. This requires an extra step of imagining/predicting which point pair will form a covering box, which can fail even when the selected points are close to the target. In contrast, our method directly provides candidate point instances (as masks) on the polyline, therefore the MLLM can select the ta… view at source ↗
Figure 5
Figure 5. Figure 5: Break down analysis results. Break down analysis results We conduct more fine-grained qunatitative analysis with different subsets of our benchmark using our model in Sec. 5.2. Results are shown in the supply material. Effect of chart complexity. We measure chart complexity by the plotting-code length. As shown in [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: chartlens modification example targets required by the question. As shown in [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: data referring clue example [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: visual referring clue example Subplot titles and positions Legend Entry and positions Non-data axis tick values and positions Text annotations directly on chart Axis labels the plotted line series along with its markers representing Average Temperature in the legend All vertical bars positioned above the x-tick label 'WSDMS' all vertical bars in the upper panel of the figure The polar bar sector directly i… view at source ↗
Figure 9
Figure 9. Figure 9: visual referring clue example [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: referring target element example PolarLinePoints Fill Errorbar Fill_between_density Treemap BoxPlot_Boxpatch [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
read the original abstract

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces ChartREG++, a new benchmark for referring expression grounding on charts that supports multiple localization forms, multiple target instances, diverse referring clues, and a wide range of chart types. It describes a code-driven synthesis pipeline to create pixel-accurate instance masks by leveraging plotting programs, trains an instance segmentation model on these masks, and integrates it into a multimodal large model framework for grounding. The resulting system is claimed to outperform baselines on the proposed benchmark and to generalize well to a real-chart grounding benchmark derived from ChartQA.

Significance. If the synthetic data pipeline is shown to produce faithful representations of real charts and the generalization results are robust, this work could significantly advance the field of chart understanding in vision-language models by providing a more comprehensive benchmark and an improved grounding method. The approach of using code for precise mask generation is a promising direction for data synthesis in structured visual domains.

major comments (3)
  1. The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.
  2. The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.
  3. There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and supporting our claims. We address each major comment below and have made revisions to the manuscript where appropriate to strengthen the presentation.

read point-by-point responses
  1. Referee: The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.

    Authors: We agree that the abstract would benefit from more concrete details to allow readers to better assess our claims. In the revised manuscript, we have updated the abstract to include key quantitative results, such as the mIoU scores of our model versus baselines on the ChartREG++ benchmark and the generalization performance on the ChartQA-derived set. We have also expanded the error analysis section in the main paper to provide supporting evidence for the outperformance and generalization observations. revision: yes

  2. Referee: The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.

    Authors: This is a fair and important observation regarding the strength of our generalization results. Our code-driven pipeline generates pixel-accurate masks by construction for the synthetic charts through direct use of plotting primitives. For the real-chart generalization, we have added a new discussion subsection that includes qualitative comparisons of synthetic versus real chart visuals to support the similarity assumption. However, we do not provide quantitative fidelity metrics such as IoU against human-annotated masks on real charts, as this would require a separate annotation effort beyond the scope of the current work. We have accordingly moderated the language around the generalization claims to reflect this limitation. revision: partial

  3. Referee: There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

    Authors: We appreciate this suggestion for greater transparency. In the revised manuscript, we have substantially expanded the relevant section (now including a dedicated subsection and accompanying table) to describe the conversion process: original ChartQA questions were adapted by identifying multi-element references and reformulating them as referring expressions with varied clues. We also report the distribution statistics for chart types (e.g., proportions of bar, line, pie, and scatter charts) and referring clue categories (textual, data-rank, positional, etc.) in the test set. These additions demonstrate alignment with diverse real-world chart scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new benchmark construction and external generalization test.

full rationale

The paper introduces a novel benchmark and code-driven synthesis pipeline that generates instance masks from plotting programs, then trains and evaluates an instance segmentation model on this data. Performance is reported on the synthetic benchmark and a separately constructed ChartQA-derived real-chart set. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on empirical outperformance against baselines under the same evaluation protocol and on an external distribution, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the work relies on standard computer vision and multimodal techniques without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5526 in / 1173 out tokens · 39811 ms · 2026-05-11T01:44:53.476341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.