ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

Qingfu Zhu; Tianhao Niu; Wanxiang Che; Xuan Dong; Ziyu Han

arxiv: 2605.07415 · v2 · pith:X7IS4N2Bnew · submitted 2026-05-08 · 💻 cs.CV · cs.CL

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

Tianhao Niu , Ziyu Han , Xuan Dong , Qingfu Zhu , Wanxiang Che This is my paper

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords chart referring expression groundingmulti-target referringcode-driven synthesispixel accurate masksinstance segmentationmultimodal modelsbenchmarkchart understanding

0 comments

The pith

A new benchmark and code-driven synthesis pipeline improve referring expression grounding on charts with multiple targets and diverse clues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a more complete benchmark for chart referring expression grounding that handles multiple target instances, various localization forms beyond bounding boxes, a range of referring clues, and multiple chart types. Existing multimodal models exhibit large performance gaps on this benchmark. The authors also present a code-driven synthesis pipeline that generates pixel-accurate instance masks by leveraging the alignment between plotting code and rendered chart elements, then train an instance segmentation model on these masks and integrate it into a multimodal grounding system.

Core claim

The authors introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. They further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. Training an instance segmentation model with the synthesized masks and integrating it into a general-purpose multimodal grounding framework produces a system that consistently outperforms baselines on the benchmark and generalizes well to a ChartQA-derived real

What carries the argument

The code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities.

If this is right

Localization of fine chart elements can shift from bounding boxes to pixel-accurate masks.
Multi-instance target references become tractable in chart grounding tasks.
Performance improves across a wider variety of chart types and referring clue types.
The trained system transfers to grounding tasks on real charts drawn from ChartQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The synthesis approach could extend to other structured visualization types such as diagrams or infographics.
Improved chart grounding may benefit downstream applications like automated chart question answering.
The benchmark could act as a targeted test for spatial reasoning in vision-language models focused on data visualizations.

Load-bearing premise

The code-driven synthesis pipeline produces masks that faithfully match real rendered charts and the benchmark's distribution of clues and chart types reflects practical use cases.

What would settle it

A pixel-level comparison between synthesized masks and manually annotated real rendered chart instances would show systematic misalignment, or the trained model would show no performance gain on a held-out real-chart test set with new clue distributions.

Figures

Figures reproduced from arXiv: 2605.07415 by Qingfu Zhu, Tianhao Niu, Wanxiang Che, Xuan Dong, Ziyu Han.

**Figure 1.** Figure 1: Comparison between ChartREG++ (c) and prior benchmarks. Prior work (a), such as RefChartQA [18] and ChartLens [15], evaluates attribution-aware chart question answering, while (b) ChartRef [17] evaluates the ability to link natural language to chart image elements. In these benchmarks, referred targets are mostly identified from textual/location cues in the expression or simple ranking cues in the data, a… view at source ↗

**Figure 2.** Figure 2: Distributions of dataset complexity and taxonomy. Top: (left) target image complexity measured by the number of lines in the corresponding plotting code; (middle) complexity of referring expressions measured by sentence length; (right) distribution of the number of referred target instances per query (shown only for multi-target samples). Bottom: (left) distribution of referring cue types; (right) distri… view at source ↗

**Figure 3.** Figure 3: Proposed pipeline for multi-granularity instance masks with fine-grained chartelement labels.We start from large-scale Matplotlib plotting code collected from the web or synthesized at scale, and trace each plotting API call to the rendered Artist objects together with their associated metadata.Using the Artist hierarchy, we construct a multi-granularity Artist-to-visual mapping that links code-level prim… view at source ↗

**Figure 4.** Figure 4: Qualitative cases between our method and existing methods bounding box so that the box covers the target point. This requires an extra step of imagining/predicting which point pair will form a covering box, which can fail even when the selected points are close to the target. In contrast, our method directly provides candidate point instances (as masks) on the polyline, therefore the MLLM can select the ta… view at source ↗

**Figure 5.** Figure 5: Break down analysis results. Break down analysis results We conduct more fine-grained qunatitative analysis with different subsets of our benchmark using our model in Sec. 5.2. Results are shown in the supply material. Effect of chart complexity. We measure chart complexity by the plotting-code length. As shown in [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: chartlens modification example targets required by the question. As shown in [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: data referring clue example [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗

**Figure 8.** Figure 8: visual referring clue example Subplot titles and positions Legend Entry and positions Non-data axis tick values and positions Text annotations directly on chart Axis labels the plotted line series along with its markers representing Average Temperature in the legend All vertical bars positioned above the x-tick label 'WSDMS' all vertical bars in the upper panel of the figure The polar bar sector directly i… view at source ↗

**Figure 9.** Figure 9: visual referring clue example [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: referring target element example PolarLinePoints Fill Errorbar Fill_between_density Treemap BoxPlot_Boxpatch [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗

read the original abstract

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a benchmark and code-driven synthesis pipeline for multi-target chart grounding with diverse cues, but the outperformance and real-chart generalization claims lack any supporting numbers or fidelity checks.

read the letter

The core advance is a benchmark that explicitly targets gaps in prior chart referring expression work: multi-target references, pixel masks instead of just boxes, language clues beyond text or rank, and broader chart types. The synthesis pipeline that derives instance masks from plotting code is a practical move, since it uses the built-in alignment between code and rendered elements to get accurate labels without manual work. That part could save effort and improve label quality for chart primitives like bars or legends. Training a segmentation model on the synthetic data and folding it into a multimodal grounding setup is a reasonable integration step. The abstract notes that existing chart benchmarks are narrow, and this one tries to widen them systematically. The stress-test concern lands: the generalization claim to a ChartQA-derived real-chart set rests on the untested assumption that synthetic masks and clue distributions transfer without big domain shift. No IoU checks against human annotations on real charts, no details on how ChartQA questions were turned into multi-target expressions, and no metrics at all for the claimed outperformance. Without those, it's difficult to tell how much the results actually move the needle. The work stays focused on the technical gaps it identifies and does not overclaim in the abstract itself. This is the kind of paper that would interest people building multimodal models for data visualization or structured images. The benchmark construction alone could be worth examining if the details hold up. I would send it for peer review so the methods and results can be checked properly.

Referee Report

3 major / 0 minor

Summary. The paper introduces ChartREG++, a new benchmark for referring expression grounding on charts that supports multiple localization forms, multiple target instances, diverse referring clues, and a wide range of chart types. It describes a code-driven synthesis pipeline to create pixel-accurate instance masks by leveraging plotting programs, trains an instance segmentation model on these masks, and integrates it into a multimodal large model framework for grounding. The resulting system is claimed to outperform baselines on the proposed benchmark and to generalize well to a real-chart grounding benchmark derived from ChartQA.

Significance. If the synthetic data pipeline is shown to produce faithful representations of real charts and the generalization results are robust, this work could significantly advance the field of chart understanding in vision-language models by providing a more comprehensive benchmark and an improved grounding method. The approach of using code for precise mask generation is a promising direction for data synthesis in structured visual domains.

major comments (3)

The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.
The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.
There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and supporting our claims. We address each major comment below and have made revisions to the manuscript where appropriate to strengthen the presentation.

read point-by-point responses

Referee: The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.

Authors: We agree that the abstract would benefit from more concrete details to allow readers to better assess our claims. In the revised manuscript, we have updated the abstract to include key quantitative results, such as the mIoU scores of our model versus baselines on the ChartREG++ benchmark and the generalization performance on the ChartQA-derived set. We have also expanded the error analysis section in the main paper to provide supporting evidence for the outperformance and generalization observations. revision: yes
Referee: The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.

Authors: This is a fair and important observation regarding the strength of our generalization results. Our code-driven pipeline generates pixel-accurate masks by construction for the synthetic charts through direct use of plotting primitives. For the real-chart generalization, we have added a new discussion subsection that includes qualitative comparisons of synthetic versus real chart visuals to support the similarity assumption. However, we do not provide quantitative fidelity metrics such as IoU against human-annotated masks on real charts, as this would require a separate annotation effort beyond the scope of the current work. We have accordingly moderated the language around the generalization claims to reflect this limitation. revision: partial
Referee: There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

Authors: We appreciate this suggestion for greater transparency. In the revised manuscript, we have substantially expanded the relevant section (now including a dedicated subsection and accompanying table) to describe the conversion process: original ChartQA questions were adapted by identifying multi-element references and reformulating them as referring expressions with varied clues. We also report the distribution statistics for chart types (e.g., proportions of bar, line, pie, and scatter charts) and referring clue categories (textual, data-rank, positional, etc.) in the test set. These additions demonstrate alignment with diverse real-world chart scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new benchmark construction and external generalization test.

full rationale

The paper introduces a novel benchmark and code-driven synthesis pipeline that generates instance masks from plotting programs, then trains and evaluates an instance segmentation model on this data. Performance is reported on the synthetic benchmark and a separately constructed ChartQA-derived real-chart set. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on empirical outperformance against baselines under the same evaluation protocol and on an external distribution, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the work relies on standard computer vision and multimodal techniques without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5526 in / 1173 out tokens · 39811 ms · 2026-05-11T01:44:53.476341+00:00 · methodology

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)