Recognition: no theorem link
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3
The pith
A new benchmark and code-driven synthesis pipeline improve referring expression grounding on charts with multiple targets and diverse clues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. They further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. Training an instance segmentation model with the synthesized masks and integrating it into a general-purpose multimodal grounding framework produces a system that consistently outperforms baselines on the benchmark and generalizes well to a ChartQA-derived real
What carries the argument
The code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities.
If this is right
- Localization of fine chart elements can shift from bounding boxes to pixel-accurate masks.
- Multi-instance target references become tractable in chart grounding tasks.
- Performance improves across a wider variety of chart types and referring clue types.
- The trained system transfers to grounding tasks on real charts drawn from ChartQA.
Where Pith is reading between the lines
- The synthesis approach could extend to other structured visualization types such as diagrams or infographics.
- Improved chart grounding may benefit downstream applications like automated chart question answering.
- The benchmark could act as a targeted test for spatial reasoning in vision-language models focused on data visualizations.
Load-bearing premise
The code-driven synthesis pipeline produces masks that faithfully match real rendered charts and the benchmark's distribution of clues and chart types reflects practical use cases.
What would settle it
A pixel-level comparison between synthesized masks and manually annotated real rendered chart instances would show systematic misalignment, or the trained model would show no performance gain on a held-out real-chart test set with new clue distributions.
Figures
read the original abstract
Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChartREG++, a new benchmark for referring expression grounding on charts that supports multiple localization forms, multiple target instances, diverse referring clues, and a wide range of chart types. It describes a code-driven synthesis pipeline to create pixel-accurate instance masks by leveraging plotting programs, trains an instance segmentation model on these masks, and integrates it into a multimodal large model framework for grounding. The resulting system is claimed to outperform baselines on the proposed benchmark and to generalize well to a real-chart grounding benchmark derived from ChartQA.
Significance. If the synthetic data pipeline is shown to produce faithful representations of real charts and the generalization results are robust, this work could significantly advance the field of chart understanding in vision-language models by providing a more comprehensive benchmark and an improved grounding method. The approach of using code for precise mask generation is a promising direction for data synthesis in structured visual domains.
major comments (3)
- The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.
- The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.
- There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and supporting our claims. We address each major comment below and have made revisions to the manuscript where appropriate to strengthen the presentation.
read point-by-point responses
-
Referee: The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.
Authors: We agree that the abstract would benefit from more concrete details to allow readers to better assess our claims. In the revised manuscript, we have updated the abstract to include key quantitative results, such as the mIoU scores of our model versus baselines on the ChartREG++ benchmark and the generalization performance on the ChartQA-derived set. We have also expanded the error analysis section in the main paper to provide supporting evidence for the outperformance and generalization observations. revision: yes
-
Referee: The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.
Authors: This is a fair and important observation regarding the strength of our generalization results. Our code-driven pipeline generates pixel-accurate masks by construction for the synthetic charts through direct use of plotting primitives. For the real-chart generalization, we have added a new discussion subsection that includes qualitative comparisons of synthetic versus real chart visuals to support the similarity assumption. However, we do not provide quantitative fidelity metrics such as IoU against human-annotated masks on real charts, as this would require a separate annotation effort beyond the scope of the current work. We have accordingly moderated the language around the generalization claims to reflect this limitation. revision: partial
-
Referee: There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.
Authors: We appreciate this suggestion for greater transparency. In the revised manuscript, we have substantially expanded the relevant section (now including a dedicated subsection and accompanying table) to describe the conversion process: original ChartQA questions were adapted by identifying multi-element references and reformulating them as referring expressions with varied clues. We also report the distribution statistics for chart types (e.g., proportions of bar, line, pie, and scatter charts) and referring clue categories (textual, data-rank, positional, etc.) in the test set. These additions demonstrate alignment with diverse real-world chart scenarios. revision: yes
Circularity Check
No significant circularity; derivation relies on new benchmark construction and external generalization test.
full rationale
The paper introduces a novel benchmark and code-driven synthesis pipeline that generates instance masks from plotting programs, then trains and evaluates an instance segmentation model on this data. Performance is reported on the synthetic benchmark and a separately constructed ChartQA-derived real-chart set. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on empirical outperformance against baselines under the same evaluation protocol and on an external distribution, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., RÃďdle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [3]
-
[4]
Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., Shao, V., Yang, Y., Huang, W., Gao, Z., Anderson, T., Zhang, J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open weights and data for vision-language models with video understanding and grounding (2026),https://arxiv.org/abs/2601.10611
-
[5]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aharoni, A., Lintz, N., Pais, T.C., Jacobsson, H., Szpektor, I., Jiang, N.J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G., Chu, E., Boyd, T., Hekman, B., Parisi, A., Zhang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., G...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Kantharaj, S., Leong, R.T., Lin, X., Masry, A., Thakkar, M., Hoque, E., Joty, S.: Chart-to-text: A large-scale benchmark for chart summarization. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 4005–4023.AssociationforComputationalLing...
-
[8]
Li, J., Dong, X., Zang, Y., Cao, Y., Wang, J., Lin, D.: Visual self-refine: A pixel- guidedparadigmforaccuratechartparsing(2026),https://arxiv.org/abs/2602. 16455
work page 2026
-
[9]
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26286–26296 (2024).https://doi.org/10.1109/CVPR52733.2024. 02484
-
[10]
Liu,H.,Li,C.,Wu,Q.,Lee,Y.J.:Visualinstructiontuning.In:Thirty-seventhCon- ference on Neural Information Processing Systems (2023),https://openreview. net/forum?id=w0H2xGHlkw
work page 2023
-
[11]
https://doi.org/10.18653/v1/2022.findings- acl.272
Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022. pp. 2263–2279. Association for Computational Linguistics, Dublin, Ireland (May 2022).htt...
-
[12]
Ni, M., Yang, Z., Li, L., Lin, C.C., Lin, K., Zuo, W., Wang, L.: Point-RFT: Im- proving multimodal reasoning with visually grounded reinforcement finetuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=wdyOwMISSR
work page 2025
-
[13]
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024), https://arxiv.org/abs/2401.14159
work page Pith review arXiv 2024
-
[15]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Suri, M., Mathur, P., Lipka, N., Dernoncourt, F., Rossi, R.A., Manocha, D.: ChartLens: Fine-grained visual attribution in charts. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 22447– 22462. Association for Computational Lingu...
-
[16]
Tang, L., Kim, G., Zhao, X., Lake, T., Ding, W., Yin, F., Singhal, P., Wadhwa, M., Liu, Z.L., Sprague, Z.R., Namuduri, R., Hu, B., Rodriguez, J.D., Peng, P., Durrett, G.: Chartmuseum: Testing visual reasoning capabilities of large vision- language models. In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTr...
work page 2025
-
[17]
Tjandrasuwita, M., Liang, P.P., Solar-Lezama, A.: Chartref: Benchmarking fine- grained visual element localization in charts (2025),https://openreview.net/ forum?id=Pi1Y2huHLg
work page 2025
- [18]
-
[19]
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., Chen, D.: Charxiv: Charting gaps in realistic chart understanding in multimodal LLMs. In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024), https://openreview.net/forum?id=cy8mq7QYae
work page 2024
- [21]
-
[22]
Yang, C., Shi, C., Liu, Y., Shui, B., Wang, J., Jing, M., XU, L., Zhu, X., Li, S., Zhang, Y., Liu, G., Nie, X., Cai, D., Yang, Y.: Chartmimic: Evaluat- ing LMM’s cross-modal reasoning capability via chart-to-code generation. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=sGpCzsfd1K
work page 2025
-
[23]
Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v (2023),https://arxiv.org/abs/ 2310.11441
work page internal anchor Pith review arXiv 2023
- [24]
- [25]
-
[26]
In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T
Zhao, X., Luo, X., Shi, Q., Chen, C., Wang, S., Liu, Z., Sun, M.: ChartCoder: Advancing multimodal large language model for chart-to-code generation. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7333–7348. Association for Co...
- [27]
-
[28]
Zhu, J., Zhou, Y., Wang, Z., Yao, J., Gu, Y., Yuan, Y., Liu, S.: Infodet: A dataset for infographic element detection. In: The Fourteenth International Con- ference on Learning Representations (2026),https://openreview.net/forum?id= Wj0Sc9WBHZ 30 T. Niu et al. 7 Supplementary Materials This section provides additional supply materials on related work, mor...
work page 2026
-
[29]
* Do **not** refer to any other graphic element types as targets
Target Composition (Strict) * The referent(s) must be **only** the specified **Target Element Type**. * Do **not** refer to any other graphic element types as targets
-
[30]
* Do not generate âĂIJno-targetâĂİ expressions
Referent Existence (Strict) * Each referring expression must refer to **at least one** valid target instance in the rendered chart. * Do not generate âĂIJno-targetâĂİ expressions. * **Feasibility guard:** Avoid self-contradictory constraints (e.g., mutually exclusive rank/range/pattern conditions) that could plausibly yield an empty set
-
[31]
Referent Subject (Strict) * Each referring expression must **explicitly begin** with **[referent subject]** as expression start. * **Format hardening:** The ‘referring_expression‘ string must start with **exactly** the characters ‘[referent subject]‘ as the **very first characters** (no leading spaces/newlines/punctuation before it)
-
[32]
Rendered-Image-Only Constraint (Strict) * The model answering will see **only the rendered image**, not the code. Therefore: ChartREG++ 41 * **Never** mention code-level details (variable names, function calls, parameters, hex color codes, random seeds, etc.). * If the code intent and the rendered view could differ, describe only what is **visually appare...
-
[33]
* **Never** cite specific generated numeric values
Random Data Constraint (Strict) If target data are generated using random functions: * Use **only** relations implied by explicit random parameters (distribution/bounds/monotonic transforms). * **Never** cite specific generated numeric values. * Prefer robust non-empty selection styles (rank-bands, local patterns, tick-anchored ranges) over fragile narrow...
-
[34]
Recognition-Only Rule (No Calculations) (Strict) These are **recognition_data** expressions: * **Do NOT** use arithmetic or statistics: no differences/ratios/rates, no mean/median/std, no âĂIJaverage + âĂęâĂİ, no derived thresholds. * You **may** use: * direct comparisons (âĂIJhigher/lowerâĂİ, âĂIJabove/belowâĂİ) on visible values, * rank selection by compar...
-
[35]
**Value-Range Filtering** Targets are all elements whose values are **within/above/below** a specified **range/interval**, where boundaries are **directly given constants** or **explicitly referenced in the chart** (ticks/labels/on-mark labels/some reference mark point). **No derived thresholds**
-
[36]
**Tie policy:** if ties occur at the boundary, **include all tied elements**
**Rank-Band Set Selection** Targets are all elements whose **rank positions** fall in a specified band (top-N, bottom-N, ranks iâĂŞj, excluding extremes) within an explicit scope, determined by **ordering/comparisons only** (no arithmetic/statistics). **Tie policy:** if ties occur at the boundary, **include all tied elements**
-
[37]
**Local-Structure Patterns** Targets are elements defined by **local adjacency comparisons** along a series: local peaks/troughs, reversals, neighbor comparisons, and contiguous increasing/decreasing/plateau runsâĂŤ**purely by pairwise higher/lower/equal comparisons**, with no computed rates or aggregates
-
[38]
**Cross-Series Relations** Targets are defined by **cross-series comparisons** at the same x/category (A above/below B), or per-category winner/loser by comparison, with **no gap/ratio calculations**. *If multiple series exist, series identity must be disambiguated using legend text (textual/localization) or visual attributes (visual).* 42 T. Niu et al. B...
-
[43]
C) Visual Feature Categories (for data + visual; use one or more)
**Text annotations directly on the chart** **Definition:** Targets are selected using **explicit on-chart text** (callouts, data labels, annotation strings) that is visually attached to marks or regions, serving as a direct textual anchor for grounding. C) Visual Feature Categories (for data + visual; use one or more)
-
[44]
**Color Attributes** **Definition:** Targets are elements/series identified by a **discrete color label** (e.g., red/blue/green), not subjective shades
-
[47]
**Fill Style** **Definition:** Targets are elements identified by **interior fill appearance**: **filled vs hollow (outline-only)**, and (when present) **hatch/pattern type and direction** (e.g., diagonal/vertical/horizontal/crosshatch). --- Generation Task (Counts + Mix) Generate **exactly 20** distinct referring expressions: * ‘data_only‘: **10** items ...
-
[48]
**Axis labels** **Definition:** Targets are selected using **explicit axis label text** (e.g., x/y axis titles) as an unambiguous anchor to specify which axis (or which subplotâĂŹs axis) the reference applies to
-
[49]
**Non-data axis tick values and positions** **Definition:** Targets are selected using **Non-data axis tick labels (values) and their positions** as explicit, non-data anchorsâĂŤwithout requiring exact value reading beyond the printed tick text
-
[50]
**Legend entries and their positions** **Definition:** Targets are selected by **legend entry text** (and optionally its **layout position**, e.g., first/second entry, top/bottom of legend) to map from label âĘŠ corresponding elements/series in a scope
-
[51]
**Subplot titles, identifiers and positions** **Definition:** Targets are selected by **subplot-level text identifiers** (e.g., subplot title, facet header label, panel tag like âĂIJ(a)/(b)âĂİ) and/or their **panel positions** to disambiguate which subplot the reference is in
-
[52]
### B) Visual Feature Categories (use one or more when visual is allowed)
**Text annotations directly on the chart** **Definition:** Targets are selected using **explicit on-chart text** (callouts, data labels, annotation strings) that is visually attached to marks or regions, serving as a direct textual anchor for grounding. ### B) Visual Feature Categories (use one or more when visual is allowed)
-
[53]
**Color Attributes** ChartREG++ 47 **Definition:** Targets are elements/series identified by a **discrete color label** (e.g., red/blue/green), not subjective shades
-
[54]
**Shape Style** **Definition:** Targets are elements identified by a **fixed, enumerated elements/shape name**, e.g., **circle, square, diamond, cross, plus, x, star, pentagon, hexagon**, and oriented variants such as **triangle-up/down/left/right**
-
[55]
**Line Style / Stroke Style** **Definition:** Targets are elements identified by the **stroke/outline pattern** (solid/dashed/dotted/dashdot). This applies to **any visible edge**, including **lines** and **borders/outlines** of bars/areas/markers
-
[56]
**Fill Style** **Definition:** Targets are elements identified by **interior fill appearance**: **filled vs hollow (outline-only)**, and (when present) **hatch/pattern type and direction** (e.g., diagonal/vertical/horizontal/crosshatch). --- ## Generation Task (Counts + Mix) Generate **exactly 15** distinct referring expressions: * **textual_localization_...
-
[57]
a detailed target element type description,
-
[58]
a referring expression (natural language) that refers to one or multiple elements of that target type in the final plot,
-
[59]
Python code that generates the visualization, Return which visual element(s) are referred to by the referring expression, **restricted to the target elements created at the code lines marked with ‘#‘**. You MUST reason about the **final visual appearance** after the entire code finishes executing (including axis scaling, normalization, transforms, limits,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.