arxiv: 2605.07415 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.CL

Recognition: no theorem link

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

Tianhao Niu , Ziyu Han , Qingfu Zhu , Wanxiang Che

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:44 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords chart referring expression groundingmulti-target referringcode-driven synthesispixel accurate masksinstance segmentationmultimodal modelsbenchmarkchart understanding

0 comments

The pith

A new benchmark and code-driven synthesis pipeline improve referring expression grounding on charts with multiple targets and diverse clues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a more complete benchmark for chart referring expression grounding that handles multiple target instances, various localization forms beyond bounding boxes, a range of referring clues, and multiple chart types. Existing multimodal models exhibit large performance gaps on this benchmark. The authors also present a code-driven synthesis pipeline that generates pixel-accurate instance masks by leveraging the alignment between plotting code and rendered chart elements, then train an instance segmentation model on these masks and integrate it into a multimodal grounding system.

Core claim

The authors introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. They further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. Training an instance segmentation model with the synthesized masks and integrating it into a general-purpose multimodal grounding framework produces a system that consistently outperforms baselines on the benchmark and generalizes well to a ChartQA-derived real

What carries the argument

The code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities.

If this is right

Localization of fine chart elements can shift from bounding boxes to pixel-accurate masks.
Multi-instance target references become tractable in chart grounding tasks.
Performance improves across a wider variety of chart types and referring clue types.
The trained system transfers to grounding tasks on real charts drawn from ChartQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The synthesis approach could extend to other structured visualization types such as diagrams or infographics.
Improved chart grounding may benefit downstream applications like automated chart question answering.
The benchmark could act as a targeted test for spatial reasoning in vision-language models focused on data visualizations.

Load-bearing premise

The code-driven synthesis pipeline produces masks that faithfully match real rendered charts and the benchmark's distribution of clues and chart types reflects practical use cases.

What would settle it

A pixel-level comparison between synthesized masks and manually annotated real rendered chart instances would show systematic misalignment, or the trained model would show no performance gain on a held-out real-chart test set with new clue distributions.

Figures

Figures reproduced from arXiv: 2605.07415 by Qingfu Zhu, Tianhao Niu, Wanxiang Che, Ziyu Han.

**Figure 1.** Figure 1: Comparison between ChartREG++ (c) and prior benchmarks. Prior work (a), such as RefChartQA [18] and ChartLens [15], evaluates attribution-aware chart question answering, while (b) ChartRef [17] evaluates the ability to link natural language to chart image elements. In these benchmarks, referred targets are mostly identified from textual/location cues in the expression or simple ranking cues in the data, a… view at source ↗

**Figure 2.** Figure 2: Distributions of dataset complexity and taxonomy. Top: (left) target image complexity measured by the number of lines in the corresponding plotting code; (middle) complexity of referring expressions measured by sentence length; (right) distribution of the number of referred target instances per query (shown only for multi-target samples). Bottom: (left) distribution of referring cue types; (right) distri… view at source ↗

**Figure 3.** Figure 3: Proposed pipeline for multi-granularity instance masks with fine-grained chartelement labels.We start from large-scale Matplotlib plotting code collected from the web or synthesized at scale, and trace each plotting API call to the rendered Artist objects together with their associated metadata.Using the Artist hierarchy, we construct a multi-granularity Artist-to-visual mapping that links code-level prim… view at source ↗

**Figure 4.** Figure 4: Qualitative cases between our method and existing methods bounding box so that the box covers the target point. This requires an extra step of imagining/predicting which point pair will form a covering box, which can fail even when the selected points are close to the target. In contrast, our method directly provides candidate point instances (as masks) on the polyline, therefore the MLLM can select the ta… view at source ↗

**Figure 5.** Figure 5: Break down analysis results. Break down analysis results We conduct more fine-grained qunatitative analysis with different subsets of our benchmark using our model in Sec. 5.2. Results are shown in the supply material. Effect of chart complexity. We measure chart complexity by the plotting-code length. As shown in [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: chartlens modification example targets required by the question. As shown in [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: data referring clue example [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗

**Figure 8.** Figure 8: visual referring clue example Subplot titles and positions Legend Entry and positions Non-data axis tick values and positions Text annotations directly on chart Axis labels the plotted line series along with its markers representing Average Temperature in the legend All vertical bars positioned above the x-tick label 'WSDMS' all vertical bars in the upper panel of the figure The polar bar sector directly i… view at source ↗

**Figure 9.** Figure 9: visual referring clue example [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: referring target element example PolarLinePoints Fill Errorbar Fill_between_density Treemap BoxPlot_Boxpatch [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: referring target element example [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗

read the original abstract

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a benchmark and code-driven synthesis pipeline for multi-target chart grounding with diverse cues, but the outperformance and real-chart generalization claims lack any supporting numbers or fidelity checks.

read the letter

The core advance is a benchmark that explicitly targets gaps in prior chart referring expression work: multi-target references, pixel masks instead of just boxes, language clues beyond text or rank, and broader chart types. The synthesis pipeline that derives instance masks from plotting code is a practical move, since it uses the built-in alignment between code and rendered elements to get accurate labels without manual work. That part could save effort and improve label quality for chart primitives like bars or legends. Training a segmentation model on the synthetic data and folding it into a multimodal grounding setup is a reasonable integration step. The abstract notes that existing chart benchmarks are narrow, and this one tries to widen them systematically. The stress-test concern lands: the generalization claim to a ChartQA-derived real-chart set rests on the untested assumption that synthetic masks and clue distributions transfer without big domain shift. No IoU checks against human annotations on real charts, no details on how ChartQA questions were turned into multi-target expressions, and no metrics at all for the claimed outperformance. Without those, it's difficult to tell how much the results actually move the needle. The work stays focused on the technical gaps it identifies and does not overclaim in the abstract itself. This is the kind of paper that would interest people building multimodal models for data visualization or structured images. The benchmark construction alone could be worth examining if the details hold up. I would send it for peer review so the methods and results can be checked properly.

Referee Report

3 major / 0 minor

Summary. The paper introduces ChartREG++, a new benchmark for referring expression grounding on charts that supports multiple localization forms, multiple target instances, diverse referring clues, and a wide range of chart types. It describes a code-driven synthesis pipeline to create pixel-accurate instance masks by leveraging plotting programs, trains an instance segmentation model on these masks, and integrates it into a multimodal large model framework for grounding. The resulting system is claimed to outperform baselines on the proposed benchmark and to generalize well to a real-chart grounding benchmark derived from ChartQA.

Significance. If the synthetic data pipeline is shown to produce faithful representations of real charts and the generalization results are robust, this work could significantly advance the field of chart understanding in vision-language models by providing a more comprehensive benchmark and an improved grounding method. The approach of using code for precise mask generation is a promising direction for data synthesis in structured visual domains.

major comments (3)

The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.
The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.
There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and supporting our claims. We address each major comment below and have made revisions to the manuscript where appropriate to strengthen the presentation.

read point-by-point responses

Referee: The abstract claims consistent outperformance and good generalization but provides no specific metrics, baseline comparisons, error analysis, or quantitative results, which makes it difficult to evaluate the strength and reliability of these claims.

Authors: We agree that the abstract would benefit from more concrete details to allow readers to better assess our claims. In the revised manuscript, we have updated the abstract to include key quantitative results, such as the mIoU scores of our model versus baselines on the ChartREG++ benchmark and the generalization performance on the ChartQA-derived set. We have also expanded the error analysis section in the main paper to provide supporting evidence for the outperformance and generalization observations. revision: yes
Referee: The central generalization claim to the ChartQA-derived benchmark depends on the unvalidated assumption that the synthetic masks from the plotting-code pipeline accurately match real rendered charts; no quantitative fidelity metrics (e.g., IoU with human annotations) are mentioned, which is load-bearing for the practical utility of the results.

Authors: This is a fair and important observation regarding the strength of our generalization results. Our code-driven pipeline generates pixel-accurate masks by construction for the synthetic charts through direct use of plotting primitives. For the real-chart generalization, we have added a new discussion subsection that includes qualitative comparisons of synthetic versus real chart visuals to support the similarity assumption. However, we do not provide quantitative fidelity metrics such as IoU against human-annotated masks on real charts, as this would require a separate annotation effort beyond the scope of the current work. We have accordingly moderated the language around the generalization claims to reflect this limitation. revision: partial
Referee: There is insufficient detail on the process of converting ChartQA questions into multi-target referring expressions and on the distribution of clues and chart types in this test set, raising questions about whether it reflects real-world use cases and thus whether the generalization is meaningful.

Authors: We appreciate this suggestion for greater transparency. In the revised manuscript, we have substantially expanded the relevant section (now including a dedicated subsection and accompanying table) to describe the conversion process: original ChartQA questions were adapted by identifying multi-element references and reformulating them as referring expressions with varied clues. We also report the distribution statistics for chart types (e.g., proportions of bar, line, pie, and scatter charts) and referring clue categories (textual, data-rank, positional, etc.) in the test set. These additions demonstrate alignment with diverse real-world chart scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new benchmark construction and external generalization test.

full rationale

The paper introduces a novel benchmark and code-driven synthesis pipeline that generates instance masks from plotting programs, then trains and evaluates an instance segmentation model on this data. Performance is reported on the synthetic benchmark and a separately constructed ChartQA-derived real-chart set. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described chain. The central claims rest on empirical outperformance against baselines under the same evaluation protocol and on an external distribution, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the work relies on standard computer vision and multimodal techniques without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5526 in / 1173 out tokens · 39811 ms · 2026-05-11T01:44:53.476341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

[1]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., RÃďdle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation (2022),https://arxiv.org/ abs/2112.01527

work page arXiv 2022
[4]

Clark, C., Zhang, J., Ma, Z., Park, J.S., Salehi, M., Tripathi, R., Lee, S., Ren, Z., Kim, C.D., Yang, Y., Shao, V., Yang, Y., Huang, W., Gao, Z., Anderson, T., Zhang, J., Jain, J., Stoica, G., Han, W., Farhadi, A., Krishna, R.: Molmo2: Open weights and data for vision-language models with video understanding and grounding (2026),https://arxiv.org/abs/2601.10611

work page arXiv 2026
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., Marris, L., Petulla, S., Gaffney, C., Aharoni, A., Lintz, N., Pais, T.C., Jacobsson, H., Szpektor, I., Jiang, N.J., Haridasan, K., Omran, A., Saunshi, N., Bahri, D., Mishra, G., Chu, E., Boyd, T., Hekman, B., Parisi, A., Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Kantharaj, S., Leong, R.T., Lin, X., Masry, A., Thakkar, M., Hoque, E., Joty, S.: Chart-to-text: A large-scale benchmark for chart summarization. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 4005–4023.AssociationforComputationalLing...

work page doi:10.18653/v1/2022.acl- 2022
[8]

Li, J., Dong, X., Zang, Y., Cao, Y., Wang, J., Lin, D.: Visual self-refine: A pixel- guidedparadigmforaccuratechartparsing(2026),https://arxiv.org/abs/2602. 16455

work page 2026
[9]

In: CVPR

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 26286–26296 (2024).https://doi.org/10.1109/CVPR52733.2024. 02484

work page doi:10.1109/cvpr52733.2024 2024
[10]

net/forum?id=w0H2xGHlkw

Liu,H.,Li,C.,Wu,Q.,Lee,Y.J.:Visualinstructiontuning.In:Thirty-seventhCon- ference on Neural Information Processing Systems (2023),https://openreview. net/forum?id=w0H2xGHlkw

work page 2023
[11]

https://doi.org/10.18653/v1/2022.findings- acl.272

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022. pp. 2263–2279. Association for Computational Linguistics, Dublin, Ireland (May 2022).htt...

work page doi:10.18653/v1/2022.findings- 2022
[12]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=wdyOwMISSR

Ni, M., Yang, Z., Li, L., Lin, C.C., Lin, K., Zuo, W., Wang, L.: Point-RFT: Im- proving multimodal reasoning with visually grounded reinforcement finetuning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025),https://openreview.net/forum?id=wdyOwMISSR

work page 2025
[13]

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., Zeng, Z., Zhang, H., Li, F., Yang, J., Li, H., Jiang, Q., Zhang, L.: Grounded sam: Assembling open-world models for diverse visual tasks (2024), https://arxiv.org/abs/2401.14159

work page Pith review arXiv 2024
[15]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Suri, M., Mathur, P., Lipka, N., Dernoncourt, F., Rossi, R.A., Manocha, D.: ChartLens: Fine-grained visual attribution in charts. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 22447– 22462. Association for Computational Lingu...

work page doi:10.18653/v1/2025.acl-long.1094 2025
[16]

In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTrack(2025),https://openreview

Tang, L., Kim, G., Zhao, X., Lake, T., Ding, W., Yin, F., Singhal, P., Wadhwa, M., Liu, Z.L., Sprague, Z.R., Namuduri, R., Hu, B., Rodriguez, J.D., Peng, P., Durrett, G.: Chartmuseum: Testing visual reasoning capabilities of large vision- language models. In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTr...

work page 2025
[17]

Tjandrasuwita, M., Liang, P.P., Solar-Lezama, A.: Chartref: Benchmarking fine- grained visual element localization in charts (2025),https://openreview.net/ forum?id=Pi1Y2huHLg

work page 2025
[18]

Vogel, A., Moured, O., Chen, Y., Zhang, J., Stiefelhagen, R.: Refchartqa: Ground- ing visual answer on chart images through instruction tuning (2025),https: //arxiv.org/abs/2503.23131

work page arXiv 2025
[19]

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y., Wang, X., Hou, Z., Hao, H., Zhang, T., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y., Wang, Y., He,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024), https://openreview.net/forum?id=cy8mq7QYae

Wang, Z., Xia, M., He, L., Chen, H., Liu, Y., Zhu, R., Liang, K., Wu, X., Liu, H., Malladi, S., Chevalier, A., Arora, S., Chen, D.: Charxiv: Charting gaps in realistic chart understanding in multimodal LLMs. In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024), https://openreview.net/forum?id=cy8mq7QYae

work page 2024
[21]

Xu, Z., Du, S., Qi, Y., SiwenLu, Xu, C., Yuan, C., Guo, J.: Chartpoint: Guiding mllms with grounding reflection for chart reasoning (2025),https://arxiv.org/ abs/2512.00305

work page arXiv 2025
[22]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=sGpCzsfd1K

Yang, C., Shi, C., Liu, Y., Shui, B., Wang, J., Jing, M., XU, L., Zhu, X., Li, S., Zhang, Y., Liu, G., Nie, X., Cai, D., Yang, Y.: Chartmimic: Evaluat- ing LMM’s cross-modal reasoning capability via chart-to-code generation. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=sGpCzsfd1K

work page 2025
[23]

Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompting un- leashes extraordinary visual grounding in gpt-4v (2023),https://arxiv.org/abs/ 2310.11441

work page internal anchor Pith review arXiv 2023
[24]

Yang, Y., Zhang, Z., Hou, Y., Li, Z., Liu, G., Payani, A., Ting, Y.S., Zheng, L.: Effective training data synthesis for improving mllm chart understanding (2025), https://arxiv.org/abs/2508.06492

work page arXiv 2025
[25]

Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., Yang, M.H.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos (2025),https://arxiv.org/abs/2501.04001

work page arXiv 2025
[26]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Zhao, X., Luo, X., Shi, Q., Chen, C., Wang, S., Liu, Z., Sun, M.: ChartCoder: Advancing multimodal large language model for chart-to-code generation. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7333–7348. Association for Co...

work page doi:10.18653/v1/2025.acl-long.363 2025
[27]

Zhou, Y., Chen, Y., Lin, H., Wu, Y., Yang, S., Qi, Z., Ma, C., Zhu, L., Shan, Y.: Dogr: Towards versatile visual document grounding and referring (2025),https: //arxiv.org/abs/2411.17125 ChartREG++ 29

work page arXiv 2025
[28]

Average Rainfall (cm)

Zhu, J., Zhou, Y., Wang, Z., Yao, J., Gu, Y., Yuan, Y., Liu, S.: Infodet: A dataset for infographic element detection. In: The Fourteenth International Con- ference on Learning Representations (2026),https://openreview.net/forum?id= Wj0Sc9WBHZ 30 T. Niu et al. 7 Supplementary Materials This section provides additional supply materials on related work, mor...

work page 2026
[29]

* Do **not** refer to any other graphic element types as targets

Target Composition (Strict) * The referent(s) must be **only** the specified **Target Element Type**. * Do **not** refer to any other graphic element types as targets

work page
[30]

* Do not generate âĂĲno-targetâĂİ expressions

Referent Existence (Strict) * Each referring expression must refer to **at least one** valid target instance in the rendered chart. * Do not generate âĂĲno-targetâĂİ expressions. * **Feasibility guard:** Avoid self-contradictory constraints (e.g., mutually exclusive rank/range/pattern conditions) that could plausibly yield an empty set

work page
[31]

Referent Subject (Strict) * Each referring expression must **explicitly begin** with **[referent subject]** as expression start. * **Format hardening:** The ‘referring_expression‘ string must start with **exactly** the characters ‘[referent subject]‘ as the **very first characters** (no leading spaces/newlines/punctuation before it)

work page
[32]

Therefore: ChartREG++ 41 * **Never** mention code-level details (variable names, function calls, parameters, hex color codes, random seeds, etc.)

Rendered-Image-Only Constraint (Strict) * The model answering will see **only the rendered image**, not the code. Therefore: ChartREG++ 41 * **Never** mention code-level details (variable names, function calls, parameters, hex color codes, random seeds, etc.). * If the code intent and the rendered view could differ, describe only what is **visually appare...

work page
[33]

* **Never** cite specific generated numeric values

Random Data Constraint (Strict) If target data are generated using random functions: * Use **only** relations implied by explicit random parameters (distribution/bounds/monotonic transforms). * **Never** cite specific generated numeric values. * Prefer robust non-empty selection styles (rank-bands, local patterns, tick-anchored ranges) over fragile narrow...

work page
[34]

Recognition-Only Rule (No Calculations) (Strict) These are **recognition_data** expressions: * **Do NOT** use arithmetic or statistics: no differences/ratios/rates, no mean/median/std, no âĂĲaverage + âĂęâĂİ, no derived thresholds. * You **may** use: * direct comparisons (âĂĲhigher/lowerâĂİ, âĂĲabove/belowâĂİ) on visible values, * rank selection by compar...

work page
[35]

**No derived thresholds**

**Value-Range Filtering** Targets are all elements whose values are **within/above/below** a specified **range/interval**, where boundaries are **directly given constants** or **explicitly referenced in the chart** (ticks/labels/on-mark labels/some reference mark point). **No derived thresholds**

work page
[36]

**Tie policy:** if ties occur at the boundary, **include all tied elements**

**Rank-Band Set Selection** Targets are all elements whose **rank positions** fall in a specified band (top-N, bottom-N, ranks iâĂŞj, excluding extremes) within an explicit scope, determined by **ordering/comparisons only** (no arithmetic/statistics). **Tie policy:** if ties occur at the boundary, **include all tied elements**

work page
[37]

**Local-Structure Patterns** Targets are elements defined by **local adjacency comparisons** along a series: local peaks/troughs, reversals, neighbor comparisons, and contiguous increasing/decreasing/plateau runsâĂŤ**purely by pairwise higher/lower/equal comparisons**, with no computed rates or aggregates

work page
[38]

*If multiple series exist, series identity must be disambiguated using legend text (textual/localization) or visual attributes (visual).* 42 T

**Cross-Series Relations** Targets are defined by **cross-series comparisons** at the same x/category (A above/below B), or per-category winner/loser by comparison, with **no gap/ratio calculations**. *If multiple series exist, series identity must be disambiguated using legend text (textual/localization) or visual attributes (visual).* 42 T. Niu et al. B...

work page
[43]

C) Visual Feature Categories (for data + visual; use one or more)

**Text annotations directly on the chart** **Definition:** Targets are selected using **explicit on-chart text** (callouts, data labels, annotation strings) that is visually attached to marks or regions, serving as a direct textual anchor for grounding. C) Visual Feature Categories (for data + visual; use one or more)

work page
[44]

**Color Attributes** **Definition:** Targets are elements/series identified by a **discrete color label** (e.g., red/blue/green), not subjective shades

work page
[47]

data_only

**Fill Style** **Definition:** Targets are elements identified by **interior fill appearance**: **filled vs hollow (outline-only)**, and (when present) **hatch/pattern type and direction** (e.g., diagonal/vertical/horizontal/crosshatch). --- Generation Task (Counts + Mix) Generate **exactly 20** distinct referring expressions: * ‘data_only‘: **10** items ...

work page
[48]

**Axis labels** **Definition:** Targets are selected using **explicit axis label text** (e.g., x/y axis titles) as an unambiguous anchor to specify which axis (or which subplotâĂŹs axis) the reference applies to

work page
[49]

**Non-data axis tick values and positions** **Definition:** Targets are selected using **Non-data axis tick labels (values) and their positions** as explicit, non-data anchorsâĂŤwithout requiring exact value reading beyond the printed tick text

work page
[50]

**Legend entries and their positions** **Definition:** Targets are selected by **legend entry text** (and optionally its **layout position**, e.g., first/second entry, top/bottom of legend) to map from label âĘŠ corresponding elements/series in a scope

work page
[51]

**Subplot titles, identifiers and positions** **Definition:** Targets are selected by **subplot-level text identifiers** (e.g., subplot title, facet header label, panel tag like âĂĲ(a)/(b)âĂİ) and/or their **panel positions** to disambiguate which subplot the reference is in

work page
[52]

### B) Visual Feature Categories (use one or more when visual is allowed)

**Text annotations directly on the chart** **Definition:** Targets are selected using **explicit on-chart text** (callouts, data labels, annotation strings) that is visually attached to marks or regions, serving as a direct textual anchor for grounding. ### B) Visual Feature Categories (use one or more when visual is allowed)

work page
[53]

**Color Attributes** ChartREG++ 47 **Definition:** Targets are elements/series identified by a **discrete color label** (e.g., red/blue/green), not subjective shades

work page
[54]

**Shape Style** **Definition:** Targets are elements identified by a **fixed, enumerated elements/shape name**, e.g., **circle, square, diamond, cross, plus, x, star, pentagon, hexagon**, and oriented variants such as **triangle-up/down/left/right**

work page
[55]

This applies to **any visible edge**, including **lines** and **borders/outlines** of bars/areas/markers

**Line Style / Stroke Style** **Definition:** Targets are elements identified by the **stroke/outline pattern** (solid/dashed/dotted/dashdot). This applies to **any visible edge**, including **lines** and **borders/outlines** of bars/areas/markers

work page
[56]

textual_localization_only

**Fill Style** **Definition:** Targets are elements identified by **interior fill appearance**: **filled vs hollow (outline-only)**, and (when present) **hatch/pattern type and direction** (e.g., diagonal/vertical/horizontal/crosshatch). --- ## Generation Task (Counts + Mix) Generate **exactly 15** distinct referring expressions: * **textual_localization_...

work page
[57]

a detailed target element type description,

work page
[58]

a referring expression (natural language) that refers to one or multiple elements of that target type in the final plot,

work page
[59]

element_indices

Python code that generates the visualization, Return which visual element(s) are referred to by the referring expression, **restricted to the target elements created at the code lines marked with ‘#‘**. You MUST reason about the **final visual appearance** after the entire code finishes executing (including axis scaling, normalization, transforms, limits,...

work page