ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

Fan Liu; Hao Liu; Kun Wang; Liqiang Nie; Ruping Cao; Yupeng Hu; Zhiran Li

arxiv: 2606.10640 · v1 · pith:ZVWRWE4Knew · submitted 2026-06-09 · 💻 cs.CV

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

Hao Liu , Ruping Cao , Kun Wang , Zhiran Li , Fan Liu , Yupeng Hu , Liqiang Nie This is my paper

Pith reviewed 2026-06-27 13:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords chart understandingdata extractionsummary generationdual-branch frameworkCSV correctionOCR groundingfactual refinementchallenge solution

0 comments

The pith

A dual-branch system corrects chart data extraction and refines factual summaries from images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ChartLens to solve the chart understanding task of recovering structured data and producing faithful natural-language summaries from chart images. It introduces two modules that work together: one verifies and fixes the extracted CSV structure using awareness of chart layout, while the other refines generated summaries by keeping key textual and numerical details from the image via OCR. The authors combine this with model adaptation and evidence grounding, reporting that the approach yields more reliable data recovery and fewer factual errors in summaries. On the challenge test set this produces an overall score of 69.10 and first place in Track 2.

Core claim

ChartLens consists of a Structure-Aware CSV Verification and Correction module that checks and fixes extracted chart data for structural consistency and a Text-Retention-Guided Summary Refinement module that preserves critical evidence during summary generation; together with model adaptation and OCR grounding these steps improve both structured data accuracy and summary factuality on chart images.

What carries the argument

Dual-branch framework with SAVC (Structure-Aware CSV Verification and Correction) for data reliability and TRSR (Text-Retention-Guided Summary Refinement) for evidence preservation.

If this is right

Structured chart data becomes more consistent after verification and correction steps.
Summaries retain more of the original numerical and textual evidence from the chart image.
The combined pipeline outperforms single-branch baselines on both data recovery and factuality metrics.
OCR-assisted grounding reduces hallucinated values in the final output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correction-and-retention pattern could apply to other image-to-text tasks that require both structured output and narrative fidelity.
Performance may depend on the quality of the initial OCR step; weaker OCR would limit how much evidence TRSR can retain.
If the test distribution shifts away from the training charts, the correction modules might require retraining to maintain the reported score.

Load-bearing premise

The SAVC and TRSR modules produce reliable corrections on the challenge test distribution without adding new factual errors.

What would settle it

Independent re-evaluation of the released system on the same test charts yields a score below 69.10 or shows introduced factual mismatches in the generated summaries.

Figures

Figures reproduced from arXiv: 2606.10640 by Fan Liu, Hao Liu, Kun Wang, Liqiang Nie, Ruping Cao, Yupeng Hu, Zhiran Li.

**Figure 2.** Figure 2: DataMFM Track 2 evaluates chart understanding with [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of ChartLens. (a) The pipeline uses parallel branches for CSV generation and summary generation. (b) Granite-Vision [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Successful case visualization. The correction strategy [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Failure case visualization. The model reads several nu [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

In this report, we present our champion solution for the DataMFM Challenge Track 2: Chart Understanding. This track requires models to recover structured chart data and generate faithful natural-language summaries from chart images. To address the complementary requirements of accurate data extraction and factual narration, we propose ChartLens, a dual-branch framework for chart data correction and summary refinement. ChartLens consists of two key modules: Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR). SAVC improves the reliability of structured data extraction through verification and correction, while TRSR enhances summary generation by preserving critical textual and numerical evidence from charts. By combining model adaptation, correction-based generation, and OCR-assisted evidence grounding, ChartLens improves both structured data recovery and summary factuality. On the test set, our final system achieves an overall score of 69.10 and ranks first in Track 2, demonstrating its effectiveness for accurate chart understanding. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ChartLens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Competition report on a dual-branch chart system that won its track but supplies no experiments or comparisons to back the module claims.

read the letter

The paper reports first place in the DataMFM Challenge Track 2 with an overall score of 69.10 using ChartLens, a dual-branch setup with SAVC for structure-aware CSV correction and TRSR for text-retention-guided summary refinement. It combines model adaptation, correction-based generation, and OCR grounding to handle both structured data recovery and factual summaries from chart images.

The work is straightforward engineering for the specific task. Naming the two modules and tying them to the dual requirements of accuracy and factuality is a reasonable way to organize the solution. The commitment to release code at the GitHub link is the clearest positive, as it lets others inspect or reuse the implementation.

The soft spots are exactly where the stress-test note points. The abstract gives no ablation results, no baseline comparisons, no error rates on the corrections, and no examples showing that SAVC or TRSR avoid introducing new factual mistakes. Without those, the attribution of the win to the framework stays unverified. The central assumption that the modules generalize reliably on the held-out test distribution is not supported by any numbers here.

This is mainly of interest to groups already working on the same competition or on narrow chart-understanding pipelines in document AI. It does not introduce a new technique or result that would stand on its own outside that setting.

I would not send it to peer review as a research paper. It reads as a competition solution note; any serious review would require the missing quantitative evidence first.

Referee Report

2 major / 0 minor

Summary. The manuscript presents ChartLens, a dual-branch framework for the DataMFM Challenge Track 2 on chart understanding. It consists of Structure-Aware CSV Verification and Correction (SAVC) and Text-Retention-Guided Summary Refinement (TRSR) modules, combined with model adaptation, correction-based generation, and OCR-assisted evidence grounding. The central claim is that this approach improves structured data recovery and summary factuality, achieving an overall score of 69.10 and first place on the test set.

Significance. If substantiated with detailed evidence, the result would indicate that targeted verification/correction and evidence-preservation modules can enhance both data extraction accuracy and factual consistency in chart-to-text tasks. The stated intention to release code at the provided GitHub link supports reproducibility.

major comments (2)

[Abstract] Abstract: The claim that ChartLens 'improves both structured data recovery and summary factuality' and achieves first place with score 69.10 is unsupported by any quantitative results, ablation studies, baseline comparisons, error analysis, or module-specific metrics within the manuscript.
[Abstract] Abstract: The descriptions of SAVC ('verification and correction') and TRSR ('preserving critical textual and numerical evidence') provide no error rates, failure cases, or evidence that the modules avoid introducing new factual inaccuracies on the held-out test distribution, which is required to attribute the reported score to the dual-branch framework.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the current version lacks sufficient quantitative support for the claims and will revise it to include additional analyses and evidence.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that ChartLens 'improves both structured data recovery and summary factuality' and achieves first place with score 69.10 is unsupported by any quantitative results, ablation studies, baseline comparisons, error analysis, or module-specific metrics within the manuscript.

Authors: We acknowledge that the provided manuscript text focuses primarily on the high-level description and final challenge score. To substantiate the claims of improvement, the revised manuscript will incorporate ablation studies, baseline comparisons, and module-specific metrics evaluated on the validation set. revision: yes
Referee: [Abstract] Abstract: The descriptions of SAVC ('verification and correction') and TRSR ('preserving critical textual and numerical evidence') provide no error rates, failure cases, or evidence that the modules avoid introducing new factual inaccuracies on the held-out test distribution, which is required to attribute the reported score to the dual-branch framework.

Authors: We agree that error rates, failure cases, and evidence against introducing new inaccuracies are needed for proper attribution. The revision will add an error analysis section with validation-set metrics and examples demonstrating the modules' contributions. However, detailed analysis on the held-out test distribution is constrained by the unavailability of per-instance ground truth beyond the aggregate score. revision: partial

standing simulated objections not resolved

Detailed per-instance error rates and failure case analysis specifically on the held-out test distribution, due to the challenge setup not providing public ground-truth labels for individual test examples.

Circularity Check

0 steps flagged

No significant circularity; empirical competition report with no derivations

full rationale

The paper is a systems description of a dual-branch framework (SAVC and TRSR modules) for a data challenge, reporting an empirical test score of 69.10 and first-place ranking. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. The central claim rests on external challenge evaluation rather than any internal reduction to inputs by construction. This is the most common honest finding for competition reports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an applied engineering system without mathematical derivations, free parameters, axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5741 in / 1114 out tokens · 20999 ms · 2026-06-27T13:56:21.479976+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction
cs.CV 2026-06 unverdicted novelty 3.0

ParseFixer combines full-page backbone parsing with agentic selective multimodal correction to reach third place (score 61.78) in the DataMFM Challenge Track 1.

Reference graph

Works this paper leans on

46 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Doc-researcher: A unified system for multi- modal document parsing and deep research

Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, et al. Doc-researcher: A unified system for multi- modal document parsing and deep research. InProceedings of the ACM Web Conference 2026, pages 2349–2360, 2026. 1

2026
[3]

Gemini 3.5 flash model card.https: / / deepmind

Google DeepMind. Gemini 3.5 flash model card.https: / / deepmind . google / models / model - cards / gemini-3-5-flash/, 2026. Accessed: 2026-06-06. 3, 6

2026
[4]

Lora: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InInternational Conference on Learning Representations, pages 1–20, 2022. 3, 4, 6

2022
[5]

Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021. 1

2021
[6]

Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

2021
[7]

Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023. 1

2023
[8]

Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026

Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1

2026
[9]

From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026

Yupeng Hu, Hao Liu, Kun Wang, Ruping Cao, Yinwei Wei, and Liqiang Nie. From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026. 1

2026
[10]

From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5): 2550–2568, 2024

Kung-Hsiang Huang, Hou Pong Chan, May Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, and Heng Ji. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5): 2550–2568, 2024. 1

2024
[11]

Granite 4.1 language models.https:// huggingface.co/blog/ibm- granite/granit- 4-1, 2026

IBM Research. Granite 4.1 language models.https:// huggingface.co/blog/ibm- granite/granit- 4-1, 2026. Accessed: 2026-04-28. 3, 4, 6

2026
[12]

Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, et al. Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 15922–15932, 202...

2026
[13]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

2022
[14]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

2023
[15]

Efficient document parsing via parallel token prediction

Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, and Zang Li. Efficient document parsing via parallel token prediction. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2763–2772, 2026. 1

2026
[16]

Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting

Ming Li, Yupeng Hu, Yinwei Wei, Hao Liu, Haocong Wang, and Weili Guan. Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 5306–5315, 2025. 1

2025
[17]

Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025

Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, and Yupeng Hu. Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025. 1

2025
[18]

Unim: A unified any-to- any interleaved multimodal benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, et al. Unim: A unified any-to- any interleaved multimodal benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15902–15911, 2026. 1

2026
[19]

Gaming for boundary: Elastic localization for frame- supervised video moment retrieval

Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,
[20]

Curmim: Curricu- lum masked image modeling

Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yupeng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Curricu- lum masked image modeling. InICASSP 2025-2025 IEEE International Conference on Acoustics, page 2041, 2025. 1

2025
[21]

GPT-5.5 System Card.https://openai

OpenAI. GPT-5.5 System Card.https://openai. com/index/gpt- 5- 5- system- card/, 2026. Ac- cessed: 2026-06-06. 4, 6

2026
[22]

Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 1

2025
[23]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1

2021
[24]

Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization

Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1

2024
[25]

Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

2025
[26]

Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026

Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1

2026
[27]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural In- formation Processing Systems, 37:113569–113697, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural In- formation Processing Systems, 37:113569–113697, 2024. 1

2024
[28]

Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1

2025
[29]

Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models

Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30076–30086, 2026. 1

2026
[30]

Chartmoe: Mixture of di- versely aligned expert connector for chart understanding

Zhengzhuo Xu, Bowen Qu, Yiyan Qi, Sinan Du, Chengjin Xu, Chun Yuan, and Jian Guo. Chartmoe: Mixture of di- versely aligned expert connector for chart understanding. InInternational Conference on Learning Representations, pages 78550–78572, 2025. 1

2025
[31]

Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023

Xinxiao Zhao, Fan Liu, Hao Liu, Mingzhu Xu, Haoyu Tang, Xueqing Li, and Yupeng Hu. Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023. 1

2023
[32]

Chartcoder: Advancing multimodal large language model for chart-to-code gener- ation

Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to-code gener- ation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7333–7348, 2025. 1 ChartLens: A Dual-Branch Framework for Chart ...

2025
[33]

Variables enclosed by braces, such as imagename,baseline csv, andocr reference, are replaced with instance-specific inputs during inference

Prompt Details This supplementary material provides the prompt templates used in ChartLens. Variables enclosed by braces, such as imagename,baseline csv, andocr reference, are replaced with instance-specific inputs during inference. 1.1. Direct Generation Prompt Direct Generation Prompt You are a chart understanding assistant. You will receive one chart i...
[34]

A structured CSV table that recovers the chart data
[35]

csv": "

A concise natural-language summary grounded in the chart. CSV requirements: * The CSV must have a header row. * Preserve visible categories, legends, series, and numerical values. * Use plain numeric cells whenever possible. * Put units in column names when needed. * Do not invent invisible rows, columns, labels, sources, or values. Summary requirements: ...
[36]

Compare every candidate CSV against the image
[37]

Select the candidate that best matches the image
[38]

Improve the selected CSV when it has small or moderate mistakes
[39]

csv": "

If all candidates are clearly inconsistent with the image, regenerate a new CSV from the image. Important decision rules: * Do NOT simply copy the best candidate. * If a candidate has the right table structure but a few values or labels are wrong, keep its structure and correct those values or labels. * If a candidate misses visible categories or visible ...
[40]

Make the smallest possible edits to the original summary
[41]

Preserve the original sentence order, paragraph structure, chart type, main trend descriptions, and numeric content unless the image clearly shows they are wrong
[42]

If the chart contains visible text such as title, subtitle, axis label, legend label, category label, source, note, or publisher, use the exact wording from the chart
[43]

Do NOT replace visible chart text with synonyms or expanded paraphrases
[44]

Trust the image over OCR when they conflict
[45]

Do not add invisible facts, invisible labels, or unsupported numeric values
[46]

summary":

Do not rewrite the whole summary unnecessarily. Return only one valid JSON object: { "summary": "..." } Final output requirements: * The summary must be a single paragraph string. * No Markdown. * No code fence. * No explanation outside JSON

[1] [1]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Doc-researcher: A unified system for multi- modal document parsing and deep research

Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, et al. Doc-researcher: A unified system for multi- modal document parsing and deep research. InProceedings of the ACM Web Conference 2026, pages 2349–2360, 2026. 1

2026

[3] [3]

Gemini 3.5 flash model card.https: / / deepmind

Google DeepMind. Gemini 3.5 flash model card.https: / / deepmind . google / models / model - cards / gemini-3-5-flash/, 2026. Accessed: 2026-06-06. 3, 6

2026

[4] [4]

Lora: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low- rank adaptation of large language models. InInternational Conference on Learning Representations, pages 1–20, 2022. 3, 4, 6

2022

[5] [5]

Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021. 1

2021

[6] [6]

Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

2021

[7] [7]

Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023. 1

2023

[8] [8]

Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026

Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1

2026

[9] [9]

From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026

Yupeng Hu, Hao Liu, Kun Wang, Ruping Cao, Yinwei Wei, and Liqiang Nie. From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026. 1

2026

[10] [10]

From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5): 2550–2568, 2024

Kung-Hsiang Huang, Hou Pong Chan, May Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, and Heng Ji. From pixels to insights: A survey on automatic chart understanding in the era of large foundation models.IEEE Transactions on Knowledge and Data Engineering, 37(5): 2550–2568, 2024. 1

2024

[11] [11]

Granite 4.1 language models.https:// huggingface.co/blog/ibm- granite/granit- 4-1, 2026

IBM Research. Granite 4.1 language models.https:// huggingface.co/blog/ibm- granite/granit- 4-1, 2026. Accessed: 2026-04-28. 3, 4, 6

2026

[12] [12]

Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, et al. Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 15922–15932, 202...

2026

[13] [13]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

2022

[14] [14]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

2023

[15] [15]

Efficient document parsing via parallel token prediction

Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, and Zang Li. Efficient document parsing via parallel token prediction. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2763–2772, 2026. 1

2026

[16] [16]

Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting

Ming Li, Yupeng Hu, Yinwei Wei, Hao Liu, Haocong Wang, and Weili Guan. Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 5306–5315, 2025. 1

2025

[17] [17]

Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025

Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, and Yupeng Hu. Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025. 1

2025

[18] [18]

Unim: A unified any-to- any interleaved multimodal benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, et al. Unim: A unified any-to- any interleaved multimodal benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15902–15911, 2026. 1

2026

[19] [19]

Gaming for boundary: Elastic localization for frame- supervised video moment retrieval

Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,

[20] [20]

Curmim: Curricu- lum masked image modeling

Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yupeng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Curricu- lum masked image modeling. InICASSP 2025-2025 IEEE International Conference on Acoustics, page 2041, 2025. 1

2025

[21] [21]

GPT-5.5 System Card.https://openai

OpenAI. GPT-5.5 System Card.https://openai. com/index/gpt- 5- 5- system- card/, 2026. Ac- cessed: 2026-06-06. 4, 6

2026

[22] [22]

Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 1

2025

[23] [23]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1

2021

[24] [24]

Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization

Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1

2024

[25] [25]

Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

2025

[26] [26]

Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026

Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1

2026

[27] [27]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural In- formation Processing Systems, 37:113569–113697, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural In- formation Processing Systems, 37:113569–113697, 2024. 1

2024

[28] [28]

Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1

2025

[29] [29]

Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models

Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30076–30086, 2026. 1

2026

[30] [30]

Chartmoe: Mixture of di- versely aligned expert connector for chart understanding

Zhengzhuo Xu, Bowen Qu, Yiyan Qi, Sinan Du, Chengjin Xu, Chun Yuan, and Jian Guo. Chartmoe: Mixture of di- versely aligned expert connector for chart understanding. InInternational Conference on Learning Representations, pages 78550–78572, 2025. 1

2025

[31] [31]

Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023

Xinxiao Zhao, Fan Liu, Hao Liu, Mingzhu Xu, Haoyu Tang, Xueqing Li, and Yupeng Hu. Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023. 1

2023

[32] [32]

Chartcoder: Advancing multimodal large language model for chart-to-code gener- ation

Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, and Maosong Sun. Chartcoder: Advancing multimodal large language model for chart-to-code gener- ation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7333–7348, 2025. 1 ChartLens: A Dual-Branch Framework for Chart ...

2025

[33] [33]

Variables enclosed by braces, such as imagename,baseline csv, andocr reference, are replaced with instance-specific inputs during inference

Prompt Details This supplementary material provides the prompt templates used in ChartLens. Variables enclosed by braces, such as imagename,baseline csv, andocr reference, are replaced with instance-specific inputs during inference. 1.1. Direct Generation Prompt Direct Generation Prompt You are a chart understanding assistant. You will receive one chart i...

[34] [34]

A structured CSV table that recovers the chart data

[35] [35]

csv": "

A concise natural-language summary grounded in the chart. CSV requirements: * The CSV must have a header row. * Preserve visible categories, legends, series, and numerical values. * Use plain numeric cells whenever possible. * Put units in column names when needed. * Do not invent invisible rows, columns, labels, sources, or values. Summary requirements: ...

[36] [36]

Compare every candidate CSV against the image

[37] [37]

Select the candidate that best matches the image

[38] [38]

Improve the selected CSV when it has small or moderate mistakes

[39] [39]

csv": "

If all candidates are clearly inconsistent with the image, regenerate a new CSV from the image. Important decision rules: * Do NOT simply copy the best candidate. * If a candidate has the right table structure but a few values or labels are wrong, keep its structure and correct those values or labels. * If a candidate misses visible categories or visible ...

[40] [40]

Make the smallest possible edits to the original summary

[41] [41]

Preserve the original sentence order, paragraph structure, chart type, main trend descriptions, and numeric content unless the image clearly shows they are wrong

[42] [42]

If the chart contains visible text such as title, subtitle, axis label, legend label, category label, source, note, or publisher, use the exact wording from the chart

[43] [43]

Do NOT replace visible chart text with synonyms or expanded paraphrases

[44] [44]

Trust the image over OCR when they conflict

[45] [45]

Do not add invisible facts, invisible labels, or unsupported numeric values

[46] [46]

summary":

Do not rewrite the whole summary unnecessarily. Return only one valid JSON object: { "summary": "..." } Final output requirements: * The summary must be a single paragraph string. * No Markdown. * No code fence. * No explanation outside JSON