ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction

Fan Liu; Hao Liu; Kun Wang; LeKai Yu; Ruping Cao; Yupeng Hu; Zhiran Li

arxiv: 2606.11977 · v1 · pith:ROCLUYODnew · submitted 2026-06-10 · 💻 cs.CV

ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction

LeKai Yu , Hao Liu , Kun Wang , Zhiran Li , Ruping Cao , Fan Liu , Yupeng Hu This is my paper

Pith reviewed 2026-06-27 09:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords document parsingmarkdown recoverymultimodal correctionagentic frameworkimage to structured textselective repairbackbone parsingcomputer vision

0 comments

The pith

Selective multimodal correction after a stable backbone parser improves recovery of key document elements without rewriting reliable predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ParseFixer to turn document page images into structured Markdown by first generating an initial output with a full-page backbone parser and then applying corrections only to selected high-value failures. This two-stage design seeks to maintain the stability of the initial parse while using an agentic process to detect and repair important errors through verification and rollback. A sympathetic reader would care because many real-world systems rely on accurate extraction of both text and layout from images, and a method that avoids blanket rewrites could reduce unnecessary changes. If the selective correction works as described, it would mean that targeted multimodal fixes can raise overall quality without risking the parts the backbone already handled correctly.

Core claim

ParseFixer consists of Full-Page Backbone Parsing that produces initial Markdown outputs and Agentic Selective Correction that detects high-value parsing failures and repairs them through a verify-and-rollback correction process, thereby improving the recovery of key document elements without rewriting reliable backbone predictions.

What carries the argument

The Agentic Selective Correction module, which identifies important parsing failures in the backbone output and executes targeted multimodal repairs with rollback to avoid introducing new errors.

If this is right

Initial backbone outputs remain stable because only selected failures are addressed rather than the entire page.
Recovery of key document elements increases because corrections focus on high-value issues identified by the agentic process.
Reliable predictions from the backbone are preserved because the correction includes a rollback mechanism.
The overall output quality improves when the verify-and-rollback steps succeed on the chosen failures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective approach could reduce the computational cost of full re-parsing by limiting multimodal calls to only problematic regions.
Similar selective correction patterns might apply to other image-to-structure tasks where some output parts are already accurate.
If the failure detection generalizes, the framework could be tested on documents with tables or figures that the backbone typically mishandles.

Load-bearing premise

The agentic module can reliably spot which parsing failures are worth fixing and correct them without creating new mistakes in the process.

What would settle it

A controlled comparison on the same test images showing that the selective correction step produces the same or lower overall quality scores than the backbone parser alone, or that corrected sections contain more errors than the original backbone output.

Figures

Figures reproduced from arXiv: 2606.11977 by Fan Liu, Hao Liu, Kun Wang, LeKai Yu, Ruping Cao, Yupeng Hu, Zhiran Li.

**Figure 2.** Figure 2: Motivation of our solution for DataMFM Track 1. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of ParseFixer. Given document page images, ParseFixer first uses FBP to generate initial Markdown outputs and [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples of page-level failure modes ad [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

In this report, we present our third-place solution for the DataMFM Challenge Track 1: Document Parsing. This track requires models to recover structured Markdown documents from document page images while preserving textual content and document structure. To address the complementary requirements of accurate content recovery and faithful structure reconstruction, we propose ParseFixer, an agentic framework for backbone parsing and selective correction. ParseFixer consists of two key modules: Full-Page Backbone Parsing (FBP) and Agentic Selective Correction (ASC). FBP produces stable initial Markdown outputs with MinerU2.5 Pro, while ASC detects high-value parsing failures and repairs them through a verify-and-rollback correction process. By placing selective multimodal correction after open-source backbone parsing, ParseFixer improves the recovery of key document elements without rewriting reliable backbone predictions. On the test set, our final system achieves an overall score of 61.78 and ranks third in Track 1, demonstrating its effectiveness for accurate document parsing. Our code will be released at: https://github.com/iLearn-Lab/CVPRW26-ParseFixer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ParseFixer is a competition report that wires MinerU2.5 Pro to an agentic correction loop and claims third place, but the text gives no baseline or ablation to show the loop adds anything.

read the letter

The paper describes a third-place entry for the DataMFM Challenge Track 1 on document parsing. The system runs MinerU2.5 Pro as a full-page backbone to produce initial Markdown, then applies an Agentic Selective Correction module that tries to detect high-value failures and repair them with a verify-and-rollback step. The stated goal is to fix key elements while leaving reliable backbone output untouched. The final test-set score is 61.78.

The practical pattern is clear enough: selective multimodal correction after a stable open-source parser. Releasing the code at the GitHub link is also useful for anyone who wants to replicate the pipeline.

The main gap is the missing measurements. The abstract reports only the combined score and ranking. It does not give the backbone-only score on the same test set, any ablation that removes the correction module, or any breakdown of which errors were fixed versus introduced. Without those numbers the claim that selective correction improves recovery stays unverified.

This work is aimed at practitioners who build document parsing systems and might want an engineering template for agentic post-processing. Readers looking for new methods, formal analysis, or quantified gains will not find them here.

I would not bring this to a reading group. It does not deserve peer review in its current form because the central benefit of the correction module is not demonstrated by any direct comparison or error analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ParseFixer as the authors' third-place entry in the DataMFM Challenge Track 1 on document parsing. The system comprises two modules: Full-Page Backbone Parsing (FBP) that generates initial Markdown from page images using MinerU2.5 Pro, and Agentic Selective Correction (ASC) that identifies high-value failures and applies verify-and-rollback multimodal repairs. The central claim is that selective correction after backbone parsing improves recovery of key elements without rewriting reliable predictions; the final test-set overall score is reported as 61.78.

Significance. If the selective-correction benefit can be quantified, the framework would illustrate a practical way to augment open-source parsers by targeting only high-value errors. The explicit commitment to release code at the cited GitHub repository is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: The claim that ASC 'improves the recovery of key document elements without rewriting reliable backbone predictions' is unsupported by any reported comparison to the MinerU2.5 Pro baseline score on the same test set or by ablation results that isolate the ASC contribution to the 61.78 score.
[Abstract] Abstract: No definition of the 'overall score,' per-element error breakdown, rollback statistics, or analysis showing that ASC repairs failures without introducing new errors is supplied, leaving the weakest assumption (reliable identification and net-positive repair) unverified.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly stated the competition metric used for the 61.78 score.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that the manuscript requires additional evidence to support the claims made in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that ASC 'improves the recovery of key document elements without rewriting reliable backbone predictions' is unsupported by any reported comparison to the MinerU2.5 Pro baseline score on the same test set or by ablation results that isolate the ASC contribution to the 61.78 score.

Authors: We agree that the current manuscript does not include a direct comparison to the MinerU2.5 Pro baseline or ablations isolating the ASC contribution. The reported 61.78 is the challenge test-set score for the full system. In the revised manuscript we will add the baseline score on the same test set and an ablation study quantifying the selective correction benefit. revision: yes
Referee: [Abstract] Abstract: No definition of the 'overall score,' per-element error breakdown, rollback statistics, or analysis showing that ASC repairs failures without introducing new errors is supplied, leaving the weakest assumption (reliable identification and net-positive repair) unverified.

Authors: We will explicitly define the overall score according to the DataMFM Challenge metric in the revision. We will also add per-element error breakdowns, rollback statistics, and analysis of the verify-and-rollback process demonstrating that repairs are net-positive and do not introduce new errors. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical system report with external competition score

full rationale

The paper describes an engineering system (FBP using MinerU2.5 Pro + ASC module) and reports its third-place result on an external competition test set (score 61.78). No equations, parameters, or derivations are present. No self-citations, fitted inputs renamed as predictions, or self-definitional steps exist. The central claim rests on the reported competition outcome rather than any internal reduction to the paper's own inputs. This is the expected non-finding for a competition report lacking mathematical content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering submission to a document-parsing challenge. No mathematical derivations, free parameters, axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5740 in / 1017 out tokens · 21777 ms · 2026-06-27T09:56:43.172974+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl-1.5: Towards a multi- task 0.9 b vlm for robust in-the-wild document parsing.arXiv preprint arXiv:2601.21957, 2026. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Doc-researcher: A unified system for multi- modal document parsing and deep research

Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, et al. Doc-researcher: A unified system for multi- modal document parsing and deep research. InProceedings of the ACM Web Conference 2026, pages 2349–2360, 2026. 1

2026
[5]

Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026. 1

work page arXiv 2026
[6]

Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021. 1

2021
[7]

Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

2021
[8]

Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023. 1

2023
[9]

Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026

Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1

2026
[10]

From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026

Yupeng Hu, Hao Liu, Kun Wang, Ruping Cao, Yinwei Wei, and Liqiang Nie. From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026. 1

2026
[11]

Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, et al. Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 15922–15932, 2026. 1

2026
[12]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

2022
[13]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

2023
[14]

Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting

Ming Li, Yupeng Hu, Yinwei Wei, Hao Liu, Haocong Wang, and Weili Guan. Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 5306–5315, 2025. 1

2025
[15]

Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025

Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, and Yupeng Hu. Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025. 1

2025
[16]

Unim: A unified any-to- any interleaved multimodal benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, et al. Unim: A unified any-to- any interleaved multimodal benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15902–15911, 2026. 1

2026
[17]

Gaming for boundary: Elastic localization for frame- supervised video moment retrieval

Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,
[18]

Curmim: Curricu- lum masked image modeling

Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yupeng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Curricu- lum masked image modeling. InICASSP 2025-2025 IEEE International Conference on Acoustics, page 2041, 2025. 1

2025
[19]

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

Hao Liu, Ruping Cao, Kun Wang, Zhiran Li, Fan Liu, Yu- peng Hu, and Liqiang Nie. Chartlens: A dual-branch frame- work for chart data correction and factual summary refine- ment.arXiv preprint arXiv:2606.10640, 2026. 1

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

GPT-5.5 System Card.https://openai

OpenAI. GPT-5.5 System Card.https://openai. com/index/gpt- 5- 5- system- card/, 2026. Ac- cessed: 2026-06-06. 4, 5, 6, 7

2026
[21]

Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 1, 5

2025
[22]

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization

Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1

2024
[24]

Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

2025
[25]

Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026

Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1

2026
[26]

Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1

2025
[27]

Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models

Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30076–30086, 2026. 1

2026
[28]

Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, and Xiang Bai. Om- niparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 1

2026
[29]

Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023

Xinxiao Zhao, Fan Liu, Hao Liu, Mingzhu Xu, Haoyu Tang, Xueqing Li, and Yupeng Hu. Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023. 1 ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction Supplementary Material

2023
[30]

Prompt Details This supplementary material provides the prompt templates used in ParseFixer for DataMFM Challenge Track 1: Docu- ment Parsing.All prompts follow the same principle: output Markdown or the required element format only, preserve visible document content, avoid explanations or summaries, and reject unsupported additions through verification a...
[31]

- Do not output explanations

Output Markdown only. - Do not output explanations. - Do not output analysis. - Do not wrap the answer in ‘‘‘markdown or any code fence. - Do not include metadata such as filename, page number, or confidence unless it is visibly part of the document content
[32]

- Paragraphs must be separated by a blank line

Text: - Use standard Markdown syntax. - Paragraphs must be separated by a blank line. - Preserve the original text as faithfully as possible. - Do not invent missing or unreadable content
[33]

- Display/block formulas must be wrapped with $$...$$

Formulas: - Use LaTeX syntax. - Display/block formulas must be wrapped with $$...$$. - Inline formulas must be wrapped with $ ...$. - Do not use \\[...\\] or \$...\$ delimiters
[34]

- Standard Markdown pipe tables are also acceptable for simple tables

Tables: - Prefer HTML <table> format, especially for complex tables or merged cells. - Standard Markdown pipe tables are also acceptable for simple tables. - Preserve row/column structure as accurately as possible. - Use rowspan/colspan in HTML tables when cell merging is visible
[35]

- For multi-column pages, read each column in the correct order

Reading order: - Arrange all elements in natural reading order. - For multi-column pages, read each column in the correct order. - Keep captions close to their corresponding figures/tables when visible
[36]

# ". - Use

Headings: - If a line is visually a title, section heading, chapter heading, category heading, or subsection heading, prefix it with "# ". - Use "# " for all heading lines, not "##" or "###". - Do not add "# " to normal body paragraphs
[37]

Lists: - Use Markdown bullet or numbered lists when the document visibly contains lists
[38]

- Do not invent a description for a figure if no caption or relevant text is visible

Figures and captions: - Transcribe visible captions. - Do not invent a description for a figure if no caption or relevant text is visible
[39]

Headers, footers, and page numbers: - Omit repeated decorative headers, footers, and page numbers unless they are meaningful document content
[40]

Unreadable text: - If text is too blurry or unreadable, omit it rather than guessing
[41]

No text found

Empty / non-document image: - If the image contains no readable document text, table, formula, caption, or meaningful document content, output an empty Markdown file. - Do not write explanations such as "No text found", "No readable content", or "The image contains no text". """ USER_PROMPT = """Convert this document page image into Markdown. Remember: - ...

[1] [1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl-1.5: Towards a multi- task 0.9 b vlm for robust in-the-wild document parsing.arXiv preprint arXiv:2601.21957, 2026. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Doc-researcher: A unified system for multi- modal document parsing and deep research

Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, et al. Doc-researcher: A unified system for multi- modal document parsing and deep research. InProceedings of the ACM Web Conference 2026, pages 2349–2360, 2026. 1

2026

[5] [5]

Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, et al. Glm-ocr technical report.arXiv preprint arXiv:2603.10910, 2026. 1

work page arXiv 2026

[6] [6]

Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. Video moment localization via deep cross-modal hash- ing.IEEE Transactions on Image Processing, 30:4667– 4677, 2021. 1

2021

[7] [7]

Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xian-Sheng Hua. Coarse-to-fine semantic align- ment for cross-modal moment localization.IEEE Transac- tions on Image Processing, 30:5933–5943, 2021

2021

[8] [8]

Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023

Yupeng Hu, Kun Wang, Meng Liu, Haoyu Tang, and Liqiang Nie. Semantic collaborative learning for cross-modal mo- ment localization.ACM Transactions on Information Sys- tems, 42(2):1–26, 2023. 1

2023

[9] [9]

Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026

Yupeng Hu, Han Jiang, Hao Liu, Kun Wang, Haoyu Tang, and Liqiang Nie. Visual self-paced iterative learning for un- supervised temporal action localization.ACM Transactions on Multimedia Computing, Communications and Applica- tions, 2026. 1

2026

[10] [10]

From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026

Yupeng Hu, Hao Liu, Kun Wang, Ruping Cao, Yinwei Wei, and Liqiang Nie. From a glance to a boundary: Uncertainty- aware distillation for glance-supervised video moment local- ization.IEEE Transactions on Multimedia, 2026. 1

2026

[11] [11]

Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, et al. Chartnet: A million-scale, high-quality multimodal dataset for robust chart understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 15922–15932, 2026. 1

2026

[12] [12]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1

2022

[13] [13]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

2023

[14] [14]

Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting

Ming Li, Yupeng Hu, Yinwei Wei, Hao Liu, Haocong Wang, and Weili Guan. Dcount: Decoupled spatial perception and attribute discrimination for referring expression counting. In Proceedings of the 33rd ACM International Conference on Multimedia, pages 5306–5315, 2025. 1

2025

[15] [15]

Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025

Yanlin Li, Hao Liu, Huimin Liu, Yinwei Wei, and Yupeng Hu. Mist: Towards multi-dimensional implicit bias and stereotype evaluation of llms via theory of mind.arXiv e- prints, pages arXiv–2506, 2025. 1

2025

[16] [16]

Unim: A unified any-to- any interleaved multimodal benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, et al. Unim: A unified any-to- any interleaved multimodal benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15902–15911, 2026. 1

2026

[17] [17]

Gaming for boundary: Elastic localization for frame- supervised video moment retrieval

Hao Liu, Yupeng Hu, Kun Wang, Yinwei Wei, and Liqiang Nie. Gaming for boundary: Elastic localization for frame- supervised video moment retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 917–926,

[18] [18]

Curmim: Curricu- lum masked image modeling

Hao Liu, Kun Wang, Yudong Han, Haocong Wang, Yupeng Hu, Chunxiao Wang, and Liqiang Nie. Curmim: Curricu- lum masked image modeling. InICASSP 2025-2025 IEEE International Conference on Acoustics, page 2041, 2025. 1

2025

[19] [19]

ChartLens: A Dual-Branch Framework for Chart Data Correction and Factual Summary Refinement

Hao Liu, Ruping Cao, Kun Wang, Zhiran Li, Fan Liu, Yu- peng Hu, and Liqiang Nie. Chartlens: A dual-branch frame- work for chart data correction and factual summary refine- ment.arXiv preprint arXiv:2606.10640, 2026. 1

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

GPT-5.5 System Card.https://openai

OpenAI. GPT-5.5 System Card.https://openai. com/index/gpt- 5- 5- system- card/, 2026. Ac- cessed: 2026-06-06. 4, 5, 6, 7

2026

[21] [21]

Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, et al. Omnidocbench: Benchmarking di- verse pdf document parsing with comprehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838–24848, 2025. 1, 5

2025

[22] [22]

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, et al. Mineru2. 5-pro: Pushing the limits of data-centric document parsing at scale.arXiv preprint arXiv:2604.04771, 2026. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization

Kun Wang, Hao Liu, Lirong Jie, Zixu Li, Yupeng Hu, and Liqiang Nie. Explicit granularity and implicit scale corre- spondence learning for point-supervised video moment lo- calization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 9214–9223, 2024. 1

2024

[24] [24]

Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025

Kun Wang, Yupeng Hu, Hao Liu, Lirong Jie, and Liqiang Nie. Redundancy mitigation: Towards accurate and efficient image-text retrieval.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 1

2025

[25] [25]

Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026

Kun Wang, Yupeng Hu, Hao Liu, Jiang Shao, and Liqiang Nie. Cross-modal representation shift refinement for point- supervised video moment retrieval.ACM Transactions on Information Systems, 44(3):1–30, 2026. 1

2026

[26] [26]

Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture

Qianlong Xiang, Miao Zhang, Yuzhang Shang, Jianlong Wu, Yan Yan, and Liqiang Nie. Dkdm: Data-free knowledge dis- tillation for diffusion models with any architecture. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2965, 2025. 1

2025

[27] [27]

Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models

Qianlong Xiang, Miao Zhang, Haoyu Zhang, Kun Wang, Junhui Hou, and Liqiang Nie. Tina: Text-free inversion at- tack for unlearned text-to-image diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 30076–30086, 2026. 1

2026

[28] [28]

Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, and Xiang Bai. Om- niparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 1

2026

[29] [29]

Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023

Xinxiao Zhao, Fan Liu, Hao Liu, Mingzhu Xu, Haoyu Tang, Xueqing Li, and Yupeng Hu. Cogcn: co-occurring item- aware gcn for recommendation.Neural computing and ap- plications, 35(36):25107–25120, 2023. 1 ParseFixer: An Agentic Framework for Document Parsing via Selective Multimodal Correction Supplementary Material

2023

[30] [30]

Prompt Details This supplementary material provides the prompt templates used in ParseFixer for DataMFM Challenge Track 1: Docu- ment Parsing.All prompts follow the same principle: output Markdown or the required element format only, preserve visible document content, avoid explanations or summaries, and reject unsupported additions through verification a...

[31] [31]

- Do not output explanations

Output Markdown only. - Do not output explanations. - Do not output analysis. - Do not wrap the answer in ‘‘‘markdown or any code fence. - Do not include metadata such as filename, page number, or confidence unless it is visibly part of the document content

[32] [32]

- Paragraphs must be separated by a blank line

Text: - Use standard Markdown syntax. - Paragraphs must be separated by a blank line. - Preserve the original text as faithfully as possible. - Do not invent missing or unreadable content

[33] [33]

- Display/block formulas must be wrapped with $$...$$

Formulas: - Use LaTeX syntax. - Display/block formulas must be wrapped with $$...$$. - Inline formulas must be wrapped with $ ...$. - Do not use \\[...\\] or \\(...\\) delimiters

[34] [34]

- Standard Markdown pipe tables are also acceptable for simple tables

Tables: - Prefer HTML <table> format, especially for complex tables or merged cells. - Standard Markdown pipe tables are also acceptable for simple tables. - Preserve row/column structure as accurately as possible. - Use rowspan/colspan in HTML tables when cell merging is visible

[35] [35]

- For multi-column pages, read each column in the correct order

Reading order: - Arrange all elements in natural reading order. - For multi-column pages, read each column in the correct order. - Keep captions close to their corresponding figures/tables when visible

[36] [36]

# ". - Use

Headings: - If a line is visually a title, section heading, chapter heading, category heading, or subsection heading, prefix it with "# ". - Use "# " for all heading lines, not "##" or "###". - Do not add "# " to normal body paragraphs

[37] [37]

Lists: - Use Markdown bullet or numbered lists when the document visibly contains lists

[38] [38]

- Do not invent a description for a figure if no caption or relevant text is visible

Figures and captions: - Transcribe visible captions. - Do not invent a description for a figure if no caption or relevant text is visible

[39] [39]

Headers, footers, and page numbers: - Omit repeated decorative headers, footers, and page numbers unless they are meaningful document content

[40] [40]

Unreadable text: - If text is too blurry or unreadable, omit it rather than guessing

[41] [41]

No text found

Empty / non-document image: - If the image contains no readable document text, table, formula, caption, or meaningful document content, output an empty Markdown file. - Do not write explanations such as "No text found", "No readable content", or "The image contains no text". """ USER_PROMPT = """Convert this document page image into Markdown. Remember: - ...