OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Jianqiang Wan; Jun Tang; Sibo Song; Wenqing Cheng; Wenwen Yu; Xiang Bai; Yuliang Liu; Zhibo Yang

arxiv: 2502.16161 · v2 · submitted 2025-02-22 · 💻 cs.CV · cs.CL

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Wenwen Yu , Zhibo Yang , Jianqiang Wan , Sibo Song , Jun Tang , Wenqing Cheng , Yuliang Liu , Xiang Bai This is my paper

Pith reviewed 2026-05-23 01:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords visual text parsingstructured points of thoughttext spottingkey information extractiontable recognitionlayout analysismultimodal large language modelsdocument understanding

0 comments

The pith

OmniParser V2 uses one Structured-Points-of-Thought prompting schema to unify text spotting, key information extraction, table recognition and layout analysis in a single encoder-decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace separate task-specific architectures and loss functions for visually-situated text parsing with one universal model. It introduces SPOT prompting schemas that enforce a shared input-output representation and objective across the four tasks. A sympathetic reader would care because current document pipelines remain fragmented by heterogeneous targets and schemas. The work reports state-of-the-art or competitive numbers on eight datasets and shows the same schemas improve performance when plugged into multimodal large language models.

Core claim

OmniParser V2 demonstrates that Structured-Points-of-Thought (SPOT) prompting schemas allow a single encoder-decoder architecture, objective, and input-output format to handle text spotting, key information extraction, table recognition, and layout analysis, removing the need for task-specific components while matching or exceeding prior results on eight datasets and extending successfully to multimodal large language models.

What carries the argument

Structured-Points-of-Thought (SPOT) prompting schemas, which supply a unified representation that collapses the four VsTP tasks into one encoder-decoder pipeline.

If this is right

Task-specific architectures and loss functions become unnecessary for the four evaluated VsTP tasks.
A single training and inference pipeline replaces four separate workflows.
The same SPOT schemas improve visual text parsing when inserted into multimodal large language models.
Evaluation across eight datasets confirms the approach reaches state-of-the-art or competitive accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SPOT may allow rapid addition of new document tasks by writing only new prompt templates rather than new model heads.
The unified format could reduce memory and compute costs in production document systems that currently run multiple models in parallel.
Because SPOT works inside existing multimodal LLMs, the technique might transfer to other multimodal reasoning settings that currently require task-specific fine-tuning.

Load-bearing premise

The diverse targets and schemas of the four tasks can be captured by one shared prompting format and loss without measurable degradation relative to dedicated models.

What would settle it

A direct comparison on any of the eight datasets in which the unified OmniParser V2 model falls more than a few points behind a published task-specific baseline on text spotting, key information extraction, table recognition, or layout analysis.

read the original abstract

Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing visual text parsing capabilities on four tasks, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniParser V2 introduces SPOT prompting to unify four VsTP tasks in one encoder-decoder but supplies no numbers to show whether accuracy holds across all of them.

read the letter

The main takeaway is that this work proposes Structured-Points-of-Thought prompting schemas so a single encoder-decoder plus one objective can cover text spotting, key information extraction, table recognition, and layout analysis at once. They also plug the same schemas into a multimodal LLM to test generality. That direction makes sense for cutting down on separate pipelines in document systems. The paper does a clear job naming the current problem of task-specific architectures and heterogeneous schemas, and the code link is useful for anyone who wants to inspect the implementation. The SPOT idea itself is presented as the concrete novelty rather than a routine reuse of prior techniques. The soft spot is exactly what the stress-test note flags: the abstract asserts SOTA or competitive results on eight datasets but gives zero tables, baselines, ablations, or per-task scores. Without those numbers it is impossible to check whether the unified representation actually preserves performance on every task or whether one of them degrades. If table recognition or spotting drops, the drop-in replacement claim does not hold. The full manuscript is referenced but the supplied text stops at the abstract, so the performance evidence stays out of reach here. This is for readers working on practical document AI pipelines who care about reducing engineering overhead across VsTP tasks. A serious referee should see it if the full version contains the missing comparisons and ablations, because a working unification would still be worth knowing even if the gains are incremental rather than transformative.

Referee Report

2 major / 1 minor

Summary. The paper introduces OmniParser V2, a universal encoder-decoder model for visually-situated text parsing (VsTP) that unifies four tasks—text spotting, key information extraction, table recognition, and layout analysis—via Structured-Points-of-Thought (SPOT) prompting schemas. It claims this single architecture, objective, and I/O representation eliminates task-specific designs while achieving SOTA or competitive results on eight datasets; it further explores SPOT integration into multimodal LLMs to demonstrate generality. Code is released.

Significance. If the performance claims are substantiated, the work would be significant for simplifying VsTP pipelines through unification, reducing modal isolation and complex workflows. The open-sourcing of code at https://github.com/AlibabaResearch/AdvancedLiterateMachinery is a clear strength that supports reproducibility.

major comments (2)

[Abstract] Abstract: The assertion that 'OmniParser V2 achieves state-of-the-art or competitive results in VsTP' across four tasks and eight datasets is made without any quantitative tables, baseline comparisons, per-task metrics, error bars, or ablation details. This directly undermines verification of the load-bearing claim that the unified SPOT model serves as a drop-in replacement without performance degradation on any task.
[Evaluations (presumed §4)] The central unification premise (that one encoder-decoder + SPOT prompting matches task-specific performance on text spotting, KIE, table recognition, and layout analysis) requires explicit per-task comparisons against specialized baselines; without them in the evaluations section, the 'no systematic drop' condition remains unverified.

minor comments (1)

[Abstract] The abstract would benefit from naming the eight datasets and briefly indicating the four tasks' metrics to allow readers to immediately gauge the scope of the claimed results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the feedback. We address the two major comments below by clarifying the location and content of the quantitative evidence already present in the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'OmniParser V2 achieves state-of-the-art or competitive results in VsTP' across four tasks and eight datasets is made without any quantitative tables, baseline comparisons, per-task metrics, error bars, or ablation details. This directly undermines verification of the load-bearing claim that the unified SPOT model serves as a drop-in replacement without performance degradation on any task.

Authors: Abstracts are concise summaries by convention and do not include tables or detailed metrics. The full quantitative evidence—including tables with baseline comparisons, per-task metrics across all eight datasets, error bars, and ablation studies—is provided in Section 4. These results substantiate SOTA or competitive performance with no systematic degradation for the unified model. revision: no
Referee: [Evaluations (presumed §4)] The central unification premise (that one encoder-decoder + SPOT prompting matches task-specific performance on text spotting, KIE, table recognition, and layout analysis) requires explicit per-task comparisons against specialized baselines; without them in the evaluations section, the 'no systematic drop' condition remains unverified.

Authors: Section 4 already contains explicit per-task comparisons of the unified encoder-decoder + SPOT model against specialized baselines for text spotting, KIE, table recognition, and layout analysis on the eight datasets. The reported results confirm competitive or superior performance without systematic drops. We can add cross-references or emphasis for clarity if the referee finds the presentation insufficient. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model proposal with external dataset benchmarks

full rationale

The paper introduces OmniParser V2 and SPOT prompting as a unified encoder-decoder framework for four VsTP tasks, with claims resting on experimental results across eight datasets rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The unification claim is presented as an empirical outcome verified against external benchmarks, making the work self-contained without reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; no numerical free parameters or additional axioms are identifiable from the provided text.

axioms (1)

domain assumption A single encoder-decoder architecture with one objective and one input-output representation can handle the four distinct VsTP tasks without task-specific modifications
Invoked when the abstract states that SPOT eliminates the need for task-specific architectures and loss functions.

invented entities (1)

Structured-Points-of-Thought (SPOT) prompting schemas no independent evidence
purpose: To provide a unified input-output representation that improves performance across diverse VsTP scenarios
Newly introduced as the central technical contribution of the paper.

pith-pipeline@v0.9.0 · 5818 in / 1362 out tokens · 52396 ms · 2026-05-23T01:41:51.396117+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
cs.AI 2025-12 accept novelty 8.0

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
InstructTable: Improving Table Structure Recognition Through Instructions
cs.CV 2026-04 unverdicted novelty 6.0

InstructTable combines instruction-guided pre-training on structural patterns with visual fine-tuning and a template-free synthetic data generator (TME) to reach state-of-the-art table structure recognition on public ...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 5 Pith papers

[1]

Platypus: A generalized specialist model for reading text in various forms,

P . Wang, Z. Li, J. Tang, H. Zhong, F. Huang, Z. Yang, and C. Yao, “Platypus: A generalized specialist model for reading text in various forms,” in ECCV, 2024, pp. 1–15. 5

work page 2024
[2]

Rico: A mobile app dataset for building data-driven design applications,

B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar, “Rico: A mobile app dataset for building data-driven design applications,” in UIST, 2017, pp. 845–854

work page 2017
[3]

Icdar 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. M. Romeu, D. F. Mota, J. Almaz ´an, and L.-P . de las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013, pp. 1484–1493

work page 2013
[4]

Towards end-to-end unified scene text detection and layout analysis,

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis, “Towards end-to-end unified scene text detection and layout analysis,” in CVPR, 2022, pp. 1049–1059

work page 2022
[5]

Textocr: Towards large-scale end-to-end reasoning for arbitrary- shaped scene text,

A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner, “Textocr: Towards large-scale end-to-end reasoning for arbitrary- shaped scene text,” in CVPR, 2021, pp. 8802–8812

work page 2021
[6]

Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

Y. Liu, C. Shen, L. Jin, T. He, P . Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,” TP AMI, vol. 44, pp. 8048–8064, 2021

work page 2021
[7]

Open images v5 text annotation and yet another mask text spotter,

I. Krylov, S. Nosov, and V . Sovrasov, “Open images v5 text annotation and yet another mask text spotter,” in Asian Conference on Machine Learning , 2021, pp. 379–389

work page 2021
[8]

Doclaynet: A large human-annotated dataset for document-layout segmentation,

B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P . Staar, “Doclaynet: A large human-annotated dataset for document-layout segmentation,” in KDD, 2022, pp. 3743–3751

work page 2022
[9]

Icdar2017 robust reading chal- lenge on multi-lingual scene text detection and script identification- rrc-mlt,

N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon et al., “Icdar2017 robust reading chal- lenge on multi-lingual scene text detection and script identification- rrc-mlt,” in ICDAR, vol. 1, 2017, pp. 1454–1459

work page 2017
[10]

Coco- text: Dataset and benchmark for text detection and recognition in natural images,

A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco- text: Dataset and benchmark for text detection and recognition in natural images,” arXiv: Comp. Res. Repository , pp. 1–8, 2016

work page 2016
[11]

Vision grid transformer for document layout analysis,

C. Da, C. Luo, Q. Zheng, and C. Yao, “Vision grid transformer for document layout analysis,” in ICCV, 2023, pp. 19 462–19 472

work page 2023
[12]

Ocr-free document understanding transformer,

G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” in ECCV, 2022, pp. 498–517

work page 2022
[13]

Publaynet: largest dataset ever for document layout analysis,

X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever for document layout analysis,” in ICDAR, 2019, pp. 1015–1022

work page 2019
[14]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P . Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kacz- marczyk, and J. Jitsev, “Laion-5b: An open large-scale dataset for training next generation image-text models,” in NeurIPS, 2022, pp. 25 278 – 2529

work page 2022
[15]

Conditional text image generation with diffusion models,

Y. Zhu, Z. Li, T. Wang, M. He, and C. Yao, “Conditional text image generation with diffusion models,” in CVPR, 2023, pp. 14 235– 14 244

work page 2023
[16]

Towards unified scene text spotting based on sequence generation,

T. Kil, S. Kim, S. Seo, Y. Kim, and D. Kim, “Towards unified scene text spotting based on sequence generation,” in CVPR, 2023, pp. 15 223–15 232

work page 2023
[17]

Challenges in end-to-end neural scientific table recognition,

Y. Deng, D. Rosenberg, and G. Mann, “Challenges in end-to-end neural scientific table recognition,” in ICDAR, 2019, pp. 894–901

work page 2019
[18]

Tablebank: Table benchmark for image-based table detection and recognition,

M. Li, L. Cui, S. Huang, F. Wei, M. Zhou, and Z. Li, “Tablebank: Table benchmark for image-based table detection and recognition,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 1918–1925

work page 2020
[19]

An open approach towards the benchmarking of table structure recognition systems,

A. Shahab, F. Shafait, T. Kieninger, and A. Dengel, “An open approach towards the benchmarking of table structure recognition systems,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , 2010, pp. 113–120

work page 2010
[20]

Icdar 2019 competition on table detection and recognition (ctdar),

L. Gao, Y. Huang, H. D ´ejean, J.-L. Meunier, Q. Yan, Y. Fang, F. Kleber, and E. Lang, “Icdar 2019 competition on table detection and recognition (ctdar),” in ICDAR, 2019, pp. 1510–1515

work page 2019
[21]

Parsing table structures in the wild,

R. Long, W. Wang, N. Xue, F. Gao, Z. Yang, Y. Wang, and G.-S. Xia, “Parsing table structures in the wild,” in ICCV, 2021, pp. 944–952

work page 2021
[22]

Visual understanding of complex table structures from document images,

S. Raja, A. Mondal, and C. Jawahar, “Visual understanding of complex table structures from document images,” in WACV, 2022, pp. 2543–2552

work page 2022
[23]

Icdar 2013 table competition,

M. G ¨obel, T. Hassan, E. Oro, and G. Orsi, “Icdar 2013 table competition,” in ICDAR, 2013, pp. 1449–1453

work page 2013
[24]

Com- plicated table structure recognition,

Z. Chi, H. Huang, H.-D. Xu, H. Yu, W. Yin, and X.-L. Mao, “Com- plicated table structure recognition,” arXiv: Comp. Res. Repository , pp. 1–9, 2019

work page 2019
[25]

Pubtables-1m: Towards comprehensive table extraction from unstructured documents,

B. Smock, R. Pesala, and R. Abraham, “Pubtables-1m: Towards comprehensive table extraction from unstructured documents,” in CVPR, 2022, pp. 4634–4642

work page 2022
[26]

Image-based table recognition: data, model, and evaluation,

X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in ECCV, 2020, pp. 564–580

work page 2020
[27]

Global table extractor (gte): A framework for joint table identifica- tion and cell structure recognition using visual context,

X. Zheng, D. Burdick, L. Popa, X. Zhong, and N. X. R. Wang, “Global table extractor (gte): A framework for joint table identifica- tion and cell structure recognition using visual context,” in WACV, 2021, pp. 697–706

work page 2021
[28]

Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html,

J. Ye, X. Qi, Y. He, Y. Chen, D. Gu, P . Gao, and R. Xiao, “Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html,” arXiv: Comp. Res. Repository, pp. 1–8, 2021

work page 2021
[29]

Improving table structure recognition with visual- alignment sequential coordinate modeling,

Y. Huang, N. Lu, D. Chen, Y. Li, Z. Xie, S. Zhu, L. Gao, and W. Peng, “Improving table structure recognition with visual- alignment sequential coordinate modeling,” in CVPR, 2023, pp. 11 134–11 143

work page 2023
[30]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

work page 2025

[1] [1]

Platypus: A generalized specialist model for reading text in various forms,

P . Wang, Z. Li, J. Tang, H. Zhong, F. Huang, Z. Yang, and C. Yao, “Platypus: A generalized specialist model for reading text in various forms,” in ECCV, 2024, pp. 1–15. 5

work page 2024

[2] [2]

Rico: A mobile app dataset for building data-driven design applications,

B. Deka, Z. Huang, C. Franzen, J. Hibschman, D. Afergan, Y. Li, J. Nichols, and R. Kumar, “Rico: A mobile app dataset for building data-driven design applications,” in UIST, 2017, pp. 845–854

work page 2017

[3] [3]

Icdar 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. M. Romeu, D. F. Mota, J. Almaz ´an, and L.-P . de las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013, pp. 1484–1493

work page 2013

[4] [4]

Towards end-to-end unified scene text detection and layout analysis,

S. Long, S. Qin, D. Panteleev, A. Bissacco, Y. Fujii, and M. Raptis, “Towards end-to-end unified scene text detection and layout analysis,” in CVPR, 2022, pp. 1049–1059

work page 2022

[5] [5]

Textocr: Towards large-scale end-to-end reasoning for arbitrary- shaped scene text,

A. Singh, G. Pang, M. Toh, J. Huang, W. Galuba, and T. Hassner, “Textocr: Towards large-scale end-to-end reasoning for arbitrary- shaped scene text,” in CVPR, 2021, pp. 8802–8812

work page 2021

[6] [6]

Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,

Y. Liu, C. Shen, L. Jin, T. He, P . Chen, C. Liu, and H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,” TP AMI, vol. 44, pp. 8048–8064, 2021

work page 2021

[7] [7]

Open images v5 text annotation and yet another mask text spotter,

I. Krylov, S. Nosov, and V . Sovrasov, “Open images v5 text annotation and yet another mask text spotter,” in Asian Conference on Machine Learning , 2021, pp. 379–389

work page 2021

[8] [8]

Doclaynet: A large human-annotated dataset for document-layout segmentation,

B. Pfitzmann, C. Auer, M. Dolfi, A. S. Nassar, and P . Staar, “Doclaynet: A large human-annotated dataset for document-layout segmentation,” in KDD, 2022, pp. 3743–3751

work page 2022

[9] [9]

Icdar2017 robust reading chal- lenge on multi-lingual scene text detection and script identification- rrc-mlt,

N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon et al., “Icdar2017 robust reading chal- lenge on multi-lingual scene text detection and script identification- rrc-mlt,” in ICDAR, vol. 1, 2017, pp. 1454–1459

work page 2017

[10] [10]

Coco- text: Dataset and benchmark for text detection and recognition in natural images,

A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco- text: Dataset and benchmark for text detection and recognition in natural images,” arXiv: Comp. Res. Repository , pp. 1–8, 2016

work page 2016

[11] [11]

Vision grid transformer for document layout analysis,

C. Da, C. Luo, Q. Zheng, and C. Yao, “Vision grid transformer for document layout analysis,” in ICCV, 2023, pp. 19 462–19 472

work page 2023

[12] [12]

Ocr-free document understanding transformer,

G. Kim, T. Hong, M. Yim, J. Nam, J. Park, J. Yim, W. Hwang, S. Yun, D. Han, and S. Park, “Ocr-free document understanding transformer,” in ECCV, 2022, pp. 498–517

work page 2022

[13] [13]

Publaynet: largest dataset ever for document layout analysis,

X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever for document layout analysis,” in ICDAR, 2019, pp. 1015–1022

work page 2019

[14] [14]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wight- man, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P . Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kacz- marczyk, and J. Jitsev, “Laion-5b: An open large-scale dataset for training next generation image-text models,” in NeurIPS, 2022, pp. 25 278 – 2529

work page 2022

[15] [15]

Conditional text image generation with diffusion models,

Y. Zhu, Z. Li, T. Wang, M. He, and C. Yao, “Conditional text image generation with diffusion models,” in CVPR, 2023, pp. 14 235– 14 244

work page 2023

[16] [16]

Towards unified scene text spotting based on sequence generation,

T. Kil, S. Kim, S. Seo, Y. Kim, and D. Kim, “Towards unified scene text spotting based on sequence generation,” in CVPR, 2023, pp. 15 223–15 232

work page 2023

[17] [17]

Challenges in end-to-end neural scientific table recognition,

Y. Deng, D. Rosenberg, and G. Mann, “Challenges in end-to-end neural scientific table recognition,” in ICDAR, 2019, pp. 894–901

work page 2019

[18] [18]

Tablebank: Table benchmark for image-based table detection and recognition,

M. Li, L. Cui, S. Huang, F. Wei, M. Zhou, and Z. Li, “Tablebank: Table benchmark for image-based table detection and recognition,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, 2020, pp. 1918–1925

work page 2020

[19] [19]

An open approach towards the benchmarking of table structure recognition systems,

A. Shahab, F. Shafait, T. Kieninger, and A. Dengel, “An open approach towards the benchmarking of table structure recognition systems,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems , 2010, pp. 113–120

work page 2010

[20] [20]

Icdar 2019 competition on table detection and recognition (ctdar),

L. Gao, Y. Huang, H. D ´ejean, J.-L. Meunier, Q. Yan, Y. Fang, F. Kleber, and E. Lang, “Icdar 2019 competition on table detection and recognition (ctdar),” in ICDAR, 2019, pp. 1510–1515

work page 2019

[21] [21]

Parsing table structures in the wild,

R. Long, W. Wang, N. Xue, F. Gao, Z. Yang, Y. Wang, and G.-S. Xia, “Parsing table structures in the wild,” in ICCV, 2021, pp. 944–952

work page 2021

[22] [22]

Visual understanding of complex table structures from document images,

S. Raja, A. Mondal, and C. Jawahar, “Visual understanding of complex table structures from document images,” in WACV, 2022, pp. 2543–2552

work page 2022

[23] [23]

Icdar 2013 table competition,

M. G ¨obel, T. Hassan, E. Oro, and G. Orsi, “Icdar 2013 table competition,” in ICDAR, 2013, pp. 1449–1453

work page 2013

[24] [24]

Com- plicated table structure recognition,

Z. Chi, H. Huang, H.-D. Xu, H. Yu, W. Yin, and X.-L. Mao, “Com- plicated table structure recognition,” arXiv: Comp. Res. Repository , pp. 1–9, 2019

work page 2019

[25] [25]

Pubtables-1m: Towards comprehensive table extraction from unstructured documents,

B. Smock, R. Pesala, and R. Abraham, “Pubtables-1m: Towards comprehensive table extraction from unstructured documents,” in CVPR, 2022, pp. 4634–4642

work page 2022

[26] [26]

Image-based table recognition: data, model, and evaluation,

X. Zhong, E. ShafieiBavani, and A. Jimeno Yepes, “Image-based table recognition: data, model, and evaluation,” in ECCV, 2020, pp. 564–580

work page 2020

[27] [27]

Global table extractor (gte): A framework for joint table identifica- tion and cell structure recognition using visual context,

X. Zheng, D. Burdick, L. Popa, X. Zhong, and N. X. R. Wang, “Global table extractor (gte): A framework for joint table identifica- tion and cell structure recognition using visual context,” in WACV, 2021, pp. 697–706

work page 2021

[28] [28]

Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html,

J. Ye, X. Qi, Y. He, Y. Chen, D. Gu, P . Gao, and R. Xiao, “Pingan- vcgroup’s solution for icdar 2021 competition on scientific literature parsing task b: table recognition to html,” arXiv: Comp. Res. Repository, pp. 1–8, 2021

work page 2021

[29] [29]

Improving table structure recognition with visual- alignment sequential coordinate modeling,

Y. Huang, N. Lu, D. Chen, Y. Li, Z. Xie, S. Zhu, L. Gao, and W. Peng, “Improving table structure recognition with visual- alignment sequential coordinate modeling,” in CVPR, 2023, pp. 11 134–11 143

work page 2023

[30] [30]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...

work page 2025