pith. sign in

arxiv: 2606.29905 · v1 · pith:TL6CFYDDnew · submitted 2026-06-29 · 💻 cs.CV

StrucTab: A Structured Optimization Framework for Table Parsing

Pith reviewed 2026-06-30 06:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords table parsingtable structure recognitionvision-language modelsreinforcement learningreward decompositiondocument image analysisstructured output
0
0 comments X

The pith

StrucTab improves table parsing accuracy by decomposing the task into subtasks and using separate rewards for validity, structure, and content in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Table parsing converts images of tables into machine-readable structured data, a process that demands both spatial layout understanding and text recognition. Standard vision-language models often train directly on final outputs and struggle with unstable rewards when using reinforcement learning. StrucTab instead breaks the work into human-like subtasks such as counting rows and columns and detecting merged cells, then unifies them via sequential reasoning. It optimizes this process with Uni-TabRL, which supplies three distinct reward signals rather than one combined score. The result is state-of-the-art performance on existing benchmarks and clear gains on the new TableVerse-5K dataset of challenging real-world tables.

Core claim

StrucTab is a table parsing model learned through intermediate structural supervision and reward decomposition. At the modeling level, the parsing process is decomposed into subtasks such as row-column counting and merged-cell analysis and progressively unified through a sequential reasoning strategy. At the optimization level, Uni-TabRL supplies decomposed rewards (validity, structure, and content) that deliver stable and informative signals. At the evaluation level, TableVerse-5K provides a large-scale benchmark of diverse real-world tables. Experiments confirm state-of-the-art results across public benchmarks and substantial gains on TableVerse-5K.

What carries the argument

Uni-TabRL, a unified reinforcement learning framework that supplies three separate reward components (validity, structure, content) to guide a sequential reasoning strategy over subtasks such as row-column counting and merged-cell analysis.

If this is right

  • Explicit subtask decomposition allows the model to handle complex merged-cell and irregular layouts that direct end-to-end supervision misses.
  • Decomposed rewards reduce reward ambiguity and stabilize reinforcement learning updates for structured output tasks.
  • The sequential reasoning strategy produces intermediate outputs that can be inspected or corrected independently.
  • Performance gains on TableVerse-5K indicate the method scales to diverse real-world table formats beyond existing benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition pattern could be tested on other structured document tasks such as form extraction or chart understanding.
  • If the three reward components prove additive, the framework might reduce the volume of fully annotated table images needed for training.
  • Extending the subtask list to include cell-type classification or alignment checks could further tighten the optimization loop.

Load-bearing premise

That breaking the task into subtasks and splitting the reward into validity, structure, and content signals will produce clearer and more stable training than direct supervision without introducing new ambiguities or instabilities.

What would settle it

Training a standard vision-language model on TableVerse-5K using only a single combined reward instead of the three decomposed rewards and measuring whether accuracy fails to match or exceed the reported StrucTab gains.

Figures

Figures reproduced from arXiv: 2606.29905 by Binghong Wu, Can Ma, Chengquan Zhang, Gengluo Li, Han Hu, Hao Feng, Huawen Shen, Pengyuan Lyu, Shangpin Peng, Weinong Wang, Xingyu Wan, Yu Zhou, Zhuotao Tian.

Figure 1
Figure 1. Figure 1: StrucTab: A table parsing framework that overcomes the limitations of (a) current VLM-based end-to-end paradigms through three key innovations: (b) Mod￾eling via human-inspired sequential reasoning; (c) Optimization with Uni-TabRL using decomposed, fine-grained rewards; and (d) Evaluation on TableVerse-5K, a large-scale, challenging benchmark of diverse real-world tables. 1 Introduction Table parsing is a … view at source ↗
Figure 2
Figure 2. Figure 2: Limitations of existing RL reward designs for table parsing. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Structured Optimization Framework for StrucTab. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Providing structural cues (e.g., row–column counts) explicitly improves parsing performance on OmniDocBench. (b) Correcting the first structural error re￾duces subsequent structural errors while improving downstream generation confidence. 3.1 Human-Inspired Structural Decomposition Inspired by the explicit reasoning process of humans, we hypothesize that inter￾mediate structural signals play an importa… view at source ↗
Figure 5
Figure 5. Figure 5: Uni-TabRL Implementation Pipeline. Details are in the Appendix. 3.2 Curriculum Training with Sequential Reasoning To bridge structural decomposition and end-to-end parsing, we design a curricu￾lum training strategy with sequential reasoning. In the first stage of pretraining, row-column counting, merged-cell analysis, and HTML parsing are optimized as three independent question–answer tasks. This stage foc… view at source ↗
Figure 6
Figure 6. Figure 6: TableVerse-5K Construction Pipeline. Data Collection. All table images in TableVerse-5K are manually collected from diverse real-world sources, including academic papers, posters, reports, and technical documents [32]. The tables are manually cropped to ensure accurate localization. The dataset covers both Chinese and English, spanning printed tables, photographed tables, and handwritten content, thereby r… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparisons on TableVerse-5K. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Table parsing aims to convert table images into structured, machine-readable representations, a task requiring the joint perception of complex spatial layouts and textual content. While recent vision-language models (VLMs) enable end-to-end parsing, they typically rely on direct supervision of the final output, thereby bypassing the explicit intermediate reasoning that is crucial for understanding complex table structures. Furthermore, attempts to optimize these models using reinforcement learning (RL) are often hindered by unstable or ambiguous reward designs, limiting potential performance gains. To address these limitations, we propose StrucTab, a table parsing model learned through intermediate structural supervision and reward decomposition. At the modeling level, by decomposing the parsing process into human-inspired subtasks, such as row-column counting and merged-cell analysis, StrucTab progressively unifies them through a sequential reasoning strategy. At the optimization level, we introduce Uni-TabRL, a unified RL framework that leverages decomposed rewards (validity, structure, and content) to provide stable and informative optimization signals. Finally, at the evaluation level, we present TableVerse-5K, a large-scale, challenging benchmark encompassing diverse, real-world table scenarios. Extensive experiments demonstrate the state-of-the-art performance of StrucTab across all evaluated public benchmarks and significant improvements on TableVerse-5K, validating the effectiveness of explicit structural modeling and decomposed reward optimization. Code and benchmark are publicly available at https://github.com/VirtualLUOUCAS/StrucTab.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces StrucTab, a table parsing model that decomposes the task into human-inspired subtasks (row-column counting, merged-cell analysis) unified via sequential reasoning. It proposes Uni-TabRL, an RL framework using decomposed rewards (validity, structure, content) for stable optimization signals, and introduces the TableVerse-5K benchmark for diverse real-world tables. The central claim is that this yields SOTA performance on public benchmarks and significant improvements on TableVerse-5K.

Significance. If the experimental results hold, the combination of explicit intermediate structural supervision and decomposed RL rewards could improve reliability of VLM-based table parsing over direct supervision, particularly for complex layouts. The public release of code and the new benchmark adds value for the community by enabling reproducibility and further testing.

major comments (2)
  1. [Abstract] Abstract: The assertion of 'state-of-the-art performance' and 'significant improvements' on TableVerse-5K is load-bearing for the contribution, yet the text supplies no quantitative results, baselines, error bars, ablation details, or dataset statistics to support these claims.
  2. [Abstract] Abstract: The claim that Uni-TabRL's decomposed rewards (validity/structure/content) supply 'stable and informative optimization signals' superior to direct supervision is central, but no analysis, alignment check, or empirical verification is provided that the linear combination avoids reward conflicts or high-variance gradients in merged-cell or irregular-layout cases.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'human-inspired subtasks' is used without elaboration on how the decomposition is operationalized or how the sequential reasoning strategy is implemented in the model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the abstract's support for our central claims. We will revise accordingly by incorporating quantitative results and additional analysis on reward decomposition.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of 'state-of-the-art performance' and 'significant improvements' on TableVerse-5K is load-bearing for the contribution, yet the text supplies no quantitative results, baselines, error bars, ablation details, or dataset statistics to support these claims.

    Authors: We agree the abstract would benefit from explicit quantitative support. In the revision we will add concise numerical results (e.g., absolute gains on TableVerse-5K versus baselines), reference the error bars and ablation tables already present in Sections 4–5, and include brief dataset statistics for TableVerse-5K. revision: yes

  2. Referee: [Abstract] Abstract: The claim that Uni-TabRL's decomposed rewards (validity/structure/content) supply 'stable and informative optimization signals' superior to direct supervision is central, but no analysis, alignment check, or empirical verification is provided that the linear combination avoids reward conflicts or high-variance gradients in merged-cell or irregular-layout cases.

    Authors: The main experiments already show performance gains from the decomposed rewards. To directly address the request for verification, we will add a short analysis (new subsection or appendix) that reports reward-component alignment statistics and gradient-variance measurements on merged-cell and irregular-layout subsets, comparing the linear combination against direct-supervision baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper introduces StrucTab with decomposed subtasks and Uni-TabRL with decomposed rewards (validity/structure/content) as novel contributions, then validates via experiments on public benchmarks plus the new TableVerse-5K. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. Central claims reduce to reported experimental outcomes rather than any self-referential construction. This is the standard non-circular case for an applied ML framework paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach implicitly relies on standard VLM and RL assumptions whose details are not stated.

pith-pipeline@v0.9.1-grok · 5830 in / 1155 out tokens · 24082 ms · 2026-06-30T06:20:05.586585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 39 canonical work pages · 18 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., et al.: Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631 (2025)

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., et al.: Qwen2.5-VL Technical Report. arXiv preprint arXiv:2502.13923 (2025)

  3. [3]

    In: Proceedings of the International Conference on Learning Representations (2024)

    Blecher, L., Cucurull Preixens, G., Scialom, T., Stojnic, R.: Nougat: Neural op- tical understanding for academic documents. In: Proceedings of the International Conference on Learning Representations (2024)

  4. [4]

    arXiv preprint arXiv:2509.19760 (2025)

    Chen, X., Li, S., Zhu, X., Chen, Y., Yang, F., Fang, C., Qu, L., Xu, X., Wei, H., Wu, M.: Logics-Parsing Technical Report. arXiv preprint arXiv:2509.19760 (2025)

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  6. [6]

    PaddleOCR-VL: Boosting general document parsing via a 0.9B ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

    Cui, C., Sun, T., Liang, S., Gao, T., Zhang, Z., Liu, J., Wang, X., Zhou, C., Liu, H., Lin, M., et al.: PaddleOCR-VL: Boosting multilingual document parsing via a 0.9B ultra-compact vision-language model. arXiv preprint arXiv:2510.14528 (2025)

  7. [7]

    PaddleOCR 3.0 Technical Report

    Cui, C., Sun, T., Lin, M., Gao, T., Zhang, Y., Liu, J., Wang, X., Zhang, Z., Zhou, C., Liu, H., et al.: PaddleOCR 3.0 Technical Report. arXiv preprint arXiv:2507.05595 (2025)

  8. [8]

    arXiv preprint arXiv:2511.18434 (2025)

    Du, Y., Chen, P., Ying, X., Chen, Z.: DocPTBench: Benchmarking end-to-end photographed document parsing and translation. arXiv preprint arXiv:2511.18434 (2025)

  9. [9]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2723–2738 (2025)

    Du, Y., Chen, Z., Su, Y., Jia, C., Jiang, Y.G.: Instruction-guided scene text recog- nition. IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2723–2738 (2025)

  10. [10]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Du, Y., Chen, Z., Xie, H., Jia, C., Jiang, Y.G.: SVTRv2: CTC beats encoder- decoder models in scene text recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 20147–20156 (2025)

  11. [11]

    arXiv preprint arXiv:2109.03144 (2021)

    Du, Y., Li, C., Guo, R., Cui, C., Liu, W., Zhou, J., Lu, B., Yang, Y., Liu, Q., Hu, X., et al.: PP-OCRv2: Bag of tricks for ultra lightweight OCR system. arXiv preprint arXiv:2109.03144 (2021)

  12. [12]

    OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

    Fu, L., Kuang, Z., Song, J., Huang, M., Yang, B., Li, Y., Zhu, L., Luo, Q., Wang, X., Lu, H., et al.: OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. arXiv preprint arXiv:2501.00321 (2024)

  13. [13]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (2023)

    Huang, Y., Lu, N., Chen, D., Li, Y., Xie, Z., Zhu, S., Gao, L., Peng, W.: Improving table structure recognition with visual-alignment sequential coordinate modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (2023)

  14. [14]

    arXiv preprint arXiv:2501.11800 (2025)

    Khang, M., Hong, T.: TFLOP: table structure recognition framework with layout pointer mechanism. arXiv preprint arXiv:2501.11800 (2025)

  15. [15]

    Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    Lai, X., Tian, Z., Chen, Y., Yang, S., Peng, X., Jia, J.: Step-DPO: Step- wise preference optimization for long-chain reasoning of LLMs. arXiv preprint arXiv:2406.18629 (2024)

  16. [16]

    arXiv preprint arXiv:2506.01710 (2025) StrucTab: A Structured Optimization Framework for Table Parsing 17

    Lei, F., Meng, J., Huang, Y., Chen, T., Zhang, Y., He, S., Zhao, J., Liu, K.: Reasoning-Table: Exploring reinforcement learning for table reasoning. arXiv preprint arXiv:2506.01710 (2025) StrucTab: A Structured Optimization Framework for Table Parsing 17

  17. [17]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Li, G., Lyu, P., Zhang, C., Shen, H., Wu, L., Wan, X., Zeng, G., Hu, H., Ma, C., Zhou, Y.: Towards real-world document parsing via realistic scene synthesis and document-aware training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 23709–23719 (2026)

  18. [18]

    Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters

    Li, G., Peng, S., Wan, X., Zhang, C., Feng, H., Xu, X., Wu, P., Li, B., Ding, Z., Liu, Y., Ye, Y., Yang, Y., Shu, Z., Yan, G., Li, Z., Ma, C., Wang, W., Zhou, Y., Hu, H.: Chronicles-OCR: A cross-temporal perception benchmark for the evolutionary trajectory of chinese characters. arXiv preprint arXiv:2605.11960 (2026)

  19. [19]

    arXiv preprint arXiv:2512.02498 (2025)

    Li, Y., Yang, G., Liu, H., Wang, B., Zhang, C.: dots.ocr: Multilingual document layout parsing in a single vision-language model. arXiv preprint arXiv:2512.02498 (2025)

  20. [20]

    In: Pro- ceedings of the International Conference on Document Analysis and Recognition

    Li, Z., Wei, J., Shen, Z., Ma, C., Wu, Y., Zhou, Y.: PACM: Position-aware cross- modality decoder for handwritten mathematical expression recognition. In: Pro- ceedings of the International Conference on Document Analysis and Recognition. pp. 96–114 (2025)

  21. [21]

    arXiv preprint arXiv:2506.05218 (2025)

    Li, Z., Liu, Y., Liu, Q., Ma, Z., Zhang, Z., Zhang, S., Guo, Z., Zhang, J., Wang, X., Bai, X.: MonkeyOCR: Document parsing with a Structure-Recognition-Relation triplet paradigm. arXiv preprint arXiv:2506.05218 (2025)

  22. [22]

    In: Proceedings of the ACM International Conference on Multimedia (2022)

    Lin, W., Sun, Z., Ma, C., Li, M., Wang, J., Sun, L., Huo, Q.: TSRFormer: Table structure recognition with transformers. In: Proceedings of the ACM International Conference on Multimedia (2022)

  23. [23]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Liu, H., Li, X., Liu, B., Jiang, D., Liu, Y., Ren, B.: Neural collaborative graph machines for table structure recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4533–4542 (2022)

  24. [24]

    In: Proceedings of the Conference on Empirical Methods in Natural Language Processing

    Liu, Y., Zhao, Z., Tian, L., Wang, H., Ye, X., You, Y., Yu, Z., Wu, C., Xiao, Z., Yu, Y., et al.: POINTS-Reader: Distillation-free adaptation of vision-language models for document conversion. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 1576–1601 (2025)

  25. [25]

    In: Proceedings of the IEEE International Conference on Computer Vision (2021)

    Long, R., Wang, W., Xue, N., Gao, F., Yang, Z., Wang, Y., Xia, G.S.: Parsing table structures in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (2021)

  26. [26]

    Lyu, J., Wang, W., Yang, D., Zhong, J., Zhou, Y.: Arbitrary reading order scene textspotterwithlocalsemanticsguidance.In:ProceedingsoftheAAAIConference on Artificial Intelligence. vol. 39, pp. 5919–5927 (2025)

  27. [27]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2022)

    Nassar, A., Livathinos, N., Lysak, M., Staar, P.: TableFormer: Table structure un- derstanding with transformers. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2022)

  28. [28]

    MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

    Niu, J., Liu, Z., Gu, Z., Wang, B., Ouyang, L., Zhao, Z., Chu, T., He, T., Wu, F., Zhang, Q., et al.: MinerU2.5: A decoupled vision-language model for efficient high-resolution document parsing. arXiv preprint arXiv:2509.22186 (2025)

  29. [29]

    OpenAI: GPT-4o System Card (2024),https : / / cdn . openai . com / gpt - 4o - system-card.pdf, accessed: 2026-04-20

  30. [30]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)

    Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., et al.: OmniDocBench: Benchmarking diverse PDF document parsing with comprehensive annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)

  31. [31]

    Paruchuri, V.: Marker (2025),https://github.com/datalab- to/marker, ac- cessed: 2026-04-20

  32. [32]

    ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

    Peng, S., Li, G., Wan, X., Zhang, C., Feng, H., Wu, B., Shen, H., Wang, W., Cai, Z., Tian, Z., Hu, H., Ma, C., Zhou, Y.: ChartArena: Benchmarking chart parsing across languages, scenarios, and formats. arXiv preprint arXiv:2606.01348 (2026) 18 G. Li et al

  33. [33]

    Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

    Peng, S., Wang, W., Tian, Z., Yang, S., Wu, X., Xu, H., Zhang, C., Isobe, T., Hu, B., Zhang, M.: Uni-DPO: A unified paradigm for dynamic preference optimization of LLMs. arXiv preprint arXiv:2506.10054 (2025)

  34. [34]

    In: Proceedings of the IEEE International Con- ference on Computer Vision (2025)

    Peng, S., Yang, S., Jiang, L., Tian, Z.: Mitigating object hallucinations via sentence-level early intervention. In: Proceedings of the IEEE International Con- ference on Computer Vision (2025)

  35. [35]

    arXiv preprint arXiv:2403.04822 (2024)

    Peng, S., Chakravarthy, A., Lee, S., Wang, X., Balasubramaniyan, R., Chau, D.H.: UniTable: Towards a unified framework for table recognition via self-supervised pretraining. arXiv preprint arXiv:2403.04822 (2024)

  36. [36]

    arXiv preprint arXiv:2502.18443 (2025)

    Poznanski, J., Rangapur, A., Borchardt, J., Dunkelberger, J., Huff, R., Lin, D., Wilhelm, C., Lo, K., Soldaini, L.: olmOCR: Unlocking trillions of tokens in PDFs with vision language models. arXiv preprint arXiv:2502.18443 (2025)

  37. [37]

    arXiv preprint arXiv:2510.19817 (2025)

    Poznanski, J., Soldaini, L., Lo, K.: olmOCR 2: Unit test rewards for document OCR. arXiv preprint arXiv:2510.19817 (2025)

  38. [38]

    In: Proceedings of the European Conference on Computer Vision

    Raja, S., Mondal, A., Jawahar, C.: Table structure recognition using top-down and bottom-up cues. In: Proceedings of the European Conference on Computer Vision. pp. 70–86 (2020)

  39. [39]

    com / ByteDance - Seed / Seed - 1

    Seed, B.: Seed1.8 Model Card: Towards generalized real-world agency (2025), https : / / github . com / ByteDance - Seed / Seed - 1 . 8 / blob / main / Seed - 1 . 8 - Modelcard.pdf, accessed: 2026-04-20

  40. [40]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

  41. [41]

    In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

    Shen, H., Gao, X., Wei, J., Qiao, L., Zhou, Y., Li, Q., Cheng, Z.: Divide Rows and Conquer Cells: Towards structure recognition for large tables. In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. pp. 1369–1377 (2023)

  42. [42]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: OpenAI GPT-5 System Card. arXiv preprint arXiv:2601.03267 (2025)

  43. [43]

    LightOnOCR: A 1b end-to-end multilingual vision-language model for state-of-the-art OCR,

    Taghadouini, S., Cavaillès, A., Aubertin, B.: LightOnOCR: A 1B end-to-end multilingual vision-language model for state-of-the-art OCR. arXiv preprint arXiv:2601.14251 (2026)

  44. [44]

    arXiv preprint arXiv:2511.19575 (2025)

    Team, H.V., Lyu, P., Wan, X., Li, G., Peng, S., Wang, W., Wu, L., Shen, H., Zhou, Y., Tang, C., et al.: HunyuanOCR Technical Report. arXiv preprint arXiv:2511.19575 (2025)

  45. [45]

    Kimi K2.5: Visual Agentic Intelligence

    Team, K., Bai, T., Bai, Y., Bao, Y., Cai, S., Cao, Y., Charles, Y., Che, H., Chen, C., Chen, G., et al.: Kimi K2.5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276 (2026)

  46. [46]

    Wang, A.L., Tang, J., Liao, L., Feng, H., Liu, Q., Fei, X., Lu, J., Wang, H., Liu, H., Liu, Y., et al.: WildDoc: How far are we from achieving comprehensive and robust document understanding in the wild? In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2025)

  47. [47]

    arXiv preprint arXiv:2506.03197 (2025)

    Wang, B., Wu, B., Li, W., Fang, M., Huang, Z., Huang, J., Wang, H., Liang, Y., Chen, L., Chu, W., et al.: Infinity Parser: Layout aware reinforcement learning for scanned document parsing. arXiv preprint arXiv:2506.03197 (2025)

  48. [48]

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y., Shang, F., et al.: MinerU: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839 (2024) StrucTab: A Structured Optimization Framework for Table Parsing 19

  49. [49]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  50. [50]

    In: Proceedings of the European Conference on Computer Vision (2024)

    Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yang, J., Sun, J., Han, C., Zhang, X.: Vary: Scaling up the vision vocabulary for large vision-language models. In: Proceedings of the European Conference on Computer Vision (2024)

  51. [51]

    General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

    Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., et al.: General OCR theory: Towards OCR-2.0 via a unified end-to-end model. arXiv preprint arXiv:2409.01704 (2024)

  52. [52]

    DeepSeek-OCR: Contexts Optical Compression

    Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025)

  53. [53]

    DeepSeek-OCR 2: Visual causal flow,

    Wei, H., Sun, Y., Li, Y.: DeepSeek-OCR 2: Visual causal flow. arXiv preprint arXiv:2601.20552 (2026)

  54. [54]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

    Xing, H., Gao, F., Long, R., Bu, J., Zheng, Q., Li, L., Yao, C., Yu, Z.: LORE: Logical location regression network for table structure recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

  55. [55]

    Scientific Data10(1), 110 (2023)

    Yang, F., Hu, L., Liu, X., Huang, S., Gu, Z.: A large-scale dataset for end-to-end table recognition in the wild. Scientific Data10(1), 110 (2023)

  56. [56]

    arXiv preprint arXiv:2512.20312 (2025)

    Yang, S., Huang, Q., Yuan, J., Zha, L., Tang, K., Yang, Y., Wang, N., Wei, Y., Li, L., Ye, W., et al.: TableGPT-R1: Advancing tabular reasoning through reinforce- ment learning. arXiv preprint arXiv:2512.20312 (2025)

  57. [57]

    arXiv preprint arXiv:2505.23621 (2025)

    Yang, Z., Chen, L., Cohan, A., Zhao, Y.: Table-R1: Inference-time scaling for table reasoning. arXiv preprint arXiv:2505.23621 (2025)

  58. [58]

    In: Proceedings of the IEEE International Conference on Computer Vision (2025)

    Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., et al.: CC-OCR: A comprehensive and challenging OCR benchmark for evaluating large multimodal models in literacy. In: Proceedings of the IEEE International Conference on Computer Vision (2025)

  59. [59]

    arXiv preprint arXiv:2511.10390 (2025)

    Zhang, J., Liu, Y., Wu, Z., Pang, G., Ye, Z., Zhong, Y., Ma, J., Wei, T., Xu, H., Chen, W., et al.: MonkeyOCR v1.5 Technical Report: Unlocking robust document parsing for complex patterns. arXiv preprint arXiv:2511.10390 (2025)

  60. [60]

    arXiv preprint arXiv:2512.01248 (2025)

    Zhang, J., Wang, B., Zhang, Q., Wu, F., Wen, Z., Lu, J., Shan, J., Zhao, Z., Yang, S., Wang, Z., et al.: TRivia: Self-supervised fine-tuning of vision-language models for table recognition. arXiv preprint arXiv:2512.01248 (2025)

  61. [61]

    arXiv preprint arXiv:2305.13534 (2023)

    Zhang, M., Press, O., Merrill, W., Liu, A., Smith, N.A.: How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534 (2023)

  62. [62]

    Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

    Zhang, Q., Wang, B., Huang, V.S.J., Zhang, J., Wang, Z., Liang, H., He, C., Zhang, W.: Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction. arXiv preprint arXiv:2410.21169 (2024)

  63. [63]

    Zhang, Z., Hu, P., Ma, J., Du, J., Zhang, J., Yin, B., Yin, B., Liu, C.: SEMv2: Table separationlinedetectionbasedoninstancesegmentation.PatternRecognition149, 110279 (2024)

  64. [64]

    Pattern Recognition126, 108565 (2022)

    Zhang, Z., Zhang, J., Du, J., Wang, F.: Split, embed and merge: An accurate table structure recognizer. Pattern Recognition126, 108565 (2022)

  65. [65]

    In: Proceedings of Advances in Neural Information Process- ing Systems

    Zhao, W., Feng, H., Liu, Q., Tang, J., Wei, S., Wu, B., Liao, L., Ye, Y., Liu, H., Zhou, W., et al.: TabPedia: Towards comprehensive visual table understanding with concept synergy. In: Proceedings of Advances in Neural Information Process- ing Systems. vol. 37, pp. 7185–7212 (2024)

  66. [66]

    In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (2021) 20 G

    Zheng,X.,Burdick,D.,Popa,L.,Zhong,X.,Wang,N.X.R.:GlobalTableExtractor (GTE): A framework for joint table identification and cell structure recognition us- ing visual context. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (2021) 20 G. Li et al

  67. [67]

    In: Proceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers) (2024)

    Zhong, W., Feng, X., Zhao, L., Li, Q., Huang, L., Gu, Y., Ma, W., Xu, Y., Qin, B.: Investigating and mitigating the multimodal hallucination snowballing in large vision-language models. In: Proceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers) (2024)

  68. [68]

    In: Proceedings of the European Conference on Com- puter Vision (2020)

    Zhong, X., ShafieiBavani, E., Jimeno Yepes, A.: Image-based table recognition: data, model, and evaluation. In: Proceedings of the European Conference on Com- puter Vision (2020)

  69. [69]

    arXiv preprint arXiv:2601.08834 (2025)

    Zhong, Y., Chen, L., Zeng, Z., Zhao, X., Jiang, D., Zheng, L., Huang, J., Qiu, H., Shi, P., Yang, S., et al.: Reading or Reasoning? format decoupled reinforcement learning for document OCR. arXiv preprint arXiv:2601.08834 (2025)