pith. sign in

arxiv: 2604.22880 · v1 · submitted 2026-04-24 · 💻 cs.CL

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Pith reviewed 2026-05-08 11:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords document OCRLaTeX reconstructioncompilable LaTeXPDF to LaTeXreinforcement learningstructural fidelityscientific documentsverifiable rewards
0
0 comments X

The pith

Reinforcement learning with LaTeX unit-test rewards produces more structurally consistent and compilable PDF-to-LaTeX reconstructions than supervised fine-tuning alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that standard document OCR approaches discard the structural and executable properties required for scientific LaTeX, resulting in outputs that break section order, float placement, and cross-references. It introduces TexOCR-Bench to jointly score transcription accuracy, structural faithfulness, and end-to-end compilability, plus a large training corpus for the task. The authors then train a 2B model with supervised fine-tuning followed by reinforcement learning whose rewards come directly from unit tests that check whether the generated LaTeX compiles and maintains referential integrity. If the central claim holds, automated conversion of research PDFs into reusable, editable source code would become reliable enough to support archiving, editing, and downstream scientific workflows without constant manual repair.

Core claim

Existing frontier OCR systems frequently violate key document invariants when reconstructing scientific PDFs as LaTeX, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability. Training a 2B-parameter model on the introduced TexOCR-Train corpus with supervised fine-tuning followed by reinforcement learning that uses verifiable rewards from LaTeX unit tests yields consistent gains over SFT alone, especially on structural and compilation metrics, as measured by the multi-dimensional suite in TexOCR-Bench.

What carries the argument

Reinforcement learning with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity.

If this is right

  • Reconstructed documents maintain ordered sections and proper float placement at higher rates than SFT-only models.
  • Cross-reference and label integrity improves, enabling navigation and editing without broken links.
  • End-to-end compilability rates rise, reducing the need for post-processing fixes in scientific publishing pipelines.
  • Downstream usability increases for tasks that require valid, executable LaTeX rather than plain text output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The verifiable-reward technique could transfer to other markup or code-generation tasks where syntactic validity provides an automatic success signal.
  • Combining unit-test rewards with limited human review might address edge cases the current tests miss.
  • Larger training corpora built the same way could further narrow the remaining gap to human-level reconstruction quality.

Load-bearing premise

The LaTeX unit tests used for rewards and the multi-dimensional evaluation suite in TexOCR-Bench fully capture the invariants and usability properties required for real-world scientific documents without additional human judgment or domain-specific edge cases.

What would settle it

A collection of recent scientific papers with held-out ground-truth LaTeX sources, run through the trained model and checked for whether compilation success rate and reference validity exceed those achieved by SFT-only or other frontier baselines on the identical set.

Figures

Figures reproduced from arXiv: 2604.22880 by Chengye Wang, Lin Fu, Yilun Zhao, Zexi Kuang.

Figure 1
Figure 1. Figure 1: Overview of our work, covering data synthesis, view at source ↗
Figure 2
Figure 2. Figure 2: Effect of group size K in RLVR. We report the average benchmark score as a function of K, with shaded regions indicating the standard deviation across evaluation samples. Bench to evaluate executable, structure-faithful re￾construction at scale, and TEXOCR-Train to en￾able training with page-aligned supervision. Using unit-test style, verifiable rewards, we show that optimizing directly for functional corr… view at source ↗
Figure 3
Figure 3. Figure 3: Example of paragraph truncation and missing text. view at source ↗
Figure 4
Figure 4. Figure 4: Example of paragraph truncation and missing text. view at source ↗
Figure 5
Figure 5. Figure 5: Example of mathematical formula mismatches. view at source ↗
Figure 6
Figure 6. Figure 6: Example of mathematical formula mismatches. view at source ↗
Figure 7
Figure 7. Figure 7: Example of table structure corruption view at source ↗
Figure 8
Figure 8. Figure 8: Example of table structure corruption view at source ↗
Figure 9
Figure 9. Figure 9: Example of citation and reference errors. view at source ↗
Figure 10
Figure 10. Figure 10: Example of citation and reference errors. view at source ↗
Figure 11
Figure 11. Figure 11: Example of compilation failures. Error Type: Compilation Failures Original Latex AI can play a moderator role to make human feedback more constructive with a positive tone. \\textbf{Implication \\#3:} To capitalize on the potential of AI tools in SE processes, future AI tools should incorporate personalization features that align feedback delivery with individual preferences. In addition, AI can play the … view at source ↗
Figure 12
Figure 12. Figure 12: Example of compilation failures view at source ↗
read the original abstract

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing document OCR systems for scientific PDFs frequently violate key invariants (consistent section structure, correct float placement, valid label-reference links), leading to poor compilability. It introduces TexOCR-Bench, a multi-dimensional benchmark assessing transcription fidelity, structural faithfulness, and end-to-end compilability, along with TexOCR-Train, a large-scale corpus. A 2B-parameter model (TexOCR) is trained via SFT followed by RL with rewards from independent LaTeX unit tests; experiments across 21 frontier models show RL yields consistent gains over SFT on structural and compilation metrics.

Significance. If the results hold, the work advances OCR toward producing executable LaTeX rather than plain text or Markdown, addressing a practical need in scientific publishing. The use of verifiable rewards from LaTeX unit tests is a clear strength, providing objective, non-circular signals for both training and evaluation. The benchmark and corpus are potentially valuable community resources. The significance is reduced by the lack of evidence that the automated metrics align with semantic fidelity or real-world usability, but the core technical approach is sound and extensible.

major comments (3)
  1. [§5 (Experiments)] §5 (Experiments): The claims of consistent RL improvements over SFT on structural and compilation metrics across 21 models are presented without reporting train/test splits for TexOCR-Train or TexOCR-Bench, any statistical significance tests, or controls for post-hoc model selection. This is load-bearing for the central experimental claim and leaves the reliability of the gains unclear.
  2. [§3 (TexOCR-Bench)] §3 (TexOCR-Bench): The multi-dimensional evaluation suite relies on automated unit tests for invariants and compilability, but provides no human evaluation, inter-annotator agreement, or correlation analysis between the metrics and expert judgments of semantic correctness or scientific readability. This directly affects the adequacy of the proxies for the claimed downstream usability.
  3. [§4 (Training Methodology)] §4 (Training Methodology): While the RL rewards are derived from independent LaTeX unit tests (a positive design choice avoiding circularity), the manuscript gives only high-level descriptions of the tests for section consistency, float placement, and label-reference links. Concrete specification of the test suite is needed to assess coverage and enable reproduction.
minor comments (2)
  1. [Abstract] The abstract states results on '21 frontier models' but does not list them or their selection criteria; adding a table or appendix reference would improve transparency.
  2. Ensure first-use definitions for acronyms (SFT, RL) and new terms (TexOCR-Bench, TexOCR-Train) appear consistently in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the paper's experimental rigor and reproducibility. We address each major comment point-by-point below, with clear indications of planned revisions.

read point-by-point responses
  1. Referee: §5 (Experiments): The claims of consistent RL improvements over SFT on structural and compilation metrics across 21 models are presented without reporting train/test splits for TexOCR-Train or TexOCR-Bench, any statistical significance tests, or controls for post-hoc model selection. This is load-bearing for the central experimental claim and leaves the reliability of the gains unclear.

    Authors: We agree that explicit reporting of splits, statistical tests, and selection controls is necessary to support the reliability of the RL gains. In the revised manuscript, we will add these details to §5: TexOCR-Train uses an 80/20 document-ID-based split to avoid leakage, while TexOCR-Bench is a fully held-out set of recent arXiv pages; we will report p-values from paired statistical tests (e.g., Wilcoxon signed-rank) for all RL-vs-SFT comparisons across the 21 models; and we will clarify that hyperparameter tuning occurred on a validation subset prior to test-set evaluation, with no post-hoc selection on test results. revision: yes

  2. Referee: §3 (TexOCR-Bench): The multi-dimensional evaluation suite relies on automated unit tests for invariants and compilability, but provides no human evaluation, inter-annotator agreement, or correlation analysis between the metrics and expert judgments of semantic correctness or scientific readability. This directly affects the adequacy of the proxies for the claimed downstream usability.

    Authors: We acknowledge that the automated proxies, while directly tied to the invariants highlighted in the paper, would benefit from explicit discussion of their relation to human judgments of usability. In the revision we will expand §3 with a dedicated limitations subsection that includes qualitative examples illustrating how violations detected by the unit tests correspond to semantic and readability issues in reconstructed documents. We will also note the value of future human studies with inter-annotator agreement for broader validation of downstream usability, without claiming such a study is already present. revision: partial

  3. Referee: §4 (Training Methodology): While the RL rewards are derived from independent LaTeX unit tests (a positive design choice avoiding circularity), the manuscript gives only high-level descriptions of the tests for section consistency, float placement, and label-reference links. Concrete specification of the test suite is needed to assess coverage and enable reproduction.

    Authors: We agree that concrete specifications are required for reproducibility. In the revised §4 we will include detailed algorithmic descriptions of each unit test, accompanied by pseudocode and representative LaTeX input/output examples for section-hierarchy consistency, float-placement validation, and label-reference link resolution. These additions will allow readers to assess test coverage and fully reproduce the reward signals. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent unit tests and held-out benchmark

full rationale

The paper trains TexOCR on TexOCR-Train via SFT then RL, where rewards come from separate LaTeX unit tests that enforce compilability and referential integrity. Evaluation occurs on the distinct TexOCR-Bench using a multi-dimensional suite (transcription, structure, compilability). No equations, fitted parameters renamed as predictions, or self-citation chains reduce the reported improvements or invariant violations to quantities defined by the same inputs. Comparative results across 21 models are measured externally rather than by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new benchmark and corpus being representative and on LaTeX compilation serving as a sufficient proxy for document quality; no external validation of these choices is mentioned.

free parameters (1)
  • 2B model scale
    Chosen architecture size for the TexOCR model.
axioms (1)
  • domain assumption LaTeX unit tests reliably detect violations of document invariants such as section consistency and reference integrity.
    Used to generate verifiable rewards during RL.
invented entities (2)
  • TexOCR-Bench no independent evidence
    purpose: Multi-dimensional benchmark for transcription, structure, and compilability
    Newly introduced evaluation suite.
  • TexOCR-Train no independent evidence
    purpose: Large-scale training corpus for page-to-LaTeX reconstruction
    Newly introduced dataset.

pith-pipeline@v0.9.0 · 5494 in / 1486 out tokens · 78550 ms · 2026-05-08T11:54:18.954601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.arXiv preprint arXiv:2601.14251,

    Lightonocr: A 1b end-to-end multilingual vision-language model for state-of-the-art ocr.ArXiv, abs/2601.14251. Qwen Team. 2025. Qwen3 technical report.Preprint, arXiv:2505.09388. Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yan- jie Liang, Ling Chen, Wei Chu, and Yuan Qi

  2. [2]

    Infinity parser: Layout aware reinforcement learning for scanned document parsing.Preprint, arXiv:2506.03197. Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weiju Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dong- sheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, and 24 o...

  3. [3]

    Figure 3

    Calamari - a high-performance tensorflow- based deep learning package for optical character recognition.Preprint, arXiv:1807.02004. Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, and 3 oth...

  4. [4]

    If the page is thefirst page of the paper, use \title{} for the title, \author{} for the author information (if any), and \begin{abstract} xxx \end{abstract} for the abstract, then begin the main text using\section

  5. [5]

    Pezeshkpour & Hruschka, 2023

    Do not use any packages or preamble: Output only the LaTeX code corresponding to the body content. If the beginning of the page is the continuation of a previous section, convert that text as well without omission. 4.Handling figures and tables: • Name all figure files asfigure_x.pdf, wherexis the figure’s ID from the original document. • If a figure cont...

  6. [6]

    Keep the original meaning and convert it as is

    Incomplete content at the top or bottom of the page: Do not modify or complete it. Keep the original meaning and convert it as is. 8.If the page contains part of the References section: • Convert all reference entries on the page into BibTeX format only (e.g., @article{Smith_2020, ...} ). • Do not use\bibitem. • The BibTeX entry keys must exactly match th...