LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Adrien Cavaill\`es; Baptiste Aubertin; Said Taghadouini

arxiv: 2601.14251 · v2 · pith:FHURX5KGnew · submitted 2026-01-20 · 💻 cs.CV

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Said Taghadouini , Adrien Cavaill\`es , Baptiste Aubertin This is my paper

classification 💻 cs.CV

keywords modelend-to-endimagesmultilingualpdfsreleasestate-of-the-artunder

0 comments

read the original abstract

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

End-to-End Text Line Detection and Ordering
cs.CV 2026-06 unverdicted novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition
cs.CV 2026-05 unverdicted novelty 7.0

METATR is a new benchmark dataset and evaluation framework for ATR covering 29 languages, multiple scripts and layouts, with standardized prompting and a dynamic extensible protocol.
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
cs.CL 2026-04 unverdicted novelty 7.0

A 2B-parameter model trained with RL on verifiable LaTeX unit tests produces more compilable page-to-LaTeX reconstructions than prior OCR systems across structural and compilation metrics.
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
cs.CL 2026-04 unverdicted novelty 7.0

GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
StrucTab: A Structured Optimization Framework for Table Parsing
cs.CV 2026-06 unverdicted novelty 6.0

StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.
Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis
cs.CL 2026-06 unverdicted novelty 6.0

New Sinhala OCR dataset from 1981-2019 legislative acts enables LightOnOCR-2-1B to reach 1.05% CER, beating Surya-OCR, Tesseract, and Google Document AI.
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
cs.CL 2026-05 unverdicted novelty 6.0

VLMs generate plausible but visually ungrounded OCR output for Ancient Greek editions, with model-specific prior reliance revealed by image perturbations and conditional decoding analysis.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 6.0

RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 5.0

RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
cs.CV 2026-05 unverdicted novelty 4.0

RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.