LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
read the original abstract
We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
End-to-End Text Line Detection and Ordering
Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
-
METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition
METATR is a new benchmark dataset and evaluation framework for ATR covering 29 languages, multiple scripts and layouts, with standardized prompting and a dynamic extensible protocol.
-
TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction
A 2B-parameter model trained with RL on verifiable LaTeX unit tests produces more compilable page-to-LaTeX reconstructions than prior OCR systems across structural and compilation metrics.
-
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
-
StrucTab: A Structured Optimization Framework for Table Parsing
StrucTab achieves SOTA table parsing performance by unifying structural subtasks through sequential reasoning and using decomposed RL rewards in Uni-TabRL, plus a new TableVerse-5K benchmark.
-
Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis
New Sinhala OCR dataset from 1981-2019 legislative acts enables LightOnOCR-2-1B to reach 1.05% CER, beating Surya-OCR, Tesseract, and Google Document AI.
-
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
VLMs generate plausible but visually ungrounded OCR output for Ancient Greek editions, with model-specific prior reliance revealed by image perturbations and conditional decoding analysis.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune prunes visual tokens in DeepSeek-OCR via a reading-twice two-stage process, retaining 84.25% tokens for 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
-
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
RTPrune delivers 99.47% accuracy and 1.23x faster prefill on OmniDocBench for DeepSeek-OCR-Large by retaining only 84.25% of tokens through a reading-twice inspired two-stage pruning process.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.