pith. machine review for the scientific record. sign in

arxiv: 2604.14314 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords structured OCRsmall language modelsDirect Preference Optimizationtext degenerationJSON schema extractionOCR benchmarkmodel quantizationfine-tuning for OCR
0
0 comments X

The pith

Specialized 7B and 3B language models reach state-of-the-art structured OCR by using schema fine-tuning plus preference optimization to cut degeneration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two small specialized language models for structured OCR tasks that jointly improve transcription accuracy, output stability, and inference cost. It applies supervised fine-tuning to enforce a strict JSON schema for document parts and direct preference optimization to penalize degenerate generations such as loops. On a new benchmark spanning printed, handwritten, and legal documents, these models exceed the quality of tested open-source and commercial OCR systems while holding degeneration below 0.5 percent. The work also shows that degeneration raises real production costs through longer runtimes and higher compute use, and that quantization preserves quality at lower cost. A reader would care because reliable extraction from structured documents supports automation in legal, administrative, and archival settings where errors or wasted computation carry direct expenses.

Core claim

The authors show that the first application of Direct Preference Optimization to OCR, treating degenerate generations as rejected examples, combined with Supervised Fine-Tuning to enforce a strict JSON schema for header, margin, footer, and text fields, produces DharmaOCR Full (7B) and DharmaOCR Lite (3B) models. These reach extraction quality scores of 0.925 and 0.911 on the DharmaOCR-Benchmark with degeneration rates of 0.40 percent and 0.20 percent, outperforming every open-source and commercial baseline evaluated, while AWQ quantization further reduces per-page cost by up to 22 percent with negligible quality loss.

What carries the argument

Direct Preference Optimization (DPO) that uses degenerate OCR outputs as rejected preferences to discourage looping behavior, paired with Supervised Fine-Tuning (SFT) that enforces a fixed JSON schema for document structure.

If this is right

  • Lower degeneration rates directly reduce average response time and raise throughput in production OCR pipelines.
  • Quantized versions of the models deliver up to 22 percent lower per-page inference cost while preserving extraction quality.
  • Tracking degeneration as a first-class metric alongside quality reveals hidden computational costs that standard OCR evaluations miss.
  • The same SFT-plus-DPO recipe works across model scales, delivering gains for both the 7B and 3B variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference optimization against repetition could improve stability in other narrow-domain generation tasks such as report or form filling.
  • A public release of the benchmark would allow direct comparison of future OCR models on the same structured-document distribution.
  • These compact models suggest that domain-specific fine-tuning can close much of the gap to larger general-purpose systems for well-defined extraction problems.
  • Extending the approach to additional languages or document layouts would test how far the observed quality-cost gains generalize.

Load-bearing premise

The authors' benchmark documents and unified evaluation protocol represent real-world structured OCR tasks without selection bias that favors their particular schema and training setup.

What would settle it

Test the same models on an independent collection of structured printed, handwritten, and legal documents drawn from sources outside the benchmark and training data, then measure whether the reported quality scores and degeneration rates remain unchanged.

Figures

Figures reproduced from arXiv: 2604.14314 by Caio Lucas da Silva Chacon, Gabriel Pimenta de Freitas Cardoso, Jonas Felipe da Fonseca Oliveira, Paulo Henrique de Medeiros Araujo.

Figure 1
Figure 1. Figure 1: Synthesis of the proposed approach, key contributions, and results, illustrating the progression from vanilla [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pictorial example of token- and sequence-level text degeneration, in which a single token (or token sequence) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Text degeneration rate (%) across alignment stages. SFT reduces degeneration relative to [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quality–cost comparison among DharmaOCR models developed in this research, other open-source OCR [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM-as-a-judge results from a comparison between DharmaOCR Full and Google Document AI. Bars show [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM-as-a-judge results from a comparison between DharmaOCR Full and olmOCR-2-7B. Bars show the [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: LLM-as-a-judge results from a comparison between DharmaOCR Full and DharmaOCR Lite. Bars show the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Progressive specialization strategy and comparison of two training paths. Three specialization levels are [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Start and end time of each request (in submission order) for dataset 1. Each request is represented by a bar [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Start and end time of each request (in submission order) for dataset 2. Each request is represented by a bar [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Start and end time of each request (in submission order) for dataset 3. Each request is represented by a bar [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of healthy-request durations for the three datasets, contrasting periods with at least one [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of document used to illustrate the structured output format. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example document used to illustrate structured output format for handwritten document. [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
read the original abstract

This manuscript introduces DharmaOCR Full and Lite, a pair of specialized small language models (SSLMs) for structured OCR that jointly optimize transcription quality, generation stability, and inference cost. It also presents DharmaOCR-Benchmark, a benchmark that covers printed, handwritten, and legal/administrative documents, and proposes a unified evaluation protocol that measures fidelity and structure while explicitly tracking text degeneration as a first-class benchmark metric (alongside unit cost). Beyond reporting degeneration rates, the manuscript empirically shows degeneration is not merely a quality failure, since it materially worsens production performance by increasing response time, reducing throughput, and inflating computational cost due to abnormally long generations. To the best of the author's knowledge, as a methodological contribution, this is the first application of Direct Preference Optimization (DPO) for OCR, explicitly using degenerate generations as rejected examples to penalize looping behavior. Combined with Supervised Fine-Tuning (SFT) for enforcing a strict JSON schema (header, margin, footer, and text), DPO consistently reduces degeneration rate across model families (up to 87.6% relative) while preserving or improving extraction quality. The resulting models, namely, DharmaOCR Full (7B) and DharmaOCR Lite (3B), set a new state-of-the-art on DharmaOCR-Benchmark, outperforming each open-source and commercial baseline model evaluated regarding extraction quality, reaching 0.925 and 0.911 scores with 0.40% and 0.20% degeneration rates. AWQ quantization reduced up to 22% per-page cost with negligible quality loss, enabling a strong quality-cost trade-off in comparison to proprietary OCR APIs and open-source alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This manuscript claims to introduce DharmaOCR Full (7B) and DharmaOCR Lite (3B) as specialized small language models for structured OCR. These models are optimized using Supervised Fine-Tuning (SFT) with a strict JSON schema for elements like header, margin, footer, and text, combined with Direct Preference Optimization (DPO) applied to degenerate generations to reduce looping behavior. The paper also introduces the DharmaOCR-Benchmark covering printed, handwritten, and legal/administrative documents, and a unified evaluation protocol that assesses extraction quality, structure, and degeneration rates as a key metric. It reports that the models achieve state-of-the-art performance with extraction scores of 0.925 and 0.911, degeneration rates of 0.40% and 0.20%, outperforming open-source and commercial baselines. The work further shows that degeneration increases response time and cost, and that AWQ quantization can reduce per-page costs by up to 22% with negligible quality loss.

Significance. Assuming the results are based on a fair and reproducible evaluation, the significance lies in providing evidence that small LLMs can be effectively specialized for high-quality structured OCR with low degeneration using DPO, which is claimed to be the first such application. The new benchmark and protocol, along with the analysis of degeneration's impact on production metrics, could advance the field by highlighting stability as a critical factor alongside accuracy. The cost-quality trade-offs demonstrated could inform practical deployments in document processing pipelines. Credit is due for the empirical demonstration of DPO's benefits in this domain and the introduction of degeneration tracking.

major comments (2)
  1. Abstract: The central SOTA claim (0.925/0.911 extraction scores, 0.40%/0.20% degeneration) and the reported relative degeneration reductions (up to 87.6%) rest entirely on the fairness of the unified evaluation protocol and DharmaOCR-Benchmark. The manuscript provides no details on how baselines were prompted, whether identical JSON schema enforcement and degeneration tracking were applied without post-processing, or how benchmark documents were selected and annotated. This is load-bearing because any misalignment in protocol could explain the gains rather than model superiority, as flagged in the stress-test concern on selection bias.
  2. Abstract: No information is given on training datasets, SFT/DPO hyperparameters (e.g., beta, learning rate, JSON enforcement strength), number of training examples, or statistical tests for the reported improvements. Without these, the reproducibility of the quality-cost trade-offs and the claim that DPO preserves extraction quality while reducing degeneration cannot be assessed.
minor comments (1)
  1. The abstract's novelty claim ('to the best of the author's knowledge, this is the first application of DPO for OCR') should be supported by a concise related-work paragraph in the introduction that cites prior preference optimization work in vision-language or document tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on evaluation transparency and reproducibility. We address each major comment below and have revised the manuscript to incorporate additional details that strengthen the claims without altering the core results.

read point-by-point responses
  1. Referee: Abstract: The central SOTA claim (0.925/0.911 extraction scores, 0.40%/0.20% degeneration) and the reported relative degeneration reductions (up to 87.6%) rest entirely on the fairness of the unified evaluation protocol and DharmaOCR-Benchmark. The manuscript provides no details on how baselines were prompted, whether identical JSON schema enforcement and degeneration tracking were applied without post-processing, or how benchmark documents were selected and annotated. This is load-bearing because any misalignment in protocol could explain the gains rather than model superiority, as flagged in the stress-test concern on selection bias.

    Authors: We agree that explicit protocol details are necessary to substantiate the SOTA claims. In the revised manuscript we have expanded Section 4 (Evaluation Protocol) to include the exact prompts and system instructions applied to every baseline, confirming uniform JSON schema enforcement and degeneration tracking with no post-processing or cherry-picking. Appendix B now details benchmark curation (public printed/handwritten sources plus expert-annotated legal documents) and provides a stress-test showing that random 50% subsamples preserve model rankings, addressing selection-bias concerns. These additions demonstrate that performance differences arise from model specialization rather than evaluation misalignment. revision: yes

  2. Referee: Abstract: No information is given on training datasets, SFT/DPO hyperparameters (e.g., beta, learning rate, JSON enforcement strength), number of training examples, or statistical tests for the reported improvements. Without these, the reproducibility of the quality-cost trade-offs and the claim that DPO preserves extraction quality while reducing degeneration cannot be assessed.

    Authors: We concur that these elements are required for full reproducibility. The revised Section 3 now contains a dedicated 'Training Details' subsection reporting: 120k SFT examples and 15k DPO preference pairs drawn from the same document distribution; SFT hyperparameters (lr=2e-5, 3 epochs, JSON loss weighting); DPO hyperparameters (beta=0.1, lr=1e-6, 1 epoch); and the constrained-decoding mechanism used for JSON enforcement. We also added paired t-tests (p<0.01) over five random seeds confirming that DPO reduces degeneration while preserving extraction scores. These details were summarized in the supplement and are now moved to the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical SOTA claims rest on external baselines and new benchmark

full rationale

The paper introduces DharmaOCR models via standard SFT + DPO training and a new benchmark with unified protocol, then reports empirical extraction scores and degeneration rates against open-source and commercial baselines. No mathematical derivation chain exists; claims do not reduce to self-defined quantities, fitted parameters renamed as predictions, or load-bearing self-citations. The benchmark and protocol are presented as methodological contributions, with performance measured externally rather than by construction. This matches the default expectation of non-circular empirical ML work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised fine-tuning and DPO assumptions plus the domain claim that penalizing degeneration via preference optimization will generalize; no new physical entities or ad-hoc constants are introduced beyond typical training hyperparameters.

free parameters (2)
  • DPO beta and learning rate
    Standard DPO hyperparameters chosen during training to balance quality and degeneration reduction.
  • JSON schema enforcement strength
    Weighting or prompting choices during SFT to enforce header-margin-footer-text structure.
axioms (1)
  • domain assumption Degenerate generations can be reliably identified and used as rejected examples in DPO to reduce looping without harming extraction quality.
    Invoked when the paper states DPO consistently reduces degeneration rate while preserving or improving quality.

pith-pipeline@v0.9.0 · 5631 in / 1393 out tokens · 44654 ms · 2026-05-10T13:14:29.769605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    An overview of the tesseract ocr engine

    Ray Smith. An overview of the tesseract ocr engine. InProceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR), pages 629–633. IEEE, 2007

  2. [2]

    Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016

  3. [3]

    Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison

    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

  4. [4]

    Connectionist temporal clas- sification: Labelling unsegmented sequence data with recurrent neural networks

    Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal clas- sification: Labelling unsegmented sequence data with recurrent neural networks. InProceedings of the 23rd International Conference on Machine Learning (ICML), pages 369–376. ACM, 2006

  5. [5]

    Multimodal llms for ocr, ocr post-correction, and named entity recognition in historical documents, 2025

    Gavin Greif, Niclas Griesshaber, and Robin Greif. Multimodal llms for ocr, ocr post-correction, and named entity recognition in historical documents, 2025

  6. [6]

    Layoutlm: Pre-training of text and layout for document image understan- ding

    Yiheng Xu, Minghao Li, Lei Cui, et al. Layoutlm: Pre-training of text and layout for document image understan- ding. InKDD, 2020

  7. [7]

    Layoutlmv2: Multi-modal pre-training for visually rich document understanding

    Yang Xu et al. Layoutlmv2: Multi-modal pre-training for visually rich document understanding. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

  8. [8]

    Donut: Document understanding transformer without ocr

    Geewook Kim et al. Donut: Document understanding transformer without ocr. InProceedings of the European Conference on Computer Vision (ECCV), 2022

  9. [9]

    Small language models (slms) can still pack a punch: A survey, 2025

    Shreyas Subramanian, Vikram Elango, and Mecit Gungor. Small language models (slms) can still pack a punch: A survey, 2025. 18 APREPRINT- APRIL17, 2026

  10. [10]

    Small language models are the future of domain-specific nlp.arXiv preprint arXiv:2305.04787, 2023

    Zhi Zhou et al. Small language models are the future of domain-specific nlp.arXiv preprint arXiv:2305.04787, 2023

  11. [11]

    Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break-even performance, 2026

    Branislav Pecher, Ivan Srba, and Maria Bielikova. Comparing specialised small and general large language models on text classification: 100 labelled samples to achieve break-even performance, 2026

  12. [12]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models.arXiv preprint arXiv:2502.18443, 2025

    Jake Poznanski, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, and Luca Soldaini. olmocr: Unlocking trillions of tokens in pdfs with vision language models. CoRR, abs/2502.18443, 2025

  13. [13]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

  14. [14]

    olmocr 2: Unit test rewards for document ocr, 2025

    Jake Poznanski, Luca Soldaini, and Kyle Lo. olmocr 2: Unit test rewards for document ocr, 2025

  15. [15]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  16. [16]

    Nanonets ocr 2: Transforming documents into llm-ready structured data

    Souvik Mandal and Nanonets. Nanonets ocr 2: Transforming documents into llm-ready structured data. https:// nanonets.com/research/nanonets-ocr-2/, 2025. Research overview and implementation details; accessed 2026-02-20

  17. [17]

    Deepseek-ocr: Contexts optical compression, 2025

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression, 2025

  18. [18]

    Deepseek-ocr 2: Visual causal flow, 2026

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr 2: Visual causal flow, 2026

  19. [19]

    Zhipu AI. GLM-OCR. Hugging Face, 2025. Acessado em: março de 2026

  20. [20]

    Encoder-decoder or decoder- only? revisiting encoder-decoder large language model, 2025

    Biao Zhang, Yong Cheng, Siamak Shakeri, Xinyi Wang, Min Ma, and Orhan Firat. Encoder-decoder or decoder- only? revisiting encoder-decoder large language model, 2025

  21. [21]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. Empirical scaling laws showing improved performance and sample-efficiency with increased model size

  22. [22]

    The curious case of neural text degeneration, 2020

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration, 2020

  23. [23]

    What llms think when you don’t tell them what to think about?, 2026

    Yongchan Kwon and James Zou. What llms think when you don’t tell them what to think about?, 2026

  24. [24]

    Neural text generation with unlikelihood training, 2019

    Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training, 2019

  25. [25]

    Understanding the repeat curse in large language models from a feature perspective

    Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, and Di Wang. Understanding the repeat curse in large language models from a feature perspective. InFindings of the Association for Computational Linguistics: ACL 2025, page 7787–7815. Association for Computational Linguistics, 2025

  26. [26]

    Dongkyu Lee, Gyeonghun Kim, Janghoon Han, Taesuk Hong, Yi-Reun Kim, Stanley Jungkyu Choi, and Nevin L. Zhang. Local temperature beam search: Avoid neural text DeGeneration via enhanced calibration. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Findings of the Association for Computational Linguistics: ACL 2023, pages 9903–9915, Toronto, ...

  27. [27]

    Repetition in repetition out: Towards understanding neural text degeneration from the data perspective, 2023

    Huayang Li, Tian Lan, Zihao Fu, Deng Cai, Lemao Liu, Nigel Collier, Taro Watanabe, and Yixuan Su. Repetition in repetition out: Towards understanding neural text degeneration from the data perspective, 2023

  28. [28]

    A theoretical analysis of the repetition problem in text generation, 2021

    Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. A theoretical analysis of the repetition problem in text generation, 2021

  29. [29]

    Mitigating the language mismatch and repetition issues in llm-based machine translation via model editing, 2024

    Weichuan Wang, Zhaoyi Li, Defu Lian, Chen Ma, Linqi Song, and Ying Wei. Mitigating the language mismatch and repetition issues in llm-based machine translation via model editing, 2024

  30. [30]

    Relating neural text degeneration to exposure bias, 2021

    Ting-Rui Chiang and Yun-Nung Chen. Relating neural text degeneration to exposure bias, 2021

  31. [31]

    Break the sequential dependency of llm inference using lookahead decoding, 2024

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding, 2024

  32. [32]

    Queue management for slo-oriented large language model serving

    Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Chandra Narayanaswami, Zbigniew Kalbarczyk, and Ravishankar Iyer. Queue management for slo-oriented large language model serving. In Proceedings of the 2024 ACM Symposium on Cloud Computing, SoCC ’24, page 18–35, New York, NY , USA,

  33. [33]

    19 APREPRINT- APRIL17, 2026

    Association for Computing Machinery. 19 APREPRINT- APRIL17, 2026

  34. [34]

    vllm: A high-throughput and memory-efficient inference engine for llms, 2026

    vLLM Contributors. vllm: A high-throughput and memory-efficient inference engine for llms, 2026

  35. [35]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023

  36. [36]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming throughput-latency tradeoff in llm inference with sarathi-serve, 2024

  37. [37]

    Ocrbench leaderboard

    Hugging Face Spaces — echo840/ocrbench-leaderboard. Ocrbench leaderboard. https://huggingface.co/ spaces/echo840/ocrbench-leaderboard, 2025. Acessado em Novembro 2025

  38. [38]

    Qwen/qwen2.5-vl-7b-instruct

    Qwen Team. Qwen/qwen2.5-vl-7b-instruct. https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct ,

  39. [39]

    Modelo multimodal geral selecionado para fine-tuning no presente estudo

  40. [40]

    Qwen/qwen2.5-vl-3b-instruct

    Qwen Team. Qwen/qwen2.5-vl-3b-instruct. https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct ,

  41. [41]

    Versão menor do modelo multimodal geral selecionado para fine-tuning

  42. [42]

    unsloth/gemma-3-4b-it

    unsloth. unsloth/gemma-3-4b-it. https://huggingface.co/unsloth/gemma-3-4b-it , 2025. Modelo multimodal geral escolhido para fine-tuning no presente estudo

  43. [43]

    Qwen3-VL technical report, 2025

    Qwen Team. Qwen3-VL technical report, 2025

  44. [44]

    System card: Claude Opus 4 & Claude Sonnet 4

    Anthropic. System card: Claude Opus 4 & Claude Sonnet 4. Technical report, Anthropic, may 2025

  45. [45]

    Llama 4 maverick model card, april 2025

    Meta. Llama 4 maverick model card, april 2025. Acessado em: março de 2026

  46. [46]

    Gemini 2.5 pro model card

    Google DeepMind. Gemini 2.5 pro model card. Technical report, Google DeepMind, june 2025. Acessado em: março de 2026

  47. [47]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Preprint, OpenAI, 2018. Preprint (OpenAI)

  48. [48]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685, 2021

  49. [49]

    SGDR: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. InInternational Conference on Learning Representations (ICLR), 2017

  50. [50]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov et al. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  51. [51]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  52. [52]

    Qwen3-vl-235b-a22b-instruct

    Qwen Team. Qwen3-vl-235b-a22b-instruct. https://huggingface.co/Qwen/ Qwen3-VL-235B-A22B-Instruct , 2025. Model card and weights on Hugging Face. Accessed: 2026- 02-24

  53. [53]

    unsloth/gemma-3-27b-it

    Unsloth AI. unsloth/gemma-3-27b-it. https://huggingface.co/unsloth/gemma-3-27b-it , 2025. Model card and weights on Hugging Face. Accessed: 2026-02-24

  54. [54]

    Principled data selection for alignment: The hidden risks of difficult examples.CoRR, abs/2502.09650, 2025

    Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, and Zhiqiang Xu. Principled data selection for alignment: The hidden risks of difficult examples.CoRR, abs/2502.09650, 2025

  55. [55]

    Beyond reward margin: Rethinking and resolving likelihood displacement in diffusion models via video generation.CoRR, abs/2511.19049, 2025

    Ruojun Xu, Yu Kai, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Tianxiang Zheng, and Qinhlin Lu. Beyond reward margin: Rethinking and resolving likelihood displacement in diffusion models via video generation.CoRR, abs/2511.19049, 2025

  56. [56]

    Integer quantization for deep learning inference: Principles and empirical evaluation

    Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation.arXiv preprint arXiv:2004.09602, 2020

  57. [57]

    Advances in the neural network quantization: A comprehensive review.Applied Sciences, 14(17):7445, 2024

    Lu Wei, Zhong Ma, Chaojie Yang, and Qin Yao. Advances in the neural network quantization: A comprehensive review.Applied Sciences, 14(17):7445, 2024

  58. [58]

    Awq: Activation-aware weight quantization for llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. InProceedings of the Machine Learning and Systems (MLSys) Conference, 2024

  59. [59]

    LLM Compressor, 8 2024

    Red Hat AI and vLLM Project. LLM Compressor, 8 2024

  60. [60]

    MBQ: Modality-balanced quantization for large vision-language models

    Shiyao Li et al. MBQ: Modality-balanced quantization for large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 20 APREPRINT- APRIL17, 2026

  61. [61]

    Ocrbench: On the hidden mystery of ocr in large multimodal models, 2023

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: On the hidden mystery of ocr in large multimodal models, 2023

  62. [62]

    olmocr-bench

    Allen Institute for AI. olmocr-bench. Hugging Face Datasets, 2025. Accessed: 2026-03-27

  63. [63]

    Ester-pt: An evaluation suite for text recognition in portuguese

    Moniele Kunrath Santos, Guilherme Bazzo, Lucas Lima de Oliveira, and Viviane Pereira Moreira. Ester-pt: An evaluation suite for text recognition in portuguese. InDocument Analysis and Recognition - ICDAR 2023: 17th International Conference, San José, CA, USA, August 21–26, 2023, Proceedings, Part III, page 366–383, Berlin, Heidelberg, 2023. Springer-Verlag

  64. [64]

    Arthur F. S. Neto, Byron L. D. Bezerra, Sávio S. Araújo, W. M. A. S. Souza, K. F. Alves, M. F. Oliveira, S. V . S. Lins, H. J. F. Hazin, P. H. V . Rocha, and Alejandro H. Toselli. Bressay: A brazilian portuguese dataset for offline handwritten text recognition. InProceedings of the 18th International Conference on Document Analysis and Recognition (ICDAR)...

  65. [65]

    Levenshtein

    Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 1966. translated from Doklady Akademii Nauk SSSR

  66. [66]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, 2002

  67. [67]

    Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2024

    Xiaobin Ouyang et al. Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2024

  68. [68]

    in progress

    Wonseok Hwang et al. Disgo: A unified model for document image similarity, glyph, and ocr, 2023. 21 APREPRINT- APRIL17, 2026 A Appendix A.1 Impact of text degeneration on system performance Qwen2.5-VL-7B-Instruct [37] was served with the use of vLLM for evaluation of the impact of text degeneration on system performance. Three OCR datasets, namely, Dharma...

  69. [69]

    critérios razoáveis e justos para determinados tratamentos desiguais

    e dos Juizados Especiais da Fazenda Pública (Lei n. 12.153, de 2009).\nAlém disso, não se estabelece tratamento privilegiado da Fazenda Pública em detrimento dos contribuintes (sujeitos passivos), uma vez que o art. 93 do PLP n. 108, de 2024, aplica-se também ao reexame necessário.\nO princípio da isonomia ou igualdade, direito fundamental positivado no a...

  70. [70]

    , "header

    ed. rev. e atual. São Paulo: Malheiros, 2012, p. 89.\n^28 MELLO, Celso Antônio Bandeira de. Conteúdo jurídico do princípio da igualdade. 3. ed. atualizada. 8. tir. São Paulo: Malheiros, 2000.", "header": "DOUTRINA NACIONAL 149", "margin": null, "footer": "CASTRO, Eduardo Moreira Lima Rodrigues de. Limitação de Acesso à Segunda Instância no Processo Admini...

  71. [71]

    The aggregated score was computed for each instance and each response as the arithmetic mean of the four criterion scores

  72. [72]

    All-vs-all pairing among the five responses (10 pairs per instance) was generated for each instance, yielding the 237 260 candidate pairs previously reported

  73. [73]

    A multi-stage filtering policy, summarized in Table 6 and detailed below. In general terms, it pursues two complementary objectives, namely, (i) to ensure the included pairs provide an instructive signal for preference learning (maximize signal / reduce noise) and (ii) to avoid pairs that induce optimization conflicts or probability-shift effects, as iden...

  74. [74]

    were used for token-priced APIs for estimating an average per-page cost, which was then multiplied by one million to obtaining the cost per million pages. Denoting the average numbers of input and output tokens per page by nin andn out and the corresponding prices per million tokens byp in andp out one has C1M = 106 nin pin 106 + nout pout 106 (6) 34 APRE...