PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
hub Mixed citations
InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5
Mixed citation behavior. Most common role is background (33%).
hub tools
citation-role summary
citation-polarity summary
years
2026 13representative citing papers
AEGIS is a benchmark with 7 academic categories, 39 subtypes, 4 forgery strategies, and multi-dimensional tests showing that leading models like GPT-5.1 achieve only 48.80% overall forensic accuracy on AI-generated academic images.
GlotOCR Bench shows that OCR models perform well on fewer than 10 scripts and fail to generalize beyond about 30, with results tracking pretraining coverage and models hallucinating from known scripts on unfamiliar ones.
ParseBench is a new benchmark for document parsing in AI agents that reveals fragmented performance across five semantic dimensions with LlamaParse Agentic scoring highest at 84.9%.
The Character Error Vector is a decomposable bag-of-characters evaluator for page-level OCR that remains defined under parsing errors and bridges parsing metrics with local CER.
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
A parser-oriented refinement stage performs set-level reasoning on detector hypotheses to jointly decide instance retention, refine boxes, and set parser input order, cutting reading order errors to 0.024 on OmniDocBench.
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting attention latency by 3x.
RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
Frontier multimodal LLMs achieve ~85% accuracy and ~90% weighted F1 on digitizing complex handwritten medical forms, with Gemini 3.1 strongest overall and prompt optimization lifting macro metrics over 60%.