SEFD reconstructs SEC filings into MultiMarkdown to create a 152B-token financial pretraining corpus with low overlap to existing data and introduces EDGAR-Forecast and EDGAR-OCR benchmarks.
Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations, 2025
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
RealDocBench supplies 1,356 field-level QA questions over 581 real documents and 1,500 annotated pages, evaluating 18 systems on per-field accuracy, cost, and latency.
Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Presents RT-DocLayout, a 33M-parameter end-to-end model extending RT-DETR that unifies layout classification, detection, segmentation, and reading-order prediction at 132.1 FPS with claimed SOTA results on public benchmarks.
ABot-OCR is a new end-to-end VLM for direct image-to-Markdown transcription using a custom data engine and structure-constrained RL optimization, reporting SOTA scores of 92.81/93.30 on OmniDocBench v1.5/v1.6.
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
citing papers explorer
-
The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
SEFD reconstructs SEC filings into MultiMarkdown to create a 152B-token financial pretraining corpus with low overlap to existing data and introduces EDGAR-Forecast and EDGAR-OCR benchmarks.
-
RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
RealDocBench supplies 1,356 field-level QA questions over 581 real documents and 1,500 annotated pages, evaluating 18 systems on per-field accuracy, cost, and latency.
-
MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
MedStruct-S benchmark shows encoder-only models outperform larger decoder-only ones on key-conditioned QA from noisy OCR clinical reports, with fine-tuned large models winning only when scale is ignored.
-
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
FinCriticalED benchmark reveals that OCR and MLLM systems frequently fail to preserve critical financial facts such as numbers and monetary units even when lexical accuracy is high.
-
RT-DocLayout: Real-Time End-to-End Document Layout Analysis with Reading Order in the Wild
Presents RT-DocLayout, a 33M-parameter end-to-end model extending RT-DETR that unifies layout classification, detection, segmentation, and reading-order prediction at 132.1 FPS with claimed SOTA results on public benchmarks.
-
ABot-OCR Technical Report
ABot-OCR is a new end-to-end VLM for direct image-to-Markdown transcription using a custom data engine and structure-constrained RL optimization, reporting SOTA scores of 92.81/93.30 on OmniDocBench v1.5/v1.6.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Qwen2.5-VL Technical Report
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
-
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.