The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
Kosmos-2.5: A multimodal literate model
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning for VRDU.
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
Hybrid semantic-LLM method for reading order reconstruction in Armenian historical newspapers outperforms baselines on a new 66-page dataset while releasing a specialized Tesseract OCR model.
New Sinhala OCR dataset from 1981-2019 legislative acts enables LightOnOCR-2-1B to reach 1.05% CER, beating Surya-OCR, Tesseract, and Google Document AI.
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
VisShield with OPTIC dataset enables VLMs to localize and mask private text in vision data via instruction tuning for privacy preservation.
MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
citing papers explorer
-
Towards Characterizing Scientific Image Utility and Upgradability
The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning for VRDU.
-
From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models
A model-agnostic Geometric Risk Controller reduces extreme errors in VLM-based OCR by requiring cross-view consensus before accepting outputs.
-
Semantic-Guided Reading Order Reconstruction in Historical Armenian Newspapers with LLMs
Hybrid semantic-LLM method for reading order reconstruction in Armenian historical newspapers outperforms baselines on a new 66-page dataset while releasing a specialized Tesseract OCR model.
-
Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis
New Sinhala OCR dataset from 1981-2019 legislative acts enables LightOnOCR-2-1B to reach 1.05% CER, beating Surya-OCR, Tesseract, and Google Document AI.
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
-
Vision Language Model Helps Private Information De-Identification in Vision Data
VisShield with OPTIC dataset enables VLMs to localize and mask private text in vision data via instruction tuning for privacy preservation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.