hub Canonical reference

mplug-docowl 1.5: Unified structure learn- ing for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al · 2024 · arXiv 2403.12895

Canonical reference. 75% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 75% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 dataset 1 method 1

citation-polarity summary

background 6 baseline 1 use dataset 1

representative citing papers

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

cs.CV · 2024-12-31 · accept · novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

cs.CV · 2024-10-22 · accept · novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

cs.LG · 2026-04-18 · unverdicted · novelty 6.0

SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

cs.CV · 2024-09-03 · unverdicted · novelty 5.0

GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

MinerU: An Open-Source Solution for Precise Document Content Extraction

cs.CV · 2024-09-27 · conditional · novelty 4.0

MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

cs.MM · 2024-10-28 · unverdicted · novelty 3.0

Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

citing papers explorer

Showing 15 of 15 citing papers.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence cs.CL · 2026-05-13 · accept · none · ref 10
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
A document is worth a structured record: Principled inductive bias design for document recognition cs.CV · 2025-07-11 · unverdicted · none · ref 39
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? cs.CV · 2024-08-23 · conditional · none · ref 25
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models cs.CV · 2026-03-31 · unverdicted · none · ref 13
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding cs.CL · 2026-02-02 · unverdicted · none · ref 42
Multimodal LLMs process code as images to achieve up to 8x token compression, with visual cues like syntax highlighting aiding tasks and clone detection remaining resilient or even improving under compression.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 35
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction cs.CV · 2024-10-22 · accept · none · ref 20
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models cs.LG · 2026-04-18 · unverdicted · none · ref 14
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 93
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model cs.CV · 2024-09-03 · unverdicted · none · ref 13
GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 51
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
MinerU: An Open-Source Solution for Precise Document Content Extraction cs.CV · 2024-09-27 · conditional · none · ref 9
MinerU delivers an open-source pipeline for high-precision document content extraction by integrating specialized models with tuned preprocessing and postprocessing rules.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 33
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction cs.MM · 2024-10-28 · unverdicted · none · ref 82
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 158
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

mplug-docowl 1.5: Unified structure learn- ing for ocr-free document understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer