ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning for VRDU.
Jacob Cohen
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
unclear 1representative citing papers
Introduces ProcedureVQA benchmark and Chain-of-Procedure framework that improves VLM next-step prediction in procedures by up to 13% over baselines.
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
Unveil proposes a visual-textual embedding model for multi-modal documents that is distilled into an efficient visual-only retriever.
citing papers explorer
-
ShredBench: Evaluating the Semantic Reasoning Capabilities of Multimodal LLMs in Document Reconstruction
ShredBench shows state-of-the-art MLLMs perform well on intact documents but suffer sharp drops in restoration accuracy as fragmentation increases to 8-16 pieces, indicating insufficient cross-modal semantic reasoning for VRDU.
-
Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
Introduces ProcedureVQA benchmark and Chain-of-Procedure framework that improves VLM next-step prediction in procedures by up to 13% over baselines.
-
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
-
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
Unveil proposes a visual-textual embedding model for multi-modal documents that is distilled into an efficient visual-only retriever.