Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park · 2022

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Thinking with Drafting: Optical Decompression via Logical Reconstruction

cs.CL · 2026-02-12 · unverdicted · novelty 6.0

Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

cs.CV · 2025-09-26 · unverdicted · novelty 6.0

MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

citing papers explorer

Showing 3 of 3 citing papers.

Thinking with Drafting: Optical Decompression via Logical Reconstruction cs.CL · 2026-02-12 · unverdicted · none · ref 20
Thinking with Drafting reconceptualizes visual reasoning as optical decompression by forcing models to draft mental models into executable DSL code for deterministic self-verification on the VisAlg benchmark.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV · 2025-09-26 · unverdicted · none · ref 15
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 71
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Ocr-free document understanding transformer

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer