COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit; Jiri Matas; Lukas Neumann; Serge Belongie; Tomas Matera

arxiv: 1601.07140 · v2 · pith:F36ORSKRnew · submitted 2016-01-26 · 💻 cs.CV

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Andreas Veit , Tomas Matera , Lukas Neumann , Jiri Matas , Serge Belongie This is my paper

classification 💻 cs.CV

keywords textdatasetimagesrecognitioncoco-textdetectionnaturalanalysis

0 comments

read the original abstract

This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of complex everyday scenes. The images were not collected with text in mind and thus contain a broad variety of text instances. To reflect the diversity of text in natural scenes, we annotate text with (a) location in terms of a bounding box, (b) fine-grained classification into machine printed text and handwritten text, (c) classification into legible and illegible text, (d) script of the text and (e) transcriptions of legible text. The dataset contains over 173k text annotations in over 63k images. We provide a statistical analysis of the accuracy of our annotations. In addition, we present an analysis of three leading state-of-the-art photo Optical Character Recognition (OCR) approaches on our dataset. While scene text detection and recognition enjoys strong advances in recent years, we identify significant shortcomings motivating future work.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting
cs.CV 2026-05 unverdicted novelty 7.0

StyleText is a new large-scale dataset and benchmark for stylized scene text inpainting, constructed via an automated pipeline and paired with a FluxFill+LoRA baseline that improves OCR accuracy.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
cs.CV 2025-06 unverdicted novelty 7.0

AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
cs.CV 2023-05 accept novelty 6.0

OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.
NVIDIA Nemotron 3: Efficient and Open Intelligence
cs.CL 2025-12 unverdicted novelty 5.0

NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Common Inpainted Objects In-N-Out of Context
cs.CV 2025-05 unverdicted novelty 5.0

COinCO is a new dataset of inpainted COCO images with in- and out-of-context objects, enabling context reasoning, object prediction from scenes, and improved fake image detection.
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
cs.CV 2024-09 unverdicted novelty 5.0

GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
cs.MM 2024-10 unverdicted novelty 3.0

Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
Understanding Deep Learning Techniques for Image Segmentation
cs.CV 2019-07 unverdicted novelty 1.0

A 2019 survey that categorizes and intuitively explains major deep learning techniques for image segmentation, progressing from classical methods to modern neural architectures.