Synthetic Data for Text Localisation in Natural Images

Andrea Vedaldi; Andrew Zisserman; Ankush Gupta

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1604.06646 v1 pith:XBS6RLO2 submitted 2016-04-22 cs.CV

Synthetic Data for Text Localisation in Natural Images

Ankush Gupta , Andrea Vedaldi , Andrew Zisserman This is my paper

classification cs.CV

keywords imagestextdetectionnaturalsyntheticenginefcrnmethod

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

In this paper we introduce a new method for text detection in natural images. The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regression Network (FCRN) which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. We discuss the relation of FCRN to the recently-introduced YOLO detector, as well as other end-to-end object detection systems based on deep learning. The resulting detection network significantly out performs current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark. Furthermore, it can process 15 images per second on a GPU.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models
cs.CV 2026-06 unverdicted novelty 6.0

FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting
cs.CV 2026-05 unverdicted novelty 6.0

SAME-Net adds a differentiable soft attention mask embedding module to achieve rectification-free end-to-end scene text spotting with 84.02% H-mean on Total-Text.
Improving Performance of End-to-End ASR on Numeric Sequences
eess.AS 2019-07 unverdicted novelty 4.0

TTS-generated numeric training data plus a compact neural denormalizer improve E2E ASR word error rates on numeric sequences by up to a factor of 8 for the longest cases.