2D-CTC for Scene Text Recognition

Cong Yao; Fengming Xie; Xiang Bai; Yibo Liu; Zhaoyi Wan

arxiv: 1907.09705 · v1 · pith:2O23WDZMnew · submitted 2019-07-23 · 💻 cs.CV · cs.CL· cs.LG

2D-CTC for Scene Text Recognition

Zhaoyi Wan , Fengming Xie , Yibo Liu , Xiang Bai , Cong Yao This is my paper

Pith reviewed 2026-05-24 17:56 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords scene text recognition2D-CTCCTCirregular textsequence predictioncomputer visiondeep learning

0 comments

The pith

Extending CTC to two dimensions lets the model treat scene text as 2D image features rather than 1D sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that scene text recognition has been held back by methods that flatten images into 1D signals, even though the text itself occupies a 2D space. By adding a second dimension to the CTC loss and alignment process, the resulting 2D-CTC can focus computation on the actual character locations while ignoring background clutter. This single change is claimed to remove the need for separate attention modules or post-processing while improving accuracy on both straight and curved text. Experiments on IIIT-5K, ICDAR 2015, SVT-Perspective, and CUTE80 are presented as evidence that the approach beats prior methods in accuracy and runs faster at both training and inference time.

Core claim

2D-CTC extends the vanilla CTC model to a second dimension so that it can adaptively concentrate on the most relevant features while excluding the impact from clutters and noises in the background. It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. On standard benchmarks the model outperforms state-of-the-art methods on both regular and irregular text and shows clear advantages in training and testing speed.

What carries the argument

2D-CTC, the two-dimensional extension of Connectionist Temporal Classification that aligns and classifies directly on 2D feature maps.

If this is right

Higher accuracy on both regular and irregular text benchmarks without added components.
Faster training and inference than prior attention-based or 1D-CTC approaches.
More interpretable per-frame predictions that reflect the 2D layout of the text.
Natural handling of horizontal, oriented, and curved text instances in a single model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same 2D alignment idea could be tested on other spatial sequence tasks such as handwriting or formula recognition.
If the speed gain holds, 2D-CTC might replace attention layers in lightweight mobile text recognizers.
The interpretability of the 2D paths might help diagnose failure cases on heavily cluttered signs.

Load-bearing premise

The 2D extension of CTC will automatically focus on relevant text regions and handle irregular shapes without any extra modules or post-processing.

What would settle it

Running the released 2D-CTC model on CUTE80 or ICDAR 2015 and finding that its accuracy or speed does not exceed the best published 1D-CTC or attention baselines.

Figures

Figures reproduced from arXiv: 1907.09705 by Cong Yao, Fengming Xie, Xiang Bai, Yibo Liu, Zhaoyi Wan.

**Figure 2.** Figure 2: Alignment procedures of vanilla CTC and 2D-CTC. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the intermediate predictions of 2D-CTC. The first rows are input images, the second and last rows [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Possible alignments of transcription ‘FREE’ in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the network architecture of 2D [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of the recognition result of vanilla [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy and speed of different methods. The [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Scene text recognition has been an important, active research topic in computer vision for years. Previous approaches mainly consider text as 1D signals and cast scene text recognition as a sequence prediction problem, by feat of CTC or attention based encoder-decoder framework, which is originally designed for speech recognition. However, different from speech voices, which are 1D signals, text instances are essentially distributed in 2D image spaces. To adhere to and make use of the 2D nature of text for higher recognition accuracy, we extend the vanilla CTC model to a second dimension, thus creating 2D-CTC. 2D-CTC can adaptively concentrate on most relevant features while excluding the impact from clutters and noises in the background; It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. The experiments on standard benchmarks for scene text recognition, such as IIIT-5K, ICDAR 2015, SVP-Perspective, and CUTE80, demonstrate that the proposed 2D-CTC model outperforms state-of-the-art methods on the text of both regular and irregular shapes. Moreover, 2D-CTC exhibits its superiority over prior art on training and testing speed. Our implementation and models of 2D-CTC will be made publicly available soon later.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

2D-CTC extends CTC into two dimensions for scene text and claims better accuracy plus speed on standard benchmarks, but the contribution needs the full methods and ablations to judge.

read the letter

The main takeaway is that this paper adapts CTC, normally a 1D sequence tool, to operate directly on 2D image features for scene text recognition. The authors argue this matches the actual layout of text in photos better than forcing everything through 1D pipelines or adding attention and rectification layers on top. They test on IIIT-5K, ICDAR 2015, SVT-Perspective, and CUTE80 and say the model wins on both regular and irregular text while cutting training and inference time.

Referee Report

2 major / 1 minor

Summary. The paper proposes extending the standard 1D CTC loss to 2D-CTC for scene text recognition, arguing that this better respects the 2D spatial distribution of text in images. The model is claimed to adaptively focus on relevant features while ignoring background clutter, naturally handle horizontal/oriented/curved text without extra components or post-processing, and yield more interpretable intermediate predictions. Experiments on IIIT-5K, ICDAR 2015, SVT-Perspective (noted as SVP in abstract), and CUTE80 are said to demonstrate outperformance over prior art on both regular and irregular text plus gains in training and inference speed.

Significance. If the empirical results are reproducible, the work offers a lightweight, parameter-free extension of an established sequence model that directly incorporates 2D structure; this could be useful for other 2D signal tasks in computer vision. The stated intent to release code and models would strengthen the contribution by enabling verification.

major comments (2)

[Abstract / Experiments] The central claim rests on outperformance on the listed benchmarks, yet the provided manuscript text supplies only high-level description of the 2D extension and no quantitative results tables, ablation studies, or error analysis; this makes the support for superiority on regular and irregular text unverifiable from the given material.
[Method] The description of how 2D-CTC computes probabilities over 2D paths and excludes background noise is stated at a conceptual level only; without the explicit formulation or algorithm (e.g., dynamic programming extension or loss definition), it is difficult to assess whether the claimed adaptive concentration occurs by construction or requires additional mechanisms.

minor comments (1)

[Abstract] Benchmark name inconsistency: abstract lists 'SVP-Perspective' while the reader's summary and standard literature use 'SVT-Perspective'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the opportunity to clarify the manuscript. We address each major comment below and will revise accordingly to strengthen the presentation.

read point-by-point responses

Referee: [Abstract / Experiments] The central claim rests on outperformance on the listed benchmarks, yet the provided manuscript text supplies only high-level description of the 2D extension and no quantitative results tables, ablation studies, or error analysis; this makes the support for superiority on regular and irregular text unverifiable from the given material.

Authors: The referee correctly notes that the provided text contains only the abstract-level claims without supporting tables or analysis. The full manuscript includes an Experiments section with quantitative comparisons on IIIT-5K, ICDAR 2015, SVT-Perspective, and CUTE80, plus ablations and speed measurements. We will ensure these tables, ablations, and any error analysis are explicitly included and referenced in the revised version to make the empirical support verifiable. revision: yes
Referee: [Method] The description of how 2D-CTC computes probabilities over 2D paths and excludes background noise is stated at a conceptual level only; without the explicit formulation or algorithm (e.g., dynamic programming extension or loss definition), it is difficult to assess whether the claimed adaptive concentration occurs by construction or requires additional mechanisms.

Authors: We agree the provided description is high-level. The full manuscript contains a dedicated Method section with the explicit 2D-CTC formulation, including the extension of the forward-backward algorithm to 2D paths, the loss definition, and the mechanism by which background features receive lower probability mass by construction of the 2D alignment. We will expand this section with the full equations and algorithm to allow direct assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes 2D-CTC as a direct extension of standard CTC to operate over 2D image features for scene text recognition. Performance claims rest entirely on empirical results from public benchmarks (IIIT-5K, ICDAR 2015, SVT-Perspective, CUTE80) rather than any internal derivation, fitted-parameter renaming, or self-citation chain. No equations, uniqueness theorems, or ansatzes are shown that reduce to the inputs by construction; the model description and speed/accuracy advantages are externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no information on specific free parameters, axioms, or invented entities used in the 2D-CTC model.

pith-pipeline@v0.9.0 · 5780 in / 1123 out tokens · 30430 ms · 2026-05-24T17:56:43.441050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

Almaz ´an, A

J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence , 36(12):2552–2566, 2014

work page 2014
[2]

End-to-End Text Recognition with Hybrid HMM Maxout Models

O. Alsharif and J. Pineau. End-to-end text recogni- tion with hybrid hmm maxout models. arXiv preprint arXiv:1310.1811, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[3]

F. Bai, Z. Cheng, Y . Niu, S. Pu, and S. Zhou. Edit probabil- ity for scene text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 1508–1516, 2018

work page 2018
[4]

Bissacco, M

A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In2013 IEEE International Conference on Computer Vision , pages 785– 792, Dec 2013

work page 2013
[5]

Bissacco, M

A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In Proceed- ings of the IEEE International Conference on Computer Vi- sion, pages 785–792, 2013

work page 2013
[6]

Cheng, F

Z. Cheng, F. Bai, Y . Xu, G. Zheng, S. Pu, and S. Zhou. Fo- cusing attention: Towards accurate text recognition in natu- ral images. In 2017 IEEE International Conference on Com- puter Vision (ICCV), pages 5086–5094, Oct 2017

work page 2017
[7]

Cheng, Y

Z. Cheng, Y . Xu, F. Bai, Y . Niu, S. Pu, and S. Zhou. AON: towards arbitrarily-oriented text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 5571–5579, 2018

work page 2018
[8]

C. K. Ch’ng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In Docu- ment Analysis and Recognition (ICDAR), 2017 14th IAPR In- ternational Conference on, volume 1, pages 935–942. IEEE, 2017

work page 2017
[9]

S. K. Ghosh, E. Valveny, and A. D. Bagdanov. Visual atten- tion models for scene text recognition. In 2017 14th IAPR International Conference on Document Analysis and Recog- nition (ICDAR), volume 01, pages 943–948, Nov 2017

work page 2017
[10]

V . Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Document Analysis and Recognition (ICDAR), 2013 12th In- ternational Conference on, pages 398–402. IEEE, 2013

work page 2013
[11]

A. Gordo. Supervised mid-level features for word image rep- resentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2956–2964, 2015

work page 2015
[12]

Graves, S

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhu- ber. Connectionist temporal classiﬁcation: Labelling unseg- mented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Ma- chine learning , pages 369–376, Pittsburgh, Pennsylvania, USA, 2006. IMLS

work page 2006
[13]

Gupta, A

A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2315–2324, 2016

work page 2016
[14]

T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun. An end-to-end textspotter with explicit alignment and attention. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 5020–5029, 2018

work page 2018
[15]

Deep Structured Output Learning for Unconstrained Text Recognition

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. arXiv preprint arXiv:1412.5903, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artiﬁcial neural networks for natural scene text recognition. NIPS Deep Learning Workshop, 2014

work page 2014
[17]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser- man. Reading text in the wild with convolutional neural net- works. International Journal of Computer Vision, 116(1):1– 20, 2016

work page 2016
[18]

Karatzas, L

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on ro- bust reading. In Document Analysis and Recognition (IC- DAR), 2015 13th International Conference on, pages 1156–

work page 2015
[19]

Karatzas, F

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Big- orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazn, and L. P. de las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, Aug 2013

work page 2013
[20]

D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. international conference on learning represen- tations, 2015

work page 2015
[21]

Lee and S

C.-Y . Lee and S. Osindero. Recursive recurrent nets with at- tention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2231–2239, 2016

work page 2016
[22]

M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai. Scene text recognition from two-dimensional perspective. In Proceedings of the Thirty-Third AAAI Con- ference on Artiﬁcial Intelligence, 2019

work page 2019
[23]

W. Liu, C. Chen, K.-Y . K. Wong, Z. Su, and J. Han. Star-net: A spatial attention residue network for scene text recogni- tion. In BMVC, volume 2, page 7, 2016

work page 2016
[24]

Mishra, K

A. Mishra, K. Alahari, and C. Jawahar. Scene text recog- nition using higher order language priors. In BMVC-British Machine Vision Conference. BMV A, 2012

work page 2012
[25]

Mishra, K

A. Mishra, K. Alahari, and C. V . Jawahar. Enhancing en- ergy minimization framework for scene text recognition with 9 top-down cues. Computer Vision and Image Understanding, 145:30–42, 2016

work page 2016
[26]

Novikova, O

T. Novikova, O. Barinova, P. Kohli, and V . S. Lempitsky. Large-lexicon attribute-consistent text recognition in natural images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7- 13, 2012, Proceedings, Part VI, pages 752–765, 2012

work page 2012
[27]

Paszke, S

A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. URL https://github. com/pytorch/pytorch, 2017

work page 2017
[28]

T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Rec- ognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, Dec 2013

work page 2013
[29]

Risnumawan, P

A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Systems with Applications, 41(18):8027 – 8048, 2014

work page 2014
[30]

J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. Inter- national Journal of Computer Vision , 113(3):193–207, Jul 2015

work page 2015
[31]

B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. IEEE transactions on pat- tern analysis and machine intelligence , 39(11):2298–2304, 2017

work page 2017
[32]

B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectiﬁcation. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4168–4176, 2016

work page 2016
[33]

B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. Aster: An and attentional scene and text recognizer and with ﬂexi- ble and rectiﬁcation. In IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, pages 1–1. IEEE, 2018

work page 2018
[34]

D. L. Smith, J. Field, and E. Learned-Miller. Enforcing sim- ilarity constraints with integer programming for better scene text recognition. In Proceedings of the 2011 IEEE Confer- ence on Computer Vision and Pattern Recognition , CVPR ’11, pages 73–80, Washington, DC, USA, 2011. IEEE Com- puter Society

work page 2011
[35]

Su and S

B. Su and S. Lu. Accurate scene text recognition based on re- current neural network. In D. Cremers, I. Reid, H. Saito, and M.-H. Yang, editors, Computer Vision – ACCV 2014, pages 35–48, Cham, 2015. Springer International Publishing

work page 2014
[36]

K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision , ICCV ’11, pages 1457– 1464, Washington, DC, USA, Nov 2011. IEEE Computer Society

work page 2011
[37]

T. Wang, D. J. Wu, A. Coates, and A. Y . Ng. End-to-end text recognition with convolutional neural networks. In Proceed- ings of the 21st International Conference on Pattern Recog- nition (ICPR2012), pages 3304–3308, Nov 2012

work page 2012
[38]

Y .-C. Wu, F. Yin, X.-Y . Zhang, L. Liu, and C.-L. Liu. Scan: Sliding convolutional attention network for scene text recog- nition. arXiv preprint arXiv:1806.00578, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In Pro- ceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI-17, pages 3280–3286, 2017

work page 2017
[40]

C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4042–4049, 2014

work page 2014
[41]

ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification

F. Zhan and S. Lu. Esir: End-to-end scene text recog- nition via iterative image rectiﬁcation. arXiv preprint arXiv:1812.05824, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6230–6239, Honolulu, HI, USA, July 2017. 10

work page 2017

[1] [1]

Almaz ´an, A

J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence , 36(12):2552–2566, 2014

work page 2014

[2] [2]

End-to-End Text Recognition with Hybrid HMM Maxout Models

O. Alsharif and J. Pineau. End-to-end text recogni- tion with hybrid hmm maxout models. arXiv preprint arXiv:1310.1811, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[3] [3]

F. Bai, Z. Cheng, Y . Niu, S. Pu, and S. Zhou. Edit probabil- ity for scene text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 1508–1516, 2018

work page 2018

[4] [4]

Bissacco, M

A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In2013 IEEE International Conference on Computer Vision , pages 785– 792, Dec 2013

work page 2013

[5] [5]

Bissacco, M

A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In Proceed- ings of the IEEE International Conference on Computer Vi- sion, pages 785–792, 2013

work page 2013

[6] [6]

Cheng, F

Z. Cheng, F. Bai, Y . Xu, G. Zheng, S. Pu, and S. Zhou. Fo- cusing attention: Towards accurate text recognition in natu- ral images. In 2017 IEEE International Conference on Com- puter Vision (ICCV), pages 5086–5094, Oct 2017

work page 2017

[7] [7]

Cheng, Y

Z. Cheng, Y . Xu, F. Bai, Y . Niu, S. Pu, and S. Zhou. AON: towards arbitrarily-oriented text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 5571–5579, 2018

work page 2018

[8] [8]

C. K. Ch’ng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In Docu- ment Analysis and Recognition (ICDAR), 2017 14th IAPR In- ternational Conference on, volume 1, pages 935–942. IEEE, 2017

work page 2017

[9] [9]

S. K. Ghosh, E. Valveny, and A. D. Bagdanov. Visual atten- tion models for scene text recognition. In 2017 14th IAPR International Conference on Document Analysis and Recog- nition (ICDAR), volume 01, pages 943–948, Nov 2017

work page 2017

[10] [10]

V . Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Document Analysis and Recognition (ICDAR), 2013 12th In- ternational Conference on, pages 398–402. IEEE, 2013

work page 2013

[11] [11]

A. Gordo. Supervised mid-level features for word image rep- resentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2956–2964, 2015

work page 2015

[12] [12]

Graves, S

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhu- ber. Connectionist temporal classiﬁcation: Labelling unseg- mented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Ma- chine learning , pages 369–376, Pittsburgh, Pennsylvania, USA, 2006. IMLS

work page 2006

[13] [13]

Gupta, A

A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2315–2324, 2016

work page 2016

[14] [14]

T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun. An end-to-end textspotter with explicit alignment and attention. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 5020–5029, 2018

work page 2018

[15] [15]

Deep Structured Output Learning for Unconstrained Text Recognition

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. arXiv preprint arXiv:1412.5903, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artiﬁcial neural networks for natural scene text recognition. NIPS Deep Learning Workshop, 2014

work page 2014

[17] [17]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser- man. Reading text in the wild with convolutional neural net- works. International Journal of Computer Vision, 116(1):1– 20, 2016

work page 2016

[18] [18]

Karatzas, L

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on ro- bust reading. In Document Analysis and Recognition (IC- DAR), 2015 13th International Conference on, pages 1156–

work page 2015

[19] [19]

Karatzas, F

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Big- orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazn, and L. P. de las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, Aug 2013

work page 2013

[20] [20]

D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. international conference on learning represen- tations, 2015

work page 2015

[21] [21]

Lee and S

C.-Y . Lee and S. Osindero. Recursive recurrent nets with at- tention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2231–2239, 2016

work page 2016

[22] [22]

M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai. Scene text recognition from two-dimensional perspective. In Proceedings of the Thirty-Third AAAI Con- ference on Artiﬁcial Intelligence, 2019

work page 2019

[23] [23]

W. Liu, C. Chen, K.-Y . K. Wong, Z. Su, and J. Han. Star-net: A spatial attention residue network for scene text recogni- tion. In BMVC, volume 2, page 7, 2016

work page 2016

[24] [24]

Mishra, K

A. Mishra, K. Alahari, and C. Jawahar. Scene text recog- nition using higher order language priors. In BMVC-British Machine Vision Conference. BMV A, 2012

work page 2012

[25] [25]

Mishra, K

A. Mishra, K. Alahari, and C. V . Jawahar. Enhancing en- ergy minimization framework for scene text recognition with 9 top-down cues. Computer Vision and Image Understanding, 145:30–42, 2016

work page 2016

[26] [26]

Novikova, O

T. Novikova, O. Barinova, P. Kohli, and V . S. Lempitsky. Large-lexicon attribute-consistent text recognition in natural images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7- 13, 2012, Proceedings, Part VI, pages 752–765, 2012

work page 2012

[27] [27]

Paszke, S

A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. URL https://github. com/pytorch/pytorch, 2017

work page 2017

[28] [28]

T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Rec- ognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, Dec 2013

work page 2013

[29] [29]

Risnumawan, P

A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Systems with Applications, 41(18):8027 – 8048, 2014

work page 2014

[30] [30]

J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. Inter- national Journal of Computer Vision , 113(3):193–207, Jul 2015

work page 2015

[31] [31]

B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. IEEE transactions on pat- tern analysis and machine intelligence , 39(11):2298–2304, 2017

work page 2017

[32] [32]

B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectiﬁcation. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4168–4176, 2016

work page 2016

[33] [33]

B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. Aster: An and attentional scene and text recognizer and with ﬂexi- ble and rectiﬁcation. In IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, pages 1–1. IEEE, 2018

work page 2018

[34] [34]

D. L. Smith, J. Field, and E. Learned-Miller. Enforcing sim- ilarity constraints with integer programming for better scene text recognition. In Proceedings of the 2011 IEEE Confer- ence on Computer Vision and Pattern Recognition , CVPR ’11, pages 73–80, Washington, DC, USA, 2011. IEEE Com- puter Society

work page 2011

[35] [35]

Su and S

B. Su and S. Lu. Accurate scene text recognition based on re- current neural network. In D. Cremers, I. Reid, H. Saito, and M.-H. Yang, editors, Computer Vision – ACCV 2014, pages 35–48, Cham, 2015. Springer International Publishing

work page 2014

[36] [36]

K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision , ICCV ’11, pages 1457– 1464, Washington, DC, USA, Nov 2011. IEEE Computer Society

work page 2011

[37] [37]

T. Wang, D. J. Wu, A. Coates, and A. Y . Ng. End-to-end text recognition with convolutional neural networks. In Proceed- ings of the 21st International Conference on Pattern Recog- nition (ICPR2012), pages 3304–3308, Nov 2012

work page 2012

[38] [38]

Y .-C. Wu, F. Yin, X.-Y . Zhang, L. Liu, and C.-L. Liu. Scan: Sliding convolutional attention network for scene text recog- nition. arXiv preprint arXiv:1806.00578, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In Pro- ceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI-17, pages 3280–3286, 2017

work page 2017

[40] [40]

C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4042–4049, 2014

work page 2014

[41] [41]

ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification

F. Zhan and S. Lu. Esir: End-to-end scene text recog- nition via iterative image rectiﬁcation. arXiv preprint arXiv:1812.05824, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[42] [42]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6230–6239, Honolulu, HI, USA, July 2017. 10

work page 2017