pith. sign in

arxiv: 1907.09705 · v1 · pith:2O23WDZMnew · submitted 2019-07-23 · 💻 cs.CV · cs.CL· cs.LG

2D-CTC for Scene Text Recognition

Pith reviewed 2026-05-24 17:56 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords scene text recognition2D-CTCCTCirregular textsequence predictioncomputer visiondeep learning
0
0 comments X

The pith

Extending CTC to two dimensions lets the model treat scene text as 2D image features rather than 1D sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that scene text recognition has been held back by methods that flatten images into 1D signals, even though the text itself occupies a 2D space. By adding a second dimension to the CTC loss and alignment process, the resulting 2D-CTC can focus computation on the actual character locations while ignoring background clutter. This single change is claimed to remove the need for separate attention modules or post-processing while improving accuracy on both straight and curved text. Experiments on IIIT-5K, ICDAR 2015, SVT-Perspective, and CUTE80 are presented as evidence that the approach beats prior methods in accuracy and runs faster at both training and inference time.

Core claim

2D-CTC extends the vanilla CTC model to a second dimension so that it can adaptively concentrate on the most relevant features while excluding the impact from clutters and noises in the background. It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. On standard benchmarks the model outperforms state-of-the-art methods on both regular and irregular text and shows clear advantages in training and testing speed.

What carries the argument

2D-CTC, the two-dimensional extension of Connectionist Temporal Classification that aligns and classifies directly on 2D feature maps.

If this is right

  • Higher accuracy on both regular and irregular text benchmarks without added components.
  • Faster training and inference than prior attention-based or 1D-CTC approaches.
  • More interpretable per-frame predictions that reflect the 2D layout of the text.
  • Natural handling of horizontal, oriented, and curved text instances in a single model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same 2D alignment idea could be tested on other spatial sequence tasks such as handwriting or formula recognition.
  • If the speed gain holds, 2D-CTC might replace attention layers in lightweight mobile text recognizers.
  • The interpretability of the 2D paths might help diagnose failure cases on heavily cluttered signs.

Load-bearing premise

The 2D extension of CTC will automatically focus on relevant text regions and handle irregular shapes without any extra modules or post-processing.

What would settle it

Running the released 2D-CTC model on CUTE80 or ICDAR 2015 and finding that its accuracy or speed does not exceed the best published 1D-CTC or attention baselines.

Figures

Figures reproduced from arXiv: 1907.09705 by Cong Yao, Fengming Xie, Xiang Bai, Yibo Liu, Zhaoyi Wan.

Figure 1
Figure 1. Figure 1: Motivation of 2D-CTC. The prediction process [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Alignment procedures of vanilla CTC and 2D-CTC. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the intermediate predictions of 2D-CTC. The first rows are input images, the second and last rows [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Possible alignments of transcription ‘FREE’ in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the network architecture of 2D [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the recognition result of vanilla [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy and speed of different methods. The [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Scene text recognition has been an important, active research topic in computer vision for years. Previous approaches mainly consider text as 1D signals and cast scene text recognition as a sequence prediction problem, by feat of CTC or attention based encoder-decoder framework, which is originally designed for speech recognition. However, different from speech voices, which are 1D signals, text instances are essentially distributed in 2D image spaces. To adhere to and make use of the 2D nature of text for higher recognition accuracy, we extend the vanilla CTC model to a second dimension, thus creating 2D-CTC. 2D-CTC can adaptively concentrate on most relevant features while excluding the impact from clutters and noises in the background; It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. The experiments on standard benchmarks for scene text recognition, such as IIIT-5K, ICDAR 2015, SVP-Perspective, and CUTE80, demonstrate that the proposed 2D-CTC model outperforms state-of-the-art methods on the text of both regular and irregular shapes. Moreover, 2D-CTC exhibits its superiority over prior art on training and testing speed. Our implementation and models of 2D-CTC will be made publicly available soon later.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes extending the standard 1D CTC loss to 2D-CTC for scene text recognition, arguing that this better respects the 2D spatial distribution of text in images. The model is claimed to adaptively focus on relevant features while ignoring background clutter, naturally handle horizontal/oriented/curved text without extra components or post-processing, and yield more interpretable intermediate predictions. Experiments on IIIT-5K, ICDAR 2015, SVT-Perspective (noted as SVP in abstract), and CUTE80 are said to demonstrate outperformance over prior art on both regular and irregular text plus gains in training and inference speed.

Significance. If the empirical results are reproducible, the work offers a lightweight, parameter-free extension of an established sequence model that directly incorporates 2D structure; this could be useful for other 2D signal tasks in computer vision. The stated intent to release code and models would strengthen the contribution by enabling verification.

major comments (2)
  1. [Abstract / Experiments] The central claim rests on outperformance on the listed benchmarks, yet the provided manuscript text supplies only high-level description of the 2D extension and no quantitative results tables, ablation studies, or error analysis; this makes the support for superiority on regular and irregular text unverifiable from the given material.
  2. [Method] The description of how 2D-CTC computes probabilities over 2D paths and excludes background noise is stated at a conceptual level only; without the explicit formulation or algorithm (e.g., dynamic programming extension or loss definition), it is difficult to assess whether the claimed adaptive concentration occurs by construction or requires additional mechanisms.
minor comments (1)
  1. [Abstract] Benchmark name inconsistency: abstract lists 'SVP-Perspective' while the reader's summary and standard literature use 'SVT-Perspective'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the review and the opportunity to clarify the manuscript. We address each major comment below and will revise accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract / Experiments] The central claim rests on outperformance on the listed benchmarks, yet the provided manuscript text supplies only high-level description of the 2D extension and no quantitative results tables, ablation studies, or error analysis; this makes the support for superiority on regular and irregular text unverifiable from the given material.

    Authors: The referee correctly notes that the provided text contains only the abstract-level claims without supporting tables or analysis. The full manuscript includes an Experiments section with quantitative comparisons on IIIT-5K, ICDAR 2015, SVT-Perspective, and CUTE80, plus ablations and speed measurements. We will ensure these tables, ablations, and any error analysis are explicitly included and referenced in the revised version to make the empirical support verifiable. revision: yes

  2. Referee: [Method] The description of how 2D-CTC computes probabilities over 2D paths and excludes background noise is stated at a conceptual level only; without the explicit formulation or algorithm (e.g., dynamic programming extension or loss definition), it is difficult to assess whether the claimed adaptive concentration occurs by construction or requires additional mechanisms.

    Authors: We agree the provided description is high-level. The full manuscript contains a dedicated Method section with the explicit 2D-CTC formulation, including the extension of the forward-backward algorithm to 2D paths, the loss definition, and the mechanism by which background features receive lower probability mass by construction of the 2D alignment. We will expand this section with the full equations and algorithm to allow direct assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes 2D-CTC as a direct extension of standard CTC to operate over 2D image features for scene text recognition. Performance claims rest entirely on empirical results from public benchmarks (IIIT-5K, ICDAR 2015, SVT-Perspective, CUTE80) rather than any internal derivation, fitted-parameter renaming, or self-citation chain. No equations, uniqueness theorems, or ansatzes are shown that reduce to the inputs by construction; the model description and speed/accuracy advantages are externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no information on specific free parameters, axioms, or invented entities used in the 2D-CTC model.

pith-pipeline@v0.9.0 · 5780 in / 1123 out tokens · 30430 ms · 2026-05-24T17:56:43.441050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    Almaz ´an, A

    J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence , 36(12):2552–2566, 2014

  2. [2]

    End-to-End Text Recognition with Hybrid HMM Maxout Models

    O. Alsharif and J. Pineau. End-to-end text recogni- tion with hybrid hmm maxout models. arXiv preprint arXiv:1310.1811, 2013

  3. [3]

    F. Bai, Z. Cheng, Y . Niu, S. Pu, and S. Zhou. Edit probabil- ity for scene text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 1508–1516, 2018

  4. [4]

    Bissacco, M

    A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In2013 IEEE International Conference on Computer Vision , pages 785– 792, Dec 2013

  5. [5]

    Bissacco, M

    A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In Proceed- ings of the IEEE International Conference on Computer Vi- sion, pages 785–792, 2013

  6. [6]

    Cheng, F

    Z. Cheng, F. Bai, Y . Xu, G. Zheng, S. Pu, and S. Zhou. Fo- cusing attention: Towards accurate text recognition in natu- ral images. In 2017 IEEE International Conference on Com- puter Vision (ICCV), pages 5086–5094, Oct 2017

  7. [7]

    Cheng, Y

    Z. Cheng, Y . Xu, F. Bai, Y . Niu, S. Pu, and S. Zhou. AON: towards arbitrarily-oriented text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 5571–5579, 2018

  8. [8]

    C. K. Ch’ng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In Docu- ment Analysis and Recognition (ICDAR), 2017 14th IAPR In- ternational Conference on, volume 1, pages 935–942. IEEE, 2017

  9. [9]

    S. K. Ghosh, E. Valveny, and A. D. Bagdanov. Visual atten- tion models for scene text recognition. In 2017 14th IAPR International Conference on Document Analysis and Recog- nition (ICDAR), volume 01, pages 943–948, Nov 2017

  10. [10]

    V . Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Document Analysis and Recognition (ICDAR), 2013 12th In- ternational Conference on, pages 398–402. IEEE, 2013

  11. [11]

    A. Gordo. Supervised mid-level features for word image rep- resentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2956–2964, 2015

  12. [12]

    Graves, S

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhu- ber. Connectionist temporal classification: Labelling unseg- mented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Ma- chine learning , pages 369–376, Pittsburgh, Pennsylvania, USA, 2006. IMLS

  13. [13]

    Gupta, A

    A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2315–2324, 2016

  14. [14]

    T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun. An end-to-end textspotter with explicit alignment and attention. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 5020–5029, 2018

  15. [15]

    Deep Structured Output Learning for Unconstrained Text Recognition

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. arXiv preprint arXiv:1412.5903, 2014

  16. [16]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. NIPS Deep Learning Workshop, 2014

  17. [17]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser- man. Reading text in the wild with convolutional neural net- works. International Journal of Computer Vision, 116(1):1– 20, 2016

  18. [18]

    Karatzas, L

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on ro- bust reading. In Document Analysis and Recognition (IC- DAR), 2015 13th International Conference on, pages 1156–

  19. [19]

    Karatzas, F

    D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Big- orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazn, and L. P. de las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, Aug 2013

  20. [20]

    D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. international conference on learning represen- tations, 2015

  21. [21]

    Lee and S

    C.-Y . Lee and S. Osindero. Recursive recurrent nets with at- tention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 2231–2239, 2016

  22. [22]

    M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai. Scene text recognition from two-dimensional perspective. In Proceedings of the Thirty-Third AAAI Con- ference on Artificial Intelligence, 2019

  23. [23]

    W. Liu, C. Chen, K.-Y . K. Wong, Z. Su, and J. Han. Star-net: A spatial attention residue network for scene text recogni- tion. In BMVC, volume 2, page 7, 2016

  24. [24]

    Mishra, K

    A. Mishra, K. Alahari, and C. Jawahar. Scene text recog- nition using higher order language priors. In BMVC-British Machine Vision Conference. BMV A, 2012

  25. [25]

    Mishra, K

    A. Mishra, K. Alahari, and C. V . Jawahar. Enhancing en- ergy minimization framework for scene text recognition with 9 top-down cues. Computer Vision and Image Understanding, 145:30–42, 2016

  26. [26]

    Novikova, O

    T. Novikova, O. Barinova, P. Kohli, and V . S. Lempitsky. Large-lexicon attribute-consistent text recognition in natural images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7- 13, 2012, Proceedings, Part VI, pages 752–765, 2012

  27. [27]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration. URL https://github. com/pytorch/pytorch, 2017

  28. [28]

    T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Rec- ognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, Dec 2013

  29. [29]

    Risnumawan, P

    A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Systems with Applications, 41(18):8027 – 8048, 2014

  30. [30]

    J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. Inter- national Journal of Computer Vision , 113(3):193–207, Jul 2015

  31. [31]

    B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. IEEE transactions on pat- tern analysis and machine intelligence , 39(11):2298–2304, 2017

  32. [32]

    B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4168–4176, 2016

  33. [33]

    B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. Aster: An and attentional scene and text recognizer and with flexi- ble and rectification. In IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, pages 1–1. IEEE, 2018

  34. [34]

    D. L. Smith, J. Field, and E. Learned-Miller. Enforcing sim- ilarity constraints with integer programming for better scene text recognition. In Proceedings of the 2011 IEEE Confer- ence on Computer Vision and Pattern Recognition , CVPR ’11, pages 73–80, Washington, DC, USA, 2011. IEEE Com- puter Society

  35. [35]

    Su and S

    B. Su and S. Lu. Accurate scene text recognition based on re- current neural network. In D. Cremers, I. Reid, H. Saito, and M.-H. Yang, editors, Computer Vision – ACCV 2014, pages 35–48, Cham, 2015. Springer International Publishing

  36. [36]

    K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision , ICCV ’11, pages 1457– 1464, Washington, DC, USA, Nov 2011. IEEE Computer Society

  37. [37]

    T. Wang, D. J. Wu, A. Coates, and A. Y . Ng. End-to-end text recognition with convolutional neural networks. In Proceed- ings of the 21st International Conference on Pattern Recog- nition (ICPR2012), pages 3304–3308, Nov 2012

  38. [38]

    Y .-C. Wu, F. Yin, X.-Y . Zhang, L. Liu, and C.-L. Liu. Scan: Sliding convolutional attention network for scene text recog- nition. arXiv preprint arXiv:1806.00578, 2018

  39. [39]

    X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In Pro- ceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3280–3286, 2017

  40. [40]

    C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4042–4049, 2014

  41. [41]

    ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification

    F. Zhan and S. Lu. Esir: End-to-end scene text recog- nition via iterative image rectification. arXiv preprint arXiv:1812.05824, 2018

  42. [42]

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6230–6239, Honolulu, HI, USA, July 2017. 10