2D-CTC for Scene Text Recognition
Pith reviewed 2026-05-24 17:56 UTC · model grok-4.3
The pith
Extending CTC to two dimensions lets the model treat scene text as 2D image features rather than 1D sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
2D-CTC extends the vanilla CTC model to a second dimension so that it can adaptively concentrate on the most relevant features while excluding the impact from clutters and noises in the background. It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. On standard benchmarks the model outperforms state-of-the-art methods on both regular and irregular text and shows clear advantages in training and testing speed.
What carries the argument
2D-CTC, the two-dimensional extension of Connectionist Temporal Classification that aligns and classifies directly on 2D feature maps.
If this is right
- Higher accuracy on both regular and irregular text benchmarks without added components.
- Faster training and inference than prior attention-based or 1D-CTC approaches.
- More interpretable per-frame predictions that reflect the 2D layout of the text.
- Natural handling of horizontal, oriented, and curved text instances in a single model.
Where Pith is reading between the lines
- The same 2D alignment idea could be tested on other spatial sequence tasks such as handwriting or formula recognition.
- If the speed gain holds, 2D-CTC might replace attention layers in lightweight mobile text recognizers.
- The interpretability of the 2D paths might help diagnose failure cases on heavily cluttered signs.
Load-bearing premise
The 2D extension of CTC will automatically focus on relevant text regions and handle irregular shapes without any extra modules or post-processing.
What would settle it
Running the released 2D-CTC model on CUTE80 or ICDAR 2015 and finding that its accuracy or speed does not exceed the best published 1D-CTC or attention baselines.
Figures
read the original abstract
Scene text recognition has been an important, active research topic in computer vision for years. Previous approaches mainly consider text as 1D signals and cast scene text recognition as a sequence prediction problem, by feat of CTC or attention based encoder-decoder framework, which is originally designed for speech recognition. However, different from speech voices, which are 1D signals, text instances are essentially distributed in 2D image spaces. To adhere to and make use of the 2D nature of text for higher recognition accuracy, we extend the vanilla CTC model to a second dimension, thus creating 2D-CTC. 2D-CTC can adaptively concentrate on most relevant features while excluding the impact from clutters and noises in the background; It can also naturally handle text instances with various forms (horizontal, oriented and curved) while giving more interpretable intermediate predictions. The experiments on standard benchmarks for scene text recognition, such as IIIT-5K, ICDAR 2015, SVP-Perspective, and CUTE80, demonstrate that the proposed 2D-CTC model outperforms state-of-the-art methods on the text of both regular and irregular shapes. Moreover, 2D-CTC exhibits its superiority over prior art on training and testing speed. Our implementation and models of 2D-CTC will be made publicly available soon later.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes extending the standard 1D CTC loss to 2D-CTC for scene text recognition, arguing that this better respects the 2D spatial distribution of text in images. The model is claimed to adaptively focus on relevant features while ignoring background clutter, naturally handle horizontal/oriented/curved text without extra components or post-processing, and yield more interpretable intermediate predictions. Experiments on IIIT-5K, ICDAR 2015, SVT-Perspective (noted as SVP in abstract), and CUTE80 are said to demonstrate outperformance over prior art on both regular and irregular text plus gains in training and inference speed.
Significance. If the empirical results are reproducible, the work offers a lightweight, parameter-free extension of an established sequence model that directly incorporates 2D structure; this could be useful for other 2D signal tasks in computer vision. The stated intent to release code and models would strengthen the contribution by enabling verification.
major comments (2)
- [Abstract / Experiments] The central claim rests on outperformance on the listed benchmarks, yet the provided manuscript text supplies only high-level description of the 2D extension and no quantitative results tables, ablation studies, or error analysis; this makes the support for superiority on regular and irregular text unverifiable from the given material.
- [Method] The description of how 2D-CTC computes probabilities over 2D paths and excludes background noise is stated at a conceptual level only; without the explicit formulation or algorithm (e.g., dynamic programming extension or loss definition), it is difficult to assess whether the claimed adaptive concentration occurs by construction or requires additional mechanisms.
minor comments (1)
- [Abstract] Benchmark name inconsistency: abstract lists 'SVP-Perspective' while the reader's summary and standard literature use 'SVT-Perspective'.
Simulated Author's Rebuttal
We thank the referee for the review and the opportunity to clarify the manuscript. We address each major comment below and will revise accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract / Experiments] The central claim rests on outperformance on the listed benchmarks, yet the provided manuscript text supplies only high-level description of the 2D extension and no quantitative results tables, ablation studies, or error analysis; this makes the support for superiority on regular and irregular text unverifiable from the given material.
Authors: The referee correctly notes that the provided text contains only the abstract-level claims without supporting tables or analysis. The full manuscript includes an Experiments section with quantitative comparisons on IIIT-5K, ICDAR 2015, SVT-Perspective, and CUTE80, plus ablations and speed measurements. We will ensure these tables, ablations, and any error analysis are explicitly included and referenced in the revised version to make the empirical support verifiable. revision: yes
-
Referee: [Method] The description of how 2D-CTC computes probabilities over 2D paths and excludes background noise is stated at a conceptual level only; without the explicit formulation or algorithm (e.g., dynamic programming extension or loss definition), it is difficult to assess whether the claimed adaptive concentration occurs by construction or requires additional mechanisms.
Authors: We agree the provided description is high-level. The full manuscript contains a dedicated Method section with the explicit 2D-CTC formulation, including the extension of the forward-backward algorithm to 2D paths, the loss definition, and the mechanism by which background features receive lower probability mass by construction of the 2D alignment. We will expand this section with the full equations and algorithm to allow direct assessment. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes 2D-CTC as a direct extension of standard CTC to operate over 2D image features for scene text recognition. Performance claims rest entirely on empirical results from public benchmarks (IIIT-5K, ICDAR 2015, SVT-Perspective, CUTE80) rather than any internal derivation, fitted-parameter renaming, or self-citation chain. No equations, uniqueness theorems, or ansatzes are shown that reduce to the inputs by construction; the model description and speed/accuracy advantages are externally falsifiable via the reported experiments.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny. Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence , 36(12):2552–2566, 2014
work page 2014
-
[2]
End-to-End Text Recognition with Hybrid HMM Maxout Models
O. Alsharif and J. Pineau. End-to-end text recogni- tion with hybrid hmm maxout models. arXiv preprint arXiv:1310.1811, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[3]
F. Bai, Z. Cheng, Y . Niu, S. Pu, and S. Zhou. Edit probabil- ity for scene text recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 1508–1516, 2018
work page 2018
-
[4]
A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In2013 IEEE International Conference on Computer Vision , pages 785– 792, Dec 2013
work page 2013
-
[5]
A. Bissacco, M. Cummins, Y . Netzer, and H. Neven. Pho- toocr: Reading text in uncontrolled conditions. In Proceed- ings of the IEEE International Conference on Computer Vi- sion, pages 785–792, 2013
work page 2013
- [6]
- [7]
-
[8]
C. K. Ch’ng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In Docu- ment Analysis and Recognition (ICDAR), 2017 14th IAPR In- ternational Conference on, volume 1, pages 935–942. IEEE, 2017
work page 2017
-
[9]
S. K. Ghosh, E. Valveny, and A. D. Bagdanov. Visual atten- tion models for scene text recognition. In 2017 14th IAPR International Conference on Document Analysis and Recog- nition (ICDAR), volume 01, pages 943–948, Nov 2017
work page 2017
-
[10]
V . Goel, A. Mishra, K. Alahari, and C. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In Document Analysis and Recognition (ICDAR), 2013 12th In- ternational Conference on, pages 398–402. IEEE, 2013
work page 2013
-
[11]
A. Gordo. Supervised mid-level features for word image rep- resentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2956–2964, 2015
work page 2015
-
[12]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhu- ber. Connectionist temporal classification: Labelling unseg- mented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Ma- chine learning , pages 369–376, Pittsburgh, Pennsylvania, USA, 2006. IMLS
work page 2006
- [13]
-
[14]
T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun. An end-to-end textspotter with explicit alignment and attention. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 5020–5029, 2018
work page 2018
-
[15]
Deep Structured Output Learning for Unconstrained Text Recognition
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. arXiv preprint arXiv:1412.5903, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[16]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. NIPS Deep Learning Workshop, 2014
work page 2014
-
[17]
M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser- man. Reading text in the wild with convolutional neural net- works. International Journal of Computer Vision, 116(1):1– 20, 2016
work page 2016
-
[18]
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on ro- bust reading. In Document Analysis and Recognition (IC- DAR), 2015 13th International Conference on, pages 1156–
work page 2015
-
[19]
D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Big- orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazn, and L. P. de las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, Aug 2013
work page 2013
-
[20]
D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. international conference on learning represen- tations, 2015
work page 2015
- [21]
-
[22]
M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai. Scene text recognition from two-dimensional perspective. In Proceedings of the Thirty-Third AAAI Con- ference on Artificial Intelligence, 2019
work page 2019
-
[23]
W. Liu, C. Chen, K.-Y . K. Wong, Z. Su, and J. Han. Star-net: A spatial attention residue network for scene text recogni- tion. In BMVC, volume 2, page 7, 2016
work page 2016
- [24]
- [25]
-
[26]
T. Novikova, O. Barinova, P. Kohli, and V . S. Lempitsky. Large-lexicon attribute-consistent text recognition in natural images. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7- 13, 2012, Proceedings, Part VI, pages 752–765, 2012
work page 2012
- [27]
-
[28]
T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Rec- ognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, Dec 2013
work page 2013
-
[29]
A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Systems with Applications, 41(18):8027 – 8048, 2014
work page 2014
-
[30]
J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. Inter- national Journal of Computer Vision , 113(3):193–207, Jul 2015
work page 2015
-
[31]
B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. IEEE transactions on pat- tern analysis and machine intelligence , 39(11):2298–2304, 2017
work page 2017
-
[32]
B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4168–4176, 2016
work page 2016
-
[33]
B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai. Aster: An and attentional scene and text recognizer and with flexi- ble and rectification. In IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, pages 1–1. IEEE, 2018
work page 2018
-
[34]
D. L. Smith, J. Field, and E. Learned-Miller. Enforcing sim- ilarity constraints with integer programming for better scene text recognition. In Proceedings of the 2011 IEEE Confer- ence on Computer Vision and Pattern Recognition , CVPR ’11, pages 73–80, Washington, DC, USA, 2011. IEEE Com- puter Society
work page 2011
- [35]
-
[36]
K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision , ICCV ’11, pages 1457– 1464, Washington, DC, USA, Nov 2011. IEEE Computer Society
work page 2011
-
[37]
T. Wang, D. J. Wu, A. Coates, and A. Y . Ng. End-to-end text recognition with convolutional neural networks. In Proceed- ings of the 21st International Conference on Pattern Recog- nition (ICPR2012), pages 3304–3308, Nov 2012
work page 2012
-
[38]
Y .-C. Wu, F. Yin, X.-Y . Zhang, L. Liu, and C.-L. Liu. Scan: Sliding convolutional attention network for scene text recog- nition. arXiv preprint arXiv:1806.00578, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
X. Yang, D. He, Z. Zhou, D. Kifer, and C. L. Giles. Learning to read irregular text with attention mechanisms. In Pro- ceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 3280–3286, 2017
work page 2017
-
[40]
C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4042–4049, 2014
work page 2014
-
[41]
ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification
F. Zhan and S. Lu. Esir: End-to-end scene text recog- nition via iterative image rectification. arXiv preprint arXiv:1812.05824, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[42]
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6230–6239, Honolulu, HI, USA, July 2017. 10
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.