GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition

Chuhui Xue; Fangneng Zhan; Shijian Lu

arxiv: 1907.09653 · v1 · pith:ZEOP3HK6new · submitted 2019-07-23 · 💻 cs.CV

GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition

Fangneng Zhan , Chuhui Xue , Shijian Lu This is my paper

Pith reviewed 2026-05-24 18:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords domain adaptationscene text detectionscene text recognitiongeometry-aware adaptationcycle consistencyadversarial learningimage translation

0 comments

The pith

GA-DAN models cross-domain shifts in both geometry and appearance to convert source images into realistic target-domain views for scene text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GA-DAN, a network that handles domain shifts simultaneously in geometry space and appearance space rather than appearance alone. It introduces multi-modal spatial learning to produce multiple geometric views of a source image that match the target domain, plus a disentangled cycle-consistency loss that keeps both appearance and geometry consistent during translation. When the resulting adapted images are used to train detection and recognition networks, experiments show higher accuracy on target-domain scene text data than training on unadapted source images.

Core claim

GA-DAN converts images across domains with very different characteristics by jointly modeling geometric and appearance shifts, using multi-modal spatial learning to generate varied spatial views and a disentangled cycle-consistency loss to balance the two spaces, and these adapted images produce superior scene text detection and recognition performance when used for network training.

What carries the argument

Geometry-Aware Domain Adaptation Network (GA-DAN) that uses multi-modal spatial learning to generate multiple target-domain spatial views and a disentangled cycle-consistency loss to enforce consistency separately in appearance and geometry.

If this is right

Domain-adapted images improve scene text detection accuracy on target domains.
Domain-adapted images improve scene text recognition accuracy on target domains.
The network can handle source and target domains that differ strongly in both layout and visual style.
Separate cycle-consistency terms for geometry and appearance stabilize the overall translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometry-plus-appearance translation could supply synthetic training data for other tasks that suffer from viewpoint or layout shifts.
If the adapted images preserve text legibility, the method might reduce reliance on manual labeling in new environments.
Extending the spatial learning module to temporal sequences could support video domain adaptation.

Load-bearing premise

The multi-modal spatial learning technique and disentangled cycle-consistency loss can produce realistic image conversions across domains whose geometry and appearance differ substantially.

What would settle it

Training a scene text detector or recognizer on the GA-DAN-adapted images yields no accuracy gain over training on the original source images when both are tested on the same target-domain benchmark.

Figures

Figures reproduced from arXiv: 1907.09653 by Chuhui Xue, Fangneng Zhan, Shijian Lu.

**Figure 2.** Figure 2: The structure of the proposed GA-DAN: SX (or SY ) represents the spatial modules as enclosed in blue-color boxes which consist of Spatial Code, transformation module T and localization network LNX (or LNY ) that predict transformation matrix and transform input images. GX (or GY ) denote generators consisting of GXA (or GYA ) and GXB (or GYB ) as enclosed in green-color boxes that complete the background a… view at source ↗

**Figure 3.** Figure 3: Illustration of the disentangled cycle-consistency loss: [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparing our GA-DAN with state-of-the-art adaptation methods: The first and last columns show source-domain (IC13) and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparing GA-DAN with state-of-the-art adaptation methods: Rows 1-2 show adaptation from COMB to CUTE, Rows 3-4 show [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Recent adversarial learning research has achieved very impressive progress for modelling cross-domain data shifts in appearance space but its counterpart in modelling cross-domain shifts in geometry space lags far behind. This paper presents an innovative Geometry-Aware Domain Adaptation Network (GA-DAN) that is capable of modelling cross-domain shifts concurrently in both geometry space and appearance space and realistically converting images across domains with very different characteristics. In the proposed GA-DAN, a novel multi-modal spatial learning technique is designed which converts a source-domain image into multiple images of different spatial views as in the target domain. A new disentangled cycle-consistency loss is introduced which balances the cycle consistency in appearance and geometry spaces and improves the learning of the whole network greatly. The proposed GA-DAN has been evaluated for the classic scene text detection and recognition tasks, and experiments show that the domain-adapted images achieve superior scene text detection and recognition performance while applied to network training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GA-DAN adds explicit geometry handling to domain adaptation for scene text via multi-modal views and a disentangled loss, but the performance claims need the full experiments to judge.

read the letter

Hi, the main takeaway is that this paper targets the gap where most domain adaptation stays in appearance space and ignores geometry shifts that matter for scene text. It does this by converting source images into multiple target-like spatial views through a new multi-modal spatial learning step, then uses a disentangled cycle-consistency loss to balance appearance and geometry terms separately during training. That combination is the actual new element compared to standard adversarial or cycle-based methods cited in the abstract. The approach makes conceptual sense for text detection and recognition, where perspective, scale, and layout differences across domains are common, and the paper states that the adapted images improve downstream network training. What it does reasonably well is identify the limitation in prior work and propose components that try to address both spaces without one overwhelming the other. The central claim that this yields superior detection and recognition performance rests on the experiments mentioned in the abstract. The soft spots are that the abstract supplies no numbers, no baseline comparisons, no ablation results, and no discussion of artifacts or failure cases in the generated images. Without those, it is difficult to tell whether the gains are meaningful or whether the multi-modal and disentangled pieces deliver as intended on domains with large characteristic differences. This is the sort of paper that would interest people working on domain adaptation or scene text in computer vision. A reader looking for extensions beyond appearance-only adaptation could extract the method description and try the ideas. It deserves a serious referee to examine the implementation details, the quantitative results, and whether the geometry modeling holds up under scrutiny.

Referee Report

1 major / 0 minor

Summary. The paper proposes GA-DAN, a geometry-aware domain adaptation network for scene text detection and recognition. It claims to model cross-domain shifts concurrently in geometry and appearance spaces via a novel multi-modal spatial learning technique (converting source images to multiple target-domain spatial views) and a disentangled cycle-consistency loss (balancing appearance and geometry consistency). Experiments are stated to show that the resulting domain-adapted images yield superior detection and recognition performance when used for network training.

Significance. If the performance claims hold under rigorous evaluation, the work would address a recognized gap in domain adaptation by extending adversarial methods to geometry space, which is relevant for scene text where domains often differ in both appearance and layout. The disentangled loss formulation could offer a reusable technique for multi-modal adaptation if shown to be effective beyond the specific tasks.

major comments (1)

[Abstract] Abstract: the central claim that 'the domain-adapted images achieve superior scene text detection and recognition performance' is presented without any quantitative results, baselines, datasets, metrics, or error analysis. This prevents assessment of whether the multi-modal spatial learning or disentangled loss actually drive the gains and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the comment point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the domain-adapted images achieve superior scene text detection and recognition performance' is presented without any quantitative results, baselines, datasets, metrics, or error analysis. This prevents assessment of whether the multi-modal spatial learning or disentangled loss actually drive the gains and is load-bearing for the paper's primary contribution.

Authors: We agree that the abstract would be strengthened by including key quantitative highlights to substantiate the claims and better allow readers to assess the contributions of the multi-modal spatial learning and disentangled cycle-consistency loss. In the revised version, we will update the abstract to briefly reference the performance gains (in detection and recognition metrics) achieved on the evaluated datasets relative to relevant baselines. The full quantitative results, baselines, metrics, and any error analysis remain detailed in the experimental sections; the abstract revision will serve as a high-level pointer without expanding its length substantially. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces GA-DAN as a new architecture with multi-modal spatial learning and a disentangled cycle-consistency loss for concurrent geometry and appearance domain adaptation. No derivation chain, equations, or predictions are presented that reduce to fitted inputs, self-definitions, or self-citation load-bearing premises; the central claims rest on experimental evaluation of the proposed components rather than any closed mathematical loop or renamed prior result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents extraction of specific free parameters, axioms, or invented entities; ledger remains empty as no technical details on fitting, assumptions, or new entities are available.

pith-pipeline@v0.9.0 · 5691 in / 997 out tokens · 20561 ms · 2026-05-24T18:11:06.153657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 3 internal anchors

[1]

Almazan, A

J. Almazan, A. Gordo, A. Fornes, and E. Valveny. Word spotting and recognition with embedded attributes. TPAMI, 36(12):2552–2566, 2014

work page 2014
[2]

F. L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. TPAMI, 11(6), 1989

work page 1989
[3]

Bousmalis, N

K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr- ishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, 2017

work page 2017
[4]

Bousmalis, G

K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016

work page 2016
[5]

Cheng, F

Z. Cheng, F. Bai, Y . Xu, G. Zheng, S. Pu, and S. Zhou. Fo- cusing attention: Towards accurate text recognition in natural images. In ICCV, pages 5076–5084, 2017

work page 2017
[6]

C. K. Chng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In ICDAR, pages 935–942, 2017

work page 2017
[7]

D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018

work page 2018
[8]

Denton, S

E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen- erative image models using a laplacian pyramid of adversar- ial networks. In NIPS, 2015

work page 2015
[9]

Adversarially Learned Inference

V . Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned in- ference. arXiv:1606.00704, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Ganin and V

Y . Ganin and V . Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, pages 325–333, 2015

work page 2015
[11]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial networks. In NIPS, pages 2672–2680, 2014

work page 2014
[12]

A. Gordo. Supervised mid-level features for word image rep- resentation. In CVPR, 2015

work page 2015
[13]

Gupta, A

A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In CVPR, 2016

work page 2016
[14]

P. He, W. Huang, T. He, Q. Zhu, Y . Qiao, and X. Li. Single shot text detector with regional attention. arXiv:1709.00138, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

T. He, W. Huang, Y . Qiao, and J. Yao. Text-attentional convolutional neural network for scene text detection. TIP, 25(6):2529–2541, 2016

work page 2016
[16]

Hoffman, E

J. Hoffman, E. Tzeng, T. Park, J.-Y . Zhu, P. Isola, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018

work page 2018
[17]

H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, Oct 2017

work page 2017
[18]

Huang, Y

W. Huang, Y . Qiao, and X. Tang. Robust scene text detec- tion with convolution neural network induced mser trees. In ECCV, pages 497–511, 2014

work page 2014
[19]

Isola, J.-Y

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017

work page 2017
[20]

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artiﬁcial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. In ICLR, 2015

work page 2015
[22]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV, 116(1):1–20, 2016

work page 2016
[23]

Jaderberg, A

M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, pages 512–528, 2014

work page 2014
[24]

Karatzas, L

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, and F. Shafait. Icdar 2015 compe- tition on robust reading. In ICDAR, pages 1156–1160, 2015

work page 2015
[25]

Karatzas, F

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, and et al. Icdar 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013

work page 2013
[26]

T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017

work page 2017
[27]

Lee and S

C.-Y . Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. InCVPR, pages 2231– 2239, 2016

work page 2016
[28]

M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. InAAAI, pages 4161–4167, 2017

work page 2017
[29]

M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai. Rotation- sensitive regression for oriented scene text detection. In CVPR, pages 5909–5918, 2018

work page 2018
[30]

C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In CVPR, 2018

work page 2018
[31]

M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to- image translation networks. In NIPS, 2017

work page 2017
[32]

X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a uniﬁed network. In CVPR, pages 5676–5685, 2018

work page 2018
[33]

Liu and L

Y . Liu and L. Jin. Deep matching prior network: Toward tighter multi-oriented text detection. In CVPR, 2017

work page 2017
[34]

M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017

work page 2017
[35]

S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao. Textsnake: A ﬂexible representation for detecting text of ar- bitrary shapes. In ECCV, pages 20–36, 2018

work page 2018
[36]

S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan. Scene text extraction based on edges and support vector regression. IJDAR, 18(2):125–135, 2015

work page 2015
[37]

C. Luo, L. Jin, and Z. Sun. Moran: A multi-object recti- ﬁed attention network for scene text recognition. In Pattern Recognition, volume 90, pages 109–118, 2019

work page 2019
[38]

P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In ECCV, 2018

work page 2018
[39]

P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi-oriented scene text detection via corner localization and region seg- mentation. In CVPR, pages 7553–7563, 2018

work page 2018
[40]

O. T. Ming-Yu Liu. Coupled generative adversarial net- works. In NIPS, 2016

work page 2016
[41]

Mishra, K

A. Mishra, K. Alahari, and C. Jawahar. Scene text recogni- tion using higher order language priors. In BMVC, 2012

work page 2012
[42]

Neumann and J

L. Neumann and J. Matas. Real-time scene text localization and recognition. In CVPR, pages 3538–3545, 2012

work page 2012
[43]

T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog- nizing text with perspective distortion in natural scenes. In ICCV, 2013

work page 2013
[44]

Polzounov, A

A. Polzounov, A. Ablavatski, S. Escalera, S. Lu, and J. Cai. Wordfence: Text detection in natural images with border awareness. In ICIP, pages 1222–1226. IEEE, 2017

work page 2017
[45]

Radford, L

A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. In ICLR, 2016

work page 2016
[46]

Risnumawan, P

A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Syst. Appl., 41(18):8027–8048, 2014

work page 2014
[47]

J. A. Rodrguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. IJCV, 2015

work page 2015
[48]

Saenko, B

K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting vi- sual category models to new domains. In ECCV, pages 325– 333, 2010

work page 2010
[49]

B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. TPAMI, 39(11):2298–2304, 2017

work page 2017
[50]

B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectiﬁcation. InCVPR, 2016

work page 2016
[51]

B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Be- longie, S. Lu, and X. Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In ICDAR, volume 01, pages 1429–1434, 2017

work page 2017
[52]

Shrivastava, T

A. Shrivastava, T. Pﬁster, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017

work page 2017
[53]

Su and S

B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014

work page 2014
[54]

Su and S

B. Su and S. Lu. Accurate recognition of words in scenes without character segmentation using recurrent neural net- work. PR, 2017

work page 2017
[55]

B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016

work page 2016
[56]

Sun and K

B. Sun and K. Saenko. Deep coral: correlation alignment for deep domain adaptation. In ICCV workshop, 2016

work page 2016
[57]

Taigman, A

Y . Taigman, A. Polyak, and L. Wolf. Unsupervised cross- domain image generation. In ICLR, 2017

work page 2017
[58]

S. Tian, Y . Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan. Text ﬂow: A uniﬁed text detection system in natural scene images. In ICCV, pages 4651–4659, 2015

work page 2015
[59]

Z. Tian, W. Huang, P. H. T. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72, 2016

work page 2016
[60]

Torralba and A

A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, 2011

work page 2011
[61]

Tzeng, J

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017

work page 2017
[62]

F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao. Geometry- aware scene text detection with instance transformation net- work. In CVPR, June 2018

work page 2018
[63]

K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011

work page 2011
[64]

C. Xue, S. Lu, and F. Zhan. Accurate scene text detection through border semantics awareness and bootstrapping. In ECCV, pages 370–387, 2018

work page 2018
[65]

C. Yao, X. Bai, W. Liu, Y . Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In CVPR, 2012

work page 2012
[66]

C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, 2014

work page 2014
[67]

Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper- vised dual learning for image-to-image translation. In ICCV, 2017

work page 2017
[68]

X. C. Yin, W. Y . Pei, J. Zhang, and H. W. Hao. Multiorien- tation scene text detection with adaptive clustering. TPAMI, 37(9):1930–1937, 2015

work page 1930
[69]

F. Zhan, J. Huang, and S. Lu. Adaptive composition gan towards realistic image synthesis. arXiv preprint arXiv:1905.04693, 2019

work page arXiv 1905
[70]

Zhan and S

F. Zhan and S. Lu. Esir: End-to-end scene text recognition via iterative image rectiﬁcation. In CVPR, pages 2059–2068, 2019

work page 2059
[71]

F. Zhan, S. Lu, and C. Xue. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In ECCV, pages 257–273, 2018

work page 2018
[72]

F. Zhan, H. Zhu, and S. Lu. Scene text synthesis for ef- ﬁcient and effective deep network training. arXiv preprint arXiv:1901.09193, 2019

work page arXiv 1901
[73]

F. Zhan, H. Zhu, and S. Lu. Spatial fusion gan for image synthesis. In CVPR, pages 3653–3662, 2019

work page 2019
[74]

Zhang, T

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthe- sis with stacked generative adversarial networks. In ICCV, 2017

work page 2017
[75]

Zhang, W

Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In CVPR, pages 2558–2567, 2015

work page 2015
[76]

Zhang, C

Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional net- works. In CVPR, pages 4159–4167, 2016

work page 2016
[77]

X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: An efﬁcient and accurate scene text detector. In CVPR, 2017

work page 2017
[78]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- works. In ICCV, 2017

work page 2017

[1] [1]

Almazan, A

J. Almazan, A. Gordo, A. Fornes, and E. Valveny. Word spotting and recognition with embedded attributes. TPAMI, 36(12):2552–2566, 2014

work page 2014

[2] [2]

F. L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. TPAMI, 11(6), 1989

work page 1989

[3] [3]

Bousmalis, N

K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr- ishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, 2017

work page 2017

[4] [4]

Bousmalis, G

K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016

work page 2016

[5] [5]

Cheng, F

Z. Cheng, F. Bai, Y . Xu, G. Zheng, S. Pu, and S. Zhou. Fo- cusing attention: Towards accurate text recognition in natural images. In ICCV, pages 5076–5084, 2017

work page 2017

[6] [6]

C. K. Chng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In ICDAR, pages 935–942, 2017

work page 2017

[7] [7]

D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018

work page 2018

[8] [8]

Denton, S

E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen- erative image models using a laplacian pyramid of adversar- ial networks. In NIPS, 2015

work page 2015

[9] [9]

Adversarially Learned Inference

V . Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned in- ference. arXiv:1606.00704, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Ganin and V

Y . Ganin and V . Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, pages 325–333, 2015

work page 2015

[11] [11]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial networks. In NIPS, pages 2672–2680, 2014

work page 2014

[12] [12]

A. Gordo. Supervised mid-level features for word image rep- resentation. In CVPR, 2015

work page 2015

[13] [13]

Gupta, A

A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In CVPR, 2016

work page 2016

[14] [14]

P. He, W. Huang, T. He, Q. Zhu, Y . Qiao, and X. Li. Single shot text detector with regional attention. arXiv:1709.00138, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

T. He, W. Huang, Y . Qiao, and J. Yao. Text-attentional convolutional neural network for scene text detection. TIP, 25(6):2529–2541, 2016

work page 2016

[16] [16]

Hoffman, E

J. Hoffman, E. Tzeng, T. Park, J.-Y . Zhu, P. Isola, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018

work page 2018

[17] [17]

H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, Oct 2017

work page 2017

[18] [18]

Huang, Y

W. Huang, Y . Qiao, and X. Tang. Robust scene text detec- tion with convolution neural network induced mser trees. In ECCV, pages 497–511, 2014

work page 2014

[19] [19]

Isola, J.-Y

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017

work page 2017

[20] [20]

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artiﬁcial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[21] [21]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. In ICLR, 2015

work page 2015

[22] [22]

Jaderberg, K

M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV, 116(1):1–20, 2016

work page 2016

[23] [23]

Jaderberg, A

M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, pages 512–528, 2014

work page 2014

[24] [24]

Karatzas, L

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, and F. Shafait. Icdar 2015 compe- tition on robust reading. In ICDAR, pages 1156–1160, 2015

work page 2015

[25] [25]

Karatzas, F

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, and et al. Icdar 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013

work page 2013

[26] [26]

T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017

work page 2017

[27] [27]

Lee and S

C.-Y . Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. InCVPR, pages 2231– 2239, 2016

work page 2016

[28] [28]

M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. InAAAI, pages 4161–4167, 2017

work page 2017

[29] [29]

M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai. Rotation- sensitive regression for oriented scene text detection. In CVPR, pages 5909–5918, 2018

work page 2018

[30] [30]

C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In CVPR, 2018

work page 2018

[31] [31]

M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to- image translation networks. In NIPS, 2017

work page 2017

[32] [32]

X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a uniﬁed network. In CVPR, pages 5676–5685, 2018

work page 2018

[33] [33]

Liu and L

Y . Liu and L. Jin. Deep matching prior network: Toward tighter multi-oriented text detection. In CVPR, 2017

work page 2017

[34] [34]

M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017

work page 2017

[35] [35]

S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao. Textsnake: A ﬂexible representation for detecting text of ar- bitrary shapes. In ECCV, pages 20–36, 2018

work page 2018

[36] [36]

S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan. Scene text extraction based on edges and support vector regression. IJDAR, 18(2):125–135, 2015

work page 2015

[37] [37]

C. Luo, L. Jin, and Z. Sun. Moran: A multi-object recti- ﬁed attention network for scene text recognition. In Pattern Recognition, volume 90, pages 109–118, 2019

work page 2019

[38] [38]

P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In ECCV, 2018

work page 2018

[39] [39]

P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi-oriented scene text detection via corner localization and region seg- mentation. In CVPR, pages 7553–7563, 2018

work page 2018

[40] [40]

O. T. Ming-Yu Liu. Coupled generative adversarial net- works. In NIPS, 2016

work page 2016

[41] [41]

Mishra, K

A. Mishra, K. Alahari, and C. Jawahar. Scene text recogni- tion using higher order language priors. In BMVC, 2012

work page 2012

[42] [42]

Neumann and J

L. Neumann and J. Matas. Real-time scene text localization and recognition. In CVPR, pages 3538–3545, 2012

work page 2012

[43] [43]

T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog- nizing text with perspective distortion in natural scenes. In ICCV, 2013

work page 2013

[44] [44]

Polzounov, A

A. Polzounov, A. Ablavatski, S. Escalera, S. Lu, and J. Cai. Wordfence: Text detection in natural images with border awareness. In ICIP, pages 1222–1226. IEEE, 2017

work page 2017

[45] [45]

Radford, L

A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. In ICLR, 2016

work page 2016

[46] [46]

Risnumawan, P

A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Syst. Appl., 41(18):8027–8048, 2014

work page 2014

[47] [47]

J. A. Rodrguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. IJCV, 2015

work page 2015

[48] [48]

Saenko, B

K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting vi- sual category models to new domains. In ECCV, pages 325– 333, 2010

work page 2010

[49] [49]

B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. TPAMI, 39(11):2298–2304, 2017

work page 2017

[50] [50]

B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectiﬁcation. InCVPR, 2016

work page 2016

[51] [51]

B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Be- longie, S. Lu, and X. Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In ICDAR, volume 01, pages 1429–1434, 2017

work page 2017

[52] [52]

Shrivastava, T

A. Shrivastava, T. Pﬁster, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017

work page 2017

[53] [53]

Su and S

B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014

work page 2014

[54] [54]

Su and S

B. Su and S. Lu. Accurate recognition of words in scenes without character segmentation using recurrent neural net- work. PR, 2017

work page 2017

[55] [55]

B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016

work page 2016

[56] [56]

Sun and K

B. Sun and K. Saenko. Deep coral: correlation alignment for deep domain adaptation. In ICCV workshop, 2016

work page 2016

[57] [57]

Taigman, A

Y . Taigman, A. Polyak, and L. Wolf. Unsupervised cross- domain image generation. In ICLR, 2017

work page 2017

[58] [58]

S. Tian, Y . Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan. Text ﬂow: A uniﬁed text detection system in natural scene images. In ICCV, pages 4651–4659, 2015

work page 2015

[59] [59]

Z. Tian, W. Huang, P. H. T. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72, 2016

work page 2016

[60] [60]

Torralba and A

A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, 2011

work page 2011

[61] [61]

Tzeng, J

E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017

work page 2017

[62] [62]

F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao. Geometry- aware scene text detection with instance transformation net- work. In CVPR, June 2018

work page 2018

[63] [63]

K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011

work page 2011

[64] [64]

C. Xue, S. Lu, and F. Zhan. Accurate scene text detection through border semantics awareness and bootstrapping. In ECCV, pages 370–387, 2018

work page 2018

[65] [65]

C. Yao, X. Bai, W. Liu, Y . Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In CVPR, 2012

work page 2012

[66] [66]

C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, 2014

work page 2014

[67] [67]

Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper- vised dual learning for image-to-image translation. In ICCV, 2017

work page 2017

[68] [68]

X. C. Yin, W. Y . Pei, J. Zhang, and H. W. Hao. Multiorien- tation scene text detection with adaptive clustering. TPAMI, 37(9):1930–1937, 2015

work page 1930

[69] [69]

F. Zhan, J. Huang, and S. Lu. Adaptive composition gan towards realistic image synthesis. arXiv preprint arXiv:1905.04693, 2019

work page arXiv 1905

[70] [70]

Zhan and S

F. Zhan and S. Lu. Esir: End-to-end scene text recognition via iterative image rectiﬁcation. In CVPR, pages 2059–2068, 2019

work page 2059

[71] [71]

F. Zhan, S. Lu, and C. Xue. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In ECCV, pages 257–273, 2018

work page 2018

[72] [72]

F. Zhan, H. Zhu, and S. Lu. Scene text synthesis for ef- ﬁcient and effective deep network training. arXiv preprint arXiv:1901.09193, 2019

work page arXiv 1901

[73] [73]

F. Zhan, H. Zhu, and S. Lu. Spatial fusion gan for image synthesis. In CVPR, pages 3653–3662, 2019

work page 2019

[74] [74]

Zhang, T

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthe- sis with stacked generative adversarial networks. In ICCV, 2017

work page 2017

[75] [75]

Zhang, W

Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In CVPR, pages 2558–2567, 2015

work page 2015

[76] [76]

Zhang, C

Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional net- works. In CVPR, pages 4159–4167, 2016

work page 2016

[77] [77]

X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: An efﬁcient and accurate scene text detector. In CVPR, 2017

work page 2017

[78] [78]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- works. In ICCV, 2017

work page 2017