pith. sign in

arxiv: 1907.09653 · v1 · pith:ZEOP3HK6new · submitted 2019-07-23 · 💻 cs.CV

GA-DAN: Geometry-Aware Domain Adaptation Network for Scene Text Detection and Recognition

Pith reviewed 2026-05-24 18:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords domain adaptationscene text detectionscene text recognitiongeometry-aware adaptationcycle consistencyadversarial learningimage translation
0
0 comments X

The pith

GA-DAN models cross-domain shifts in both geometry and appearance to convert source images into realistic target-domain views for scene text tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GA-DAN, a network that handles domain shifts simultaneously in geometry space and appearance space rather than appearance alone. It introduces multi-modal spatial learning to produce multiple geometric views of a source image that match the target domain, plus a disentangled cycle-consistency loss that keeps both appearance and geometry consistent during translation. When the resulting adapted images are used to train detection and recognition networks, experiments show higher accuracy on target-domain scene text data than training on unadapted source images.

Core claim

GA-DAN converts images across domains with very different characteristics by jointly modeling geometric and appearance shifts, using multi-modal spatial learning to generate varied spatial views and a disentangled cycle-consistency loss to balance the two spaces, and these adapted images produce superior scene text detection and recognition performance when used for network training.

What carries the argument

Geometry-Aware Domain Adaptation Network (GA-DAN) that uses multi-modal spatial learning to generate multiple target-domain spatial views and a disentangled cycle-consistency loss to enforce consistency separately in appearance and geometry.

If this is right

  • Domain-adapted images improve scene text detection accuracy on target domains.
  • Domain-adapted images improve scene text recognition accuracy on target domains.
  • The network can handle source and target domains that differ strongly in both layout and visual style.
  • Separate cycle-consistency terms for geometry and appearance stabilize the overall translation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometry-plus-appearance translation could supply synthetic training data for other tasks that suffer from viewpoint or layout shifts.
  • If the adapted images preserve text legibility, the method might reduce reliance on manual labeling in new environments.
  • Extending the spatial learning module to temporal sequences could support video domain adaptation.

Load-bearing premise

The multi-modal spatial learning technique and disentangled cycle-consistency loss can produce realistic image conversions across domains whose geometry and appearance differ substantially.

What would settle it

Training a scene text detector or recognizer on the GA-DAN-adapted images yields no accuracy gain over training on the original source images when both are tested on the same target-domain benchmark.

Figures

Figures reproduced from arXiv: 1907.09653 by Chuhui Xue, Fangneng Zhan, Shijian Lu.

Figure 1
Figure 1. Figure 1: Domain adaptation by the proposed GA-DAN: For scene [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The structure of the proposed GA-DAN: SX (or SY ) represents the spatial modules as enclosed in blue-color boxes which consist of Spatial Code, transformation module T and localization network LNX (or LNY ) that predict transformation matrix and transform input images. GX (or GY ) denote generators consisting of GXA (or GYA ) and GXB (or GYB ) as enclosed in green-color boxes that complete the background a… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the disentangled cycle-consistency loss: [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing our GA-DAN with state-of-the-art adaptation methods: The first and last columns show source-domain (IC13) and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparing GA-DAN with state-of-the-art adaptation methods: Rows 1-2 show adaptation from COMB to CUTE, Rows 3-4 show [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Recent adversarial learning research has achieved very impressive progress for modelling cross-domain data shifts in appearance space but its counterpart in modelling cross-domain shifts in geometry space lags far behind. This paper presents an innovative Geometry-Aware Domain Adaptation Network (GA-DAN) that is capable of modelling cross-domain shifts concurrently in both geometry space and appearance space and realistically converting images across domains with very different characteristics. In the proposed GA-DAN, a novel multi-modal spatial learning technique is designed which converts a source-domain image into multiple images of different spatial views as in the target domain. A new disentangled cycle-consistency loss is introduced which balances the cycle consistency in appearance and geometry spaces and improves the learning of the whole network greatly. The proposed GA-DAN has been evaluated for the classic scene text detection and recognition tasks, and experiments show that the domain-adapted images achieve superior scene text detection and recognition performance while applied to network training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes GA-DAN, a geometry-aware domain adaptation network for scene text detection and recognition. It claims to model cross-domain shifts concurrently in geometry and appearance spaces via a novel multi-modal spatial learning technique (converting source images to multiple target-domain spatial views) and a disentangled cycle-consistency loss (balancing appearance and geometry consistency). Experiments are stated to show that the resulting domain-adapted images yield superior detection and recognition performance when used for network training.

Significance. If the performance claims hold under rigorous evaluation, the work would address a recognized gap in domain adaptation by extending adversarial methods to geometry space, which is relevant for scene text where domains often differ in both appearance and layout. The disentangled loss formulation could offer a reusable technique for multi-modal adaptation if shown to be effective beyond the specific tasks.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'the domain-adapted images achieve superior scene text detection and recognition performance' is presented without any quantitative results, baselines, datasets, metrics, or error analysis. This prevents assessment of whether the multi-modal spatial learning or disentangled loss actually drive the gains and is load-bearing for the paper's primary contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the domain-adapted images achieve superior scene text detection and recognition performance' is presented without any quantitative results, baselines, datasets, metrics, or error analysis. This prevents assessment of whether the multi-modal spatial learning or disentangled loss actually drive the gains and is load-bearing for the paper's primary contribution.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights to substantiate the claims and better allow readers to assess the contributions of the multi-modal spatial learning and disentangled cycle-consistency loss. In the revised version, we will update the abstract to briefly reference the performance gains (in detection and recognition metrics) achieved on the evaluated datasets relative to relevant baselines. The full quantitative results, baselines, metrics, and any error analysis remain detailed in the experimental sections; the abstract revision will serve as a high-level pointer without expanding its length substantially. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces GA-DAN as a new architecture with multi-modal spatial learning and a disentangled cycle-consistency loss for concurrent geometry and appearance domain adaptation. No derivation chain, equations, or predictions are presented that reduce to fitted inputs, self-definitions, or self-citation load-bearing premises; the central claims rest on experimental evaluation of the proposed components rather than any closed mathematical loop or renamed prior result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents extraction of specific free parameters, axioms, or invented entities; ledger remains empty as no technical details on fitting, assumptions, or new entities are available.

pith-pipeline@v0.9.0 · 5691 in / 997 out tokens · 20561 ms · 2026-05-24T18:11:06.153657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 3 internal anchors

  1. [1]

    Almazan, A

    J. Almazan, A. Gordo, A. Fornes, and E. Valveny. Word spotting and recognition with embedded attributes. TPAMI, 36(12):2552–2566, 2014

  2. [2]

    F. L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. TPAMI, 11(6), 1989

  3. [3]

    Bousmalis, N

    K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Kr- ishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In CVPR, 2017

  4. [4]

    Bousmalis, G

    K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain separation networks. In NIPS, 2016

  5. [5]

    Cheng, F

    Z. Cheng, F. Bai, Y . Xu, G. Zheng, S. Pu, and S. Zhou. Fo- cusing attention: Towards accurate text recognition in natural images. In ICCV, pages 5076–5084, 2017

  6. [6]

    C. K. Chng and C. S. Chan. Total-text: A comprehensive dataset for scene text detection and recognition. In ICDAR, pages 935–942, 2017

  7. [7]

    D. Deng, H. Liu, X. Li, and D. Cai. Pixellink: Detecting scene text via instance segmentation. In AAAI, 2018

  8. [8]

    Denton, S

    E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep gen- erative image models using a laplacian pyramid of adversar- ial networks. In NIPS, 2015

  9. [9]

    Adversarially Learned Inference

    V . Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned in- ference. arXiv:1606.00704, 2016

  10. [10]

    Ganin and V

    Y . Ganin and V . Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, pages 325–333, 2015

  11. [11]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Gen- erative adversarial networks. In NIPS, pages 2672–2680, 2014

  12. [12]

    A. Gordo. Supervised mid-level features for word image rep- resentation. In CVPR, 2015

  13. [13]

    Gupta, A

    A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In CVPR, 2016

  14. [14]

    P. He, W. Huang, T. He, Q. Zhu, Y . Qiao, and X. Li. Single shot text detector with regional attention. arXiv:1709.00138, 2017

  15. [15]

    T. He, W. Huang, Y . Qiao, and J. Yao. Text-attentional convolutional neural network for scene text detection. TIP, 25(6):2529–2541, 2016

  16. [16]

    Hoffman, E

    J. Hoffman, E. Tzeng, T. Park, J.-Y . Zhu, P. Isola, A. A. Efros, and T. Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In ICML, 2018

  17. [17]

    H. Hu, C. Zhang, Y . Luo, Y . Wang, J. Han, and E. Ding. Wordsup: Exploiting word annotations for character based text detection. In ICCV, Oct 2017

  18. [18]

    Huang, Y

    W. Huang, Y . Qiao, and X. Tang. Robust scene text detec- tion with convolution neural network induced mser trees. In ECCV, pages 497–511, 2014

  19. [19]

    Isola, J.-Y

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017

  20. [20]

    Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014

  21. [21]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. In ICLR, 2015

  22. [22]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV, 116(1):1–20, 2016

  23. [23]

    Jaderberg, A

    M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, pages 512–528, 2014

  24. [24]

    Karatzas, L

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu, and F. Shafait. Icdar 2015 compe- tition on robust reading. In ICDAR, pages 1156–1160, 2015

  25. [25]

    Karatzas, F

    D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. de las Heras, and et al. Icdar 2013 robust reading competition. In ICDAR, pages 1484–1493, 2013

  26. [26]

    T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In ICML, 2017

  27. [27]

    Lee and S

    C.-Y . Lee and S. Osindero. Recursive recurrent nets with attention modeling for ocr in the wild. InCVPR, pages 2231– 2239, 2016

  28. [28]

    M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. InAAAI, pages 4161–4167, 2017

  29. [29]

    M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai. Rotation- sensitive regression for oriented scene text detection. In CVPR, pages 5909–5918, 2018

  30. [30]

    C.-H. Lin, E. Yumer, O. Wang, E. Shechtman, and S. Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In CVPR, 2018

  31. [31]

    M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to- image translation networks. In NIPS, 2017

  32. [32]

    X. Liu, D. Liang, S. Yan, D. Chen, Y . Qiao, and J. Yan. Fots: Fast oriented text spotting with a unified network. In CVPR, pages 5676–5685, 2018

  33. [33]

    Liu and L

    Y . Liu and L. Jin. Deep matching prior network: Toward tighter multi-oriented text detection. In CVPR, 2017

  34. [34]

    M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017

  35. [35]

    S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao. Textsnake: A flexible representation for detecting text of ar- bitrary shapes. In ECCV, pages 20–36, 2018

  36. [36]

    S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan. Scene text extraction based on edges and support vector regression. IJDAR, 18(2):125–135, 2015

  37. [37]

    C. Luo, L. Jin, and Z. Sun. Moran: A multi-object recti- fied attention network for scene text recognition. In Pattern Recognition, volume 90, pages 109–118, 2019

  38. [38]

    P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai. Mask textspot- ter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In ECCV, 2018

  39. [39]

    P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi-oriented scene text detection via corner localization and region seg- mentation. In CVPR, pages 7553–7563, 2018

  40. [40]

    O. T. Ming-Yu Liu. Coupled generative adversarial net- works. In NIPS, 2016

  41. [41]

    Mishra, K

    A. Mishra, K. Alahari, and C. Jawahar. Scene text recogni- tion using higher order language priors. In BMVC, 2012

  42. [42]

    Neumann and J

    L. Neumann and J. Matas. Real-time scene text localization and recognition. In CVPR, pages 3538–3545, 2012

  43. [43]

    T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recog- nizing text with perspective distortion in natural scenes. In ICCV, 2013

  44. [44]

    Polzounov, A

    A. Polzounov, A. Ablavatski, S. Escalera, S. Lu, and J. Cai. Wordfence: Text detection in natural images with border awareness. In ICIP, pages 1222–1226. IEEE, 2017

  45. [45]

    Radford, L

    A. Radford, L. Metz, and S. Chintala. Unsupervised repre- sentation learning with deep convolutional generative adver- sarial networks. In ICLR, 2016

  46. [46]

    Risnumawan, P

    A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene im- ages. Expert Syst. Appl., 41(18):8027–8048, 2014

  47. [47]

    J. A. Rodrguez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. IJCV, 2015

  48. [48]

    Saenko, B

    K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting vi- sual category models to new domains. In ECCV, pages 325– 333, 2010

  49. [49]

    B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its appli- cation to scene text recognition. TPAMI, 39(11):2298–2304, 2017

  50. [50]

    B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai. Robust scene text recognition with automatic rectification. InCVPR, 2016

  51. [51]

    B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Be- longie, S. Lu, and X. Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In ICDAR, volume 01, pages 1429–1434, 2017

  52. [52]

    Shrivastava, T

    A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017

  53. [53]

    Su and S

    B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014

  54. [54]

    Su and S

    B. Su and S. Lu. Accurate recognition of words in scenes without character segmentation using recurrent neural net- work. PR, 2017

  55. [55]

    B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016

  56. [56]

    Sun and K

    B. Sun and K. Saenko. Deep coral: correlation alignment for deep domain adaptation. In ICCV workshop, 2016

  57. [57]

    Taigman, A

    Y . Taigman, A. Polyak, and L. Wolf. Unsupervised cross- domain image generation. In ICLR, 2017

  58. [58]

    S. Tian, Y . Pan, C. Huang, S. Lu, K. Yu, and C. L. Tan. Text flow: A unified text detection system in natural scene images. In ICCV, pages 4651–4659, 2015

  59. [59]

    Z. Tian, W. Huang, P. H. T. He, and Y . Qiao. Detecting text in natural image with connectionist text proposal network. In ECCV, pages 56–72, 2016

  60. [60]

    Torralba and A

    A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, 2011

  61. [61]

    Tzeng, J

    E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017

  62. [62]

    F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao. Geometry- aware scene text detection with instance transformation net- work. In CVPR, June 2018

  63. [63]

    K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011

  64. [64]

    C. Xue, S. Lu, and F. Zhan. Accurate scene text detection through border semantics awareness and bootstrapping. In ECCV, pages 370–387, 2018

  65. [65]

    C. Yao, X. Bai, W. Liu, Y . Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In CVPR, 2012

  66. [66]

    C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, 2014

  67. [67]

    Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsuper- vised dual learning for image-to-image translation. In ICCV, 2017

  68. [68]

    X. C. Yin, W. Y . Pei, J. Zhang, and H. W. Hao. Multiorien- tation scene text detection with adaptive clustering. TPAMI, 37(9):1930–1937, 2015

  69. [69]

    F. Zhan, J. Huang, and S. Lu. Adaptive composition gan towards realistic image synthesis. arXiv preprint arXiv:1905.04693, 2019

  70. [70]

    Zhan and S

    F. Zhan and S. Lu. Esir: End-to-end scene text recognition via iterative image rectification. In CVPR, pages 2059–2068, 2019

  71. [71]

    F. Zhan, S. Lu, and C. Xue. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In ECCV, pages 257–273, 2018

  72. [72]

    F. Zhan, H. Zhu, and S. Lu. Scene text synthesis for ef- ficient and effective deep network training. arXiv preprint arXiv:1901.09193, 2019

  73. [73]

    F. Zhan, H. Zhu, and S. Lu. Spatial fusion gan for image synthesis. In CVPR, pages 3653–3662, 2019

  74. [74]

    Zhang, T

    H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthe- sis with stacked generative adversarial networks. In ICCV, 2017

  75. [75]

    Zhang, W

    Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In CVPR, pages 2558–2567, 2015

  76. [76]

    Zhang, C

    Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional net- works. In CVPR, pages 4159–4167, 2016

  77. [77]

    X. Zhou, C. Yao, H. Wen, Y . Wang, S. Zhou, W. He, and J. Liang. East: An efficient and accurate scene text detector. In CVPR, 2017

  78. [78]

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image- to-image translation using cycle-consistent adversarial net- works. In ICCV, 2017