TextDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts

Boyuan Chen; Chuang Yang; Lap-Pui Chau; Yi Wang; Zichen Dang

arxiv: 2606.28077 · v2 · pith:KXF6MDNPnew · submitted 2026-06-26 · 💻 cs.CV

TextDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts

Boyuan Chen , Zichen Dang , Chuang Yang , Lap-Pui Chau , Yi Wang This is my paper

Pith reviewed 2026-06-30 09:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene text detectiondistribution shiftsparameter-efficient learningLoRA adaptationfeature fusiondual-encodervisual foundation modelsadverse imaging conditions

0 comments

The pith

TextDS aligns features from visual foundation models to detect scene text under distribution shifts using only 4.9 million trainable parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that scene text detection can remain effective when the input images come from different domains or suffer from degradations like blur or low light, without needing to pretrain on huge amounts of scene text data. This would matter because real deployments face such variations, and current methods often require expensive computation or data collection. TextDS uses a dual-encoder setup based on existing visual models, adapts them step by step with low-rank updates that can exit early, and fuses the outputs in a common subspace to keep useful information while aligning for shifts. The result is competitive accuracy with far fewer parameters to train.

Core claim

TextDS is a framework that employs a data-efficient dual-encoder design with visual foundation models, applies Step-wise LoRA adaptation (SWLoRA) for progressive refinement with dynamic early-exit, and uses Common Subspace Fusion (CSF) to align the branches while preserving complementary shift-robust information, achieving robustness across domains and adverse conditions.

What carries the argument

Dual-encoder architecture combining visual foundation models with SWLoRA for adaptation and CSF for fusion in a shared subspace.

If this is right

Detectors can be deployed across varied real-world conditions without large-scale text-specific pretraining.
Training requires only 4.9 million parameters for effective adaptation.
Evaluation on newly constructed adverse-condition datasets shows maintained performance under imaging degradations.
Feature alignment in a common subspace retains information that single-branch approaches might lose.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar alignment techniques could apply to other detection tasks facing domain shifts, such as object detection in medical imaging.
The dynamic early-exit in SWLoRA suggests potential for variable compute during inference depending on input difficulty.
By avoiding full pretraining, this approach might lower the barrier for adapting to new languages or scripts in text detection.

Load-bearing premise

That the features from general visual foundation models can be sufficiently adapted and aligned using low-rank updates and subspace fusion to handle text detection shifts without dedicated large-scale pretraining.

What would settle it

If TextDS with its 4.9M parameters underperforms significantly compared to methods using full pretraining on a held-out distribution shift dataset not used in the paper.

Figures

Figures reproduced from arXiv: 2606.28077 by Boyuan Chen, Chuang Yang, Lap-Pui Chau, Yi Wang, Zichen Dang.

**Figure 1.** Figure 1: Examples of distribution shifts in scene text detection, including domain changes and adverse-condition imaging degradation. The example of domain changes is transformation of text language from English to French, Arabic and Chinese. Imaging degradation includes low-resolution, rain, fog, underexposure and overexposure. the scene text often carries dense, explicit semantic cues, the quality of detection di… view at source ↗

**Figure 2.** Figure 2: Overall structure of TextDS. The input scene text image is processed through the SAM2-Encoder-Branch and the DINOv3-Encoder-Branch, and the dual-branch encoded features of multiple scales are fused through the Common Subspace Fusion (CSF) module, and each SAM2 Encoder Block uses Step-wise LoRA (SWLoRA) structure for fine-tuning. To make adapting large foundation models feasible under limited compute and d… view at source ↗

**Figure 3.** Figure 3: The results of domain generalization from the MLT dataset to CTW-1500 and Total-Text are compared between TextDS and the comparison methods. The blue circles represent the F-measure of the model on CTW-1500 and Total-Text by itself, while the red circles represent the F-measure of the model when generalizing from MLT dataset to CTW-1500 and Total-Text. we proposed significantly outperforms the comparison m… view at source ↗

**Figure 4.** Figure 4: Visualization of the detailed scene text image degradation and detection examples for TextDS. For the detection samples, the red color represents the Ground-Truth, while the blue color represents the text region results detected by the TextDS, which maintains performance under adverse imaging conditions [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Performance comparison of TextDS and the comparison methods on the scene text datasets in adverse-condition scenarios, including the F-measure of TextDS, S3INet, TextPMs and DBNet under Normal and degraded imaging conditions [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Variation of the F-measure with the selection of Rank and Maximum Step, using the SWLoRA structure to fine-tune CTW-1500, Total-Text, and MLT datasets, where Rank = 8 and Maximum Step = 5 are adopted as the default setting [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Visual comparison with LLM-based and end-to-end OCR methods in scene text detection, including Qwen-OCR, DeepSeek-OCR, and PP-OCR. TextDS, as a front-end detector, due to its extremely strong structural prior, is highly complementary to the existing LLM-VLM models and can provide higher-quality text extraction regions for large language models. More comparison details and examples can be found in the supp… view at source ↗

read the original abstract

In real-world deployments, scene text detectors inevitably face distribution shifts beyond the training distribution. Prior work often depends on large-scale scene-text pretraining, yet evaluation under cross-domain changes and real-world imaging degradations remains limited. We propose TextDS, an efficient framework for scene text detection under distribution shifts. First, we propose a data-efficient dual-encoder design with visual foundation models, eliminating the reliance on large-scale scene-text pretraining. Second, we introduce Step-wise LoRA adaptation (SWLoRA), which performs progressive low-rank refinement with a dynamic early-exit mechanism for effective feature adaptation. Third, we propose Common Subspace Fusion (CSF) to align and fuse the two branches in a shared subspace while retaining complementary, shift-robust information. Finally, we construct adverse-condition scene text detection datasets to address the gap in evaluating under imaging degradation. Experiments show that TextDS achieves competitive performance in scene text detection, demonstrating robustness across domains and adverse imaging conditions with only 4.9M trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextDS packages SWLoRA and CSF into a dual-encoder setup for efficient scene-text adaptation under shifts, but the performance edge needs the full tables to confirm.

read the letter

TextDS combines a frozen visual foundation model dual-encoder with two new modules: SWLoRA, which does progressive low-rank updates and early exits, and CSF, which fuses the branches in a shared subspace. The paper also releases adverse-condition datasets. That combination is the main new piece, aimed at avoiding large scene-text pretraining while keeping trainable parameters at 4.9M.

The work does a clean job of targeting a practical gap—real imaging degradations and cross-domain shifts in text detection—without inflating the method count. The design choices around progressive adaptation and subspace retention are reasonable extensions of existing LoRA and fusion ideas, and the emphasis on parameter count is honest.

The soft spot is the evidence. The abstract states competitive performance and robustness, yet supplies no numbers, baselines, or variance. If the full experiments show clear gains over standard fine-tuning or other adapters on the new datasets, the claim holds; otherwise it stays an untested assertion. The assumption that the frozen encoders plus these adapters are enough for severe shifts is plausible but not automatic.

This is for people already working on scene text or efficient adaptation in CV. A reader who needs a concrete recipe for low-parameter domain handling could extract value from the components and datasets.

Send it to peer review so the tables and ablations can be examined directly.

Referee Report

0 major / 2 minor

Summary. The paper proposes TextDS, a parameter-efficient framework for scene text detection under distribution shifts. It introduces a dual-encoder architecture that leverages frozen visual foundation models to avoid large-scale scene-text pretraining, Step-wise LoRA (SWLoRA) for progressive low-rank adaptation with a dynamic early-exit mechanism, and Common Subspace Fusion (CSF) to align the encoder branches in a shared subspace while preserving complementary shift-robust features. The authors also construct new adverse-condition scene text detection datasets to evaluate robustness under imaging degradations. The central claim is that TextDS achieves competitive performance and robustness across domains and adverse conditions while training only 4.9M parameters.

Significance. If the experimental results hold, the work would be significant for practical deployment of scene text detectors, as it demonstrates how frozen foundation models combined with targeted low-rank adaptation and subspace fusion can deliver robustness without expensive pretraining or full fine-tuning. The construction of adverse-condition datasets is a concrete contribution that fills an evaluation gap. The emphasis on parameter efficiency (4.9M trainable parameters) and avoidance of large-scale scene-text pretraining aligns with broader needs in efficient computer vision.

minor comments (2)

[Abstract] Abstract: the claim of 'competitive performance' is stated without any numerical results, baselines, or error bars. Adding at least one key metric (e.g., F-measure on a cross-domain benchmark) would strengthen the abstract.
[Method (SWLoRA subsection)] The description of SWLoRA mentions a 'dynamic early-exit mechanism' but does not specify the exit criterion or how it interacts with the progressive refinement schedule; a short algorithmic outline or pseudocode would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, recognition of its significance for practical deployment, and recommendation of minor revision. The referee's description accurately reflects the core contributions of TextDS.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical proposal

full rationale

The provided abstract and description contain no equations, fitted parameters presented as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The central claims rest on the introduction of three components (dual-encoder with frozen VFM, SWLoRA, CSF) followed by experimental validation on constructed datasets; these are presented as independent engineering choices whose performance is measured externally rather than defined into existence. No reduction of any result to its own inputs by construction is visible.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, invented entities, or non-standard axioms beyond the domain assumption that visual foundation models transfer usefully to scene text without dedicated pretraining.

axioms (1)

domain assumption Visual foundation models supply transferable features for scene text detection without large-scale scene-text pretraining
The data-efficient dual-encoder design rests on this premise.

pith-pipeline@v0.9.1-grok · 5715 in / 1190 out tokens · 52530 ms · 2026-06-30T09:37:49.126598+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 9 canonical work pages · 7 internal anchors

[1]

IET Image Processing16(2), 289–310 (2022) 2

Arif, Z.H., Mahmoud, M.A., Abdulkareem, K.H., Mohammed, M.A., Al-Mhiqani, M.N., Mutlag, A.A., Damaševičius, R.: Comprehensive review of machine learning (ml) in image defogging: Taxonomy of concepts, scenes, feature extraction, and classification techniques. IET Image Processing16(2), 289–310 (2022) 2

2022
[2]

In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI)

Ayzenberg, L., Giryes, R., Greenspan, H.: Dinov2 based self supervised learning for few shot medical image segmentation. In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2024) 4 16 B. Chen et al

2024
[3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 4, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

IEEE Access11, 67040–67057 (2023) 2

Brophy, T., Mullins, D., Parsi, A., Horgan, J., Ward, E., Denny, P., Eising, C., Deegan, B., Glavin, M., Jones, E.: A review of the impact of rain on camera-based perception in automated driving systems. IEEE Access11, 67040–67057 (2023) 2

2023
[5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) 4

2021
[6]

Medical Image Analysis98, 103310 (2024) 4

Chen, C., Miao, J., Wu, D., Zhong, A., Yan, Z., Kim, S., Hu, J., Liu, Z., Sun, L., Li, X., et al.: Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. Medical Image Analysis98, 103310 (2024) 4

2024
[7]

In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR)

Ch’Ng, C.K., Chan, C.S.: Total-text: A comprehensive dataset for scene text de- tection and recognition. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 935–942. IEEE (2017) 2, 10

2017
[8]

IEEE Transactions on Circuits and Systems for Video Technology32(9), 6073–6085 (2022) 1

Guan, T., Gu, C., Lu, C., Tu, J., Feng, Q., Wu, K., Guan, X.: Industrial scene text detection with refined feature-attentive network. IEEE Transactions on Circuits and Systems for Video Technology32(9), 6073–6085 (2022) 1

2022
[9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natu- ral images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2315–2324 (2016) 2, 4, 11

2016
[10]

Iclr1(2), 3 (2022) 5

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 5

2022
[11]

IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025) 2

Jiang, J., Zuo, Z., Wu, G., Jiang, K., Liu, X.: A survey on all-in-one image restora- tion: Taxonomy, evaluation and future trends. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025) 2

2025
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025) 4

2025
[13]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) 4

2023
[14]

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System.arXiv:2206.03001,

Li, C., Liu, W., Guo, R., Yin, X., Jiang, K., Du, Y., Du, Y., Zhu, L., Lai, B., Hu, X., et al.: Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001 (2022) 4, 14

work page arXiv 2022
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L.M., Shum, H.Y.: Mask dino: Towards a unified transformer-based framework for object detection and segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3041–3050 (2023) 4

2023
[16]

In: Proceedings of the AAAI conference on artificial intelligence

Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI conference on artificial intelligence. pp. 11474–11481 (2020) 2, 3, 10, 11, 12, 14

2020
[17]

In: Forty-first International Conference on Machine Learning (2024) 5

Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T., Chen, M.H.: Dora: Weight-decomposed low-rank adaptation. In: Forty-first International Conference on Machine Learning (2024) 5

2024
[18]

International Journal of Computer Vision129(1), 161–184 (2021) 2 TextDS: Scene Text Detection under Distribution Shifts 17

Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. International Journal of Computer Vision129(1), 161–184 (2021) 2 TextDS: Scene Text Detection under Distribution Shifts 17

2021
[19]

IEEE Transactions on Intelligent Vehicles (2024) 2

Luo, S., Chen, W., Tian, W., Liu, R., Hou, L., Zhang, X., Shen, H., Wu, R., Geng, S., Zhou, Y., et al.: Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives. IEEE Transactions on Intelligent Vehicles (2024) 2

2024
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Luo, Y., Feinglass, J., Gokhale, T., Lee, K.C., Baral, C., Yang, Y.: Grounding stylistic domain generalization with quantitative domain shift measures and syn- thetic scene images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7303–7313 (2024) 2

2024
[21]

IEEE Transactions on Geoscience and Remote Sensing62, 1–16 (2024) 4

Ma, X., Wu, Q., Zhao, X., Zhang, X., Pun, M.O., Huang, B.: Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints. IEEE Transactions on Geoscience and Remote Sensing62, 1–16 (2024) 4

2024
[22]

Frontiers of Computer Science19(7), 197605 (2025) 5

Mao, Y., Ge, Y., Fan, Y., Xu, W., Mi, Y., Hu, Z., Gao, Y.: A survey on lora of large language models. Frontiers of Computer Science19(7), 197605 (2025) 5

2025
[23]

Medical Image Analysis p

Miao, J., Chen, C., Yuan, Y., Li, Q., Heng, P.A.: Sam-driven cross prompting with adaptive sampling consistency for semi-supervised medical image segmentation. Medical Image Analysis p. 103973 (2026) 4

2026
[24]

Multimedia Tools and Applications81(14), 20255–20290 (2022) 1

Naiemi, F., Ghods, V., Khalesi, H.: Scene text detection and recognition: a survey. Multimedia Tools and Applications81(14), 20255–20290 (2022) 1

2022
[25]

In: 2017 14th IAPR inter- national conference on document analysis and recognition (ICDAR)

Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud,C.,Chazalon,J.,etal.:Icdar2017robustreadingchallengeonmulti-lingual scene text detection and script identification-rrc-mlt. In: 2017 14th IAPR inter- national conference on document analysis and recognition (ICDAR). vol. 1, pp. 1454–1459. IEEE (2017) 10

2017
[26]

In: Proceedings of the 24th International Conference on Intelligent User Interfaces

Neat, L., Peng, R., Qin, S., Manduchi, R.: Scene text access: A comparison of mobile ocr modalities for blind users. In: Proceedings of the 24th International Conference on Intelligent User Interfaces. pp. 197–207 (2019) 1

2019
[27]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

IEEE Transactions on Circuits and Systems for Video Technology34(3), 1815–1826 (2023) 10, 11

Shao, Z., Su, Y., Zhou, Y., Meng, F., Zhu, H., Liu, B., Yao, R.: Ct-net: Arbitrary- shaped text detection via contour transformer. IEEE Transactions on Circuits and Systems for Video Technology34(3), 1815–1826 (2023) 10, 11

2023
[30]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024) 1

Song, Y., Sun, P., Liu, H., Li, Z., Song, W., Xiao, Y., Zhou, X.: Scene-driven multimodal knowledge graph construction for embodied ai. IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024) 1

2024
[32]

Pattern Recognition157, 110935 (2025) 2

Su, H., Li, Y., Xu, Y., Fu, X., Liu, S.: A review of deep-learning-based super- resolution: From methods to applications. Pattern Recognition157, 110935 (2025) 2

2025
[33]

Su, Y., Chen, Z., Du, Y., Ji, Z., Hu, K., Bai, J., Gao, X.: Explicit relational reasoningnetworkforscenetextdetection.In:ProceedingsoftheAAAIConference on Artificial Intelligence. pp. 7069–7077 (2025) 2, 4, 10, 11

2025
[34]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Su, Y., Chen, Z., Shao, Z., Du, Y., Ji, Z., Bai, J., Zhou, Y., Jiang, Y.G.: Lranet: Towards accurate and efficient scene text detection with low-rank approximation network. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 4979–4987 (2024) 2, 4, 10, 11, 12 18 B. Chen et al

2024
[35]

IEEE Transactions on Multimedia25, 5030–5042 (2022) 10, 11

Su, Y., Shao, Z., Zhou, Y., Meng, F., Zhu, H., Liu, B., Yao, R.: Textdct: Arbitrary- shaped text detection via discrete cosine transform mask. IEEE Transactions on Multimedia25, 5030–5042 (2022) 10, 11

2022
[36]

arXiv preprint arXiv:2510.16442 (2025) 14

Sun, H., Cai, C., Zhuang, H., Lee, K.A., Chau, L.P., Wang, Y.: Edvd-llama: Ex- plainable deepfake video detection via multimodal large language model reasoning. arXiv preprint arXiv:2510.16442 (2025) 14

work page arXiv 2025
[37]

In: International Conference on Pattern Recognition

Vaidya, S., Sharma, A.K., Gatti, P., Mishra, A.: Show me the world in my lan- guage: Establishing the first baseline for scene-text to scene-text translation. In: International Conference on Pattern Recognition. pp. 312–328. Springer (2024) 1

2024
[38]

IEEE Transactions on Neural Networks and Learning Systems (2025) 2, 3, 10, 11, 12, 14

Wang, R., Chen, H., Zhu, Y., Xu, J., Cao, X., Zhu, Z., Qian, S., Gao, C., Liu, L., Sang, N.: S3inet: Semantic-information space sharing interaction network for arbi- trary shape text detection. IEEE Transactions on Neural Networks and Learning Systems (2025) 2, 3, 10, 11, 12, 14

2025
[39]

IEEE Transactions on Neural Networks and Learning Sys- tems32(5), 2299–2304 (2020) 2

Wang, Y., Bian, Z.P., Hou, J., Chau, L.P.: Convolutional neural networks with dynamic regularization. IEEE Transactions on Neural Networks and Learning Sys- tems32(5), 2299–2304 (2020) 2

2020
[40]

IEEE transactions on intelligent transportation systems23(7), 8868–8880 (2021) 2

Wang, Y., Bian, Z.P., Zhou, Y., Chau, L.P.: Rethinking and designing a high- performing automatic license plate recognition approach. IEEE transactions on intelligent transportation systems23(7), 8868–8880 (2021) 2

2021
[41]

IEEE Transactions on Image Processing30, 2876–2887 (2021) 2

Wang, Y., Hou, J., Hou, X., Chau, L.P.: A self-training approach for point- supervised object detection and counting in crowds. IEEE Transactions on Image Processing30, 2876–2887 (2021) 2

2021
[42]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025) 4, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., et al.: Efficient track anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11513–11524 (2025) 4

2025
[44]

In: European Conference on Computer Vision

Xue, C., Zhang, W., Hao, Y., Lu, S., Torr, P.H., Bai, S.: Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. In: European Conference on Computer Vision. pp. 284–302. Springer (2022) 2

2022
[45]

IEEE Transactions on Geoscience and Remote Sensing61, 1–16 (2023) 4

Yan, Z., Li, J., Li, X., Zhou, R., Zhang, W., Feng, Y., Diao, W., Fu, K., Sun, X.: Ringmo-sam: A foundation model for segment anything in multimodal remote- sensing images. IEEE Transactions on Geoscience and Remote Sensing61, 1–16 (2023) 4

2023
[46]

Knowledge-Based Systems285, 111349 (2024) 2

Yang, S., Xie, L., Ran, X., Lei, J., Qian, X.: Pragmatic degradation learning for scene text image super-resolution with data-training strategy. Knowledge-Based Systems285, 111349 (2024) 2

2024
[47]

Detecting Curve Text in the Wild: New Dataset and New Solution

Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z.: Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017) 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., Zhao, T.: Adalora: Adaptive budget allocation for parameter-efficient fine- tuning. arXiv preprint arXiv:2303.10512 (2023) 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

IEEE transactions on pattern analysis and machine intelligence45(3), 2736–2750 (2022) 2, 3, 10, 11, 12, 14

Zhang, S.X., Zhu, X., Chen, L., Hou, J.B., Yin, X.C.: Arbitrary shape text de- tection via segmentation with probability maps. IEEE transactions on pattern analysis and machine intelligence45(3), 2736–2750 (2022) 2, 3, 10, 11, 12, 14

2022
[50]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.: Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1305–1314 (2021) 2, 4, 10, 11, 12 TextDS: Scene Text Detection under Distribution Shifts 19

2021
[51]

IEEE Transactions on Neural Networks and Learning Systems (2025) 2

Zhao, Q., Li, G., He, B., Shen, R.: Deep learning for low-light vision: A comprehen- sive survey. IEEE Transactions on Neural Networks and Learning Systems (2025) 2

2025
[52]

International Journal of Computer Vision132(8), 3119–3138 (2024) 10, 11, 12

Zhao, X., Feng, W., Zhang, Z., Lv, J., Zhu, X., Lin, Z., Hu, J., Shao, J.: Cbnet: A plug-and-play network for segmentation-based scene text detection. International Journal of Computer Vision132(8), 3119–3138 (2024) 10, 11, 12

2024

[1] [1]

IET Image Processing16(2), 289–310 (2022) 2

Arif, Z.H., Mahmoud, M.A., Abdulkareem, K.H., Mohammed, M.A., Al-Mhiqani, M.N., Mutlag, A.A., Damaševičius, R.: Comprehensive review of machine learning (ml) in image defogging: Taxonomy of concepts, scenes, feature extraction, and classification techniques. IET Image Processing16(2), 289–310 (2022) 2

2022

[2] [2]

In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI)

Ayzenberg, L., Giryes, R., Greenspan, H.: Dinov2 based self supervised learning for few shot medical image segmentation. In: 2024 IEEE International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2024) 4 16 B. Chen et al

2024

[3] [3]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 4, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

IEEE Access11, 67040–67057 (2023) 2

Brophy, T., Mullins, D., Parsi, A., Horgan, J., Ward, E., Denny, P., Eising, C., Deegan, B., Glavin, M., Jones, E.: A review of the impact of rain on camera-based perception in automated driving systems. IEEE Access11, 67040–67057 (2023) 2

2023

[5] [5]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) 4

2021

[6] [6]

Medical Image Analysis98, 103310 (2024) 4

Chen, C., Miao, J., Wu, D., Zhong, A., Yan, Z., Kim, S., Hu, J., Liu, Z., Sun, L., Li, X., et al.: Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. Medical Image Analysis98, 103310 (2024) 4

2024

[7] [7]

In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR)

Ch’Ng, C.K., Chan, C.S.: Total-text: A comprehensive dataset for scene text de- tection and recognition. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR). vol. 1, pp. 935–942. IEEE (2017) 2, 10

2017

[8] [8]

IEEE Transactions on Circuits and Systems for Video Technology32(9), 6073–6085 (2022) 1

Guan, T., Gu, C., Lu, C., Tu, J., Feng, Q., Wu, K., Guan, X.: Industrial scene text detection with refined feature-attentive network. IEEE Transactions on Circuits and Systems for Video Technology32(9), 6073–6085 (2022) 1

2022

[9] [9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natu- ral images. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2315–2324 (2016) 2, 4, 11

2016

[10] [10]

Iclr1(2), 3 (2022) 5

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 5

2022

[11] [11]

IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025) 2

Jiang, J., Zuo, Z., Wu, G., Jiang, K., Liu, X.: A survey on all-in-one image restora- tion: Taxonomy, evaluation and future trends. IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (2025) 2

2025

[12] [12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025) 4

2025

[13] [13]

In: Proceedings of the IEEE/CVF international conference on computer vision

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) 4

2023

[14] [14]

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System.arXiv:2206.03001,

Li, C., Liu, W., Guo, R., Yin, X., Jiang, K., Du, Y., Du, Y., Zhu, L., Lai, B., Hu, X., et al.: Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001 (2022) 4, 14

work page arXiv 2022

[15] [15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L.M., Shum, H.Y.: Mask dino: Towards a unified transformer-based framework for object detection and segmenta- tion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3041–3050 (2023) 4

2023

[16] [16]

In: Proceedings of the AAAI conference on artificial intelligence

Liao, M., Wan, Z., Yao, C., Chen, K., Bai, X.: Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI conference on artificial intelligence. pp. 11474–11481 (2020) 2, 3, 10, 11, 12, 14

2020

[17] [17]

In: Forty-first International Conference on Machine Learning (2024) 5

Liu, S.Y., Wang, C.Y., Yin, H., Molchanov, P., Wang, Y.C.F., Cheng, K.T., Chen, M.H.: Dora: Weight-decomposed low-rank adaptation. In: Forty-first International Conference on Machine Learning (2024) 5

2024

[18] [18]

International Journal of Computer Vision129(1), 161–184 (2021) 2 TextDS: Scene Text Detection under Distribution Shifts 17

Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning era. International Journal of Computer Vision129(1), 161–184 (2021) 2 TextDS: Scene Text Detection under Distribution Shifts 17

2021

[19] [19]

IEEE Transactions on Intelligent Vehicles (2024) 2

Luo, S., Chen, W., Tian, W., Liu, R., Hou, L., Zhang, X., Shen, H., Wu, R., Geng, S., Zhou, Y., et al.: Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives. IEEE Transactions on Intelligent Vehicles (2024) 2

2024

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Luo, Y., Feinglass, J., Gokhale, T., Lee, K.C., Baral, C., Yang, Y.: Grounding stylistic domain generalization with quantitative domain shift measures and syn- thetic scene images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7303–7313 (2024) 2

2024

[21] [21]

IEEE Transactions on Geoscience and Remote Sensing62, 1–16 (2024) 4

Ma, X., Wu, Q., Zhao, X., Zhang, X., Pun, M.O., Huang, B.: Sam-assisted remote sensing imagery semantic segmentation with object and boundary constraints. IEEE Transactions on Geoscience and Remote Sensing62, 1–16 (2024) 4

2024

[22] [22]

Frontiers of Computer Science19(7), 197605 (2025) 5

Mao, Y., Ge, Y., Fan, Y., Xu, W., Mi, Y., Hu, Z., Gao, Y.: A survey on lora of large language models. Frontiers of Computer Science19(7), 197605 (2025) 5

2025

[23] [23]

Medical Image Analysis p

Miao, J., Chen, C., Yuan, Y., Li, Q., Heng, P.A.: Sam-driven cross prompting with adaptive sampling consistency for semi-supervised medical image segmentation. Medical Image Analysis p. 103973 (2026) 4

2026

[24] [24]

Multimedia Tools and Applications81(14), 20255–20290 (2022) 1

Naiemi, F., Ghods, V., Khalesi, H.: Scene text detection and recognition: a survey. Multimedia Tools and Applications81(14), 20255–20290 (2022) 1

2022

[25] [25]

In: 2017 14th IAPR inter- national conference on document analysis and recognition (ICDAR)

Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., Luo, Z., Pal, U., Rigaud,C.,Chazalon,J.,etal.:Icdar2017robustreadingchallengeonmulti-lingual scene text detection and script identification-rrc-mlt. In: 2017 14th IAPR inter- national conference on document analysis and recognition (ICDAR). vol. 1, pp. 1454–1459. IEEE (2017) 10

2017

[26] [26]

In: Proceedings of the 24th International Conference on Intelligent User Interfaces

Neat, L., Peng, R., Qin, S., Manduchi, R.: Scene text access: A comparison of mobile ocr modalities for blind users. In: Proceedings of the 24th International Conference on Intelligent User Interfaces. pp. 197–207 (2019) 1

2019

[27] [27]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

SAM 2: Segment Anything in Images and Videos

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024) 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

IEEE Transactions on Circuits and Systems for Video Technology34(3), 1815–1826 (2023) 10, 11

Shao, Z., Su, Y., Zhou, Y., Meng, F., Zhu, H., Liu, B., Yao, R.: Ct-net: Arbitrary- shaped text detection via contour transformer. IEEE Transactions on Circuits and Systems for Video Technology34(3), 1815–1826 (2023) 10, 11

2023

[30] [30]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024) 1

Song, Y., Sun, P., Liu, H., Li, Z., Song, W., Xiao, Y., Zhou, X.: Scene-driven multimodal knowledge graph construction for embodied ai. IEEE Transactions on Knowledge and Data Engineering36(11), 6962–6976 (2024) 1

2024

[32] [32]

Pattern Recognition157, 110935 (2025) 2

Su, H., Li, Y., Xu, Y., Fu, X., Liu, S.: A review of deep-learning-based super- resolution: From methods to applications. Pattern Recognition157, 110935 (2025) 2

2025

[33] [33]

Su, Y., Chen, Z., Du, Y., Ji, Z., Hu, K., Bai, J., Gao, X.: Explicit relational reasoningnetworkforscenetextdetection.In:ProceedingsoftheAAAIConference on Artificial Intelligence. pp. 7069–7077 (2025) 2, 4, 10, 11

2025

[34] [34]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Su, Y., Chen, Z., Shao, Z., Du, Y., Ji, Z., Bai, J., Zhou, Y., Jiang, Y.G.: Lranet: Towards accurate and efficient scene text detection with low-rank approximation network. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 4979–4987 (2024) 2, 4, 10, 11, 12 18 B. Chen et al

2024

[35] [35]

IEEE Transactions on Multimedia25, 5030–5042 (2022) 10, 11

Su, Y., Shao, Z., Zhou, Y., Meng, F., Zhu, H., Liu, B., Yao, R.: Textdct: Arbitrary- shaped text detection via discrete cosine transform mask. IEEE Transactions on Multimedia25, 5030–5042 (2022) 10, 11

2022

[36] [36]

arXiv preprint arXiv:2510.16442 (2025) 14

Sun, H., Cai, C., Zhuang, H., Lee, K.A., Chau, L.P., Wang, Y.: Edvd-llama: Ex- plainable deepfake video detection via multimodal large language model reasoning. arXiv preprint arXiv:2510.16442 (2025) 14

work page arXiv 2025

[37] [37]

In: International Conference on Pattern Recognition

Vaidya, S., Sharma, A.K., Gatti, P., Mishra, A.: Show me the world in my lan- guage: Establishing the first baseline for scene-text to scene-text translation. In: International Conference on Pattern Recognition. pp. 312–328. Springer (2024) 1

2024

[38] [38]

IEEE Transactions on Neural Networks and Learning Systems (2025) 2, 3, 10, 11, 12, 14

Wang, R., Chen, H., Zhu, Y., Xu, J., Cao, X., Zhu, Z., Qian, S., Gao, C., Liu, L., Sang, N.: S3inet: Semantic-information space sharing interaction network for arbi- trary shape text detection. IEEE Transactions on Neural Networks and Learning Systems (2025) 2, 3, 10, 11, 12, 14

2025

[39] [39]

IEEE Transactions on Neural Networks and Learning Sys- tems32(5), 2299–2304 (2020) 2

Wang, Y., Bian, Z.P., Hou, J., Chau, L.P.: Convolutional neural networks with dynamic regularization. IEEE Transactions on Neural Networks and Learning Sys- tems32(5), 2299–2304 (2020) 2

2020

[40] [40]

IEEE transactions on intelligent transportation systems23(7), 8868–8880 (2021) 2

Wang, Y., Bian, Z.P., Zhou, Y., Chau, L.P.: Rethinking and designing a high- performing automatic license plate recognition approach. IEEE transactions on intelligent transportation systems23(7), 8868–8880 (2021) 2

2021

[41] [41]

IEEE Transactions on Image Processing30, 2876–2887 (2021) 2

Wang, Y., Hou, J., Hou, X., Chau, L.P.: A self-training approach for point- supervised object detection and counting in crowds. IEEE Transactions on Image Processing30, 2876–2887 (2021) 2

2021

[42] [42]

DeepSeek-OCR: Contexts Optical Compression

Wei, H., Sun, Y., Li, Y.: Deepseek-ocr: Contexts optical compression. arXiv preprint arXiv:2510.18234 (2025) 4, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xiong, Y., Zhou, C., Xiang, X., Wu, L., Zhu, C., Liu, Z., Suri, S., Varadarajan, B., Akula, R., Iandola, F., et al.: Efficient track anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11513–11524 (2025) 4

2025

[44] [44]

In: European Conference on Computer Vision

Xue, C., Zhang, W., Hao, Y., Lu, S., Torr, P.H., Bai, S.: Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. In: European Conference on Computer Vision. pp. 284–302. Springer (2022) 2

2022

[45] [45]

IEEE Transactions on Geoscience and Remote Sensing61, 1–16 (2023) 4

Yan, Z., Li, J., Li, X., Zhou, R., Zhang, W., Feng, Y., Diao, W., Fu, K., Sun, X.: Ringmo-sam: A foundation model for segment anything in multimodal remote- sensing images. IEEE Transactions on Geoscience and Remote Sensing61, 1–16 (2023) 4

2023

[46] [46]

Knowledge-Based Systems285, 111349 (2024) 2

Yang, S., Xie, L., Ran, X., Lei, J., Qian, X.: Pragmatic degradation learning for scene text image super-resolution with data-training strategy. Knowledge-Based Systems285, 111349 (2024) 2

2024

[47] [47]

Detecting Curve Text in the Wild: New Dataset and New Solution

Yuliang, L., Lianwen, J., Shuaitao, Z., Sheng, Z.: Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170 (2017) 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., Zhao, T.: Adalora: Adaptive budget allocation for parameter-efficient fine- tuning. arXiv preprint arXiv:2303.10512 (2023) 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

IEEE transactions on pattern analysis and machine intelligence45(3), 2736–2750 (2022) 2, 3, 10, 11, 12, 14

Zhang, S.X., Zhu, X., Chen, L., Hou, J.B., Yin, X.C.: Arbitrary shape text de- tection via segmentation with probability maps. IEEE transactions on pattern analysis and machine intelligence45(3), 2736–2750 (2022) 2, 3, 10, 11, 12, 14

2022

[50] [50]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, S.X., Zhu, X., Yang, C., Wang, H., Yin, X.C.: Adaptive boundary proposal network for arbitrary shape text detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1305–1314 (2021) 2, 4, 10, 11, 12 TextDS: Scene Text Detection under Distribution Shifts 19

2021

[51] [51]

IEEE Transactions on Neural Networks and Learning Systems (2025) 2

Zhao, Q., Li, G., He, B., Shen, R.: Deep learning for low-light vision: A comprehen- sive survey. IEEE Transactions on Neural Networks and Learning Systems (2025) 2

2025

[52] [52]

International Journal of Computer Vision132(8), 3119–3138 (2024) 10, 11, 12

Zhao, X., Feng, W., Zhang, Z., Lv, J., Zhu, X., Lin, Z., Hu, J., Shao, J.: Cbnet: A plug-and-play network for segmentation-based scene text detection. International Journal of Computer Vision132(8), 3119–3138 (2024) 10, 11, 12

2024