RFBTD: RFB Text Detector
Pith reviewed 2026-05-25 09:28 UTC · model grok-4.3
The pith
Receptive Field Blocks enable prediction of individual words and text lines at arbitrary orientations in scene images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An elegant solution promotes prediction of words or text lines of arbitrary orientations and directions while providing emphasis on individual words; Receptive Field Blocks are investigated for their impact on receptive fields for text segments, with experiments on ICDAR2015 yielding an F-score of 47.09 at 720p.
What carries the argument
Receptive Field Blocks (RFB) that modify receptive fields for text segments inside the detection network.
If this is right
- Individual words become the primary detection target even when text is dense.
- The detector handles text lines and words at any orientation without additional constraints.
- Receptive Field Blocks directly affect performance on text segment receptive fields.
- The reported F-score of 47.09 is achieved on ICDAR2015 at 720p input resolution.
Where Pith is reading between the lines
- The emphasis on single words could reduce reliance on grouping steps that are common in line-level detectors.
- The same RFB modification might transfer to other oriented object detection tasks beyond text.
- Testing the method at higher resolutions or on additional scene-text datasets would clarify whether the F-score holds under varied conditions.
Load-bearing premise
That adding Receptive Field Blocks will improve detection of individual words in dense text without creating new failure modes or needing extra post-processing.
What would settle it
Running the detector with and without the RFB modules on ICDAR2015 and finding no change or a drop in F-score at 720p would falsify the benefit claim.
read the original abstract
Text detection plays a critical role in the whole procedure of textual information extraction and understanding. On a high note, recent years have seen a surge in the high recall text detectors in scene text images, however text boxes for individual words is still a challenging when dense text is present in the scene. In this work, we propose an elegant solution that promotes prediction of words or text lines of arbitrary orientations and directions, providing emphasis on individual words. We also investigate the effects of Receptive Field Blocks(RFB) and its impact in receptive fields for text segments. Experiments were done on the ICDAR2015 and achieves an F-score of 47.09 at 720p
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RFBTD, a scene text detector that incorporates Receptive Field Blocks (RFB) to promote accurate prediction of individual words (rather than lines) of arbitrary orientations and directions in dense text. It investigates the impact of RFB on receptive fields for text segments and reports an F-score of 47.09 on the ICDAR2015 dataset at 720p resolution.
Significance. If the central performance attribution to RFB were demonstrated with controlled experiments, the work could offer a targeted architectural modification for word-level detection in oriented and dense scenes. However, the reported F-score is substantially below current ICDAR2015 state-of-the-art (typically >80 F-score), and no evidence is supplied that RFB yields a measurable gain over a non-RFB counterpart or avoids new failure modes, limiting the potential impact.
major comments (2)
- [Abstract] Abstract: The central claim that RFB 'promotes prediction of words or text lines ... providing emphasis on individual words' and improves receptive fields for text segments rests on an F-score of 47.09, yet the manuscript supplies neither an ablation (with/without RFB), a baseline comparison, nor any component-wise breakdown. This absence prevents attribution of any performance change to the RFB module.
- [Abstract] Abstract: No error bars, multiple runs, or statistical significance tests accompany the single reported F-score; without these, it is impossible to determine whether the result is reproducible or meaningfully different from prior detectors.
minor comments (2)
- [Abstract] Abstract, sentence 3: 'is still a challenging when dense text is present' contains a grammatical error and should be rephrased for clarity.
- [Abstract] Abstract: The phrase 'at 720p' is undefined; the input resolution or test protocol should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We acknowledge the need for stronger evidence to support the claims about the RFB module and will revise the manuscript to address the points raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that RFB 'promotes prediction of words or text lines ... providing emphasis on individual words' and improves receptive fields for text segments rests on an F-score of 47.09, yet the manuscript supplies neither an ablation (with/without RFB), a baseline comparison, nor any component-wise breakdown. This absence prevents attribution of any performance change to the RFB module.
Authors: We agree that the current manuscript lacks an ablation study, baseline comparison, or component-wise breakdown, which limits the ability to directly attribute performance to the RFB module. The reported F-score of 47.09 reflects the complete RFBTD model. In the revised version, we will add ablation experiments (with/without RFB) and a baseline comparison to demonstrate the module's impact on receptive fields for text segments. revision: yes
-
Referee: [Abstract] Abstract: No error bars, multiple runs, or statistical significance tests accompany the single reported F-score; without these, it is impossible to determine whether the result is reproducible or meaningfully different from prior detectors.
Authors: We acknowledge that the single reported F-score without error bars or multiple runs makes it difficult to assess reproducibility. The original experiments were conducted as a single run due to computational limitations. In the revision, we will perform additional runs where resources permit and include error bars or a discussion of observed variance to address this concern. revision: partial
Circularity Check
No derivation chain present; empirical proposal without circular elements
full rationale
The manuscript is an empirical computer-vision proposal that augments a detector with Receptive Field Blocks and reports an F-score on ICDAR2015. No equations, first-principles derivations, fitted-parameter predictions, or self-citation load-bearing steps appear in the abstract or described content. All performance claims rest on external benchmark results rather than on any quantity defined in terms of itself, so the work is self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Text detector uses a Resnet backbone, and outputs predictions in the form of rotated boxes... The RFB block module provides an eccentric receptive field which aid in fine granularity to clearly distinguish word boxes in text lines / segments.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also investigate the effects of Receptive Field Blocks(RFB) and its impact in receptive fields for text segments.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The network is trained until performance stops improving
Exponential decay is induced from one-tenth every 27300 minibatches upto 1e-5. The network is trained until performance stops improving. IV. E XPERIMENTS The proposed method was benchmarked in ICDAR 2015 [25]. It includes a total of 1500 pictures, 1000 of which are used for training and the remaining are for testing. The text regions are annotated by 4 of...
work page 2015
-
[2]
B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In Proc. of CVPR, 2010
work page 2010
- [3]
-
[4]
DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images
Z. Zhong, L. Jin, S. Zhang, and Z. Feng. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: An Efficient and Accurate Scene Text Detector in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2017
-
[8]
Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully convolutional networks for semantic segmentation in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2015
-
[9]
Songtao Liu, Di Huang, Yunhong Wang. Receptive Field Block Net for Accurate and Fast Object Detection in ECCV 2018: Computer Vision – ECCV 2018 pp 404-419
work page 2018
-
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2016
-
[11]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In Proc. of CVPR , 2009
work page 2009
-
[12]
J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection network. In Proceedings of the 2016 ACM on Multimedia Conference, pages 516–520. ACM ,
work page 2016
-
[13]
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In Proc. of ICDAR,
work page 2015
-
[14]
John Duchi, Elad Hazan, Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization in Journal of Machine Learning Research 12 (2011) 2121-2159
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.