pith. sign in

arxiv: 1907.02228 · v1 · pith:WB4ZE5B3new · submitted 2019-07-04 · 💻 cs.CV

RFBTD: RFB Text Detector

Pith reviewed 2026-05-25 09:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords text detectionscene textreceptive field blocksRFBICDAR2015arbitrary orientationword level detection
0
0 comments X

The pith

Receptive Field Blocks enable prediction of individual words and text lines at arbitrary orientations in scene images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a text detector that prioritizes locating individual words instead of large dense blocks in natural scene images. It claims this is achieved by a method that supports words or text lines in any orientation or direction. The work examines how Receptive Field Blocks alter the receptive fields available to text segments within the network. Results are reported on the ICDAR2015 benchmark, reaching an F-score of 47.09 when run at 720p resolution. The approach is positioned as addressing the specific difficulty of word-level detection when text appears densely packed.

Core claim

An elegant solution promotes prediction of words or text lines of arbitrary orientations and directions while providing emphasis on individual words; Receptive Field Blocks are investigated for their impact on receptive fields for text segments, with experiments on ICDAR2015 yielding an F-score of 47.09 at 720p.

What carries the argument

Receptive Field Blocks (RFB) that modify receptive fields for text segments inside the detection network.

If this is right

  • Individual words become the primary detection target even when text is dense.
  • The detector handles text lines and words at any orientation without additional constraints.
  • Receptive Field Blocks directly affect performance on text segment receptive fields.
  • The reported F-score of 47.09 is achieved on ICDAR2015 at 720p input resolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The emphasis on single words could reduce reliance on grouping steps that are common in line-level detectors.
  • The same RFB modification might transfer to other oriented object detection tasks beyond text.
  • Testing the method at higher resolutions or on additional scene-text datasets would clarify whether the F-score holds under varied conditions.

Load-bearing premise

That adding Receptive Field Blocks will improve detection of individual words in dense text without creating new failure modes or needing extra post-processing.

What would settle it

Running the detector with and without the RFB modules on ICDAR2015 and finding no change or a drop in F-score at 720p would falsify the benefit claim.

read the original abstract

Text detection plays a critical role in the whole procedure of textual information extraction and understanding. On a high note, recent years have seen a surge in the high recall text detectors in scene text images, however text boxes for individual words is still a challenging when dense text is present in the scene. In this work, we propose an elegant solution that promotes prediction of words or text lines of arbitrary orientations and directions, providing emphasis on individual words. We also investigate the effects of Receptive Field Blocks(RFB) and its impact in receptive fields for text segments. Experiments were done on the ICDAR2015 and achieves an F-score of 47.09 at 720p

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RFBTD, a scene text detector that incorporates Receptive Field Blocks (RFB) to promote accurate prediction of individual words (rather than lines) of arbitrary orientations and directions in dense text. It investigates the impact of RFB on receptive fields for text segments and reports an F-score of 47.09 on the ICDAR2015 dataset at 720p resolution.

Significance. If the central performance attribution to RFB were demonstrated with controlled experiments, the work could offer a targeted architectural modification for word-level detection in oriented and dense scenes. However, the reported F-score is substantially below current ICDAR2015 state-of-the-art (typically >80 F-score), and no evidence is supplied that RFB yields a measurable gain over a non-RFB counterpart or avoids new failure modes, limiting the potential impact.

major comments (2)
  1. [Abstract] Abstract: The central claim that RFB 'promotes prediction of words or text lines ... providing emphasis on individual words' and improves receptive fields for text segments rests on an F-score of 47.09, yet the manuscript supplies neither an ablation (with/without RFB), a baseline comparison, nor any component-wise breakdown. This absence prevents attribution of any performance change to the RFB module.
  2. [Abstract] Abstract: No error bars, multiple runs, or statistical significance tests accompany the single reported F-score; without these, it is impossible to determine whether the result is reproducible or meaningfully different from prior detectors.
minor comments (2)
  1. [Abstract] Abstract, sentence 3: 'is still a challenging when dense text is present' contains a grammatical error and should be rephrased for clarity.
  2. [Abstract] Abstract: The phrase 'at 720p' is undefined; the input resolution or test protocol should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We acknowledge the need for stronger evidence to support the claims about the RFB module and will revise the manuscript to address the points raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that RFB 'promotes prediction of words or text lines ... providing emphasis on individual words' and improves receptive fields for text segments rests on an F-score of 47.09, yet the manuscript supplies neither an ablation (with/without RFB), a baseline comparison, nor any component-wise breakdown. This absence prevents attribution of any performance change to the RFB module.

    Authors: We agree that the current manuscript lacks an ablation study, baseline comparison, or component-wise breakdown, which limits the ability to directly attribute performance to the RFB module. The reported F-score of 47.09 reflects the complete RFBTD model. In the revised version, we will add ablation experiments (with/without RFB) and a baseline comparison to demonstrate the module's impact on receptive fields for text segments. revision: yes

  2. Referee: [Abstract] Abstract: No error bars, multiple runs, or statistical significance tests accompany the single reported F-score; without these, it is impossible to determine whether the result is reproducible or meaningfully different from prior detectors.

    Authors: We acknowledge that the single reported F-score without error bars or multiple runs makes it difficult to assess reproducibility. The original experiments were conducted as a single run due to computational limitations. In the revision, we will perform additional runs where resources permit and include error bars or a discussion of observed variance to address this concern. revision: partial

Circularity Check

0 steps flagged

No derivation chain present; empirical proposal without circular elements

full rationale

The manuscript is an empirical computer-vision proposal that augments a detector with Receptive Field Blocks and reports an F-score on ICDAR2015. No equations, first-principles derivations, fitted-parameter predictions, or self-citation load-bearing steps appear in the abstract or described content. All performance claims rest on external benchmark results rather than on any quantity defined in terms of itself, so the work is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5630 in / 1073 out tokens · 55038 ms · 2026-05-25T09:28:24.583903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    The network is trained until performance stops improving

    Exponential decay is induced from one-tenth every 27300 minibatches upto 1e-5. The network is trained until performance stops improving. IV. E XPERIMENTS The proposed method was benchmarked in ICDAR 2015 [25]. It includes a total of 1500 pictures, 1000 of which are used for training and the remaining are for testing. The text regions are annotated by 4 of...

  2. [2]

    Epshtein, E

    B. Epshtein, E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. In ​ Proc. of CVPR, ​ 2010

  3. [3]

    Zhang, W

    Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In ​ Proc. of CVPR ​ , 2015

  4. [4]

    DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images

    Z. Zhong, L. Jin, S. Zhang, and Z. Feng. Deeptext: A unified framework for text proposal generation and text detection in natural images. ​ arXiv preprint arXiv:1605.07314 ​ , 2016

  5. [5]

    C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene text detection via holistic, multi-channel prediction. ​ arXiv preprint arXiv:1606.09002 ​ ,

  6. [6]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ​ arXiv preprint arXiv:1409.1556 ​ ,

  7. [7]

    EAST: An Efficient and Accurate Scene Text Detector in ​ 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. EAST: An Efficient and Accurate Scene Text Detector in ​ 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  8. [8]

    Fully convolutional networks for semantic segmentation in ​ 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully convolutional networks for semantic segmentation in ​ 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  9. [9]

    Receptive Field Block Net for Accurate and Fast Object Detection in ​ ECCV 2018: Computer Vision – ECCV 2018 pp 404-419

    Songtao Liu, Di Huang, Yunhong Wang. Receptive Field Block Net for Accurate and Fast Object Detection in ​ ECCV 2018: Computer Vision – ECCV 2018 pp 404-419

  10. [10]

    Deep Residual Learning for Image Recognition in ​ 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition in ​ 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  11. [11]

    J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In ​ Proc. of CVPR ​ , 2009

  12. [12]

    J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection network. In ​ Proceedings of the 2016 ACM on Multimedia Conference, pages 516–520. ACM ​ ,

  13. [13]

    Karatzas, L

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. ​ In Proc. of ICDAR, ​

  14. [14]

    Adaptive Subgradient Methods for Online Learning and Stochastic Optimization in Journal of ​ Machine Learning Research 12 (2011) 2121-2159

    John Duchi, Elad Hazan, Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization in Journal of ​ Machine Learning Research 12 (2011) 2121-2159