Semi-Bagging Based Deep Neural Architecture to Extract Text from High Entropy Images

Anirban Chatterjee; Pranay Dugar; Rajesh Shreedhar Bhat; Saswata Sahoo

arxiv: 1907.01284 · v1 · pith:G3LX4ZT6new · submitted 2019-07-02 · 💻 cs.CV · cs.LG· eess.IV

Semi-Bagging Based Deep Neural Architecture to Extract Text from High Entropy Images

Pranay Dugar , Anirban Chatterjee , Rajesh Shreedhar Bhat , Saswata Sahoo This is my paper

Pith reviewed 2026-05-25 11:12 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords text detectionsuper pixel segmentationensemble methodsdeep neural networkshigh entropy imagese-commercetext recognitionnatural scene text

0 comments

The pith

A semi-bagging architecture using super-pixel segments and multiple detectors extracts text more accurately from high-entropy images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a text detection approach that first divides an image into super-pixel regions and then applies an ensemble of different convolutional detectors to each region separately. The method targets images with high visual complexity where text appears alongside many other objects in limited space. It claims better coverage of text instances of different sizes and shapes than standard CNN approaches. When combined with a text recognizer, the full pipeline exceeds previous state-of-the-art results on e-commerce product images. The work focuses on practical extraction for applications like online shopping and augmented reality.

Core claim

The central claim is that an end-to-end strategy combining super-pixel based segmentation with a semi-bagging ensemble of text detectors detects text in high entropy images more effectively than existing methods, and when paired with a recognizer, outperforms state-of-the-art approaches on product images.

What carries the argument

The semi-bagging ensemble of multiple text detectors applied independently to each super-pixel segment of the image.

Load-bearing premise

Super-pixel segmentation creates regions that fully contain text instances without splitting them or removing essential context.

What would settle it

Measuring detection performance on high-entropy images where super-pixel boundaries cross through text strings would test if the segmentation step causes missed detections.

Figures

Figures reproduced from arXiv: 1907.01284 by Anirban Chatterjee, Pranay Dugar, Rajesh Shreedhar Bhat, Saswata Sahoo.

**Figure 2.** Figure 2: The results of the proposed strategy of segmentation are shown for three different [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Sample of results for text localization on ICDAR2013 dataset (top row) and high [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Extracting texts of various size and shape from images containing multiple objects is an important problem in many contexts, especially, in connection to e-commerce, augmented reality assistance system in natural scene, etc. The existing works (based on only CNN) often perform sub-optimally when the image contains regions of high entropy having multiple objects. This paper presents an end-to-end text detection strategy combining a segmentation algorithm and an ensemble of multiple text detectors of different types to detect text in every individual image segments independently. The proposed strategy involves a super-pixel based image segmenter which splits an image into multiple regions. A convolutional deep neural architecture is developed which works on each of the segments and detects texts of multiple shapes, sizes, and structures. It outperforms the competing methods in terms of coverage in detecting texts in images especially the ones where the text of various types and sizes are compacted in a small region along with various other objects. Furthermore, the proposed text detection method along with a text recognizer outperforms the existing state-of-the-art approaches in extracting text from high entropy images. We validate the results on a dataset consisting of product images on an e-commerce website.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a superpixel segmentation plus CNN ensemble pipeline for text detection in cluttered product images but reports no quantitative results or comparisons at all.

read the letter

The main thing to know is that this paper outlines a pipeline that splits images with superpixels, runs an ensemble of CNN text detectors independently on each piece, and then pairs the detector with a recognizer, claiming better results on high-entropy e-commerce photos. But the abstract (and the description available) contains zero numbers, no baselines, no dataset details, and no evaluation metrics, so the performance claim cannot be checked or compared to anything else.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce an end-to-end text detection strategy combining super-pixel segmentation to split images into regions with an ensemble of multiple CNN-based text detectors run independently on each segment. It asserts superior coverage over competing methods for high-entropy images containing compacted text of varying sizes/shapes alongside other objects, and further claims that pairing the detector with a recognizer outperforms state-of-the-art approaches; results are said to be validated on a dataset of e-commerce product images.

Significance. If the empirical claims were supported by proper quantitative evaluation, the segmentation-plus-ensemble approach might offer a practical way to handle text detection in cluttered scenes. However, the complete absence of metrics, baselines, or validation of core assumptions prevents any assessment of whether the contribution advances the field.

major comments (2)

[Abstract] Abstract: the repeated claims that the method 'outperforms the competing methods in terms of coverage' and 'outperforms the existing state-of-the-art approaches in extracting text from high entropy images' are unsupported by any numerical results, tables, baselines, error bars, or statistical tests, so the central empirical assertion cannot be evaluated.
[Abstract / pipeline description] Pipeline description: the independent-per-segment detection premise requires that super-pixel segmentation produces regions containing complete text instances without splitting text across boundaries or discarding context; no validation, boundary analysis, or ablation on segmentation granularity is supplied, which directly undermines the coverage claim for high-entropy images.

minor comments (1)

[Title / Abstract] The title uses 'Semi-Bagging' while the abstract describes an 'ensemble' without defining the bagging procedure or the meaning of 'semi'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the repeated claims that the method 'outperforms the competing methods in terms of coverage' and 'outperforms the existing state-of-the-art approaches in extracting text from high entropy images' are unsupported by any numerical results, tables, baselines, error bars, or statistical tests, so the central empirical assertion cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including specific numerical evidence. The manuscript validates the approach on an e-commerce product image dataset and includes comparisons with competing methods in the experimental section. To address this concern, we will revise the abstract to incorporate key quantitative results, such as coverage improvements and extraction accuracy metrics, along with references to the relevant tables and baselines. This revision will make the empirical claims directly evaluable from the abstract. revision: yes
Referee: [Abstract / pipeline description] Pipeline description: the independent-per-segment detection premise requires that super-pixel segmentation produces regions containing complete text instances without splitting text across boundaries or discarding context; no validation, boundary analysis, or ablation on segmentation granularity is supplied, which directly undermines the coverage claim for high-entropy images.

Authors: We acknowledge that an explicit validation of the super-pixel segmentation's impact on text instance integrity would strengthen the paper. The super-pixel approach is chosen to create regions that group similar pixels, aiming to isolate text areas in high-entropy scenes. However, we did not include an ablation study or boundary analysis in the current version. We will add such an analysis, including examples of segmentation boundaries and an ablation on the number of super-pixels, to the revised manuscript to support the premise and the coverage claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical pipeline with external validation

full rationale

The paper describes an end-to-end empirical pipeline (super-pixel segmentation followed by per-segment CNN ensemble detection) and reports performance on a held-out product-image dataset. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claim reduces to measured F-measure or similar metrics on external data, not to any internal redefinition or ansatz smuggled via prior work by the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5751 in / 1040 out tokens · 38651 ms · 2026-05-25T11:12:12.218850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

live better., 2019

Save money. live better., 2019. URL https://www.walmart.com/. Accessed: 2019-04-14

work page 2019
[2]

Supervised and unsupervised segmentation using superpixels, model estimation, and graph cut.Journal of Electronic Imaging, 26(6):061610, 2017

Ji ˇrí Borovec, Jan Švihlík, Jan Kybic, and David Habart. Supervised and unsupervised segmentation using superpixels, model estimation, and graph cut.Journal of Electronic Imaging, 26(6):061610, 2017

work page 2017
[3]

Reading text in the wild with convolutional neural networks.International Journal of Computer Vision, 116(1):1–20, 2016

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with convolutional neural networks.International Journal of Computer Vision, 116(1):1–20, 2016

work page 2016
[4]

Icdar 2013 robust reading compe- tition

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading compe- tition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493. IEEE, 2013

work page 2013
[5]

Representing and recognizing the visual appearance of materials using three-dimensional textons

Thomas Leung and Jitendra Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vision, 43(1):29–44, June

work page
[6]

doi: 10.1023/A:1011126920638

ISSN 0920-5691. doi: 10.1023/A:1011126920638. URL https://doi. org/10.1023/A:1011126920638

work page doi:10.1023/a:1011126920638
[7]

Textboxes: A fast text detector with a single deep neural network

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. InThirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017
[8]

Textboxes++: A single-shot oriented scene text detector

Minghui Liao, Baoguang Shi, and Xiang Bai. Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing, 27(8):3676–3690, 2018

work page 2018
[9]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016

work page 2016
[10]

Real-time scene text localization and recognition

Lukáš Neumann and Ji ˇrí Matas. Real-time scene text localization and recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 3538–

work page 2012
[11]

An analysis of scale invariance in object detection snip

Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018

work page 2018
[12]

Ch. Thum. Measurement of the entropy of an image with application to image focusing. Optica Acta: International Journal of Optics , 31(2):203–211, 1984. doi: 10.1080/ 713821475. URL https://doi.org/10.1080/713821475

work page doi:10.1080/713821475 1984
[13]

Text ﬂow: A uniﬁed text detection system in natural scene images

Shangxuan Tian, Yifeng Pan, Chang Huang, Shijian Lu, Kai Yu, and Chew Lim Tan. Text ﬂow: A uniﬁed text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision, pages 4651–4659, 2015

work page 2015
[14]

Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions

Alessandro Zamberletti, Lucia Noce, and Ignazio Gallo. Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. InAsian Conference on Computer Vision, pages 91–105. Springer, 2014

work page 2014
[15]

Symmetry-based text line detection in natural scenes

Zheng Zhang, Wei Shen, Cong Yao, and Xiang Bai. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2558–2567, 2015

work page 2015
[16]

Multi-oriented text detection with fully convolutional networks

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4159–4167, 2016

work page 2016

[1] [1]

live better., 2019

Save money. live better., 2019. URL https://www.walmart.com/. Accessed: 2019-04-14

work page 2019

[2] [2]

Supervised and unsupervised segmentation using superpixels, model estimation, and graph cut.Journal of Electronic Imaging, 26(6):061610, 2017

Ji ˇrí Borovec, Jan Švihlík, Jan Kybic, and David Habart. Supervised and unsupervised segmentation using superpixels, model estimation, and graph cut.Journal of Electronic Imaging, 26(6):061610, 2017

work page 2017

[3] [3]

Reading text in the wild with convolutional neural networks.International Journal of Computer Vision, 116(1):1–20, 2016

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with convolutional neural networks.International Journal of Computer Vision, 116(1):1–20, 2016

work page 2016

[4] [4]

Icdar 2013 robust reading compe- tition

Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading compe- tition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493. IEEE, 2013

work page 2013

[5] [5]

Representing and recognizing the visual appearance of materials using three-dimensional textons

Thomas Leung and Jitendra Malik. Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vision, 43(1):29–44, June

work page

[6] [6]

doi: 10.1023/A:1011126920638

ISSN 0920-5691. doi: 10.1023/A:1011126920638. URL https://doi. org/10.1023/A:1011126920638

work page doi:10.1023/a:1011126920638

[7] [7]

Textboxes: A fast text detector with a single deep neural network

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. InThirty-First AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017

[8] [8]

Textboxes++: A single-shot oriented scene text detector

Minghui Liao, Baoguang Shi, and Xiang Bai. Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing, 27(8):3676–3690, 2018

work page 2018

[9] [9]

Ssd: Single shot multibox detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng- Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016

work page 2016

[10] [10]

Real-time scene text localization and recognition

Lukáš Neumann and Ji ˇrí Matas. Real-time scene text localization and recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition , pages 3538–

work page 2012

[11] [11]

An analysis of scale invariance in object detection snip

Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 3578–3587, 2018

work page 2018

[12] [12]

Ch. Thum. Measurement of the entropy of an image with application to image focusing. Optica Acta: International Journal of Optics , 31(2):203–211, 1984. doi: 10.1080/ 713821475. URL https://doi.org/10.1080/713821475

work page doi:10.1080/713821475 1984

[13] [13]

Text ﬂow: A uniﬁed text detection system in natural scene images

Shangxuan Tian, Yifeng Pan, Chang Huang, Shijian Lu, Kai Yu, and Chew Lim Tan. Text ﬂow: A uniﬁed text detection system in natural scene images. In Proceedings of the IEEE international conference on computer vision, pages 4651–4659, 2015

work page 2015

[14] [14]

Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions

Alessandro Zamberletti, Lucia Noce, and Ignazio Gallo. Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. InAsian Conference on Computer Vision, pages 91–105. Springer, 2014

work page 2014

[15] [15]

Symmetry-based text line detection in natural scenes

Zheng Zhang, Wei Shen, Cong Yao, and Xiang Bai. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2558–2567, 2015

work page 2015

[16] [16]

Multi-oriented text detection with fully convolutional networks

Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4159–4167, 2016

work page 2016