A Multitask Network for Localization and Recognition of Text in Images

Keegan E. Hines; Mohammad Reza Sarshogh

arxiv: 1906.09266 · v1 · pith:RRQ4DZZOnew · submitted 2019-06-21 · 💻 cs.CL · cs.CV

A Multitask Network for Localization and Recognition of Text in Images

Mohammad Reza Sarshogh , Keegan E. Hines This is my paper

Pith reviewed 2026-05-25 18:48 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords text localizationtext recognitionmulti-task networkend-to-end OCRfeature pyramid networkdynamic poolingconvolutional attention

0 comments

The pith

One end-to-end network localizes and recognizes text in images without post-processing or word grouping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a multi-task model that performs text localization and recognition in a single pass for lexicon-free extraction from complex documents. A convolutional backbone combined with a feature pyramid network supplies a shared feature map to three heads that handle localization, classification, and recognition. Dynamic pooling preserves high-resolution details inside regions of interest, while a convolutional attention block replaces recurrent layers for the recognition task. The resulting system reports strong results on benchmark datasets in non-traditional OCR settings where separate pipelines typically require extra steps.

Core claim

The authors claim that a single trainable network simultaneously solves text localization and text recognition, with text segments identified directly and without post-processing, cropping, or word grouping, by routing a shared representation from a convolutional backbone and feature pyramid network into three model heads.

What carries the argument

Three model heads (localization, classification, text recognition) attached to a shared convolutional backbone plus feature pyramid network, plus dynamic pooling and convolutional attention for recognition.

If this is right

Text extraction pipelines can collapse from multiple stages into one forward pass.
High-resolution information inside detected regions improves recognition accuracy without separate cropping.
Convolutional attention can replace recurrent networks for sequence recognition while maintaining or improving accuracy.
The approach works in challenging non-traditional OCR regimes where lexicon-free methods are needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The architecture may lower latency in document scanning applications by eliminating separate detection and recognition models.
Similar shared-representation designs could apply to other paired vision tasks such as detecting objects and describing them.
Performance on highly variable fonts or scripts would test whether the dynamic pooling and attention mechanisms generalize beyond the reported benchmarks.

Load-bearing premise

The shared features produced by the backbone and pyramid network are rich enough to support accurate performance from all three heads at once.

What would settle it

If running the model on standard OCR benchmarks produces text segments that still require cropping or grouping to form complete words, the claim of identification without post-processing would be refuted.

Figures

Figures reproduced from arXiv: 1906.09266 by Keegan E. Hines, Mohammad Reza Sarshogh.

**Figure 2.** Figure 2: Model Architecture. Image features are extracted through a shared convolutional backbone consisting of a shallow DenseNet and a Feature Pyramid [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of model output with detected bounding boxes shown as dashed lines and predicted text shown in red. From left to right: (1) an image [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Example of model output where it failed to distinguish independent [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of attention vectors at inference time for several RoIs. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

read the original abstract

We present an end-to-end trainable multi-task network that addresses the problem of lexicon-free text extraction from complex documents. This network simultaneously solves the problems of text localization and text recognition and text segments are identified with no post-processing, cropping, or word grouping. A convolutional backbone and Feature Pyramid Network are combined to provide a shared representation that benefits each of three model heads: text localization, classification, and text recognition. To improve recognition accuracy, we describe a dynamic pooling mechanism that retains high-resolution information across all RoIs. For text recognition, we propose a convolutional mechanism with attention which out-performs more common recurrent architectures. Our model is evaluated against benchmark datasets and comparable methods and achieves high performance in challenging regimes of non-traditional OCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract outlines a multitask FPN network with dynamic pooling and convolutional attention for joint text localization and recognition without post-processing, but supplies no numbers or experiments to check if any of it works.

read the letter

The main point is a single trainable network that does text localization, classification, and recognition from a shared convolutional backbone and Feature Pyramid Network, with text segments coming out directly and no cropping or grouping steps afterward. Dynamic pooling is added to hold onto high-resolution details across regions of interest, and the recognition head uses a convolutional attention setup instead of the usual recurrent layers.

Referee Report

2 major / 0 minor

Summary. The paper presents an end-to-end trainable multi-task network for lexicon-free text extraction from complex documents. It simultaneously performs text localization and recognition with no post-processing, cropping, or word grouping. A convolutional backbone combined with a Feature Pyramid Network supplies a shared representation to three heads (localization, classification, recognition). Dynamic pooling retains high-resolution RoI information, and a convolutional attention mechanism is proposed for recognition that is claimed to outperform recurrent models. The model is asserted to achieve high performance on benchmark datasets in non-traditional OCR regimes.

Significance. If the empirical results and architectural details support the claims, the work would offer a meaningful contribution to scene text recognition by demonstrating a unified architecture that eliminates separate post-processing stages and leverages shared multi-scale features across tasks. The dynamic pooling and attention-based recognizer represent potentially useful technical ideas for maintaining resolution and improving sequence modeling in OCR.

major comments (2)

[Abstract] Abstract: The assertion that the model 'achieves high performance in challenging regimes of non-traditional OCR' and is 'evaluated against benchmark datasets and comparable methods' supplies no quantitative results, error bars, dataset names/splits, or ablation studies, rendering the central empirical claim unverifiable from the provided text.
[Abstract] Abstract: The description of how the shared FPN representation benefits the three heads, how dynamic pooling interacts with RoIs, and how the convolutional attention mechanism is implemented remains at a high level with no equations, pseudocode, or architectural diagrams, preventing assessment of whether the claimed simultaneous localization+recognition without post-processing is actually achieved.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the abstract to improve the verifiability and specificity of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the model 'achieves high performance in challenging regimes of non-traditional OCR' and is 'evaluated against benchmark datasets and comparable methods' supplies no quantitative results, error bars, dataset names/splits, or ablation studies, rendering the central empirical claim unverifiable from the provided text.

Authors: We agree that the abstract would benefit from including specific quantitative results to make the empirical claims verifiable. In the revised manuscript, we will update the abstract to report key performance metrics from our evaluations on benchmark datasets (including dataset names and splits), comparisons with comparable methods, and references to ablation studies and any available error bars or variance measures from the experiments section. revision: yes
Referee: [Abstract] Abstract: The description of how the shared FPN representation benefits the three heads, how dynamic pooling interacts with RoIs, and how the convolutional attention mechanism is implemented remains at a high level with no equations, pseudocode, or architectural diagrams, preventing assessment of whether the claimed simultaneous localization+recognition without post-processing is actually achieved.

Authors: We acknowledge the abstract provides only a high-level overview. We will revise it to include more specific details on the shared FPN representation benefiting the localization, classification, and recognition heads, the role of dynamic pooling in retaining high-resolution RoI information, and the convolutional attention mechanism for recognition. We will also explicitly clarify that the model achieves simultaneous localization and recognition in an end-to-end fashion with no post-processing, cropping, or word grouping required. Due to typical abstract length limits, we will reference the detailed equations, pseudocode, and diagrams in the methods section for full assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical multitask neural architecture (convolutional backbone + FPN + three heads + dynamic pooling + convolutional attention) for simultaneous text localization and recognition. All load-bearing elements are design choices and benchmark evaluations rather than equations or parameters that reduce by construction to the inputs. No self-citations, fitted quantities renamed as predictions, or uniqueness theorems appear in the provided abstract and description; the central claim of end-to-end operation without post-processing is presented as an architectural outcome, not a definitional tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning training assumptions and on the empirical effectiveness of the proposed architectural modules; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

network weights and hyperparameters
All model parameters are learned from training data; their specific values are not stated in the abstract.

axioms (1)

standard math Standard assumptions of gradient-based optimization and back-propagation through shared layers
End-to-end multitask training presupposes these properties hold for the combined loss.

pith-pipeline@v0.9.0 · 5649 in / 1191 out tokens · 28024 ms · 2026-05-25T18:48:21.836195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 18 internal anchors

[1]

http://rrc

2019 icdar robust reading challenges: Arbitrary shaped text. http://rrc. cvc.uab.es/?com=news&view=data&id=23. Accessed: 2019-02-12

work page 2019
[2]

http://rrc.cvc

2019 icdar robust reading challenges: Scanned receipts. http://rrc.cvc. uab.es/?ch=13. Accessed: 2019-02-12

work page 2019
[3]

https://github.com/andreasveit/coco-text/ blob/master/coco evaluation.py

COCO evaluation codebase. https://github.com/andreasveit/coco-text/ blob/master/coco evaluation.py. Accessed: 2018-09-30

work page 2018
[4]

http://rrc.cvc.uab.es/?ch=9&com= mymethods&task=1

ICDAR competition page. http://rrc.cvc.uab.es/?ch=9&com= mymethods&task=1. Accessed: 2018-09-30

work page 2018
[5]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convo- lutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 38(1):142– 158, Jan 2016

work page 2016
[6]

Fast R-CNN

Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Connectionist temporal classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J ¨urgen Schmid- huber. Connectionist temporal classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning , ICML ’06, pages 369–376, New York, NY , USA, 2006. ACM

work page 2006
[8]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Reading Scene Text in Deep Convolutional Sequences

Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. Reading scene text in deep convolutional sequences. CoRR, abs/1506.04395, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

An end-to-end TextSpotter with Explicit Alignment and Attention

Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. An end-to-end textspotter with explicit alignment and attention. CoRR, abs/1803.03474, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Densely Connected Convolutional Networks

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[13]

Reading Text in the Wild with Convolutional Neural Networks

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zis- serman. Reading text in the wild with convolutional neural networks. CoRR, abs/1412.1842, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. CoRR, abs/1506.02025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

TextBoxes: A Fast Text Detector with a Single Deep Neural Network

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. CoRR, abs/1611.06779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Learning a Rotation Invariant Detector with Rotatable Bounding Box

Lei Liu, Zongxu Pan, and Bin Lei. Learning a rotation invariant detector with rotatable bounding box. CoRR, abs/1711.09405, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

SSD: Single Shot MultiBox Detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

FOTS: Fast Oriented Text Spotting with a Unified Network

Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. FOTS: fast oriented text spotting with a uniﬁed network. CoRR, abs/1801.01671, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. CoRR, abs/1807.02242, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object detection. CoRR, abs/1506.02640, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R- CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR, abs/1507.05717, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[24]

Feature Pyramid Networks for Object Detection

Ray Smith. An overview of the tesseract ocr engine. Conference on Document Analysis and Recognition , abs/1612.03144, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007
[25]

Accurate scene text recognition based on recurrent neural network

Bolan Su and Shijian Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014

work page 2014
[26]

Detecting Text in Natural Image with Connectionist Text Proposal Network

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal network. CoRR, abs/1609.03605, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

J. R. Uijlings, K. E. Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Int. J. Comput. Vision , 104(2):154–171, September 2013

work page 2013
[28]

Detext: A database for evaluating text extraction from biomedical literature ﬁgures

Xu-Chang Yin, Chun Yang, Wei-Yi Pei, Haixia Man, Jun Zhang, Erik Learned-Miller, and Hong Yu. Detext: A database for evaluating text extraction from biomedical literature ﬁgures. PLoS ONE , 10(5), 2015

work page 2015

[1] [1]

http://rrc

2019 icdar robust reading challenges: Arbitrary shaped text. http://rrc. cvc.uab.es/?com=news&view=data&id=23. Accessed: 2019-02-12

work page 2019

[2] [2]

http://rrc.cvc

2019 icdar robust reading challenges: Scanned receipts. http://rrc.cvc. uab.es/?ch=13. Accessed: 2019-02-12

work page 2019

[3] [3]

https://github.com/andreasveit/coco-text/ blob/master/coco evaluation.py

COCO evaluation codebase. https://github.com/andreasveit/coco-text/ blob/master/coco evaluation.py. Accessed: 2018-09-30

work page 2018

[4] [4]

http://rrc.cvc.uab.es/?ch=9&com= mymethods&task=1

ICDAR competition page. http://rrc.cvc.uab.es/?ch=9&com= mymethods&task=1. Accessed: 2018-09-30

work page 2018

[5] [5]

Girshick, J

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Region-based convo- lutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 38(1):142– 158, Jan 2016

work page 2016

[6] [6]

Fast R-CNN

Ross B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

Connectionist temporal classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fern ´andez, Faustino Gomez, and J ¨urgen Schmid- huber. Connectionist temporal classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning , ICML ’06, pages 369–376, New York, NY , USA, 2006. ACM

work page 2006

[8] [8]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Reading Scene Text in Deep Convolutional Sequences

Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. Reading scene text in deep convolutional sequences. CoRR, abs/1506.04395, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

An end-to-end TextSpotter with Explicit Alignment and Attention

Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. An end-to-end textspotter with explicit alignment and attention. CoRR, abs/1803.03474, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Densely Connected Convolutional Networks

Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[13] [13]

Reading Text in the Wild with Convolutional Neural Networks

Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zis- serman. Reading text in the wild with convolutional neural networks. CoRR, abs/1412.1842, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. CoRR, abs/1506.02025, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

TextBoxes: A Fast Text Detector with a Single Deep Neural Network

Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. Textboxes: A fast text detector with a single deep neural network. CoRR, abs/1611.06779, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [17]

Learning a Rotation Invariant Detector with Rotatable Bounding Box

Lei Liu, Zongxu Pan, and Bin Lei. Learning a rotation invariant detector with rotatable bounding box. CoRR, abs/1711.09405, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [18]

SSD: Single Shot MultiBox Detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [19]

FOTS: Fast Oriented Text Spotting with a Unified Network

Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. FOTS: fast oriented text spotting with a uniﬁed network. CoRR, abs/1801.01671, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [20]

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. CoRR, abs/1807.02242, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [21]

You Only Look Once: Unified, Real-Time Object Detection

Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-time object detection. CoRR, abs/1506.02640, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[21] [22]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R- CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [23]

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Baoguang Shi, Xiang Bai, and Cong Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. CoRR, abs/1507.05717, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[23] [24]

Feature Pyramid Networks for Object Detection

Ray Smith. An overview of the tesseract ocr engine. Conference on Document Analysis and Recognition , abs/1612.03144, 2007

work page internal anchor Pith review Pith/arXiv arXiv 2007

[24] [25]

Accurate scene text recognition based on recurrent neural network

Bolan Su and Shijian Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014

work page 2014

[25] [26]

Detecting Text in Natural Image with Connectionist Text Proposal Network

Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. Detecting text in natural image with connectionist text proposal network. CoRR, abs/1609.03605, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [27]

J. R. Uijlings, K. E. Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. Int. J. Comput. Vision , 104(2):154–171, September 2013

work page 2013

[27] [28]

Detext: A database for evaluating text extraction from biomedical literature ﬁgures

Xu-Chang Yin, Chun Yang, Wei-Yi Pei, Haixia Man, Jun Zhang, Erik Learned-Miller, and Hong Yu. Detext: A database for evaluating text extraction from biomedical literature ﬁgures. PLoS ONE , 10(5), 2015

work page 2015