ICDAR 2019 Competition on Scene Text Visual Question Answering

Ali Furkan Biten; Andres Mafla; C.V. Jawahar; Dimosthenis Karatzas; Ernest Valveny; Lluis Gomez; Mar\c{c}al Rusi\~nol; Minesh Mathew; Rub\`en Tito

arxiv: 1907.00490 · v1 · pith:XRUL6XGWnew · submitted 2019-06-30 · 💻 cs.CV

ICDAR 2019 Competition on Scene Text Visual Question Answering

Ali Furkan Biten , Rub\`en Tito , Andres Mafla , Lluis Gomez , Mar\c{c}al Rusi\~nol , Minesh Mathew , C.V. Jawahar , Ernest Valveny

show 1 more author

Dimosthenis Karatzas

This is my paper

Pith reviewed 2026-05-25 12:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords scene textvisual question answeringdatasettext recognitionimage understandingVQA benchmarkICDAR competition

0 comments

The pith

A new dataset of 23k images with text-grounded questions pushes VQA models to combine scene reading and context understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the ST-VQA competition and its dataset to fill a gap in visual question answering by requiring models to use text visible in images to answer questions. It assembles 23,038 images from seven existing datasets along with 31,791 question-answer pairs where the correct answer depends on reading and interpreting that text in its visual setting. Three tasks increase in difficulty by demanding more integration between text recognition and scene understanding, scored by a metric that rewards both capabilities. Participant results illustrate current limits of systems that can read, and the authors position the dataset as a milestone for building models that achieve more complete image understanding through scene text.

Core claim

The paper establishes a benchmark dataset and competition for scene text visual question answering consisting of 23,038 images annotated with 31,791 question-answer pairs drawn from seven public computer vision datasets, where every answer is grounded on text instances present in the image; the benchmark defines three tasks of increasing difficulty that require reading text in scene context and introduces an evaluation metric that jointly measures text recognition and image understanding.

What carries the argument

The ST-VQA dataset together with its three tasks of increasing difficulty and a novel evaluation metric that jointly scores text recognition accuracy and contextual scene understanding.

Load-bearing premise

The questions genuinely require both accurate text recognition and scene understanding rather than being solvable from text strings alone or from visual cues alone.

What would settle it

A system that achieves high scores on most questions by processing only the transcribed text strings while ignoring image content, or by using only image content while ignoring the text.

Figures

Figures reproduced from arXiv: 1907.00490 by Ali Furkan Biten, Andres Mafla, C.V. Jawahar, Dimosthenis Karatzas, Ernest Valveny, Lluis Gomez, Mar\c{c}al Rusi\~nol, Minesh Mathew, Rub\`en Tito.

**Figure 2.** Figure 2: A detailed breakdown of the performance of the submitted models by image source (top) and question categories [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy scores per ANLS threshold for Task 1 (left) and Task 3 (right) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios. The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding. A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This competition paper releases the ST-VQA dataset with three graded tasks and a joint text-plus-understanding metric, but supplies no evidence that the questions actually require both capabilities rather than text or vision alone.

read the letter

The main takeaway is that this is a competition report releasing a new dataset for scene text VQA, but the claim that the questions test joint reading and scene understanding rests on assertion rather than checks. The paper puts out 23,038 images and 31,791 QA pairs drawn from seven existing public datasets, defines three tasks of increasing difficulty, and proposes a metric meant to score both text recognition and image understanding together. Participant results are analyzed to show current system performance on this setup. That structure and the scale are the concrete contributions, and pulling from multiple sources gives at least the potential for broader coverage than single-dataset efforts. The competition format also surfaces where existing VQA models fall short when text must be read in context. The soft spot is the missing validation for the core premise. The abstract states that answers are always grounded on text instances and that the tasks require reading the text and understanding it in scene context, yet the provided text shows no example questions, no OCR-only or vision-only baselines, and no quantitative check that scene information is indispensable for a non-trivial portion of the questions. The diversity across the seven source datasets is likewise stated without comparison or breakdown showing meaningful differences in scenario coverage. These gaps do not invalidate the dataset release itself, but they leave the strongest claim about holistic understanding only partially supported. This work is mainly for groups building or benchmarking VQA systems that must handle scene text, and for anyone who needs a standard testbed in this narrow but practical area. Dataset and benchmark papers of this type are worth referee time even when the validation is incomplete, because the data and task definitions can still be used and extended. I would send it to peer review so the data collection details, question distribution, and metric can be examined directly.

Referee Report

3 major / 1 minor

Summary. The manuscript reports the organization and results of the ICDAR 2019 Scene Text Visual Question Answering (ST-VQA) competition. It introduces a dataset of 23,038 images with 31,791 QA pairs sourced from seven public datasets, structured into three tasks of increasing difficulty that require reading scene text in context. A novel evaluation metric is introduced to jointly assess text recognition and image understanding, participant submissions are analyzed, and the dataset is positioned as a milestone toward holistic scene-text VQA models.

Significance. If the QA pairs genuinely require joint text recognition and scene context (rather than being solvable by OCR output or visual cues alone) and the seven sources supply meaningful scenario diversity, the dataset and competition could become a useful benchmark for advancing VQA systems that integrate reading with visual understanding. The structured tasks and proposed metric provide a concrete evaluation framework; credit is due for releasing a large, multi-source dataset with answers explicitly grounded in text instances.

major comments (3)

[Abstract] Abstract: The claim that the dataset constitutes 'an important milestone' for 'holistic image understanding' that exploits scene text is load-bearing on the unverified premise that questions 'require reading the text in a scene and understanding it in the context.' No sample QA pairs, modality-ablation baselines, or statistics are supplied to demonstrate that a non-trivial fraction of questions cannot be answered from text alone or from visual cues alone.
[Dataset and Tasks] Dataset and Tasks sections: The assertion that the seven source datasets 'cover a wide range of scenarios' and that answers are 'always grounded on text instances' lacks any quantitative verification of scenario diversity or grounding (e.g., distribution of question types across sources, or checks that scene context is indispensable). This directly affects the central claim of the work.
[Evaluation Metric and Results] Evaluation Metric and Results sections: The novel metric is described as 'elegantly assess[ing] both key capabilities,' yet the manuscript supplies no validation, comparison against standard VQA metrics, or error analysis. The 'detailed analysis of results from different participants' is referenced but no quantitative performance numbers, rankings, or failure-mode breakdowns appear in the provided text, leaving claims about current model capabilities unsupported.

minor comments (1)

[Abstract] The abstract would be strengthened by including one or two concrete quantitative highlights from the competition (e.g., top participant accuracy or metric scores) rather than remaining purely descriptive.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the dataset constitutes 'an important milestone' for 'holistic image understanding' that exploits scene text is load-bearing on the unverified premise that questions 'require reading the text in a scene and understanding it in the context.' No sample QA pairs, modality-ablation baselines, or statistics are supplied to demonstrate that a non-trivial fraction of questions cannot be answered from text alone or from visual cues alone.

Authors: We agree the abstract claim would be strengthened by supporting material. The grounding of answers in text instances follows directly from the annotation protocol described in the Dataset section. To address the concern, we will revise the abstract to moderate the language and add illustrative QA pair examples (with source images) plus a summary table of question categories in the revised version. Comprehensive modality ablations fall outside the scope of a competition report focused on participant results. revision: partial
Referee: [Dataset and Tasks] Dataset and Tasks sections: The assertion that the seven source datasets 'cover a wide range of scenarios' and that answers are 'always grounded on text instances' lacks any quantitative verification of scenario diversity or grounding (e.g., distribution of question types across sources, or checks that scene context is indispensable). This directly affects the central claim of the work.

Authors: The seven source datasets were deliberately chosen from distinct domains to ensure scenario variety. We acknowledge that the submitted manuscript does not include explicit quantitative breakdowns. In revision we will insert a table reporting image and QA counts per source together with representative examples that illustrate the role of scene context. The 'always grounded' property is enforced by the annotation guidelines; we can add a note on spot-check verification performed during dataset curation. revision: yes
Referee: [Evaluation Metric and Results] Evaluation Metric and Results sections: The novel metric is described as 'elegantly assess[ing] both key capabilities,' yet the manuscript supplies no validation, comparison against standard VQA metrics, or error analysis. The 'detailed analysis of results from different participants' is referenced but no quantitative performance numbers, rankings, or failure-mode breakdowns appear in the provided text, leaving claims about current model capabilities unsupported.

Authors: The metric employed is Average Normalized Levenshtein Similarity (ANLS), chosen because it jointly penalizes recognition and reasoning errors. We will expand the Results section with a participant ranking table, a short comparison against exact-match accuracy, and a concise error-mode summary drawn from the submitted runs. These details were condensed in the original submission for length reasons and will be restored. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset/competition paper with no derivations or fitted predictions

full rationale

The paper introduces a new dataset and reports competition results. It contains no equations, model derivations, parameter fitting, or predictions that could reduce to inputs by construction. Central claims (dataset as milestone, tasks requiring joint text+scene understanding) are presented as assertions supported by the collection process and task design, not as outputs of any self-referential chain. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. This matches the default non-circular case for a competition/dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical competition report that introduces a dataset and tasks rather than a mathematical derivation; no free parameters, axioms, or invented entities are required or stated.

pith-pipeline@v0.9.0 · 5785 in / 1067 out tokens · 32608 ms · 2026-05-25T12:14:30.537365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 5 internal anchors

[1]

Don’t just assume; look and answer: Overcoming priors for visual question answering,

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in CVPR, 2018

work page 2018
[2]

Word spotting and recognition with embedded at- tributes,

J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny, “Word spotting and recognition with embedded at- tributes,” TPAMI, vol. 36, 2014

work page 2014
[3]

Bottom-up and top-down attention for image captioning and visual question answering,

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018

work page 2018
[4]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015

work page 2015
[5]

Scene text visual question answering,

A. Biten, R. Tito, A. Maﬂa, L. Gomez, M. Ru- siol, E. Valveny, C. V . Jawahar, and D. Karatzas, “Scene text visual question answering,” arXiv preprint arXiv:1905.13648, 2019

work page arXiv 1905
[6]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009

work page 2009
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Are you talking to a machine? dataset and methods for multilingual image question,

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question,” in NIPS, 2015

work page 2015
[9]

Single shot scene text retrieval,

L. G ´omez, A. Maﬂa, M. Rusinol, and D. Karatzas, “Single shot scene text retrieval,” in ECCV, 2018

work page 2018
[10]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in CVPR, 2017

work page 2017
[11]

Vizwiz grand challenge: Answering visual questions from blind peo- ple,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind peo- ple,” in CVPR, 2018

work page 2018
[12]

An end-to-end textspotter with explicit align- ment and attention,

T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit align- ment and attention,” in CVPR, 2018

work page 2018
[13]

Compositional Attention Networks for Machine Reasoning

D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” arXiv preprint arXiv:1803.03067, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei- Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017

work page 2017
[15]

Icdar 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in ICDAR, 2015

work page 2015
[16]

Icdar 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013

work page 2013
[17]

Undoing the damage of dataset bias,

A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in ECCV, 2012

work page 2012
[18]

Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,” IJCV, vol. 123, 2017

work page 2017
[19]

Binary codes capable of correcting deletions, insertions, and reversals,

V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, 1966

work page 1966
[20]

Image retrieval using textual cues,

A. Mishra, K. Alahari, and C. Jawahar, “Image retrieval using textual cues,” in ICCV, 2013

work page 2013
[21]

Dy- namic lexicon generation for natural scene images,

Y . Patel, L. Gomez, M. Rusinol, and D. Karatzas, “Dy- namic lexicon generation for natural scene images,” in ECCV, 2016

work page 2016
[22]

Exploring models and data for image question answering,

M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015

work page 2015
[23]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Towards VQA Models That Can Read

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “To- wards vqa models that can read,” arXiv preprint arXiv:1904.08920, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[25]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be- longie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Dynamic memory networks for visual and textual question answering,

C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in ICML, 2016

work page 2016

[1] [1]

Don’t just assume; look and answer: Overcoming priors for visual question answering,

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in CVPR, 2018

work page 2018

[2] [2]

Word spotting and recognition with embedded at- tributes,

J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny, “Word spotting and recognition with embedded at- tributes,” TPAMI, vol. 36, 2014

work page 2014

[3] [3]

Bottom-up and top-down attention for image captioning and visual question answering,

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018

work page 2018

[4] [4]

Vqa: Visual question answering,

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015

work page 2015

[5] [5]

Scene text visual question answering,

A. Biten, R. Tito, A. Maﬂa, L. Gomez, M. Ru- siol, E. Valveny, C. V . Jawahar, and D. Karatzas, “Scene text visual question answering,” arXiv preprint arXiv:1905.13648, 2019

work page arXiv 1905

[6] [6]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009

work page 2009

[7] [7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Are you talking to a machine? dataset and methods for multilingual image question,

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question,” in NIPS, 2015

work page 2015

[9] [9]

Single shot scene text retrieval,

L. G ´omez, A. Maﬂa, M. Rusinol, and D. Karatzas, “Single shot scene text retrieval,” in ECCV, 2018

work page 2018

[10] [10]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in CVPR, 2017

work page 2017

[11] [11]

Vizwiz grand challenge: Answering visual questions from blind peo- ple,

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind peo- ple,” in CVPR, 2018

work page 2018

[12] [12]

An end-to-end textspotter with explicit align- ment and attention,

T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit align- ment and attention,” in CVPR, 2018

work page 2018

[13] [13]

Compositional Attention Networks for Machine Reasoning

D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” arXiv preprint arXiv:1803.03067, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei- Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017

work page 2017

[15] [15]

Icdar 2015 competition on robust reading,

D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in ICDAR, 2015

work page 2015

[16] [16]

Icdar 2013 robust reading competition,

D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013

work page 2013

[17] [17]

Undoing the damage of dataset bias,

A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in ECCV, 2012

work page 2012

[18] [18]

Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,” IJCV, vol. 123, 2017

work page 2017

[19] [19]

Binary codes capable of correcting deletions, insertions, and reversals,

V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, 1966

work page 1966

[20] [20]

Image retrieval using textual cues,

A. Mishra, K. Alahari, and C. Jawahar, “Image retrieval using textual cues,” in ICCV, 2013

work page 2013

[21] [21]

Dy- namic lexicon generation for natural scene images,

Y . Patel, L. Gomez, M. Rusinol, and D. Karatzas, “Dy- namic lexicon generation for natural scene images,” in ECCV, 2016

work page 2016

[22] [22]

Exploring models and data for image question answering,

M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015

work page 2015

[23] [23]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

Towards VQA Models That Can Read

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “To- wards vqa models that can read,” arXiv preprint arXiv:1904.08920, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[25] [25]

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be- longie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Dynamic memory networks for visual and textual question answering,

C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in ICML, 2016

work page 2016