ICDAR 2019 Competition on Scene Text Visual Question Answering
Pith reviewed 2026-05-25 12:14 UTC · model grok-4.3
The pith
A new dataset of 23k images with text-grounded questions pushes VQA models to combine scene reading and context understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a benchmark dataset and competition for scene text visual question answering consisting of 23,038 images annotated with 31,791 question-answer pairs drawn from seven public computer vision datasets, where every answer is grounded on text instances present in the image; the benchmark defines three tasks of increasing difficulty that require reading text in scene context and introduces an evaluation metric that jointly measures text recognition and image understanding.
What carries the argument
The ST-VQA dataset together with its three tasks of increasing difficulty and a novel evaluation metric that jointly scores text recognition accuracy and contextual scene understanding.
Load-bearing premise
The questions genuinely require both accurate text recognition and scene understanding rather than being solvable from text strings alone or from visual cues alone.
What would settle it
A system that achieves high scores on most questions by processing only the transcribed text strings while ignoring image content, or by using only image content while ignoring the text.
Figures
read the original abstract
This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios. The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding. A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the organization and results of the ICDAR 2019 Scene Text Visual Question Answering (ST-VQA) competition. It introduces a dataset of 23,038 images with 31,791 QA pairs sourced from seven public datasets, structured into three tasks of increasing difficulty that require reading scene text in context. A novel evaluation metric is introduced to jointly assess text recognition and image understanding, participant submissions are analyzed, and the dataset is positioned as a milestone toward holistic scene-text VQA models.
Significance. If the QA pairs genuinely require joint text recognition and scene context (rather than being solvable by OCR output or visual cues alone) and the seven sources supply meaningful scenario diversity, the dataset and competition could become a useful benchmark for advancing VQA systems that integrate reading with visual understanding. The structured tasks and proposed metric provide a concrete evaluation framework; credit is due for releasing a large, multi-source dataset with answers explicitly grounded in text instances.
major comments (3)
- [Abstract] Abstract: The claim that the dataset constitutes 'an important milestone' for 'holistic image understanding' that exploits scene text is load-bearing on the unverified premise that questions 'require reading the text in a scene and understanding it in the context.' No sample QA pairs, modality-ablation baselines, or statistics are supplied to demonstrate that a non-trivial fraction of questions cannot be answered from text alone or from visual cues alone.
- [Dataset and Tasks] Dataset and Tasks sections: The assertion that the seven source datasets 'cover a wide range of scenarios' and that answers are 'always grounded on text instances' lacks any quantitative verification of scenario diversity or grounding (e.g., distribution of question types across sources, or checks that scene context is indispensable). This directly affects the central claim of the work.
- [Evaluation Metric and Results] Evaluation Metric and Results sections: The novel metric is described as 'elegantly assess[ing] both key capabilities,' yet the manuscript supplies no validation, comparison against standard VQA metrics, or error analysis. The 'detailed analysis of results from different participants' is referenced but no quantitative performance numbers, rankings, or failure-mode breakdowns appear in the provided text, leaving claims about current model capabilities unsupported.
minor comments (1)
- [Abstract] The abstract would be strengthened by including one or two concrete quantitative highlights from the competition (e.g., top participant accuracy or metric scores) rather than remaining purely descriptive.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the dataset constitutes 'an important milestone' for 'holistic image understanding' that exploits scene text is load-bearing on the unverified premise that questions 'require reading the text in a scene and understanding it in the context.' No sample QA pairs, modality-ablation baselines, or statistics are supplied to demonstrate that a non-trivial fraction of questions cannot be answered from text alone or from visual cues alone.
Authors: We agree the abstract claim would be strengthened by supporting material. The grounding of answers in text instances follows directly from the annotation protocol described in the Dataset section. To address the concern, we will revise the abstract to moderate the language and add illustrative QA pair examples (with source images) plus a summary table of question categories in the revised version. Comprehensive modality ablations fall outside the scope of a competition report focused on participant results. revision: partial
-
Referee: [Dataset and Tasks] Dataset and Tasks sections: The assertion that the seven source datasets 'cover a wide range of scenarios' and that answers are 'always grounded on text instances' lacks any quantitative verification of scenario diversity or grounding (e.g., distribution of question types across sources, or checks that scene context is indispensable). This directly affects the central claim of the work.
Authors: The seven source datasets were deliberately chosen from distinct domains to ensure scenario variety. We acknowledge that the submitted manuscript does not include explicit quantitative breakdowns. In revision we will insert a table reporting image and QA counts per source together with representative examples that illustrate the role of scene context. The 'always grounded' property is enforced by the annotation guidelines; we can add a note on spot-check verification performed during dataset curation. revision: yes
-
Referee: [Evaluation Metric and Results] Evaluation Metric and Results sections: The novel metric is described as 'elegantly assess[ing] both key capabilities,' yet the manuscript supplies no validation, comparison against standard VQA metrics, or error analysis. The 'detailed analysis of results from different participants' is referenced but no quantitative performance numbers, rankings, or failure-mode breakdowns appear in the provided text, leaving claims about current model capabilities unsupported.
Authors: The metric employed is Average Normalized Levenshtein Similarity (ANLS), chosen because it jointly penalizes recognition and reasoning errors. We will expand the Results section with a participant ranking table, a short comparison against exact-match accuracy, and a concise error-mode summary drawn from the submitted runs. These details were condensed in the original submission for length reasons and will be restored. revision: yes
Circularity Check
No circularity: dataset/competition paper with no derivations or fitted predictions
full rationale
The paper introduces a new dataset and reports competition results. It contains no equations, model derivations, parameter fitting, or predictions that could reduce to inputs by construction. Central claims (dataset as milestone, tasks requiring joint text+scene understanding) are presented as assertions supported by the collection process and task design, not as outputs of any self-referential chain. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. This matches the default non-circular case for a competition/dataset paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Don’t just assume; look and answer: Overcoming priors for visual question answering,
A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in CVPR, 2018
work page 2018
-
[2]
Word spotting and recognition with embedded at- tributes,
J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny, “Word spotting and recognition with embedded at- tributes,” TPAMI, vol. 36, 2014
work page 2014
-
[3]
Bottom-up and top-down attention for image captioning and visual question answering,
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018
work page 2018
-
[4]
Vqa: Visual question answering,
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015
work page 2015
-
[5]
Scene text visual question answering,
A. Biten, R. Tito, A. Mafla, L. Gomez, M. Ru- siol, E. Valveny, C. V . Jawahar, and D. Karatzas, “Scene text visual question answering,” arXiv preprint arXiv:1905.13648, 2019
-
[6]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009
work page 2009
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Are you talking to a machine? dataset and methods for multilingual image question,
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question,” in NIPS, 2015
work page 2015
-
[9]
Single shot scene text retrieval,
L. G ´omez, A. Mafla, M. Rusinol, and D. Karatzas, “Single shot scene text retrieval,” in ECCV, 2018
work page 2018
-
[10]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering,
Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in CVPR, 2017
work page 2017
-
[11]
Vizwiz grand challenge: Answering visual questions from blind peo- ple,
D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind peo- ple,” in CVPR, 2018
work page 2018
-
[12]
An end-to-end textspotter with explicit align- ment and attention,
T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit align- ment and attention,” in CVPR, 2018
work page 2018
-
[13]
Compositional Attention Networks for Machine Reasoning
D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” arXiv preprint arXiv:1803.03067, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,
J. Johnson, B. Hariharan, L. van der Maaten, L. Fei- Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017
work page 2017
-
[15]
Icdar 2015 competition on robust reading,
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in ICDAR, 2015
work page 2015
-
[16]
Icdar 2013 robust reading competition,
D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013
work page 2013
-
[17]
Undoing the damage of dataset bias,
A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in ECCV, 2012
work page 2012
-
[18]
Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,
R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,” IJCV, vol. 123, 2017
work page 2017
-
[19]
Binary codes capable of correcting deletions, insertions, and reversals,
V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, 1966
work page 1966
-
[20]
Image retrieval using textual cues,
A. Mishra, K. Alahari, and C. Jawahar, “Image retrieval using textual cues,” in ICCV, 2013
work page 2013
-
[21]
Dy- namic lexicon generation for natural scene images,
Y . Patel, L. Gomez, M. Rusinol, and D. Karatzas, “Dy- namic lexicon generation for natural scene images,” in ECCV, 2016
work page 2016
-
[22]
Exploring models and data for image question answering,
M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015
work page 2015
-
[23]
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
Towards VQA Models That Can Read
A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “To- wards vqa models that can read,” arXiv preprint arXiv:1904.08920, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[25]
COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images
A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be- longie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[26]
Dynamic memory networks for visual and textual question answering,
C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in ICML, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.