pith. sign in

arxiv: 1907.00490 · v1 · pith:XRUL6XGWnew · submitted 2019-06-30 · 💻 cs.CV

ICDAR 2019 Competition on Scene Text Visual Question Answering

Pith reviewed 2026-05-25 12:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords scene textvisual question answeringdatasettext recognitionimage understandingVQA benchmarkICDAR competition
0
0 comments X

The pith

A new dataset of 23k images with text-grounded questions pushes VQA models to combine scene reading and context understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the ST-VQA competition and its dataset to fill a gap in visual question answering by requiring models to use text visible in images to answer questions. It assembles 23,038 images from seven existing datasets along with 31,791 question-answer pairs where the correct answer depends on reading and interpreting that text in its visual setting. Three tasks increase in difficulty by demanding more integration between text recognition and scene understanding, scored by a metric that rewards both capabilities. Participant results illustrate current limits of systems that can read, and the authors position the dataset as a milestone for building models that achieve more complete image understanding through scene text.

Core claim

The paper establishes a benchmark dataset and competition for scene text visual question answering consisting of 23,038 images annotated with 31,791 question-answer pairs drawn from seven public computer vision datasets, where every answer is grounded on text instances present in the image; the benchmark defines three tasks of increasing difficulty that require reading text in scene context and introduces an evaluation metric that jointly measures text recognition and image understanding.

What carries the argument

The ST-VQA dataset together with its three tasks of increasing difficulty and a novel evaluation metric that jointly scores text recognition accuracy and contextual scene understanding.

Load-bearing premise

The questions genuinely require both accurate text recognition and scene understanding rather than being solvable from text strings alone or from visual cues alone.

What would settle it

A system that achieves high scores on most questions by processing only the transcribed text strings while ignoring image content, or by using only image content while ignoring the text.

Figures

Figures reproduced from arXiv: 1907.00490 by Ali Furkan Biten, Andres Mafla, C.V. Jawahar, Dimosthenis Karatzas, Ernest Valveny, Lluis Gomez, Mar\c{c}al Rusi\~nol, Minesh Mathew, Rub\`en Tito.

Figure 1
Figure 1. Figure 1: Examples of questions and ground-truth answers from the ST-VQA training set [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A detailed breakdown of the performance of the submitted models by image source (top) and question categories [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy scores per ANLS threshold for Task 1 (left) and Task 3 (right) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image. The competition introduces a new dataset comprising 23,038 images annotated with 31,791 question/answer pairs where the answer is always grounded on text instances present in the image. The images are taken from 7 different public computer vision datasets, covering a wide range of scenarios. The competition was structured in three tasks of increasing difficulty, that require reading the text in a scene and understanding it in the context of the scene, to correctly answer a given question. A novel evaluation metric is presented, which elegantly assesses both key capabilities expected from an optimal model: text recognition and image understanding. A detailed analysis of results from different participants is showcased, which provides insight into the current capabilities of VQA systems that can read. We firmly believe the dataset proposed in this challenge will be an important milestone to consider towards a path of more robust and general models that can exploit scene text to achieve holistic image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript reports the organization and results of the ICDAR 2019 Scene Text Visual Question Answering (ST-VQA) competition. It introduces a dataset of 23,038 images with 31,791 QA pairs sourced from seven public datasets, structured into three tasks of increasing difficulty that require reading scene text in context. A novel evaluation metric is introduced to jointly assess text recognition and image understanding, participant submissions are analyzed, and the dataset is positioned as a milestone toward holistic scene-text VQA models.

Significance. If the QA pairs genuinely require joint text recognition and scene context (rather than being solvable by OCR output or visual cues alone) and the seven sources supply meaningful scenario diversity, the dataset and competition could become a useful benchmark for advancing VQA systems that integrate reading with visual understanding. The structured tasks and proposed metric provide a concrete evaluation framework; credit is due for releasing a large, multi-source dataset with answers explicitly grounded in text instances.

major comments (3)
  1. [Abstract] Abstract: The claim that the dataset constitutes 'an important milestone' for 'holistic image understanding' that exploits scene text is load-bearing on the unverified premise that questions 'require reading the text in a scene and understanding it in the context.' No sample QA pairs, modality-ablation baselines, or statistics are supplied to demonstrate that a non-trivial fraction of questions cannot be answered from text alone or from visual cues alone.
  2. [Dataset and Tasks] Dataset and Tasks sections: The assertion that the seven source datasets 'cover a wide range of scenarios' and that answers are 'always grounded on text instances' lacks any quantitative verification of scenario diversity or grounding (e.g., distribution of question types across sources, or checks that scene context is indispensable). This directly affects the central claim of the work.
  3. [Evaluation Metric and Results] Evaluation Metric and Results sections: The novel metric is described as 'elegantly assess[ing] both key capabilities,' yet the manuscript supplies no validation, comparison against standard VQA metrics, or error analysis. The 'detailed analysis of results from different participants' is referenced but no quantitative performance numbers, rankings, or failure-mode breakdowns appear in the provided text, leaving claims about current model capabilities unsupported.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including one or two concrete quantitative highlights from the competition (e.g., top participant accuracy or metric scores) rather than remaining purely descriptive.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the dataset constitutes 'an important milestone' for 'holistic image understanding' that exploits scene text is load-bearing on the unverified premise that questions 'require reading the text in a scene and understanding it in the context.' No sample QA pairs, modality-ablation baselines, or statistics are supplied to demonstrate that a non-trivial fraction of questions cannot be answered from text alone or from visual cues alone.

    Authors: We agree the abstract claim would be strengthened by supporting material. The grounding of answers in text instances follows directly from the annotation protocol described in the Dataset section. To address the concern, we will revise the abstract to moderate the language and add illustrative QA pair examples (with source images) plus a summary table of question categories in the revised version. Comprehensive modality ablations fall outside the scope of a competition report focused on participant results. revision: partial

  2. Referee: [Dataset and Tasks] Dataset and Tasks sections: The assertion that the seven source datasets 'cover a wide range of scenarios' and that answers are 'always grounded on text instances' lacks any quantitative verification of scenario diversity or grounding (e.g., distribution of question types across sources, or checks that scene context is indispensable). This directly affects the central claim of the work.

    Authors: The seven source datasets were deliberately chosen from distinct domains to ensure scenario variety. We acknowledge that the submitted manuscript does not include explicit quantitative breakdowns. In revision we will insert a table reporting image and QA counts per source together with representative examples that illustrate the role of scene context. The 'always grounded' property is enforced by the annotation guidelines; we can add a note on spot-check verification performed during dataset curation. revision: yes

  3. Referee: [Evaluation Metric and Results] Evaluation Metric and Results sections: The novel metric is described as 'elegantly assess[ing] both key capabilities,' yet the manuscript supplies no validation, comparison against standard VQA metrics, or error analysis. The 'detailed analysis of results from different participants' is referenced but no quantitative performance numbers, rankings, or failure-mode breakdowns appear in the provided text, leaving claims about current model capabilities unsupported.

    Authors: The metric employed is Average Normalized Levenshtein Similarity (ANLS), chosen because it jointly penalizes recognition and reasoning errors. We will expand the Results section with a participant ranking table, a short comparison against exact-match accuracy, and a concise error-mode summary drawn from the submitted runs. These details were condensed in the original submission for length reasons and will be restored. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset/competition paper with no derivations or fitted predictions

full rationale

The paper introduces a new dataset and reports competition results. It contains no equations, model derivations, parameter fitting, or predictions that could reduce to inputs by construction. Central claims (dataset as milestone, tasks requiring joint text+scene understanding) are presented as assertions supported by the collection process and task design, not as outputs of any self-referential chain. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear. This matches the default non-circular case for a competition/dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical competition report that introduces a dataset and tasks rather than a mathematical derivation; no free parameters, axioms, or invented entities are required or stated.

pith-pipeline@v0.9.0 · 5785 in / 1067 out tokens · 32608 ms · 2026-05-25T12:14:30.537365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 5 internal anchors

  1. [1]

    Don’t just assume; look and answer: Overcoming priors for visual question answering,

    A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in CVPR, 2018

  2. [2]

    Word spotting and recognition with embedded at- tributes,

    J. Almaz ´an, A. Gordo, A. Forn ´es, and E. Valveny, “Word spotting and recognition with embedded at- tributes,” TPAMI, vol. 36, 2014

  3. [3]

    Bottom-up and top-down attention for image captioning and visual question answering,

    P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018

  4. [4]

    Vqa: Visual question answering,

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015

  5. [5]

    Scene text visual question answering,

    A. Biten, R. Tito, A. Mafla, L. Gomez, M. Ru- siol, E. Valveny, C. V . Jawahar, and D. Karatzas, “Scene text visual question answering,” arXiv preprint arXiv:1905.13648, 2019

  6. [6]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transform- ers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

  8. [8]

    Are you talking to a machine? dataset and methods for multilingual image question,

    H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question,” in NIPS, 2015

  9. [9]

    Single shot scene text retrieval,

    L. G ´omez, A. Mafla, M. Rusinol, and D. Karatzas, “Single shot scene text retrieval,” in ECCV, 2018

  10. [10]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in CVPR, 2017

  11. [11]

    Vizwiz grand challenge: Answering visual questions from blind peo- ple,

    D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind peo- ple,” in CVPR, 2018

  12. [12]

    An end-to-end textspotter with explicit align- ment and attention,

    T. He, Z. Tian, W. Huang, C. Shen, Y . Qiao, and C. Sun, “An end-to-end textspotter with explicit align- ment and attention,” in CVPR, 2018

  13. [13]

    Compositional Attention Networks for Machine Reasoning

    D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” arXiv preprint arXiv:1803.03067, 2018

  14. [14]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

    J. Johnson, B. Hariharan, L. van der Maaten, L. Fei- Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017

  15. [15]

    Icdar 2015 competition on robust reading,

    D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V . R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in ICDAR, 2015

  16. [16]

    Icdar 2013 robust reading competition,

    D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013

  17. [17]

    Undoing the damage of dataset bias,

    A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in ECCV, 2012

  18. [18]

    Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annota- tions,” IJCV, vol. 123, 2017

  19. [19]

    Binary codes capable of correcting deletions, insertions, and reversals,

    V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, 1966

  20. [20]

    Image retrieval using textual cues,

    A. Mishra, K. Alahari, and C. Jawahar, “Image retrieval using textual cues,” in ICCV, 2013

  21. [21]

    Dy- namic lexicon generation for natural scene images,

    Y . Patel, L. Gomez, M. Rusinol, and D. Karatzas, “Dy- namic lexicon generation for natural scene images,” in ECCV, 2016

  22. [22]

    Exploring models and data for image question answering,

    M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015

  23. [23]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    K. Simonyan and A. Zisserman, “Very deep convo- lutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014

  24. [24]

    Towards VQA Models That Can Read

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “To- wards vqa models that can read,” arXiv preprint arXiv:1904.08920, 2019

  25. [25]

    COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

    A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be- longie, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016

  26. [26]

    Dynamic memory networks for visual and textual question answering,

    C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual question answering,” in ICML, 2016