DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes
Pith reviewed 2026-05-21 15:35 UTC · model grok-4.3
The pith
DisasterVQA introduces a benchmark of real disaster images and expert-curated questions that exposes limitations in current vision-language models for crisis reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning floods, wildfires, earthquakes and other events. Grounded in FEMA ESF and OCHA MIRA frameworks, the questions test binary, multiple-choice and open-ended reasoning about situational awareness and operational decision-making. Benchmarking of seven state-of-the-art vision-language models shows high accuracy on binary questions yet consistent struggles with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, especially for underrepresented disaster scenarios and regions.
What carries the argument
The DisasterVQA dataset of real disaster photographs paired with questions derived from humanitarian operational frameworks to measure perception and reasoning performance.
If this is right
- Models must improve on fine-grained quantitative and context-aware tasks to become useful for disaster response.
- Performance varies by disaster type and region, indicating that training data should be more balanced across events.
- The benchmark supplies a concrete way to measure progress toward operationally relevant vision-language systems.
- Adoption of the dataset could steer research toward tools that better support damage assessment from social media during crises.
Where Pith is reading between the lines
- Teams building real-time disaster monitoring systems could incorporate similar question sets to validate model outputs before deployment.
- Extending the dataset to include sequences of images over time might reveal whether models can track how situations evolve during an event.
- Comparable benchmarks in other high-stakes visual domains could expose whether the same reasoning weaknesses appear outside disasters.
Load-bearing premise
Expert-curated questions and answers grounded in standard humanitarian frameworks accurately reflect the perception and decision-making needs that arise in actual disaster response operations.
What would settle it
A demonstration that state-of-the-art models achieve uniformly high accuracy on quantitative, counting, and context-sensitive questions across all disaster categories without any disaster-specific fine-tuning would undermine the reported performance gaps.
read the original abstract
Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DisasterVQA, a VQA benchmark dataset consisting of 1,395 real-world disaster images and 4,405 expert-curated QA pairs grounded in FEMA ESF and OCHA MIRA frameworks. The dataset spans binary, multiple-choice, and open-ended questions on situational awareness and operational decision-making across floods, wildfires, earthquakes, and other events. Benchmarking of seven state-of-the-art vision-language models reveals performance variability across question types, disaster categories, regions, and tasks, with high accuracy on binary questions but notable struggles on quantitative reasoning, object counting, and context-sensitive interpretation in underrepresented scenarios. The work positions DisasterVQA as a challenging and practical benchmark to guide more robust VLMs for disaster response and releases the dataset publicly.
Significance. If the curation process accurately reflects real operational needs, this provides a timely, publicly available benchmark that exposes concrete limitations of current VLMs in safety-critical domains. The reported performance gaps across categories and the grounding in established humanitarian frameworks offer actionable guidance for future model development. The public release is a clear strength that supports reproducibility and community follow-up work.
major comments (1)
- [§3] §3 (Dataset Construction): The manuscript provides no inter-annotator agreement metrics, details on the expert curation process (e.g., number of annotators, resolution of disagreements), or validation such as practitioner feedback or explicit mapping showing that correct answers to these questions would improve real disaster response outcomes. This is load-bearing for the abstract's claim that the dataset supports 'operationally meaningful' models and 'situational awareness and operational decision-making tasks,' as the practical utility rests on an untested translation from framework categories to VQA utility.
minor comments (2)
- [Abstract] Abstract: The seven benchmarked models are not named; listing them (or at least the model families) would improve immediate clarity.
- [Figures] Figure captions and tables: Some example QA pairs in figures could include explicit annotations of the error types (e.g., counting failures) discussed in the results section to better illustrate the reported challenges.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the manuscript. We address the major comment below and indicate where revisions have been made to the next version of the paper.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction): The manuscript provides no inter-annotator agreement metrics, details on the expert curation process (e.g., number of annotators, resolution of disagreements), or validation such as practitioner feedback or explicit mapping showing that correct answers to these questions would improve real disaster response outcomes. This is load-bearing for the abstract's claim that the dataset supports 'operationally meaningful' models and 'situational awareness and operational decision-making tasks,' as the practical utility rests on an untested translation from framework categories to VQA utility.
Authors: We agree that the original manuscript omitted several details on the curation process. In the revised version we have expanded §3 to report inter-annotator agreement metrics, the number of annotators involved, their relevant expertise, and the procedure used to resolve disagreements. We have also inserted an explicit mapping that links each question category to specific elements of the FEMA ESF and OCHA MIRA frameworks, together with references to how the information supports situational awareness and operational tasks. We did not collect new practitioner feedback or conduct empirical studies measuring downstream improvements in disaster response; such validation lies outside the scope of the present benchmark paper. We have added a limitations paragraph acknowledging this gap and identifying it as an important direction for follow-up work. revision: partial
- Empirical validation via practitioner feedback or field studies demonstrating that correct answers to these VQA questions produce measurable improvements in real disaster response outcomes.
Circularity Check
No significant circularity; dataset creation and benchmarking are self-contained
full rationale
The paper performs no derivations, equations, parameter fitting, or predictions that could reduce to inputs by construction. It consists of direct dataset curation (1,395 images, 4,405 QA pairs grounded in external FEMA/OCHA frameworks) followed by empirical benchmarking of existing models. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained against external benchmarks via public release and reported performance gaps, qualifying for the default non-circular outcome.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert curation based on FEMA ESF and OCHA MIRA frameworks produces questions that capture relevant situational awareness and operational tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce DisasterVQA, a benchmark dataset... Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA... We benchmark seven state-of-the-art vision-language models
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the questions are grounded in established humanitarian response frameworks (e.g., FEMA ESF, OCHA MIRA) and cover tasks such as identifying built environment damage, determining hazard type and severity, evaluating accessibility
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.