DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

Aisha Al-Mohannadi; Ayisha Firoz; Ferda Ofli; Muhammad Imran; Yin Yang

arxiv: 2601.13839 · v2 · pith:43OSBEDQnew · submitted 2026-01-20 · 💻 cs.CV

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

Aisha Al-Mohannadi , Ayisha Firoz , Yin Yang , Muhammad Imran , Ferda Ofli This is my paper

Pith reviewed 2026-05-21 15:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual question answeringdisaster responsebenchmark datasetvision-language modelscrisis situationssituational awarenesshumanitarian frameworksimage reasoning

0 comments

The pith

DisasterVQA introduces a benchmark of real disaster images and expert-curated questions that exposes limitations in current vision-language models for crisis reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DisasterVQA as a new resource to test visual question answering specifically in disaster contexts where rapid situational awareness matters. It assembles 1,395 actual images from floods, wildfires, earthquakes and similar events together with 4,405 questions and answers written by experts. These questions follow established humanitarian guidelines and cover simple yes-no checks as well as harder counting, multiple-choice, and open-ended tasks about damage and response needs. Testing seven leading models reveals strong results on basic binary questions but clear drops in accuracy for quantitative details and context that varies by disaster type or region. If the benchmark holds, it supplies a practical yardstick for building AI tools that can actually support field decisions during emergencies.

Core claim

DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning floods, wildfires, earthquakes and other events. Grounded in FEMA ESF and OCHA MIRA frameworks, the questions test binary, multiple-choice and open-ended reasoning about situational awareness and operational decision-making. Benchmarking of seven state-of-the-art vision-language models shows high accuracy on binary questions yet consistent struggles with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, especially for underrepresented disaster scenarios and regions.

What carries the argument

The DisasterVQA dataset of real disaster photographs paired with questions derived from humanitarian operational frameworks to measure perception and reasoning performance.

If this is right

Models must improve on fine-grained quantitative and context-aware tasks to become useful for disaster response.
Performance varies by disaster type and region, indicating that training data should be more balanced across events.
The benchmark supplies a concrete way to measure progress toward operationally relevant vision-language systems.
Adoption of the dataset could steer research toward tools that better support damage assessment from social media during crises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams building real-time disaster monitoring systems could incorporate similar question sets to validate model outputs before deployment.
Extending the dataset to include sequences of images over time might reveal whether models can track how situations evolve during an event.
Comparable benchmarks in other high-stakes visual domains could expose whether the same reasoning weaknesses appear outside disasters.

Load-bearing premise

Expert-curated questions and answers grounded in standard humanitarian frameworks accurately reflect the perception and decision-making needs that arise in actual disaster response operations.

What would settle it

A demonstration that state-of-the-art models achieve uniformly high accuracy on quantitative, counting, and context-sensitive questions across all disaster categories without any disaster-specific fine-tuning would undermine the reported performance gaps.

read the original abstract

Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DisasterVQA adds a focused new benchmark for disaster-scene VQA with public data and clear model gaps, but the operational relevance claim lacks direct validation.

read the letter

The paper's main point is a new dataset for visual question answering in disaster situations. It has 1395 real images from events like floods, wildfires, and earthquakes, along with 4405 questions and answers put together by experts using standard humanitarian response categories from FEMA and OCHA. They benchmark seven models and report that performance drops on quantitative reasoning, counting, and underrepresented scenarios while holding up better on binary questions. The public release at Zenodo is straightforward and lets others test their own systems on this material. That part is concrete and fills a gap left by general VQA sets. The curation draws on established frameworks, which gives the questions a logical structure around situational awareness and response tasks. The soft spot is the missing link between those questions and actual field use. No inter-annotator agreement figures appear, and there is no practitioner review or outcome mapping showing that correct answers to these items would change real decisions. The claim that the benchmark is operationally meaningful therefore sits on the framework choice alone rather than on tested utility. Readers working on applied vision-language models for emergency or humanitarian settings would find the data and baseline numbers worth examining. The experiments are simple enough to reproduce once the images and pairs are downloaded. The work shows clear thinking in identifying performance shortfalls without overclaiming fixes. It deserves a serious referee to check the curation details and ask for any additional validation steps that could be added. I would send it through peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces DisasterVQA, a VQA benchmark dataset consisting of 1,395 real-world disaster images and 4,405 expert-curated QA pairs grounded in FEMA ESF and OCHA MIRA frameworks. The dataset spans binary, multiple-choice, and open-ended questions on situational awareness and operational decision-making across floods, wildfires, earthquakes, and other events. Benchmarking of seven state-of-the-art vision-language models reveals performance variability across question types, disaster categories, regions, and tasks, with high accuracy on binary questions but notable struggles on quantitative reasoning, object counting, and context-sensitive interpretation in underrepresented scenarios. The work positions DisasterVQA as a challenging and practical benchmark to guide more robust VLMs for disaster response and releases the dataset publicly.

Significance. If the curation process accurately reflects real operational needs, this provides a timely, publicly available benchmark that exposes concrete limitations of current VLMs in safety-critical domains. The reported performance gaps across categories and the grounding in established humanitarian frameworks offer actionable guidance for future model development. The public release is a clear strength that supports reproducibility and community follow-up work.

major comments (1)

[§3] §3 (Dataset Construction): The manuscript provides no inter-annotator agreement metrics, details on the expert curation process (e.g., number of annotators, resolution of disagreements), or validation such as practitioner feedback or explicit mapping showing that correct answers to these questions would improve real disaster response outcomes. This is load-bearing for the abstract's claim that the dataset supports 'operationally meaningful' models and 'situational awareness and operational decision-making tasks,' as the practical utility rests on an untested translation from framework categories to VQA utility.

minor comments (2)

[Abstract] Abstract: The seven benchmarked models are not named; listing them (or at least the model families) would improve immediate clarity.
[Figures] Figure captions and tables: Some example QA pairs in figures could include explicit annotations of the error types (e.g., counting failures) discussed in the results section to better illustrate the reported challenges.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the major comment below and indicate where revisions have been made to the next version of the paper.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The manuscript provides no inter-annotator agreement metrics, details on the expert curation process (e.g., number of annotators, resolution of disagreements), or validation such as practitioner feedback or explicit mapping showing that correct answers to these questions would improve real disaster response outcomes. This is load-bearing for the abstract's claim that the dataset supports 'operationally meaningful' models and 'situational awareness and operational decision-making tasks,' as the practical utility rests on an untested translation from framework categories to VQA utility.

Authors: We agree that the original manuscript omitted several details on the curation process. In the revised version we have expanded §3 to report inter-annotator agreement metrics, the number of annotators involved, their relevant expertise, and the procedure used to resolve disagreements. We have also inserted an explicit mapping that links each question category to specific elements of the FEMA ESF and OCHA MIRA frameworks, together with references to how the information supports situational awareness and operational tasks. We did not collect new practitioner feedback or conduct empirical studies measuring downstream improvements in disaster response; such validation lies outside the scope of the present benchmark paper. We have added a limitations paragraph acknowledging this gap and identifying it as an important direction for follow-up work. revision: partial

standing simulated objections not resolved

Empirical validation via practitioner feedback or field studies demonstrating that correct answers to these VQA questions produce measurable improvements in real disaster response outcomes.

Circularity Check

0 steps flagged

No significant circularity; dataset creation and benchmarking are self-contained

full rationale

The paper performs no derivations, equations, parameter fitting, or predictions that could reduce to inputs by construction. It consists of direct dataset curation (1,395 images, 4,405 QA pairs grounded in external FEMA/OCHA frameworks) followed by empirical benchmarking of existing models. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained against external benchmarks via public release and reported performance gaps, qualifying for the default non-circular outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that expert curation aligned with standard humanitarian frameworks produces representative and high-quality questions for disaster VQA; no free parameters or new entities are introduced.

axioms (1)

domain assumption Expert curation based on FEMA ESF and OCHA MIRA frameworks produces questions that capture relevant situational awareness and operational tasks.
Invoked in the dataset design and question coverage description.

pith-pipeline@v0.9.0 · 5775 in / 1154 out tokens · 61060 ms · 2026-05-21T15:35:51.571176+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce DisasterVQA, a benchmark dataset... Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA... We benchmark seven state-of-the-art vision-language models
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the questions are grounded in established humanitarian response frameworks (e.g., FEMA ESF, OCHA MIRA) and cover tasks such as identifying built environment damage, determining hazard type and severity, evaluating accessibility

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification
cs.AI 2026-05 conditional novelty 7.0

LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.