pith. sign in

arxiv: 2601.13839 · v2 · pith:43OSBEDQnew · submitted 2026-01-20 · 💻 cs.CV

DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

Pith reviewed 2026-05-21 15:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual question answeringdisaster responsebenchmark datasetvision-language modelscrisis situationssituational awarenesshumanitarian frameworksimage reasoning
0
0 comments X

The pith

DisasterVQA introduces a benchmark of real disaster images and expert-curated questions that exposes limitations in current vision-language models for crisis reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates DisasterVQA as a new resource to test visual question answering specifically in disaster contexts where rapid situational awareness matters. It assembles 1,395 actual images from floods, wildfires, earthquakes and similar events together with 4,405 questions and answers written by experts. These questions follow established humanitarian guidelines and cover simple yes-no checks as well as harder counting, multiple-choice, and open-ended tasks about damage and response needs. Testing seven leading models reveals strong results on basic binary questions but clear drops in accuracy for quantitative details and context that varies by disaster type or region. If the benchmark holds, it supplies a practical yardstick for building AI tools that can actually support field decisions during emergencies.

Core claim

DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning floods, wildfires, earthquakes and other events. Grounded in FEMA ESF and OCHA MIRA frameworks, the questions test binary, multiple-choice and open-ended reasoning about situational awareness and operational decision-making. Benchmarking of seven state-of-the-art vision-language models shows high accuracy on binary questions yet consistent struggles with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, especially for underrepresented disaster scenarios and regions.

What carries the argument

The DisasterVQA dataset of real disaster photographs paired with questions derived from humanitarian operational frameworks to measure perception and reasoning performance.

If this is right

  • Models must improve on fine-grained quantitative and context-aware tasks to become useful for disaster response.
  • Performance varies by disaster type and region, indicating that training data should be more balanced across events.
  • The benchmark supplies a concrete way to measure progress toward operationally relevant vision-language systems.
  • Adoption of the dataset could steer research toward tools that better support damage assessment from social media during crises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams building real-time disaster monitoring systems could incorporate similar question sets to validate model outputs before deployment.
  • Extending the dataset to include sequences of images over time might reveal whether models can track how situations evolve during an event.
  • Comparable benchmarks in other high-stakes visual domains could expose whether the same reasoning weaknesses appear outside disasters.

Load-bearing premise

Expert-curated questions and answers grounded in standard humanitarian frameworks accurately reflect the perception and decision-making needs that arise in actual disaster response operations.

What would settle it

A demonstration that state-of-the-art models achieve uniformly high accuracy on quantitative, counting, and context-sensitive questions across all disaster categories without any disaster-specific fine-tuning would undermine the reported performance gaps.

read the original abstract

Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://doi.org/10.5281/zenodo.18267769.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DisasterVQA, a VQA benchmark dataset consisting of 1,395 real-world disaster images and 4,405 expert-curated QA pairs grounded in FEMA ESF and OCHA MIRA frameworks. The dataset spans binary, multiple-choice, and open-ended questions on situational awareness and operational decision-making across floods, wildfires, earthquakes, and other events. Benchmarking of seven state-of-the-art vision-language models reveals performance variability across question types, disaster categories, regions, and tasks, with high accuracy on binary questions but notable struggles on quantitative reasoning, object counting, and context-sensitive interpretation in underrepresented scenarios. The work positions DisasterVQA as a challenging and practical benchmark to guide more robust VLMs for disaster response and releases the dataset publicly.

Significance. If the curation process accurately reflects real operational needs, this provides a timely, publicly available benchmark that exposes concrete limitations of current VLMs in safety-critical domains. The reported performance gaps across categories and the grounding in established humanitarian frameworks offer actionable guidance for future model development. The public release is a clear strength that supports reproducibility and community follow-up work.

major comments (1)
  1. [§3] §3 (Dataset Construction): The manuscript provides no inter-annotator agreement metrics, details on the expert curation process (e.g., number of annotators, resolution of disagreements), or validation such as practitioner feedback or explicit mapping showing that correct answers to these questions would improve real disaster response outcomes. This is load-bearing for the abstract's claim that the dataset supports 'operationally meaningful' models and 'situational awareness and operational decision-making tasks,' as the practical utility rests on an untested translation from framework categories to VQA utility.
minor comments (2)
  1. [Abstract] Abstract: The seven benchmarked models are not named; listing them (or at least the model families) would improve immediate clarity.
  2. [Figures] Figure captions and tables: Some example QA pairs in figures could include explicit annotations of the error types (e.g., counting failures) discussed in the results section to better illustrate the reported challenges.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the major comment below and indicate where revisions have been made to the next version of the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The manuscript provides no inter-annotator agreement metrics, details on the expert curation process (e.g., number of annotators, resolution of disagreements), or validation such as practitioner feedback or explicit mapping showing that correct answers to these questions would improve real disaster response outcomes. This is load-bearing for the abstract's claim that the dataset supports 'operationally meaningful' models and 'situational awareness and operational decision-making tasks,' as the practical utility rests on an untested translation from framework categories to VQA utility.

    Authors: We agree that the original manuscript omitted several details on the curation process. In the revised version we have expanded §3 to report inter-annotator agreement metrics, the number of annotators involved, their relevant expertise, and the procedure used to resolve disagreements. We have also inserted an explicit mapping that links each question category to specific elements of the FEMA ESF and OCHA MIRA frameworks, together with references to how the information supports situational awareness and operational tasks. We did not collect new practitioner feedback or conduct empirical studies measuring downstream improvements in disaster response; such validation lies outside the scope of the present benchmark paper. We have added a limitations paragraph acknowledging this gap and identifying it as an important direction for follow-up work. revision: partial

standing simulated objections not resolved
  • Empirical validation via practitioner feedback or field studies demonstrating that correct answers to these VQA questions produce measurable improvements in real disaster response outcomes.

Circularity Check

0 steps flagged

No significant circularity; dataset creation and benchmarking are self-contained

full rationale

The paper performs no derivations, equations, parameter fitting, or predictions that could reduce to inputs by construction. It consists of direct dataset curation (1,395 images, 4,405 QA pairs grounded in external FEMA/OCHA frameworks) followed by empirical benchmarking of existing models. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The work is self-contained against external benchmarks via public release and reported performance gaps, qualifying for the default non-circular outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that expert curation aligned with standard humanitarian frameworks produces representative and high-quality questions for disaster VQA; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Expert curation based on FEMA ESF and OCHA MIRA frameworks produces questions that capture relevant situational awareness and operational tasks.
    Invoked in the dataset design and question coverage description.

pith-pipeline@v0.9.0 · 5775 in / 1154 out tokens · 61060 ms · 2026-05-21T15:35:51.571176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

    cs.AI 2026-05 conditional novelty 7.0

    LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.