Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment

Armin Zarbaft; Ehsan Karimi; Maryam Rahnemoonfar; Nhut Le

arxiv: 2605.11439 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.LG

Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment

Armin Zarbaft , Ehsan Karimi , Nhut Le , Maryam Rahnemoonfar This is my paper

Pith reviewed 2026-05-13 01:32 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords post-disaster VQAmultimodal large language modelschain-of-thought promptingin-context learningFloodNet datasetdamage assessmentprompt engineering

0 comments

The pith

Instruction-driven chain-of-thought guidance raises answer accuracy for pretrained multimodal models on post-disaster visual questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether structured prompting can make existing multimodal large language models more reliable for answering questions about damaged areas in disaster images. It has one model create task-specific instructions that act as step-by-step reasoning guidance for a second model, then adds varying amounts of example-based learning during answer generation. These combinations are compared to a plain zero-shot baseline on the FloodNet dataset for flood damage assessment. The approach matters because training new models for each disaster is too slow for real-time response, while prompt-sensitive models can give inconsistent results in high-stakes settings. If the gains hold, the method offers a faster way to adapt general-purpose models without retraining.

Core claim

Using one MLLM to generate instruction-driven CoT reasoning that guides a second MLLM, incorporated with varying degrees of in-context learning, consistently improves answer accuracy over zero-shot baselines on post-disaster VQA tasks, as shown on the FloodNet dataset.

What carries the argument

Instruction-driven Chain-of-Thought (CoT) reasoning generated by one MLLM to steer the answer generation of a second MLLM, combined with in-context learning examples of varying strength.

Load-bearing premise

The accuracy gains on FloodNet come from the instruction-CoT-ICL combination itself rather than dataset quirks or fine details of the prompt wording.

What would settle it

Running the same instruction-CoT-ICL setups on a different post-disaster VQA dataset, such as one covering earthquake or wildfire damage, and finding no accuracy lift over zero-shot would show the gains are not general.

Figures

Figures reproduced from arXiv: 2605.11439 by Armin Zarbaft, Ehsan Karimi, Maryam Rahnemoonfar, Nhut Le.

**Figure 2.** Figure 2: Comparison of instruction generation and final rationale between AIC, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets accuracy gains on FloodNet by chaining two MLLMs with generated CoT instructions plus ICL, but the gains could easily come from prompt details rather than the claimed mechanism.

read the letter

The paper reports that using one MLLM to generate CoT instructions for a second one, combined with ICL, boosts accuracy on FloodNet post-disaster VQA over zero-shot. That's the main finding. It applies established prompting methods to this specific setting, which is a reasonable next step for making MLLMs more reliable in urgent scenarios like disaster response. The focus on training-free approaches is a plus, since retraining isn't feasible in the field. Where it falls short is the experimental design. There's no ablation to show that the generated instructions are key, or to compare against plain few-shot or standard CoT without the two-model setup. Only FloodNet is used, so we can't tell if the improvement is robust or tied to that dataset's characteristics. Details on the size of the gains or statistical significance aren't in the abstract, leaving the claim a bit thin. The approach might work for quick situational awareness in emergencies, but the results could stem from prompt engineering details rather than the proposed mechanism. This kind of work would suit readers in computer vision applied to humanitarian aid or remote sensing. It could spark ideas for others, but probably needs more validation before becoming a go-to method. I'd say send it to peer review with requests for ablations and cross-dataset tests. The core idea is sound enough to warrant feedback from referees.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Instruct-ICL, a prompting method for post-disaster visual question answering (VQA) in which one MLLM generates task-specific instructions to serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are combined with varying degrees of in-context learning (ICL) and evaluated on the FloodNet dataset against a zero-shot baseline, with the central claim that the approach consistently improves answer accuracy.

Significance. If the accuracy gains prove robust and attributable to the instruction-CoT-ICL mechanism, the work offers a practical, training-free way to enhance pretrained MLLM reliability for time-critical disaster assessment using only public datasets and existing models. It addresses prompt sensitivity in multimodal reasoning without requiring task-specific fine-tuning.

major comments (2)

[Evaluation] Evaluation section: The paper reports results only on FloodNet and compares solely against zero-shot prompting. No ablation studies isolate the effect of the generated instructions from plain CoT, standard few-shot ICL, or prompt-length controls, leaving open whether observed gains are caused by the proposed combination or by incidental prompt engineering.
[Abstract] Abstract and experimental results: The claim of 'consistent improvement' in answer accuracy is presented without quantitative values, error bars, statistical tests, or multiple-run details in the provided text, which is load-bearing for assessing the reliability and magnitude of the reported gains.

minor comments (1)

[Method] The distinction between the two MLLMs (instruction generator vs. answer generator) is introduced without explicit notation or pseudocode, which could be clarified in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the evaluation and result presentation that we will address in revision to better substantiate the contributions of Instruct-ICL.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The paper reports results only on FloodNet and compares solely against zero-shot prompting. No ablation studies isolate the effect of the generated instructions from plain CoT, standard few-shot ICL, or prompt-length controls, leaving open whether observed gains are caused by the proposed combination or by incidental prompt engineering.

Authors: We acknowledge that the current experiments are limited to FloodNet with a zero-shot baseline comparison. FloodNet serves as a standard public benchmark for post-disaster VQA, aligning with the practical, training-free focus of the work. However, we agree that isolating the contributions of the generated instructions, CoT guidance, and ICL is necessary to rule out prompt engineering effects. In the revised manuscript, we will add ablation studies comparing the full Instruct-ICL method against plain CoT (without generated instructions), standard few-shot ICL (without instruction generation), and prompt-length controls. We will also expand the discussion of dataset choice and note limitations regarding generalization. revision: yes
Referee: [Abstract] Abstract and experimental results: The claim of 'consistent improvement' in answer accuracy is presented without quantitative values, error bars, statistical tests, or multiple-run details in the provided text, which is load-bearing for assessing the reliability and magnitude of the reported gains.

Authors: We agree that the abstract should include specific quantitative support for the 'consistent improvement' claim to enable proper assessment of the gains. The body of the paper contains the detailed accuracy results, but the abstract does not. In revision, we will update the abstract to report the key accuracy improvements with numerical values from the FloodNet experiments. Where multiple runs were performed, we will include variability information; otherwise, we will note the single-run nature and avoid unsubstantiated claims about statistical significance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external dataset

full rationale

The paper reports experimental accuracy gains from instruction-guided CoT + ICL prompting on the public FloodNet VQA dataset, compared only to a zero-shot baseline. No equations, parameter fits, or derivations are present. The central claim is an observed empirical outcome rather than a result that reduces by construction to the authors' own inputs or self-citations. No load-bearing self-referential steps exist.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about the capabilities of pretrained multimodal models and the representativeness of the chosen benchmark, with no new free parameters, axioms beyond domain norms, or invented entities.

axioms (2)

domain assumption Pretrained MLLMs can follow and benefit from structured reasoning instructions generated by another MLLM.
Central premise of the instruction-generation and CoT guidance approach.
domain assumption The FloodNet dataset provides a valid testbed for post-disaster damage assessment VQA.
Used as the sole evaluation benchmark in the reported experiments.

pith-pipeline@v0.9.0 · 5549 in / 1326 out tokens · 44567 ms · 2026-05-13T01:32:58.143266+00:00 · methodology

Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)