Instruct-ICL: Instruction-Guided In-Context Learning for Post-Disaster Damage Assessment
Pith reviewed 2026-05-13 01:32 UTC · model grok-4.3
The pith
Instruction-driven chain-of-thought guidance raises answer accuracy for pretrained multimodal models on post-disaster visual questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using one MLLM to generate instruction-driven CoT reasoning that guides a second MLLM, incorporated with varying degrees of in-context learning, consistently improves answer accuracy over zero-shot baselines on post-disaster VQA tasks, as shown on the FloodNet dataset.
What carries the argument
Instruction-driven Chain-of-Thought (CoT) reasoning generated by one MLLM to steer the answer generation of a second MLLM, combined with in-context learning examples of varying strength.
Load-bearing premise
The accuracy gains on FloodNet come from the instruction-CoT-ICL combination itself rather than dataset quirks or fine details of the prompt wording.
What would settle it
Running the same instruction-CoT-ICL setups on a different post-disaster VQA dataset, such as one covering earthquake or wildfire damage, and finding no accuracy lift over zero-shot would show the gains are not general.
Figures
read the original abstract
Rapid and accurate situational awareness is essential for effective response during natural disasters, where delays in analysis can significantly hinder decision-making. Training task-specific models for post-disaster assessment is often time-consuming and computationally expensive, making such approaches impractical in time-critical scenarios. Consequently, pretrained multimodal large language models (MLLMs) have emerged as a promising alternative for post-disaster visual question answering (VQA), a task that aims to answer structured questions about visual scenes by jointly reasoning over images and text. While these models demonstrate strong multimodal reasoning capabilities, their responses can be sensitive to prompt formulation, which can limit their reliability in real-world disaster assessment scenarios. In this paper, we investigate whether structured reasoning strategies can improve the reliability of pretrained MLLMs for post-disaster VQA. Specifically, we explore multiple prompting paradigms in which one MLLM is used to generate task-specific instructions that serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are incorporated during answer generation with varying degrees of in-context learning (ICL), enabling the model to leverage both explicit reasoning guidance and contextual examples. We conduct our evaluation on the FloodNet dataset and compare these approaches against a zero-shot baseline. Our results demonstrate that integrating instruction-driven CoT reasoning consistently improves answer accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Instruct-ICL, a prompting method for post-disaster visual question answering (VQA) in which one MLLM generates task-specific instructions to serve as Chain-of-Thought (CoT) guidance for a second MLLM. These instructions are combined with varying degrees of in-context learning (ICL) and evaluated on the FloodNet dataset against a zero-shot baseline, with the central claim that the approach consistently improves answer accuracy.
Significance. If the accuracy gains prove robust and attributable to the instruction-CoT-ICL mechanism, the work offers a practical, training-free way to enhance pretrained MLLM reliability for time-critical disaster assessment using only public datasets and existing models. It addresses prompt sensitivity in multimodal reasoning without requiring task-specific fine-tuning.
major comments (2)
- [Evaluation] Evaluation section: The paper reports results only on FloodNet and compares solely against zero-shot prompting. No ablation studies isolate the effect of the generated instructions from plain CoT, standard few-shot ICL, or prompt-length controls, leaving open whether observed gains are caused by the proposed combination or by incidental prompt engineering.
- [Abstract] Abstract and experimental results: The claim of 'consistent improvement' in answer accuracy is presented without quantitative values, error bars, statistical tests, or multiple-run details in the provided text, which is load-bearing for assessing the reliability and magnitude of the reported gains.
minor comments (1)
- [Method] The distinction between the two MLLMs (instruction generator vs. answer generator) is introduced without explicit notation or pseudocode, which could be clarified in the method description.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the evaluation and result presentation that we will address in revision to better substantiate the contributions of Instruct-ICL.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The paper reports results only on FloodNet and compares solely against zero-shot prompting. No ablation studies isolate the effect of the generated instructions from plain CoT, standard few-shot ICL, or prompt-length controls, leaving open whether observed gains are caused by the proposed combination or by incidental prompt engineering.
Authors: We acknowledge that the current experiments are limited to FloodNet with a zero-shot baseline comparison. FloodNet serves as a standard public benchmark for post-disaster VQA, aligning with the practical, training-free focus of the work. However, we agree that isolating the contributions of the generated instructions, CoT guidance, and ICL is necessary to rule out prompt engineering effects. In the revised manuscript, we will add ablation studies comparing the full Instruct-ICL method against plain CoT (without generated instructions), standard few-shot ICL (without instruction generation), and prompt-length controls. We will also expand the discussion of dataset choice and note limitations regarding generalization. revision: yes
-
Referee: [Abstract] Abstract and experimental results: The claim of 'consistent improvement' in answer accuracy is presented without quantitative values, error bars, statistical tests, or multiple-run details in the provided text, which is load-bearing for assessing the reliability and magnitude of the reported gains.
Authors: We agree that the abstract should include specific quantitative support for the 'consistent improvement' claim to enable proper assessment of the gains. The body of the paper contains the detailed accuracy results, but the abstract does not. In revision, we will update the abstract to report the key accuracy improvements with numerical values from the FloodNet experiments. Where multiple runs were performed, we will include variability information; otherwise, we will note the single-run nature and avoid unsubstantiated claims about statistical significance. revision: yes
Circularity Check
No significant circularity; empirical results on external dataset
full rationale
The paper reports experimental accuracy gains from instruction-guided CoT + ICL prompting on the public FloodNet VQA dataset, compared only to a zero-shot baseline. No equations, parameter fits, or derivations are present. The central claim is an observed empirical outcome rather than a result that reduces by construction to the authors' own inputs or self-citations. No load-bearing self-referential steps exist.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained MLLMs can follow and benefit from structured reasoning instructions generated by another MLLM.
- domain assumption The FloodNet dataset provides a valid testbed for post-disaster damage assessment VQA.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.