Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

arxiv: 2604.24972 · v1 · submitted 2026-04-27 · 💻 cs.CL

Dynamic Decision Learning: Test-Time Evolution for Abnormality Grounding in Rare Diseases

Jun Li , Mingxuan Liu , Jiazhen Pan , Che Liu , Wenjia Bai , Cosmin I. Bercea , Julia A. Schnabel This is my paper

Pith reviewed 2026-05-08 03:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords dynamic decision learningabnormality groundingrare diseasesvision-language modelstest-time adaptationbrain imaginglocalizationreliability score

0 comments p. Extension

The pith

Frozen large vision-language models can substantially improve abnormality localization for rare diseases by iteratively refining decisions at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rare diseases present a data scarcity problem that makes conventional supervised training of vision-language models impractical and leaves single-pass inferences unstable. The paper proposes Dynamic Decision Learning as a framework that lets these models evolve their outputs without any parameter updates. It does so by optimizing the language instructions fed to the model and then consolidating multiple predictions generated under controlled visual perturbations. This yields both higher-quality localizations of abnormalities and a reliability score based on consensus across those predictions. Experiments across brain imaging benchmarks with hundreds of pathology types and model sizes from 3B to 72B parameters show marked gains over adaptation baselines and even supervised fine-tuning.

Core claim

Dynamic Decision Learning enables a frozen large vision-language model to refine its decisions across language and visual spaces by performing iterative instruction optimization and consolidating predictions generated under visual perturbations, thereby improving localization quality for rare-disease abnormalities and producing a consensus-based reliability score that better tracks actual accuracy.

What carries the argument

Dynamic Decision Learning (DDL), the test-time process of iteratively optimizing instructions in language space while consolidating predictions obtained under visual perturbations to reach a consensus localization and reliability score.

Load-bearing premise

Iterative instruction optimization together with prediction consolidation under visual perturbations will reliably raise localization quality and produce a well-calibrated reliability score without introducing new instabilities or biases into the frozen model.

What would settle it

Running DDL on a new, held-out brain-imaging dataset containing previously unseen rare pathologies and finding that mAP@75 does not rise or that the reliability scores lose their correlation with localization accuracy would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.24972 by Che Liu, Cosmin I. Bercea, Jiazhen Pan, Julia A. Schnabel, Jun Li, Mingxuan Liu, Wenjia Bai.

**Figure 1.** Figure 1: Top: Static inference with frozen LVLMs exhibits prompt and perturbation sensitivity on rare pathologies, leading to unstable and hallucinated localizations. Bottom: DDL performs test-time prompt optimization and multi-view verification, yielding substantially more stable and reliable localizations. models. In particular, grounding performance degrades sharply on rare and underrepresented pathologies (Ber… view at source ↗

**Figure 2.** Figure 2: Overview of the Dynamic Decision Learning framework. DDL instantiates adaptive inference through two components: instruction-space optimization (DAPE), which refines task-specific prompts on a development set, and visual-consensus verification (V-PUP and RHC), which evaluates cross-view consistency and aggregates localization hypotheses via bipartite matching. Farquhar et al., 2024), yet these metrics ofte… view at source ↗

**Figure 3.** Figure 3: Instructional optimization dynamics in DAPE. (a, b) Monotonic improvement in Top-3 candidate performance across model scales on the NOVA and BTD benchmarks. (c) Distributional shifts in the instruction pool: Kernel Density Estimation (KDE) reveals a rightward shift of the medium-score region and a sharpening of the high-performance tail, indicating convergence toward robust instructional priors. chain-of-… view at source ↗

**Figure 4.** Figure 4: Qualitative grounding results on rare pathologies from NOVA. Blue dashed boxes denote the vanilla baseline, orange dashed boxes correspond to DDL-DAPE, and yellow dashed boxes show the final DDL output. Green solid boxes indicate ground truth. DDL progressively suppresses unstable detections and improves spatial alignment with ground truth. language-level uncertainty across all model scales. This indicat… view at source ↗

**Figure 5.** Figure 5: Spatial calibration dynamics on the BTD and NOVA datasets. (Left) The 3B model exhibits flat, decoupled confidenceperformance curves (r ≈ 0.1). (Right) Model scaling (32B) unlocks a strong alignment with the diagonal, transforming consensus reliability into a predictive indicator of clinical grounding success. 4.2. Scaling Law 1) DDL calibration emerges with model size view at source ↗

**Figure 6.** Figure 6: Treemap visualizing the taxonomic distribution of clinical pathologies in the NOVA dataset. A.2. BTD: Brain tumor dataset view at source ↗

**Figure 7.** Figure 7: Calibration analysis across model scales on the BTD dataset. 23 view at source ↗

**Figure 8.** Figure 8: Calibration analysis across model scales on the NOVA dataset. 3b 7b 32b 72b Model 0.00 0.05 0.10 0.15 0.20 0.25 Score 0.120 0.120 0.190 0.210 0.150 0.160 0.210 0.240 Base vs DAPE Generated - Best Scores Base DAPE Generated (Max) 3b 7b 32b 72b Model 0 5 10 15 20 25 30 35 Improvement (%) 25.0% 33.3% 10.5% 14.3% DAPE Generated vs Base - Improvement DAPE Evolution Visualization view at source ↗

**Figure 9.** Figure 9: Visualization of grounding improvement driven by DAPE iterations on the development set. 24 view at source ↗

**Figure 10.** Figure 10: Evolutionary trajectory of DAPE instruction samples for the Qwen2.5-VL-3B model. 25 view at source ↗

**Figure 11.** Figure 11: Evolutionary trajectory of DAPE instruction samples for the Qwen2.5-VL-7B model. 26 view at source ↗

**Figure 12.** Figure 12: Evolutionary trajectory of DAPE instruction samples for the Qwen2.5-VL-32B model. 27 view at source ↗

**Figure 13.** Figure 13: Evolutionary trajectory of DAPE instruction samples for the Qwen2.5-VL-72B model. 28 view at source ↗

**Figure 14.** Figure 14: Qualitative visualization of the final DDL grounding decisions on the 3B model. 29 view at source ↗

**Figure 15.** Figure 15: Qualitative visualization of the final DDL grounding decisions on the 7B model. 30 view at source ↗

**Figure 16.** Figure 16: Qualitative visualization of the final DDL grounding decisions on the 32B model. 31 view at source ↗

**Figure 17.** Figure 17: Qualitative visualization of the final DDL grounding decisions on the 72B model. 32 view at source ↗

read the original abstract

Clinical abnormality grounding for rare diseases is often hindered by data scarcity, making supervised fine-tuning impractical and single-pass inference highly unstable. We propose Dynamic Decision Learning (DDL), a framework that enables frozen large vision-language models (LVLMs) to refine their decisions across both language and visual spaces by optimizing instructions and consolidating predictions under visual perturbations. This process improves localization quality and produces a consensus-based reliability score that quantifies model confidence. Results on brain imaging benchmarks, including a rare-disease dataset with 281 pathology types across models ranging from 3B to 72B parameters, show that DDL improves mAP@75 by up to 105% on rare-disease cases and outperforms adaptation baselines and supervised fine-tuning. Furthermore, DDL demonstrates stronger calibration between reliability scores and localization accuracy under severe distribution shifts and increasing task difficulty. Code is available at: https://lijunrio.github.io/DDL/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper proposes Dynamic Decision Learning (DDL), a test-time framework for frozen large vision-language models (LVLMs) that refines abnormality grounding decisions for rare diseases by iteratively optimizing instructions and consolidating predictions under visual perturbations. This produces improved localization and a consensus-based reliability score. On brain imaging benchmarks including a rare-disease dataset spanning 281 pathology types and models from 3B to 72B parameters, DDL achieves up to 105% relative improvement in mAP@75 on rare cases, outperforming adaptation baselines and supervised fine-tuning, with stronger calibration under distribution shifts.

Significance. If the reported gains hold under scrutiny, DDL offers a practical test-time alternative to fine-tuning for data-scarce medical imaging tasks, which is valuable given the impracticality of supervised learning on rare diseases. The work is strengthened by explicit code release, evaluation across a wide range of model scales, and construction of a challenging benchmark emphasizing scarcity (281 pathologies).

minor comments (1)

[Abstract] Abstract: The abstract reports quantitative gains (e.g., up to 105% mAP@75 improvement) but supplies no details on the optimization procedure, perturbation strategy, statistical testing, or failure cases. Adding a concise sentence on these elements would improve verifiability without altering length substantially.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The description accurately reflects DDL's test-time instruction optimization and prediction consolidation under visual perturbations for improving abnormality grounding in rare diseases, along with the reported gains, broad model-scale evaluation, and the challenging 281-pathology benchmark. No major comments appear in the provided report, so we have no specific points requiring rebuttal or clarification at this stage.

Circularity Check

0 steps flagged

No significant circularity; empirical test-time method validated on external benchmarks

full rationale

The paper presents DDL as an algorithmic test-time procedure (iterative instruction optimization plus perturbation-based prediction consolidation) applied to frozen LVLMs. All reported results are empirical measurements of mAP@75 and calibration on held-out benchmarks, including a constructed rare-disease dataset of 281 pathologies, with explicit comparisons to adaptation baselines and supervised fine-tuning. No equations, uniqueness theorems, or first-principles derivations are claimed; performance gains are not obtained by fitting parameters to the evaluation set and then relabeling them as predictions. No self-citations are invoked to justify core assumptions, and the reliability score is defined directly from the consolidation step rather than presupposing the target accuracy. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based solely on abstract; detailed parameter and assumption inventory not possible. The central claim rests on the unstated effectiveness of instruction optimization and visual perturbation consolidation.

axioms (1)

domain assumption Frozen LVLMs can be improved via test-time instruction optimization and multi-view prediction consolidation
Core premise of DDL invoked to justify the framework.

pith-pipeline@v0.9.0 · 5475 in / 1215 out tokens · 37441 ms · 2026-05-08T03:21:52.906878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references

[1]

Check for symmetry and identify asymmetries
[2]

Look for abnormal signal intensities
[3]

bbox 2d": [x1, y1, x2, y2],

Identify any mass effects or distortions. Return bounding boxes in JSON format: [ {"bbox 2d": [x1, y1, x2, y2], "label": "abnormality"} ] Variant 2: Act as an expert neuroradiologist. Carefully examine this MRI for any clinically significant abnormalities. Flag regions with high confidence of being a lesion, tumor, or infarct. Return results strictly as J...
[4]

Analyze key differences between high and low performing prompts
[5]

Identify specific successful elements to incorporate
[6]

Understand what aspects in low-performance prompts hurt results
[7]

5.The output prompt should also like the success prompt, avoid too long

Generate ONE improved prompt avoiding identified pitfalls. 5.The output prompt should also like the success prompt, avoid too long. Output Format: <IMPROVED PROMPT> [Your Prompt] </IMPROVED PROMPT> Meta-Optimizer Refinement (Exploitative Scenario) You are analyzing prompts for medical image analysis tasks. All candidate prompts perform better than the bas...
[8]

Analyze what makes the best prompts work so well
[9]

Identify the key success factors in both prompt sets
[10]

Understand what distinguishes the best from the weaker ones
[11]

textual gradient

Generate ONE improved prompt combining the best elements. Output strictly using <ANALYSIS> and <IMPROVED PROMPT> tags. Convergence Criteria via Top-k Stability.We monitor the convergence of the DAPE process by calculating the standard deviation (std) of the Top-3 prompts. We define the search as converged when std(Top-3)<10 −4, signaling that the Meta-LLM...
[12]

Consensus (Cns):Calculated as 1+Nmatched M+1 , where Nmatched is the number of augmented views (out of M= 7 ) whose detections successfully matched anchor boxb j (IoU≥0.1)
[13]

Good” pool (Top 50%) and the “Bad

Consistency (Cst):The average IoU of successful matches: 1 Nmatched P m IoU(bref j ,ˆbj,m) where the sum is over matched detections. Final reliability isσ j =ω 1Cns +ω 2Cst, withω 1 = 0.6, ω2 = 0.4as defined in Sec. C.4. C.4. Hyperparameter Table 9.DDL Hyperparameters. Component Parameter Value Inference Hardware Infrastructure NVIDIA H800 GPUs Computing ...
[14]

To ensure robustness, we also compute Spearman’sρ and Kendall’sτ to capture non-linear monotonic relationships

Correlation Coefficients:We primarily utilize the Pearson correlation coefficient (r) to measure the linear relationship between σ and the Ground Truth IoU (GT-IoU). To ensure robustness, we also compute Spearman’sρ and Kendall’sτ to capture non-linear monotonic relationships
[15]

Statistical Significance:For all correlation measures, we report p-values with the following markers: *** (p <0.001 ), ** (p <0.01), * (p <0.05), andns(p≥0.05)
[16]

Mean Absolute Error (MAE):Calculated as 1 N PN i=1 |σi −IoU i|, representing the global calibration gap between perceived reliability and actual precision
[17]

Confidence Dispersion (σconf ):The standard deviation of the reliability scores, used to identify if a model’s confidence distribution is collapsing (blindly confident) or appropriately sparse (honest uncertainty). Reliability Diagram Construction (Figure 5):The visualization of calibration dynamics follows a systematic procedure: 1.Binning:The predicted ...
[18]

no target

95% Confidence Intervals (CI):We estimate bin uncertainty using 1.96· sj√nj , where sj is the standard deviation and nj is the sample count in the bin. D.6. Baseline implementation details In Table 5 and Table 6, we provide a detailed description of the configurations for baseline methods to ensure reproducibility. Manual Prompting Baselines:These methods...

2022
[19]

Space-occupying lesions (tumors, cysts)
[20]

Areas of inflammation or edema
[21]

no target

Structural anomalies or vascular abnormalities. Return the result strictly in the following JSON format. If no pathology is detected, return: "no target". ```json [ {"bbox_2d": [x1, y1, x2, y2], "label": "pathology_type"} ] ``` DAPE_gen_1 DAPE Generated 0.177 1 Examine the MRI scan for pathological abnormalities. PaycloseattentiontoFocuson detectingchange...

[1] [1]

Check for symmetry and identify asymmetries

[2] [2]

Look for abnormal signal intensities

[3] [3]

bbox 2d": [x1, y1, x2, y2],

Identify any mass effects or distortions. Return bounding boxes in JSON format: [ {"bbox 2d": [x1, y1, x2, y2], "label": "abnormality"} ] Variant 2: Act as an expert neuroradiologist. Carefully examine this MRI for any clinically significant abnormalities. Flag regions with high confidence of being a lesion, tumor, or infarct. Return results strictly as J...

[4] [4]

Analyze key differences between high and low performing prompts

[5] [5]

Identify specific successful elements to incorporate

[6] [6]

Understand what aspects in low-performance prompts hurt results

[7] [7]

5.The output prompt should also like the success prompt, avoid too long

Generate ONE improved prompt avoiding identified pitfalls. 5.The output prompt should also like the success prompt, avoid too long. Output Format: <IMPROVED PROMPT> [Your Prompt] </IMPROVED PROMPT> Meta-Optimizer Refinement (Exploitative Scenario) You are analyzing prompts for medical image analysis tasks. All candidate prompts perform better than the bas...

[8] [8]

Analyze what makes the best prompts work so well

[9] [9]

Identify the key success factors in both prompt sets

[10] [10]

Understand what distinguishes the best from the weaker ones

[11] [11]

textual gradient

Generate ONE improved prompt combining the best elements. Output strictly using <ANALYSIS> and <IMPROVED PROMPT> tags. Convergence Criteria via Top-k Stability.We monitor the convergence of the DAPE process by calculating the standard deviation (std) of the Top-3 prompts. We define the search as converged when std(Top-3)<10 −4, signaling that the Meta-LLM...

[12] [12]

Consensus (Cns):Calculated as 1+Nmatched M+1 , where Nmatched is the number of augmented views (out of M= 7 ) whose detections successfully matched anchor boxb j (IoU≥0.1)

[13] [13]

Good” pool (Top 50%) and the “Bad

Consistency (Cst):The average IoU of successful matches: 1 Nmatched P m IoU(bref j ,ˆbj,m) where the sum is over matched detections. Final reliability isσ j =ω 1Cns +ω 2Cst, withω 1 = 0.6, ω2 = 0.4as defined in Sec. C.4. C.4. Hyperparameter Table 9.DDL Hyperparameters. Component Parameter Value Inference Hardware Infrastructure NVIDIA H800 GPUs Computing ...

[14] [14]

To ensure robustness, we also compute Spearman’sρ and Kendall’sτ to capture non-linear monotonic relationships

Correlation Coefficients:We primarily utilize the Pearson correlation coefficient (r) to measure the linear relationship between σ and the Ground Truth IoU (GT-IoU). To ensure robustness, we also compute Spearman’sρ and Kendall’sτ to capture non-linear monotonic relationships

[15] [15]

Statistical Significance:For all correlation measures, we report p-values with the following markers: *** (p <0.001 ), ** (p <0.01), * (p <0.05), andns(p≥0.05)

[16] [16]

Mean Absolute Error (MAE):Calculated as 1 N PN i=1 |σi −IoU i|, representing the global calibration gap between perceived reliability and actual precision

[17] [17]

Confidence Dispersion (σconf ):The standard deviation of the reliability scores, used to identify if a model’s confidence distribution is collapsing (blindly confident) or appropriately sparse (honest uncertainty). Reliability Diagram Construction (Figure 5):The visualization of calibration dynamics follows a systematic procedure: 1.Binning:The predicted ...

[18] [18]

no target

95% Confidence Intervals (CI):We estimate bin uncertainty using 1.96· sj√nj , where sj is the standard deviation and nj is the sample count in the bin. D.6. Baseline implementation details In Table 5 and Table 6, we provide a detailed description of the configurations for baseline methods to ensure reproducibility. Manual Prompting Baselines:These methods...

2022

[19] [19]

Space-occupying lesions (tumors, cysts)

[20] [20]

Areas of inflammation or edema

[21] [21]

no target

Structural anomalies or vascular abnormalities. Return the result strictly in the following JSON format. If no pathology is detected, return: "no target". ```json [ {"bbox_2d": [x1, y1, x2, y2], "label": "pathology_type"} ] ``` DAPE_gen_1 DAPE Generated 0.177 1 Examine the MRI scan for pathological abnormalities. PaycloseattentiontoFocuson detectingchange...