DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

arxiv: 2605.15532 · v2 · pith:34ARD4PWnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CL

DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

Jaehun Jung , Hyunwoo Kim , Brandon Cui , Ximing Lu , David Acuna , Prithviraj Ammanabrolu , Yejin Choi This is my paper

Pith reviewed 2026-05-19 14:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords DeltaPromptsmultimodal distillationvision-language modelsanswer divergenceprompt synthesiszero-deltareasoning distillationsynthetic data

0 comments p. Extension

pith:34ARD4PW Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{34ARD4PW}

Prints a linked pith:34ARD4PW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

High answer divergence between teacher and student makes prompts far more effective for distilling reasoning into smaller vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many prompts in common datasets for training vision-language models turn out to be zero-delta, meaning the large teacher model and the smaller student model already give the same answers and thus teach nothing new. The paper argues that this explains why adding more data often stops helping after a point. Instead, the authors create prompts deliberately designed so that the models disagree in their answers. This leads to a new dataset that improves results on multiple types of reasoning tasks even when added to strong existing models.

Core claim

The key insight is that effective distillation requires prompts that induce different answer distributions from the teacher and the student. Up to 69 percent of prompts in typical chart and document datasets fail this test. By building a staged synthesis pipeline that starts from existing data and iteratively targets areas where the student fails, the method produces 200k prompts with high divergence. These prompts drive measurable improvements in student performance across on-policy distillation, transfer to new models, and off-policy fine-tuning.

What carries the argument

Answer divergence Δ, which measures the extent to which a prompt causes the teacher and student to produce different answers, used to identify and generate useful distillation examples.

If this is right

High-divergence prompts sustain learning gains instead of causing early saturation.
The dataset works without needing to be regenerated for new teacher-student pairs.
Improvements occur on a wide range of chart, document, and perception reasoning benchmarks.
The benefits apply to both highly optimized reasoning models and basic non-reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The divergence idea might help in curating data for other forms of model training where disagreement signals value.
It raises the question of whether similar traps exist in single-model training or pretraining stages.
Automated pipelines like this could reduce reliance on human-curated datasets for distillation.

Load-bearing premise

The premise that maximizing answer divergence through targeted synthesis will reliably improve the student's capabilities without introducing new biases that hurt performance on real tasks.

What would settle it

Finding that training with DeltaPrompts produces no better or worse results than training with the original datasets on the ten benchmarks would show the approach does not deliver the promised gains.

Figures

Figures reproduced from arXiv: 2605.15532 by Brandon Cui, David Acuna, Hyunwoo Kim, Jaehun Jung, Prithviraj Ammanabrolu, Ximing Lu, Yejin Choi.

**Figure 1.** Figure 1: Overview of DELTAPROMPTS. We first identify the zero-delta trap: for majority of the prompts in chart & document reasoning datasets, the teacher and student already produce an identical answer distribution. We show that distillation on these zero-delta prompts reaches early saturation in student performance (§2.3). We then synthesize DELTAPROMPTS, leveraging teacher model itself to find the divergence-indu… view at source ↗

**Figure 2.** Figure 2: (Left) Divergence is critical for effective OPD. When controlling for diversity, delta prompts (∆ > 0) improves with data size while random or zero-delta subsets meet early saturation. (Right) Existing datasets trade off divergence against diversity—filtering existing mixtures for delta prompts (∆ > 0) reduces diversity by up to 29.4% as measured by E-Vendi. divergence. While controlling for the data scale… view at source ↗

**Figure 3.** Figure 3: (Left) Divergence distribution of an off-the-shelf chart / document reasoning data mixture. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: (Left) Ablation on answer divergence. DELTAPROMPTS leads to the best results even under a data size-controlled setup, while adding additional zero-delta data does not help. (Right) Ablation on consistency-based filtering. Further optimizing for teacher consistency (EasyPrompts) yields no gain, while training on prompts the teacher cannot reliably solve (HardPrompts) degrades performance. 4.2.3 DELTAPROMPTS… view at source ↗

**Figure 5.** Figure 5: Distribution of answer divergence of the four datasets used in § [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

read the original abstract

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($\Delta$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real data inefficiency in VLM distillation and shows practical gains from targeting disagreement, but the evidence tying those gains specifically to divergence rather than synthesis quality is still thin.

read the letter

The core observation here is that a large fraction of prompts in standard chart and document datasets produce identical answer distributions from teacher and student, so they add almost no training signal. The authors quantify this at 69% zero-delta and then build a staged synthesis pipeline that starts from existing seeds and steers toward student failure modes to create 200k higher-divergence examples. That framing and the resulting dataset are the main new pieces. They report up to 15% relative gains averaged over 10 benchmarks, and the improvements hold in on-policy distillation, transfer to a different model family, and off-policy fine-tuning of a non-reasoning base. Those cross-setting results are the strongest part of the work and give the method some credibility for practical use. The stress-test concern is worth taking seriously: the pipeline targets failure modes and produces synthetic data, so it is possible the gains come from better-structured or easier prompts rather than from the measured divergence itself. The abstract does not describe an ablation that holds the synthesis procedure fixed while varying only the divergence selection criterion, which leaves the central causal claim only partially supported. The math and data handling look straightforward with no obvious circularity or invented entities beyond the zero-delta label. This is the kind of paper that would interest groups working on efficient multimodal reasoning or distillation under tight data budgets. A reader who needs concrete improvements on chart, document, or perception tasks would find usable ideas and numbers here. It is coherent enough and grounded enough in results to deserve a serious referee, even though the mechanism question would likely come up in review.

Referee Report

2 major / 2 minor

Summary. The paper identifies that up to 69% of prompts in standard chart/document reasoning datasets are zero-delta (teacher and student induce identical answer distributions), providing minimal learning signal during distillation of reasoning into compact VLMs. It introduces DeltaPrompts, a 200k synthetic dataset generated via a staged synthesis pipeline that repurposes seeds to target student failure modes and produce high answer-divergence (Δ) prompts. The authors report up to 15% relative gains averaged over 10 benchmarks (chart, document, perception reasoning) in on-policy distillation, transfer to new model families, and off-policy fine-tuning, even atop optimized models such as Qwen3-VL-8B-Thinking.

Significance. If the gains are shown to stem specifically from high-Δ prompts rather than ancillary effects of the synthesis procedure, the work would usefully advance efficient multimodal distillation by providing a principled, divergence-based criterion for prompt selection and generation. The zero-delta statistic and multi-setting evaluation (including transfer and off-policy) are potentially impactful for scaling reasoning capabilities in smaller VLMs.

major comments (2)

[Experiments / Pipeline description] The central claim that non-zero divergence (Δ) is the operative mechanism for improved distillation signal is load-bearing, yet the manuscript provides no ablation that isolates Δ from other factors introduced by the staged synthesis pipeline (e.g., failure-mode targeting, prompt structure, or distributional shifts). Without such a control—comparing high-Δ prompts against pipeline outputs filtered only for failure modes or random synthesis—the 15% relative gains cannot be confidently attributed to the zero-delta diagnosis rather than synthetic data quality. This appears in the experimental evaluation and pipeline description sections.
[Methods / §3 (quantification of Δ)] Exact computation of answer divergence Δ is not fully specified (e.g., the divergence measure, number of samples per prompt, or handling of multimodal inputs), which is required to verify the 69% zero-delta statistic and to reproduce the prompt selection criterion. This detail is essential for confirming that Δ directly measures functional capability gaps.

minor comments (2)

[Introduction / §2] Notation for Δ could be formalized with an explicit equation or pseudocode in the main text for clarity.
[Results] Per-benchmark breakdowns of the 15% average improvement would help assess consistency across chart, document, and perception tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments help clarify how to better substantiate the role of answer divergence in our distillation pipeline. Below we respond point-by-point to the major comments and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [Experiments / Pipeline description] The central claim that non-zero divergence (Δ) is the operative mechanism for improved distillation signal is load-bearing, yet the manuscript provides no ablation that isolates Δ from other factors introduced by the staged synthesis pipeline (e.g., failure-mode targeting, prompt structure, or distributional shifts). Without such a control—comparing high-Δ prompts against pipeline outputs filtered only for failure modes or random synthesis—the 15% relative gains cannot be confidently attributed to the zero-delta diagnosis rather than synthetic data quality. This appears in the experimental evaluation and pipeline description sections.

Authors: We agree that an explicit ablation isolating the contribution of high Δ would strengthen attribution of the observed gains. Our current multi-setting results (on-policy distillation, transfer to unseen model families without data regeneration, and off-policy fine-tuning) already provide indirect evidence that the benefits are not reducible to generic synthetic-data quality, because the same DeltaPrompts corpus improves performance even when the teacher-student pair differs from the one used during synthesis. Nevertheless, to directly address the concern we will add a controlled comparison in the revised experimental section: we will generate an auxiliary set of prompts using the identical staged pipeline but then filter or sample to produce low-Δ variants (or purely random synthesis outputs) and report distillation performance on the same 10 benchmarks. This control will be presented alongside the existing results. revision: yes
Referee: [Methods / §3 (quantification of Δ)] Exact computation of answer divergence Δ is not fully specified (e.g., the divergence measure, number of samples per prompt, or handling of multimodal inputs), which is required to verify the 69% zero-delta statistic and to reproduce the prompt selection criterion. This detail is essential for confirming that Δ directly measures functional capability gaps.

Authors: We appreciate the request for reproducibility details. In the original manuscript §3 described Δ at the level of “answer divergence between teacher and student distributions,” but omitted the precise measure and sampling procedure. In the revision we will expand this section to state that Δ is computed as the Jensen-Shannon divergence between the empirical answer distributions obtained from 8 independent generations per prompt (temperature 0.7), that multimodal inputs are passed identically to both models, and that a prompt is labeled zero-delta when Δ falls below a threshold of 0.05. We will also include a short algorithm box and the exact code snippet used to compute the 69 % statistic so that readers can reproduce the prompt-selection criterion. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement and synthesis pipeline remain self-contained

full rationale

The paper's derivation begins from the standard distillation objective of minimizing distributional divergence between teacher and student, then empirically measures answer divergence Δ on existing datasets to identify zero-delta prompts (up to 69%). It proposes a staged synthesis pipeline that targets student failure modes to generate high-Δ prompts, producing the DeltaPrompts dataset. All load-bearing steps—quantifying the capability gap via Δ, demonstrating scaling benefits, and evaluating gains on 10 benchmarks—are grounded in direct measurement and controlled experiments rather than any equation, fitted parameter, or self-citation that reduces the output to the input by construction. The central claim that non-zero Δ provides the operative learning signal is tested through ablation-style evaluations and transfer settings, keeping the chain independent of the reported results themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests primarily on the domain assumption that distributional divergence between teacher and student directly measures useful learning signal, plus the ad-hoc synthesis pipeline; no explicit free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Distillation fundamentally minimizes distributional divergence between teacher and student
Invoked in the abstract as the first-principles return to justify why non-zero delta matters.

invented entities (1)

zero-delta prompt no independent evidence
purpose: Prompts where teacher and student induce identical answer distributions, providing no learning signal
New diagnostic concept introduced to explain saturation in standard distillation data.

pith-pipeline@v0.9.0 · 5848 in / 1358 out tokens · 44582 ms · 2026-05-19T14:31:11.687335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate
IndisputableMonolith/Cost Jcost_unit0 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence (Δ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.