DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation
Pith reviewed 2026-05-19 14:31 UTC · model grok-4.3
pith:34ARD4PW Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{34ARD4PW}
Prints a linked pith:34ARD4PW badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
High answer divergence between teacher and student makes prompts far more effective for distilling reasoning into smaller vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The key insight is that effective distillation requires prompts that induce different answer distributions from the teacher and the student. Up to 69 percent of prompts in typical chart and document datasets fail this test. By building a staged synthesis pipeline that starts from existing data and iteratively targets areas where the student fails, the method produces 200k prompts with high divergence. These prompts drive measurable improvements in student performance across on-policy distillation, transfer to new models, and off-policy fine-tuning.
What carries the argument
Answer divergence Δ, which measures the extent to which a prompt causes the teacher and student to produce different answers, used to identify and generate useful distillation examples.
If this is right
- High-divergence prompts sustain learning gains instead of causing early saturation.
- The dataset works without needing to be regenerated for new teacher-student pairs.
- Improvements occur on a wide range of chart, document, and perception reasoning benchmarks.
- The benefits apply to both highly optimized reasoning models and basic non-reasoning models.
Where Pith is reading between the lines
- The divergence idea might help in curating data for other forms of model training where disagreement signals value.
- It raises the question of whether similar traps exist in single-model training or pretraining stages.
- Automated pipelines like this could reduce reliance on human-curated datasets for distillation.
Load-bearing premise
The premise that maximizing answer divergence through targeted synthesis will reliably improve the student's capabilities without introducing new biases that hurt performance on real tasks.
What would settle it
Finding that training with DeltaPrompts produces no better or worse results than training with the original datasets on the ten benchmarks would show the approach does not deliver the promised gains.
Figures
read the original abstract
Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($\Delta$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies that up to 69% of prompts in standard chart/document reasoning datasets are zero-delta (teacher and student induce identical answer distributions), providing minimal learning signal during distillation of reasoning into compact VLMs. It introduces DeltaPrompts, a 200k synthetic dataset generated via a staged synthesis pipeline that repurposes seeds to target student failure modes and produce high answer-divergence (Δ) prompts. The authors report up to 15% relative gains averaged over 10 benchmarks (chart, document, perception reasoning) in on-policy distillation, transfer to new model families, and off-policy fine-tuning, even atop optimized models such as Qwen3-VL-8B-Thinking.
Significance. If the gains are shown to stem specifically from high-Δ prompts rather than ancillary effects of the synthesis procedure, the work would usefully advance efficient multimodal distillation by providing a principled, divergence-based criterion for prompt selection and generation. The zero-delta statistic and multi-setting evaluation (including transfer and off-policy) are potentially impactful for scaling reasoning capabilities in smaller VLMs.
major comments (2)
- [Experiments / Pipeline description] The central claim that non-zero divergence (Δ) is the operative mechanism for improved distillation signal is load-bearing, yet the manuscript provides no ablation that isolates Δ from other factors introduced by the staged synthesis pipeline (e.g., failure-mode targeting, prompt structure, or distributional shifts). Without such a control—comparing high-Δ prompts against pipeline outputs filtered only for failure modes or random synthesis—the 15% relative gains cannot be confidently attributed to the zero-delta diagnosis rather than synthetic data quality. This appears in the experimental evaluation and pipeline description sections.
- [Methods / §3 (quantification of Δ)] Exact computation of answer divergence Δ is not fully specified (e.g., the divergence measure, number of samples per prompt, or handling of multimodal inputs), which is required to verify the 69% zero-delta statistic and to reproduce the prompt selection criterion. This detail is essential for confirming that Δ directly measures functional capability gaps.
minor comments (2)
- [Introduction / §2] Notation for Δ could be formalized with an explicit equation or pseudocode in the main text for clarity.
- [Results] Per-benchmark breakdowns of the 15% average improvement would help assess consistency across chart, document, and perception tasks.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments help clarify how to better substantiate the role of answer divergence in our distillation pipeline. Below we respond point-by-point to the major comments and indicate the changes we will make in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments / Pipeline description] The central claim that non-zero divergence (Δ) is the operative mechanism for improved distillation signal is load-bearing, yet the manuscript provides no ablation that isolates Δ from other factors introduced by the staged synthesis pipeline (e.g., failure-mode targeting, prompt structure, or distributional shifts). Without such a control—comparing high-Δ prompts against pipeline outputs filtered only for failure modes or random synthesis—the 15% relative gains cannot be confidently attributed to the zero-delta diagnosis rather than synthetic data quality. This appears in the experimental evaluation and pipeline description sections.
Authors: We agree that an explicit ablation isolating the contribution of high Δ would strengthen attribution of the observed gains. Our current multi-setting results (on-policy distillation, transfer to unseen model families without data regeneration, and off-policy fine-tuning) already provide indirect evidence that the benefits are not reducible to generic synthetic-data quality, because the same DeltaPrompts corpus improves performance even when the teacher-student pair differs from the one used during synthesis. Nevertheless, to directly address the concern we will add a controlled comparison in the revised experimental section: we will generate an auxiliary set of prompts using the identical staged pipeline but then filter or sample to produce low-Δ variants (or purely random synthesis outputs) and report distillation performance on the same 10 benchmarks. This control will be presented alongside the existing results. revision: yes
-
Referee: [Methods / §3 (quantification of Δ)] Exact computation of answer divergence Δ is not fully specified (e.g., the divergence measure, number of samples per prompt, or handling of multimodal inputs), which is required to verify the 69% zero-delta statistic and to reproduce the prompt selection criterion. This detail is essential for confirming that Δ directly measures functional capability gaps.
Authors: We appreciate the request for reproducibility details. In the original manuscript §3 described Δ at the level of “answer divergence between teacher and student distributions,” but omitted the precise measure and sampling procedure. In the revision we will expand this section to state that Δ is computed as the Jensen-Shannon divergence between the empirical answer distributions obtained from 8 independent generations per prompt (temperature 0.7), that multimodal inputs are passed identically to both models, and that a prompt is labeled zero-delta when Δ falls below a threshold of 0.05. We will also include a short algorithm box and the exact code snippet used to compute the 69 % statistic so that readers can reproduce the prompt-selection criterion. revision: yes
Circularity Check
No circularity: empirical measurement and synthesis pipeline remain self-contained
full rationale
The paper's derivation begins from the standard distillation objective of minimizing distributional divergence between teacher and student, then empirically measures answer divergence Δ on existing datasets to identify zero-delta prompts (up to 69%). It proposes a staged synthesis pipeline that targets student failure modes to generate high-Δ prompts, producing the DeltaPrompts dataset. All load-bearing steps—quantifying the capability gap via Δ, demonstrating scaling benefits, and evaluating gains on 10 benchmarks—are grounded in direct measurement and controlled experiments rather than any equation, fitted parameter, or self-citation that reduces the output to the input by construction. The central claim that non-zero Δ provides the operative learning signal is tested through ablation-style evaluations and transfer settings, keeping the chain independent of the reported results themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Distillation fundamentally minimizes distributional divergence between teacher and student
invented entities (1)
-
zero-delta prompt
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanJcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate
-
IndisputableMonolith/CostJcost_unit0 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence (Δ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.