Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Gongbo Zhang; Li Yuan; Wen Wang; Ye Tian

arxiv: 2604.26951 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.LG

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Gongbo Zhang , Wen Wang , Ye Tian , Li Yuan This is my paper

Pith reviewed 2026-05-07 09:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords diffusion language modelsknowledge distillationcross-architecture transfermodel compressionparallel decodingcode generationmixture of expertstokenization

0 comments

The pith

A framework called TIDE distills knowledge from 8B and 16B diffusion language models into a 0.6B student across different architectures and tokenizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TIDE as a way to transfer capabilities from large diffusion-based language models to far smaller ones despite mismatches in architecture, attention, and tokenizers. It combines three components that adjust how distillation strength changes over time and noise levels, enrich the teacher's masked predictions, and match likelihoods in reverse across tokenizers. The result is a 0.6B student that beats a standard autoregressive baseline by 1.53 points on average across eight benchmarks, with especially large lifts in code generation. This approach matters because diffusion models support parallel decoding and bidirectional context, yet they have historically needed billions of parameters to reach competitive accuracy. If the gains hold, compact diffusion models could become practical alternatives without requiring matching model designs.

Core claim

TIDE is the first cross-architecture distillation framework for diffusion large language models. It consists of TIDAL, which modulates distillation strength jointly across training progress and diffusion timestep according to the teacher's noise-dependent reliability; CompDemo, which applies complementary mask splitting to improve the teacher's predictions under heavy masking; and Reverse CALM, which inverts chunk-level likelihood matching to create a cross-tokenizer objective with bounded gradients and dual-end noise filtering. Distilling from 8B dense and 16B MoE teachers into a 0.6B student using these components yields an average 1.53-point gain over baseline across eight benchmarks, and

What carries the argument

The TIDE framework and its three modular components (TIDAL for timestep-and-progress modulation, CompDemo for complementary mask splitting, and Reverse CALM for inverted cross-tokenizer likelihood matching) that together enable knowledge transfer between mismatched diffusion language models.

If this is right

A 0.6B diffusion language model can reach higher accuracy than a comparable autoregressive baseline after cross-architecture distillation.
Code generation improves markedly, with HumanEval rising from 32.3 to 48.78.
The same pipeline works for both dense 8B and mixture-of-experts 16B teachers.
Parallel decoding and bidirectional context become available in much smaller diffusion models.
Architecture and tokenizer differences no longer block effective knowledge transfer between diffusion language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modulation and reverse-matching ideas could be tested for distilling between non-diffusion model families.
If the components prove general, they might reduce the need for architecture-specific distillation recipes in future model compression work.
Compact diffusion models trained this way could be evaluated on long-context or reasoning tasks beyond the current eight benchmarks to check breadth of gains.

Load-bearing premise

The performance improvements result from the three proposed components rather than from differences in training data, optimization details, or benchmark selection.

What would settle it

Retrain the 0.6B student on identical data and teachers using conventional distillation without TIDAL, CompDemo, or Reverse CALM and check whether the 1.53-point average gain and the HumanEval jump from 32.3 to 48.78 disappear.

Figures

Figures reproduced from arXiv: 2604.26951 by Gongbo Zhang, Li Yuan, Wen Wang, Ye Tian.

**Figure 1.** Figure 1: Cross-architecture distillation for dLLMs. Compared to prior step distillation (a) view at source ↗

**Figure 2.** Figure 2: Overview of the TIDE framework, transferring knowledge from a large teacher to a 0.6B student via three modular components: (1) TIDAL for dual-axis interpolation, (2) COMPDEMO for complementary teacher demonstration, and (3) Reverse CALM for crosstokenizer alignment. Our primary contributions are: • We introduce TIDE, the pioneering cross-architecture knowledge distillation framework for dLLMs, specifical… view at source ↗

**Figure 3.** Figure 3: The KL divergence relative to the WeDLM teacher on the GSM8K dataset. The view at source ↗

read the original abstract

Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts chunk-level likelihood matching, yielding bounded gradients and dual-end noise filtering. Distilling 8B dense and 16B MoE teachers into a 0.6B student via two heterogeneous pipelines outperforms the baseline by an average of 1.53 points across eight benchmarks, yielding notable gains in code generation, where HumanEval scores reach 48.78 compared to 32.3 for the AR baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TIDE is a first attempt at cross-architecture distillation for diffusion LLMs with three targeted components, but the gains are hard to attribute without tighter experimental controls.

read the letter

The main point is that this paper introduces TIDE as the first framework for distilling between diffusion LLMs that differ in architecture, attention, and tokenizer. It breaks the task into three pieces: TIDAL to adjust distillation strength by training progress and timestep, CompDemo to split masks for better teacher context under heavy noise, and Reverse CALM to handle cross-tokenizer matching with bounded gradients. They distill from 8B dense and 16B MoE teachers down to a 0.6B student and report a 1.53-point average lift over an AR baseline across eight tasks, with a clear jump on HumanEval from 32.3 to 48.78. That direction matters for making small dLLMs usable without massive compute at inference time.

Referee Report

2 major / 1 minor

Summary. The paper introduces TIDE, the first framework for cross-architecture distillation of diffusion LLMs (dLLMs). It comprises three modular components: TIDAL (joint modulation of distillation strength across training progress and diffusion timestep), CompDemo (complementary mask splitting to enrich teacher context under heavy masking), and Reverse CALM (inverted chunk-level likelihood matching for cross-tokenizer transfer with bounded gradients). Empirical results claim that distilling 8B dense and 16B MoE dLLM teachers into a 0.6B student via two heterogeneous pipelines yields a 1.53-point average gain over an AR baseline across eight benchmarks, including a HumanEval improvement from 32.3 to 48.78.

Significance. If the reported gains can be shown to stem from the proposed components rather than uncontrolled differences in data, optimization, or compute, the work would be significant as the first demonstration of effective cross-architecture (dense/MoE to dense, different tokenizers and attention) distillation for dLLMs. This could enable compact models that retain parallel decoding and bidirectional context advantages while reducing inference cost, with the modular objectives potentially reusable in other heterogeneous distillation settings.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline performance claims (1.53 pt average lift; HumanEval 48.78 vs. 32.3) are presented without any description of the training data volume, tokenizer handling, optimizer schedule, total compute budget, or baseline AR model configuration. Without these controls, the attribution of gains to TIDAL, CompDemo, and Reverse CALM cannot be verified and the central empirical claim remains non-diagnostic.
[§4] §4 (Experiments): No variance estimates, number of runs, or statistical significance tests are mentioned for the benchmark averages or the per-task deltas. This is load-bearing because the modest 1.53 pt margin could be within noise if single-run results are reported.

minor comments (1)

[Introduction] Notation for the three components (TIDAL, CompDemo, Reverse CALM) is introduced in the abstract but the precise mathematical formulations and how they compose into the overall loss are not previewed; a short equation block in the introduction would improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of experimental rigor that we address below. We have revised the manuscript to improve transparency on training configurations and statistical reporting.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline performance claims (1.53 pt average lift; HumanEval 48.78 vs. 32.3) are presented without any description of the training data volume, tokenizer handling, optimizer schedule, total compute budget, or baseline AR model configuration. Without these controls, the attribution of gains to TIDAL, CompDemo, and Reverse CALM cannot be verified and the central empirical claim remains non-diagnostic.

Authors: We agree that the original presentation of headline results lacked sufficient experimental controls, which limits the ability to attribute gains specifically to the TIDE components. In the revised manuscript, we have added a new subsection in §4 titled 'Training Setup and Baselines' that explicitly details: the training data volume (total tokens processed during distillation), tokenizer handling (including cross-tokenizer alignment between the 8B dense/16B MoE teachers and 0.6B student), optimizer schedule (AdamW with learning rate, warmup, and decay), total compute budget (GPU-hours for the full pipelines), and the AR baseline configuration (identical data, steps, and hyperparameters to ensure fair comparison). These additions allow verification that the reported 1.53-point average improvement and HumanEval gain stem from the proposed methods rather than uncontrolled factors. revision: yes
Referee: [§4] §4 (Experiments): No variance estimates, number of runs, or statistical significance tests are mentioned for the benchmark averages or the per-task deltas. This is load-bearing because the modest 1.53 pt margin could be within noise if single-run results are reported.

Authors: We acknowledge that the absence of variance estimates and statistical tests weakens confidence in the modest average gain, especially for single-run results. In the revised §4, we have added explicit statements that the main results are from single training runs due to the high computational cost of dLLM distillation. We also report consistency of improvements across all eight benchmarks and include variance from multiple seeds where smaller-scale ablations were feasible. Full multi-run statistics and significance tests for the primary experiments are not feasible without additional resources. revision: partial

standing simulated objections not resolved

Provision of full multi-run variance estimates, number of runs, and statistical significance tests for the primary 0.6B distillation results, as these require substantial additional compute beyond the original experimental budget.

Circularity Check

0 steps flagged

No circularity: empirical results from new modular objectives

full rationale

The paper proposes three new components (TIDAL, CompDemo, Reverse CALM) within the TIDE framework for cross-architecture distillation and reports benchmark gains (e.g., 1.53 pt average lift, HumanEval 48.78). No equations, self-citations, or derivations are present that reduce any claimed prediction or result to a fitted input, self-definition, or prior author work by construction. The central claims rest on experimental outcomes rather than tautological chains, satisfying the default expectation of non-circularity for an empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract introduces no explicit free parameters, new mathematical axioms, or postulated entities beyond naming the three methodological components; it relies on standard machine-learning distillation assumptions.

axioms (1)

domain assumption Teacher model predictions remain useful supervisory signals even when architectures, attention mechanisms, and tokenizers differ substantially.
Implicit foundation of any cross-architecture distillation claim.

pith-pipeline@v0.9.0 · 5521 in / 1401 out tokens · 52312 ms · 2026-05-07T09:49:31.229958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...