Nemotron-Labs-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context
Pith reviewed 2026-07-01 06:36 UTC · model grok-4.3
The pith
A two-tower architecture decouples autoregressive context from diffusion denoising to retain 98.7 percent quality at 2.42 times the generation speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a block-wise autoregressive diffusion model with a frozen pretrained autoregressive context tower and a trainable diffusion denoiser tower that uses bidirectional block attention plus cross-attention to the context can retain 98.7 percent of autoregressive baseline quality while delivering 2.42 times higher wall-clock generation throughput.
What carries the argument
The TwoTower architecture that splits context representation into a frozen autoregressive tower processing clean tokens and denoising into a separate tower that refines noisy blocks via cross-attention.
If this is right
- The denoiser tower receives dedicated capacity for its role instead of sharing parameters with context processing.
- Generation occurs through parallel iterative refinement of blocks rather than sequential token prediction.
- A pretrained autoregressive model can be reused directly as the context source without additional training on the context tower.
- The combined system produces text at 2.42 times the wall-clock throughput of the autoregressive baseline while staying within 1.3 percent quality.
Where Pith is reading between the lines
- The same separation could let researchers scale the denoiser tower independently of the context tower in future larger systems.
- The method suggests that other non-autoregressive generation techniques might also benefit from borrowing a frozen causal context provider.
- Releasing the weights allows direct measurement of whether the reported throughput gain holds on different hardware or batch sizes.
- The approach raises whether the context tower must itself be autoregressive or if other causal architectures would work equally well.
Load-bearing premise
A frozen pretrained autoregressive context tower can supply sufficient causal context to a separately trained diffusion denoiser tower without joint optimization or degradation in the denoising process.
What would settle it
An experiment that jointly trains both towers end-to-end on the same data and measures whether the resulting quality exceeds 98.7 percent retention at comparable or better speed.
read the original abstract
Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either role. We propose TwoTower, a block-wise autoregressive diffusion model that decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Built on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1T tokens, Nemotron-Labs-TwoTower retains 98.7% of the autoregressive baseline's quality while offering 2.42X higher wall-clock generation throughput. We release the code and model weights at https://huggingface.co/collections/nvidia/nemotron-labs-twotower.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Nemotron-Labs-TwoTower, a block-wise autoregressive diffusion language model that decouples context representation from denoising by using a frozen pretrained autoregressive context tower (based on Nemotron-3-Nano-30B-A3B) that causally processes clean tokens and a trainable diffusion denoiser tower that performs bidirectional block denoising via cross-attention to the context. It claims that this architecture, trained on ~2.1T tokens, retains 98.7% of the autoregressive baseline quality while providing 2.42X higher wall-clock generation throughput, and releases the code and model weights.
Significance. If the empirical results hold under scrutiny, the decoupling of roles could allow pretrained autoregressive models to serve as fixed context providers for diffusion-based generation, potentially improving specialization and throughput in non-autoregressive language modeling. The explicit release of code and model weights is a clear strength that supports reproducibility and follow-on work.
major comments (2)
- [Abstract] Abstract: The headline claims of 98.7% quality retention and 2.42X throughput are presented as direct empirical results, yet the abstract (and by extension the central evaluation) provides no description of the evaluation protocol, metrics used, number of runs, error bars, training curves, or ablation studies that isolate the contribution of the frozen AR context tower versus joint optimization or unfrozen variants. This directly bears on the load-bearing assumption that the fixed context representations remain adequate for the diffusion objective without degradation.
- [Abstract / Methods] The weakest assumption identified in the design—that a separately trained, frozen AR context tower supplies sufficient causal context to the diffusion denoiser without joint optimization—is not tested via any reported ablation or comparison (e.g., frozen vs. fine-tuned context tower). Without such evidence, the quality-retention claim cannot be evaluated for robustness.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below with clarifications drawn directly from the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claims of 98.7% quality retention and 2.42X throughput are presented as direct empirical results, yet the abstract (and by extension the central evaluation) provides no description of the evaluation protocol, metrics used, number of runs, error bars, training curves, or ablation studies that isolate the contribution of the frozen AR context tower versus joint optimization or unfrozen variants. This directly bears on the load-bearing assumption that the fixed context representations remain adequate for the diffusion objective without degradation.
Authors: The abstract is intentionally concise. The Experiments section of the manuscript specifies the evaluation protocol (perplexity on held-out validation data together with standard downstream benchmarks), the number of evaluation runs, and variability measures. Training curves appear in the appendix. We agree that a short reference to the evaluation setup in the abstract would improve readability and will revise the abstract accordingly. revision: yes
-
Referee: [Abstract / Methods] The weakest assumption identified in the design—that a separately trained, frozen AR context tower supplies sufficient causal context to the diffusion denoiser without joint optimization—is not tested via any reported ablation or comparison (e.g., frozen vs. fine-tuned context tower). Without such evidence, the quality-retention claim cannot be evaluated for robustness.
Authors: The manuscript deliberately evaluates the frozen pretrained context tower to demonstrate that existing high-quality autoregressive models can be reused directly. The reported 98.7 % quality retention on a 30 B model trained for 2.1 T tokens constitutes direct empirical evidence that the frozen representations suffice for the diffusion objective. An explicit frozen-versus-fine-tuned ablation was not performed because of the prohibitive compute cost of additional 30 B-scale training runs; the current results therefore validate the practical utility of the decoupled design even if they do not quantify the incremental gain from joint optimization. revision: no
Circularity Check
No circularity: empirical metrics are direct measurements, not reductions by construction
full rationale
The paper describes an architectural decoupling (frozen AR context tower + trainable diffusion denoiser) and reports measured outcomes (98.7% quality retention, 2.42X throughput) after training on 2.1T tokens. No equations, derivations, or fitted-parameter predictions are shown that would reduce the headline metrics to inputs by construction. The base model citation (Nemotron-3-Nano-30B-A3B) is a standard pretrained starting point rather than a load-bearing uniqueness theorem or ansatz. Claims remain falsifiable via the released code/weights and external benchmarks, satisfying the self-contained criterion for score 0.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OpenAI blog , volume=
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[2]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Advances in neural information processing systems , volume=
Structured denoising diffusion models in discrete state-spaces , author=. Advances in neural information processing systems , volume=
-
[5]
Advances in Neural Information Processing Systems , volume=
Simple and effective masked diffusion language models , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
Large Language Diffusion Models
Large language diffusion models , author=. arXiv preprint arXiv:2502.09992 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Block diffusion: Interpolating between autoregressive and diffusion language models , author=. arXiv preprint arXiv:2503.09573 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2510.22852 , year=
Encoder-decoder diffusion language models for efficient training and inference , author=. arXiv preprint arXiv:2510.22852 , year=
-
[9]
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA Nemotron 3: Efficient and Open Intelligence , author=. arXiv preprint arXiv:2512.20856 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[11]
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Your absorbing discrete diffusion secretly models the conditional distributions of clean data , author=. arXiv preprint arXiv:2406.03736 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Chen, Junsong and Yu, Jincheng and Ge, Chongjian and Yao, Lewei and Xie, Enze and Wu, Yue and Wang, Zhongdao and Kwok, James and Luo, Ping and Lu, Huchuan and others , journal=
-
[13]
Advances in neural information processing systems , volume=
Simplified and generalized masked diffusion for discrete data , author=. Advances in neural information processing systems , volume=
-
[14]
2019 , eprint=
Decoupled Weight Decay Regularization , author=. 2019 , eprint=
2019
-
[15]
2024 , eprint=
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies , author=. 2024 , eprint=
2024
-
[16]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.