pith. sign in

arxiv: 2506.01928 · v4 · pith:7QYCYHRUnew · submitted 2025-06-02 · 💻 cs.CL · cs.LG

Esoteric Language Models: A Family of Any-Order Diffusion LLMs

classification 💻 cs.CL cs.LG
keywords modelsmdmseso-lmsfamilygenerationany-orderattentionautoregressive
0
0 comments X
read the original abstract

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key inference-time efficiency features, most notably KV caching. We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, smoothly interpolating between their perplexities while overcoming their respective limitations. Unlike prior work, which uses transformers with bidirectional attention as MDM denoisers, we exploit the connection between MDMs and Any-Order autoregressive models and adopt causal attention. This design lets us compute the exact likelihood of MDMs for the first time and, crucially, enables us to introduce KV caching for MDMs while preserving parallel generation for the first time, significantly improving inference efficiency. Combined with an optimized sampling schedule, Eso-LMs establish a new state of the art on the speed-quality Pareto frontier for unconditional generation. We provide the code, model checkpoints, and the video tutorial on the project page: https://s-sahoo.com/Eso-LMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

    cs.LG 2026-03 unverdicted novelty 8.0

    Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

  2. Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models

    cs.LG 2026-02 unverdicted novelty 7.0

    Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.

  3. Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

    cs.CL 2026-05 conditional novelty 6.0

    RePlaid achieves a 20x compute gap to autoregressive models, new SOTA PPL of 22.1 among continuous DLMs on OpenWebText, and competitive scaling laws by aligning architecture with modern discrete DLMs.

  4. DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.

  5. Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

    cs.CL 2025-12 unverdicted novelty 6.0

    Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.