arxiv: 2605.07820 · v2 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scaling Categorical Flow Maps

Oscar Davis , Anastasiia Filippova , Pierre Ablin , Victor Turrisi , Amitis Shidani , Marco Cuturi , Louis B\'ethune

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords categorical flow mapsflow matchinglanguage modelingself-distillationfew-step samplingdiscrete diffusionscalinglikelihood bounds

0 comments

The pith

Categorical flow maps scale to 1.7 billion parameters, enabling high-quality text generation in four inference steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that flow matching can be scaled for language modeling by first training a large continuous flow model and then distilling it into a faster categorical version. A 1.7 billion parameter base model trained on 2.1 trillion tokens is self-distilled to produce diverse text in as few as four steps while preserving token entropy close to the training data level. The work also derives a likelihood bound for these models in the semi-discrete case, allowing them to be scored on standard language modeling benchmarks at levels comparable to discrete diffusion methods. Challenges specific to large-scale training are identified along with guidance on loss weighting and time scheduling to address them.

Core claim

By training a 1.7B-parameter base flow model on 2.1T tokens and self-distilling it into a CFM, the authors achieve generation of diverse, high-quality text in as few as 4 inference steps while maintaining near-data-level token entropy. They further provide a likelihood bound for CFMs in the semi-discrete setting and demonstrate that these models can score competitively on standard LM benchmarks, on par with discrete diffusion methods.

What carries the argument

Self-distillation of a large flow model into a Categorical Flow Map (CFM) that matches flow from Gaussian noise to one-hot encoded data for fast discrete sampling, together with a derived likelihood bound for evaluation.

If this is right

CFMs support few-step sampling for language models while retaining sample diversity close to the data.
The likelihood bound allows direct scoring of CFMs on perplexity and other LM benchmarks.
Insights on loss weighting and scheduling stabilize training of these models at billion-parameter scale.
Performance stays in the same range as discrete diffusion methods for both generation and evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling holds, flow-based methods could enable faster inference than autoregressive decoding in some language applications.
The semi-discrete likelihood bound might extend to evaluation of other continuous-discrete hybrid models.
Self-distillation techniques shown here could be tested on scaling flow models for other discrete sequences such as code or biological data.

Load-bearing premise

The self-distillation process combined with the selected loss weighting and time schedule preserves the base model's performance and diversity without introducing new biases at this scale.

What would settle it

Measuring the 4-step CFM on held-out data and finding either token entropy well below training data levels or benchmark scores outside the range achieved by comparable discrete diffusion models would falsify the central claim.

read the original abstract

Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They scale a 1.7B flow model on 2T tokens and distill it to a 4-step CFM with a new semi-discrete likelihood bound, but the retention of entropy and benchmark parity at that speed is the part that still needs tighter checks.

read the letter

The main thing here is the scaling demonstration. Prior CFM work stayed under 1B parameters, so training a 1.7B base flow model on 2.1T tokens and then self-distilling it into a fast sampler is the concrete step forward. They also add a likelihood bound for the semi-discrete case that lets them run standard LM benchmarks, and they report the distilled model lands in the same range as discrete diffusion while keeping token entropy close to the data. The notes on loss weighting and time scheduling that came out of the scaling run are the sort of practical detail that can save other groups time.

Referee Report

3 major / 2 minor

Summary. The paper claims to scale Categorical Flow Maps (CFMs) for language modeling by training a 1.7B-parameter base flow model on 2.1T tokens, then self-distilling it into a CFM that produces diverse, high-quality text in as few as 4 inference steps while retaining near-data-level token entropy. It introduces a likelihood bound for CFMs in the semi-discrete setting that enables scoring on standard LM benchmarks at levels comparable to discrete diffusion methods, and derives prescriptive guidance on loss weighting and time scheduling after identifying scale-related training challenges.

Significance. If the central scaling and distillation results hold with supporting evidence, the work would establish that CFMs can reach large LM scales and deliver competitive few-step sampling, providing a non-autoregressive alternative with potential advantages in speed and flexibility. The semi-discrete likelihood bound would be a useful addition for model evaluation, and the scaling insights could inform training of other continuous generative models on discrete data.

major comments (3)

[Abstract and §5 (Self-Distillation Experiments)] The abstract and experimental claims assert that the distilled CFM maintains 'near-data-level token entropy' at 4 steps, yet no quantitative comparison (e.g., entropy values or histograms) of the base 1.7B flow model versus the distilled CFM versus the data distribution is supplied; without this, the retention of diversity after self-distillation cannot be verified.
[§4 (Training at Scale) and §5.1 (Loss Weighting)] The paper states that challenges at scale were uncovered and that specific loss weighting and time scheduling were derived to address them, but provides no ablation tables comparing the chosen scheme against alternatives (or against uniform weighting) on metrics such as entropy collapse, mode coverage, or benchmark scores; this leaves the prescriptive guidance unsupported.
[§3.3 (Likelihood Bound Derivation) and Table 2 (Benchmark Results)] The introduced likelihood bound for the semi-discrete CFM setting is used to report LM benchmark results 'in the same range as discrete diffusion,' but no validation of bound tightness (e.g., comparison to Monte Carlo estimates or exact likelihoods on a held-out set) or sensitivity analysis to the number of steps is given, weakening the reliability of the reported scores.

minor comments (2)

[§2 (Preliminaries)] Define the semi-discrete setting and the precise form of the flow map more explicitly in the introduction or preliminaries to make the transition from continuous flow matching to categorical data clearer for readers.
[Figure 4 and Figure 5] Include the base flow model and data entropy as explicit reference lines in all entropy and diversity plots so that the 'near-data-level' claim can be visually assessed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §5 (Self-Distillation Experiments)] The abstract and experimental claims assert that the distilled CFM maintains 'near-data-level token entropy' at 4 steps, yet no quantitative comparison (e.g., entropy values or histograms) of the base 1.7B flow model versus the distilled CFM versus the data distribution is supplied; without this, the retention of diversity after self-distillation cannot be verified.

Authors: We agree that explicit quantitative entropy comparisons would allow direct verification of diversity retention. Although the manuscript supports the claim through benchmark performance and qualitative observations, we will add to Section 5 a table and accompanying histograms reporting token entropy for the base 1.7B model, the distilled CFM at 4 steps (and other step counts), and the empirical data distribution. revision: yes
Referee: [§4 (Training at Scale) and §5.1 (Loss Weighting)] The paper states that challenges at scale were uncovered and that specific loss weighting and time scheduling were derived to address them, but provides no ablation tables comparing the chosen scheme against alternatives (or against uniform weighting) on metrics such as entropy collapse, mode coverage, or benchmark scores; this leaves the prescriptive guidance unsupported.

Authors: We acknowledge that the prescriptive guidance would be more robust with explicit ablations. The weighting and scheduling choices were informed by observed instabilities during our 1.7B-scale training runs. We will add an appendix containing ablation tables that compare our scheme against uniform weighting and selected alternatives, reporting effects on entropy, mode coverage, and benchmark scores. revision: yes
Referee: [§3.3 (Likelihood Bound Derivation) and Table 2 (Benchmark Results)] The introduced likelihood bound for the semi-discrete CFM setting is used to report LM benchmark results 'in the same range as discrete diffusion,' but no validation of bound tightness (e.g., comparison to Monte Carlo estimates or exact likelihoods on a held-out set) or sensitivity analysis to the number of steps is given, weakening the reliability of the reported scores.

Authors: We recognize the value of validating bound tightness. We will expand Section 3.3 and the appendix with Monte Carlo likelihood estimates on a held-out set together with a sensitivity analysis across discretization step counts. These additions will directly support the reliability of the scores in Table 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical training and a new bound

full rationale

The paper's core claims derive from direct large-scale training of a 1.7B base flow model on 2.1T tokens, followed by self-distillation into a CFM, plus introduction of a semi-discrete likelihood bound used for LM benchmark scoring. These steps are presented as experimental outcomes and a novel theoretical contribution rather than reductions to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The abstract and skeptic analysis reference external comparisons to discrete diffusion methods and note challenges uncovered at scale, with prescriptive guidance on weighting and scheduling emerging from the experiments themselves. No equations or derivations in the provided text collapse by construction to prior inputs or author-specific uniqueness theorems.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that flow matching can be effectively applied to discrete categorical data at large scales, with specific training adjustments for loss and time.

free parameters (2)

loss weighting scheme
Prescriptive insights on loss weighting for training at scale
time scheduling
Insights on time scheduling to address challenges at scale

axioms (2)

domain assumption Categorical flow maps can be obtained by flow matching between Gaussian and one-hot encoded data
Based on prior works mentioned in abstract
domain assumption Self-distillation preserves the quality of the base flow model
Used to create the CFM

pith-pipeline@v0.9.0 · 5540 in / 1637 out tokens · 65110 ms · 2026-05-12T04:01:52.545020+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train a 1.7B-parameter base flow model on 2.1T tokens and self-distill it into a CFM... introduce a likelihood bound for CFMs in the semi-discrete setting
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PSD... ESD... JSD... adaptive loss weight... error-decoding schedule

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages

[1]

2023 , eprint=

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions , author=. 2023 , eprint=

work page 2023
[2]

2026 , eprint=

Categorical Flow Maps , author=. 2026 , eprint=

work page 2026
[3]

2026 , eprint=

Discrete Flow Maps , author=. 2026 , eprint=

work page 2026
[4]

2026 , eprint=

Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. 2026 , eprint=

work page 2026
[5]

2023 , eprint=

Flow Matching for Generative Modeling , author=. 2023 , eprint=

work page 2023
[6]

2025 , eprint=

Variational Flow Matching for Graph Generation , author=. 2025 , eprint=

work page 2025
[7]

2025 , eprint=

Flow map matching with stochastic interpolants: A mathematical framework for consistency models , author=. 2025 , eprint=

work page 2025
[8]

2025 , eprint=

How to build a consistency model: Learning flow maps via self-distillation , author=. 2025 , eprint=

work page 2025
[9]

2025 , eprint=

Mean Flows for One-step Generative Modeling , author=. 2025 , eprint=

work page 2025
[10]

2020 , eprint=

Denoising Diffusion Probabilistic Models , author=. 2020 , eprint=

work page 2020
[11]

2015 , eprint=

Deep Unsupervised Learning using Nonequilibrium Thermodynamics , author=. 2015 , eprint=

work page 2015
[12]

2021 , eprint=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2021 , eprint=

work page 2021
[13]

2021 , eprint=

Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , author=. 2021 , eprint=

work page 2021
[14]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

work page 2024
[15]

2024 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2024 , eprint=

work page 2024
[16]

2024 , eprint=

Discrete Flow Matching , author=. 2024 , eprint=

work page 2024
[17]

2022 , eprint=

A Continuous Time Framework for Discrete Denoising Models , author=. 2022 , eprint=

work page 2022
[18]

2022 , eprint=

Classifier-Free Diffusion Guidance , author=. 2022 , eprint=

work page 2022
[19]

2023 , eprint=

Consistency Models , author=. 2023 , eprint=

work page 2023
[20]

2025 , eprint=

Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control , author=. 2025 , eprint=

work page 2025
[21]

2024 , eprint=

Fisher Flow Matching for Generative Modeling over Discrete Data , author=. 2024 , eprint=

work page 2024
[22]

2024 , eprint=

Dirichlet Flow Matching with Applications to DNA Sequence Design , author=. 2024 , eprint=

work page 2024
[23]

2024 , eprint=

Categorical Flow Matching on Statistical Manifolds , author=. 2024 , eprint=

work page 2024
[24]

2026 , eprint=

Meta Flow Maps enable scalable reward alignment , author=. 2026 , eprint=

work page 2026
[25]

2026 , eprint=

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps , author=. 2026 , eprint=

work page 2026
[26]

2026 , eprint=

Scaling Beyond Masked Diffusion Language Models , author=. 2026 , eprint=

work page 2026
[27]

2023 , eprint=

Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

work page 2023
[28]

2025 , eprint=

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model , author=. 2025 , eprint=

work page 2025
[29]

2021 , eprint=

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , author=. 2021 , eprint=

work page 2021
[30]

2013 , eprint=

Auto-Encoding Variational Bayes , author=. 2013 , eprint=

work page 2013
[31]

2019 , eprint=

Neural Ordinary Differential Equations , author=. 2019 , eprint=

work page 2019
[32]

2025 , eprint=

The Diffusion Duality , author=. 2025 , eprint=

work page 2025
[33]

2018 , eprint=

FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models , author=. 2018 , eprint=

work page 2018
[34]

2022 , eprint=

Continuous diffusion for categorical data , author=. 2022 , eprint=

work page 2022
[35]

2022 , eprint=

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

work page 2022
[36]

2026 , eprint=

Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. 2026 , eprint=

work page 2026
[37]

2025 , eprint=

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration , author=. 2025 , eprint=

work page 2025
[38]

2023 , eprint=

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2023 , eprint=

work page 2023
[39]

2025 , eprint=

Large Language Diffusion Models , author=. 2025 , eprint=

work page 2025
[40]

2025 , eprint=

ReDi: Rectified Discrete Flow , author=. 2025 , eprint=

work page 2025
[41]

2025 , eprint=

Distillation of Discrete Diffusion through Dimensional Correlations , author=. 2025 , eprint=

work page 2025
[42]

2024 , eprint=

One Step Diffusion via Shortcut Models , author=. 2024 , eprint=

work page 2024
[43]

2026 , eprint=

Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs , author=. 2026 , eprint=

work page 2026
[44]

2022 , eprint=

Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

work page 2022
[45]

2025 , eprint=

Terminal Velocity Matching , author=. 2025 , eprint=

work page 2025
[46]

2023 , eprint=

RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2023 , eprint=

work page 2023
[47]

2025 , eprint=

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time , author=. 2025 , eprint=

work page 2025
[48]

2024 , eprint=

Consistency Models Made Easy , author=. 2024 , eprint=

work page 2024
[49]

Hutchinson

M.F. Hutchinson , title =. Communications in Statistics - Simulation and Computation , volume =. 1990 , publisher =. doi:10.1080/03610919008812866 , URL =

work page doi:10.1080/03610919008812866 1990
[50]

2023 , eprint=

Improved Techniques for Training Consistency Models , author=. 2023 , eprint=

work page 2023
[51]

2025 , eprint=

CANDI: Hybrid Discrete-Continuous Diffusion Models , author=. 2025 , eprint=

work page 2025
[52]

2022 , eprint=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=

work page 2022
[53]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[54]

Language Models are Unsupervised Multitask Learners , author=

work page
[55]

Cut Your Losses in Large-Vocabulary Language Models , booktitle =

Erik Wijmans and Brody Huval and Alexander Hertzberg and Vladlen Koltun and Philipp Kr. Cut Your Losses in Large-Vocabulary Language Models , booktitle =. 2025 , url =

work page 2025
[56]

2021 , eprint=

Variational Diffusion Models , author=. 2021 , eprint=

work page 2021
[57]

arXiv preprint arXiv:2505.15270 , year=

Scaling Diffusion Transformers Efficiently via mu P , author=. arXiv preprint arXiv:2505.15270 , year=

work page arXiv
[58]

2024 , eprint=

Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2024 , eprint=

work page 2024
[59]

2026 , eprint=

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling , author=. 2026 , eprint=

work page 2026