arxiv: 2605.05629 · v2 · submitted 2026-05-07 · 📊 stat.ML · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Spherical Flows for Sampling Categorical Data

Jannis Chemseddine , Gregor Kornhardt , Gabriele Steidl

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG

keywords spherical flowsvon Mises-Fishercategorical samplinggenerative modelscontinuity equationpredictor-correctordiscrete sequencessampling on sphere

0 comments

The pith

Spherical flows using the von Mises-Fisher distribution reduce categorical sequence sampling to solving a scalar ODE in cosine similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors map categorical sequences into a continuous spherical embedding space where the von Mises-Fisher distribution provides a natural noise model with closed-form conditional scores. Exploiting the radial symmetry of this density, they reduce the continuity equation on the sphere to a scalar ordinary differential equation in the cosine similarity, whose unique bounded solution determines the velocity field. On the product space for sequences of length L, both the marginal velocity and marginal score decompose into sums of tangent vectors weighted by the learned posterior over tokens. The posterior itself is trained only with a cross-entropy loss, after which either ODE or predictor-corrector sampling can be performed. Experiments show that the vMF path paired with predictor-corrector sampling improves performance over Euclidean and geodesic alternatives on Sudoku and language modeling tasks.

Core claim

The continuity equation on the sphere for the von Mises-Fisher density reduces, by radial symmetry, to a scalar ODE whose solution in the cosine similarity gives the velocity. The marginal velocity and marginal score on the product sphere then decompose into posterior-weighted tangent sums that differ only by per-token scalar weights, allowing both ODE and predictor-corrector sampling with a single cross-entropy-trained posterior.

What carries the argument

The von Mises-Fisher distribution on the sphere, which permits reduction of the vector continuity equation to a scalar ODE in cosine similarity due to its radial symmetry.

Load-bearing premise

The posterior learned by cross-entropy loss is sufficiently accurate to produce stable posterior-weighted sums for the velocity and score during sampling.

What would settle it

If the generated samples from the spherical flow are low-quality or the sampling process becomes unstable despite high posterior accuracy on test data, the claim that the reduced ODE yields a valid sampling path would be falsified.

Figures

Figures reproduced from arXiv: 2605.05629 by Gabriele Steidl, Gregor Kornhardt, Jannis Chemseddine.

**Figure 1.** Figure 1: LM1B: Generation perplexity vs. entropy at NFE=128, varying the predictor-to-corrector ratio. Predictor–corrector sampling (stars) outperforms ODE sampling (circles), with a tradeoff between entropy and generation perplexity when using more corrector steps, see view at source ↗

**Figure 2.** Figure 2: Illustration of von Mises–Fisher density on view at source ↗

read the original abstract

We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes spherical flows for generative modeling of discrete sequences by embedding tokens on the sphere S^{d-1} and using von Mises-Fisher distributions to induce a noise process. It claims a closed-form conditional score and derives the conditional velocity by reducing the continuity equation on the sphere to a scalar ODE in cosine similarity, whose unique bounded solution is asserted to determine the velocity. The marginal velocity and marginal score on the product space (S^{d-1})^L are shown to decompose exactly into posterior-weighted sums of tangent vectors (differing only by per-token scalar weights). Only the posterior is learned, via cross-entropy loss, enabling both ODE and predictor-corrector sampling. Experiments compare the vMF path to geodesic and Euclidean baselines and report improvements on Sudoku and language modeling tasks.

Significance. If the derivations hold, the work offers a principled reduction that exploits vMF radial symmetry to obtain an analytically determined velocity field, leaving only the posterior as the learned component. This is a clear strength relative to methods that must learn full velocity or score networks. The exact decomposition into posterior-weighted tangent sums and the availability of both ODE and PC samplers are technically attractive. The approach could meaningfully advance continuous generative modeling of categorical data, particularly if the bounded-solution property and stability under approximate posteriors are confirmed.

major comments (2)

[Derivation of conditional velocity (abstract and §3)] The abstract and the section deriving the conditional velocity claim that exploiting radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity. No self-contained existence/uniqueness argument or verification that the bounded solution remains valid under the vMF concentration and dimension assumptions is supplied; this step is load-bearing for the entire velocity construction.
[Marginal decomposition (§4)] The marginal velocity and score decomposition on (S^{d-1})^L into posterior-weighted tangent sums is presented as exact. Because the posterior is obtained only by cross-entropy training and is necessarily approximate on structured data (Sudoku constraints, token dependencies), the manuscript contains no analysis of how posterior calibration error propagates into velocity bias or ODE stability. This is load-bearing for the sampling claims on non-factorized discrete tasks.

minor comments (2)

The notation for the product manifold (S^{d-1})^L and the per-token indexing of the posterior-weighted sums would benefit from an explicit definition early in the methods section to improve readability.
The experimental section should report the vMF concentration parameter(s) used and include a brief sensitivity study, as this hyper-parameter directly affects the noise process and the reduced ODE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We have carefully considered the major comments regarding the derivation of the conditional velocity and the implications of the marginal decomposition under approximate posteriors. Our responses are provided below, along with indications of how we will revise the manuscript.

read point-by-point responses

Referee: [Derivation of conditional velocity (abstract and §3)] The abstract and the section deriving the conditional velocity claim that exploiting radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity. No self-contained existence/uniqueness argument or verification that the bounded solution remains valid under the vMF concentration and dimension assumptions is supplied; this step is load-bearing for the entire velocity construction.

Authors: We agree that the manuscript would benefit from a more explicit justification of the existence and uniqueness of the bounded solution to the scalar ODE. The reduction exploits the fact that the vMF density depends only on the cosine similarity, allowing the continuity equation to be projected onto this one-dimensional quantity. In the revised version, we will include a dedicated subsection or appendix that solves the ODE explicitly or invokes Picard-Lindelöf theorem for local existence, and shows global boundedness due to the compact domain of cosine similarity [-1,1] and the form of the drift term. We will also verify the assumptions hold for the vMF parameters used in the experiments. revision: yes
Referee: [Marginal decomposition (§4)] The marginal velocity and score decomposition on (S^{d-1})^L into posterior-weighted tangent sums is presented as exact. Because the posterior is obtained only by cross-entropy training and is necessarily approximate on structured data (Sudoku constraints, token dependencies), the manuscript contains no analysis of how posterior calibration error propagates into velocity bias or ODE stability. This is load-bearing for the sampling claims on non-factorized discrete tasks.

Authors: The marginal decomposition is mathematically exact and holds for the true posterior as well as any approximation thereof, since it derives from the product structure and the definition of the marginal velocity as an expectation. We concur that analyzing the effect of posterior approximation error on the resulting velocity bias and ODE stability is important, particularly for tasks with dependencies like Sudoku. The current manuscript relies on empirical validation, where the cross-entropy trained posterior yields effective sampling. In the revision, we will expand the discussion to include a qualitative analysis of error propagation and note that the predictor-corrector sampler provides robustness. A rigorous quantitative bound is left for future investigation as it would require additional assumptions on the posterior error. revision: partial

Circularity Check

0 steps flagged

Derivation of velocity from continuity equation and vMF symmetry is mathematically independent of the learned posterior

full rationale

The paper derives the conditional velocity by reducing the continuity equation on S^{d-1} to a scalar ODE in cosine similarity via radial symmetry of the vMF density, then invoking the unique bounded solution of that ODE. This reduction and solution are presented as direct consequences of the PDE and the vMF functional form; they contain no fitted parameters, no data-dependent terms, and no self-referential definitions. The subsequent decomposition of marginal velocity and score on (S^{d-1})^L into posterior-weighted tangent sums follows algebraically once the conditional forms are known. The sole learned object—the posterior—is obtained by a separate cross-entropy loss on discrete tokens and is therefore an external input to the flow equations rather than an output that the derivation presupposes. No self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the chain. The derivation is therefore self-contained against external mathematical benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard differential geometry of the sphere, properties of the von Mises-Fisher distribution, and the existence of a unique bounded solution to the reduced scalar ODE; no new entities are postulated and the only fitted component is the posterior network.

axioms (2)

domain assumption The von Mises-Fisher distribution on the sphere admits a closed-form conditional score.
Invoked to obtain the noise process and score; stated in the abstract as given.
domain assumption Radial symmetry of the vMF density allows reduction of the continuity equation to a scalar ODE in cosine similarity.
Core technical step; uniqueness and boundedness of the solution are asserted without further proof in the abstract.

pith-pipeline@v0.9.0 · 5484 in / 1468 out tokens · 35787 ms · 2026-05-12T01:33:40.029370+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Exploiting the radial symmetry of the vMF density we reduce the continuity equation on S^{d-1} to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The marginal velocity and marginal score on (S^{d-1})^L both decompose into posterior-weighted tangent sums.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

[1]

2026 , eprint=

Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders , author=. 2026 , eprint=

work page 2026
[2]

2024 , eprint=

Generative Modeling of Discrete Joint Distributions by E-Geodesic Flow Matching on Assignment Manifolds , author=. 2024 , eprint=

work page 2024
[3]

2024 , eprint=

Fisher Flow Matching for Generative Modeling over Discrete Data , author=. 2024 , eprint=

work page 2024
[4]

2023 , eprint=

Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

work page 2023
[5]

Dirichlet Flow Matching with Applications to

Hannes Stark and Bowen Jing and Chenyu Wang and Gabriele Corso and Bonnie Berger and Regina Barzilay and Tommi Jaakkola , year=. Dirichlet Flow Matching with Applications to. 2402.05841 , archivePrefix=

work page arXiv
[6]

2026 , eprint=

Large Language Models: A Mathematical Formulation , author=. 2026 , eprint=

work page 2026
[7]

Well‑posedness of

Luigi Ambrosio and Dario Trevisan , year =. Well‑posedness of. Analysis and PDE , volume =

work page
[8]

Cedric Villani , title =

work page
[9]

2025 , eprint=

Generator Matching: Generative modeling with arbitrary Markov processes , author=. 2025 , eprint=

work page 2025
[10]

2024 , eprint=

Flow Matching Guide and Code , author=. 2024 , eprint=

work page 2024
[11]

Variational and Information Flows in Machine Learning and Optimal Transport , author =

Flow Matching:. Variational and Information Flows in Machine Learning and Optimal Transport , author =

work page
[12]

2025 , eprint=

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions , author=. 2025 , eprint=

work page 2025
[13]

2022 , eprint=

A Continuous Time Framework for Discrete Denoising Models , author=. 2022 , eprint=

work page 2022
[14]

2026 , eprint=

Telegrapher's Generative Model via Kac Flows , author=. 2026 , eprint=

work page 2026
[15]

Language Models are Unsupervised Multitask Learners , author=

work page
[16]

2024 , eprint=

Flow Matching on General Geometries , author=. 2024 , eprint=

work page 2024
[17]

2024 , eprint=

Neural Sampling from Boltzmann Densities: Fisher-Rao Curves in the Wasserstein Geometry , author=. 2024 , eprint=

work page 2024
[18]

2026 , eprint=

Adapting Noise to Data: Generative Flows from 1D Processes , author=. 2026 , eprint=

work page 2026
[19]

2022 , eprint=

Hyperspherical Variational Auto-Encoders , author=. 2022 , eprint=

work page 2022
[20]

2022 , eprint=

Continuous diffusion for categorical data , author=. 2022 , eprint=

work page 2022
[21]

2024 , eprint=

Discrete Flow Matching , author=. 2024 , eprint=

work page 2024
[22]

2023 , eprint=

Likelihood-Based Diffusion Language Models , author=. 2023 , eprint=

work page 2023
[23]

2026 , eprint=

Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. 2026 , eprint=

work page 2026
[24]

2022 , eprint=

Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

work page 2022
[25]

2023 , eprint=

Flow Matching for Generative Modeling , author=. 2023 , eprint=

work page 2023
[26]

2025 , eprint=

nGPT: Normalized Transformer with Representation Learning on the Hypersphere , author=. 2025 , eprint=

work page 2025
[27]

2024 , eprint=

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

work page 2024
[28]

2023 , eprint=

Scaling Riemannian Diffusion Models , author=. 2023 , eprint=

work page 2023
[29]

Mardia and Peter E

Kanti V. Mardia and Peter E. Jupp , title =

work page
[30]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

work page 2021
[31]

2020 , eprint=

Normalizing Flows on Tori and Spheres , author=. 2020 , eprint=

work page 2020
[32]

2025 , eprint=

Mean Flows for One-step Generative Modeling , author=. 2025 , eprint=

work page 2025
[33]

2024 , eprint=

Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

work page 2024
[34]

2026 , eprint=

Discrete Flow Maps , author=. 2026 , eprint=

work page 2026
[35]

2021 , eprint=

Score-Based Generative Modeling through Stochastic Differential Equations , author=. 2021 , eprint=

work page 2021
[36]

Nicolas Boumal , title =

work page
[37]

Absil and R

P.-A. Absil and R. Mahony and R. Sepulchre , title =

work page
[38]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

work page 2023
[39]

2017 , eprint=

Using the Output Embedding to Improve Language Models , author=. 2017 , eprint=

work page 2017
[40]

2025 , eprint=

Trajectory Generator Matching for Time Series , author=. 2025 , eprint=

work page 2025
[41]

2022 , eprint=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=

work page 2022
[42]

2024 , eprint=

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. 2024 , eprint=

work page 2024
[43]

2025 , eprint=

Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

work page 2025
[44]

2026 , eprint=

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data , author=. 2026 , eprint=

work page 2026
[45]

2025 , eprint=

Large Language Diffusion Models , author=. 2025 , eprint=

work page 2025
[46]

2026 , eprint=

Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs , author=. 2026 , eprint=

work page 2026
[47]

2025 , eprint=

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning , author=. 2025 , eprint=

work page 2025
[48]

2025 , eprint=

Edit Flows: Flow Matching with Edit Operations , author=. 2025 , eprint=

work page 2025
[49]

2023 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

work page 2023
[50]

2026 , eprint=

Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. 2026 , eprint=

work page 2026
[51]

2025 , eprint=

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions , author=. 2025 , eprint=

work page 2025
[52]

2026 , eprint=

Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error , author=. 2026 , eprint=

work page 2026
[53]

2024 , eprint=

A Reparameterized Discrete Diffusion Model for Text Generation , author=. 2024 , eprint=

work page 2024
[54]

2025 , eprint=

Hierarchical Reasoning Model , author=. 2025 , eprint=

work page 2025
[55]

, title =

Radcliffe, David G. , title =. 2020 , url =

work page 2020
[56]

2023 , eprint=

Variational Diffusion Models , author=. 2023 , eprint=

work page 2023
[57]

2025 , eprint=

Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=

work page 2025
[58]

2026 , howpublished =

Countdown (game show) , author =. 2026 , howpublished =

work page 2026
[59]

2024 , eprint=

Stream of Search (SoS): Learning to Search in Language , author=. 2024 , eprint=

work page 2024
[60]

2022 , eprint=

Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

work page 2022
[61]

2024 , eprint=

Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models , author=. 2024 , eprint=

work page 2024
[62]

2023 , eprint=

Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

work page 2023
[63]

SELF-AWARE

Gregor Kornhardt and Jannis Chemseddine and Christian Wald and Gabriele Steidl , booktitle=. SELF-AWARE. 2026 , url=

work page 2026
[64]

2025 , eprint=

Continuous Diffusion Model for Language Modeling , author=. 2025 , eprint=

work page 2025
[65]

2025 , eprint=

Categorical Flow Matching on Statistical Manifolds , author=. 2025 , eprint=

work page 2025
[66]

2025 , eprint=

CANDI: Hybrid Discrete-Continuous Diffusion Models , author=. 2025 , eprint=

work page 2025
[67]

2025 , eprint=

The Diffusion Duality , author=. 2025 , eprint=

work page 2025
[68]

2025 , eprint=

Generative Assignment Flows for Representing and Learning Joint Distributions of Discrete Data , author=. 2025 , eprint=

work page 2025
[69]

2025 , eprint=

Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=

work page 2025
[70]

2026 , eprint=

Categorical Flow Maps , author=. 2026 , eprint=

work page 2026
[71]

and Cui, Y

Sun, Q. and Cui, Y. and Zhang, X. and Zhang, F. and Yu, Q. and Wang, Y. and Rao, Y. and Liu, J. and Huang, T. and Wang, X. , year =. Generative multimodal models are in-context learners , journal =

work page
[72]

and Han, K

Shi, J. and Han, K. and Wang, Z. and Doucet, A. and Titsias, M. , year =. Simplified and generalized masked diffusion for discrete data , journal =

work page
[73]

2026 , eprint=

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling , author=. 2026 , eprint=

work page 2026
[74]

2026 , eprint=

Scaling Behavior of Discrete Diffusion Language Models , author=. 2026 , eprint=

work page 2026
[75]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

work page 2019
[76]

2025 , eprint=

Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2025 , eprint=

work page 2025
[77]

2026 , eprint=

Generalized Discrete Diffusion from Snapshots , author=. 2026 , eprint=

work page 2026
[78]

One billion word benchmark for measuring progress in statistical language modeling

Ciprian Chelba and Tom. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , journal =. 2013 , url =. 1312.3005 , timestamp =

work page arXiv 2013
[79]

2022 , eprint=

Elucidating the Design Space of Diffusion-Based Generative Models , author=. 2022 , eprint=

work page 2022