Recognition: 2 theorem links
· Lean TheoremSpherical Flows for Sampling Categorical Data
Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3
The pith
Spherical flows using the von Mises-Fisher distribution reduce categorical sequence sampling to solving a scalar ODE in cosine similarity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The continuity equation on the sphere for the von Mises-Fisher density reduces, by radial symmetry, to a scalar ODE whose solution in the cosine similarity gives the velocity. The marginal velocity and marginal score on the product sphere then decompose into posterior-weighted tangent sums that differ only by per-token scalar weights, allowing both ODE and predictor-corrector sampling with a single cross-entropy-trained posterior.
What carries the argument
The von Mises-Fisher distribution on the sphere, which permits reduction of the vector continuity equation to a scalar ODE in cosine similarity due to its radial symmetry.
Load-bearing premise
The posterior learned by cross-entropy loss is sufficiently accurate to produce stable posterior-weighted sums for the velocity and score during sampling.
What would settle it
If the generated samples from the spherical flow are low-quality or the sampling process becomes unstable despite high posterior accuracy on test data, the claim that the reduced ODE yields a valid sampling path would be falsified.
Figures
read the original abstract
We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes spherical flows for generative modeling of discrete sequences by embedding tokens on the sphere S^{d-1} and using von Mises-Fisher distributions to induce a noise process. It claims a closed-form conditional score and derives the conditional velocity by reducing the continuity equation on the sphere to a scalar ODE in cosine similarity, whose unique bounded solution is asserted to determine the velocity. The marginal velocity and marginal score on the product space (S^{d-1})^L are shown to decompose exactly into posterior-weighted sums of tangent vectors (differing only by per-token scalar weights). Only the posterior is learned, via cross-entropy loss, enabling both ODE and predictor-corrector sampling. Experiments compare the vMF path to geodesic and Euclidean baselines and report improvements on Sudoku and language modeling tasks.
Significance. If the derivations hold, the work offers a principled reduction that exploits vMF radial symmetry to obtain an analytically determined velocity field, leaving only the posterior as the learned component. This is a clear strength relative to methods that must learn full velocity or score networks. The exact decomposition into posterior-weighted tangent sums and the availability of both ODE and PC samplers are technically attractive. The approach could meaningfully advance continuous generative modeling of categorical data, particularly if the bounded-solution property and stability under approximate posteriors are confirmed.
major comments (2)
- [Derivation of conditional velocity (abstract and §3)] The abstract and the section deriving the conditional velocity claim that exploiting radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity. No self-contained existence/uniqueness argument or verification that the bounded solution remains valid under the vMF concentration and dimension assumptions is supplied; this step is load-bearing for the entire velocity construction.
- [Marginal decomposition (§4)] The marginal velocity and score decomposition on (S^{d-1})^L into posterior-weighted tangent sums is presented as exact. Because the posterior is obtained only by cross-entropy training and is necessarily approximate on structured data (Sudoku constraints, token dependencies), the manuscript contains no analysis of how posterior calibration error propagates into velocity bias or ODE stability. This is load-bearing for the sampling claims on non-factorized discrete tasks.
minor comments (2)
- The notation for the product manifold (S^{d-1})^L and the per-token indexing of the posterior-weighted sums would benefit from an explicit definition early in the methods section to improve readability.
- The experimental section should report the vMF concentration parameter(s) used and include a brief sensitivity study, as this hyper-parameter directly affects the noise process and the reduced ODE.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We have carefully considered the major comments regarding the derivation of the conditional velocity and the implications of the marginal decomposition under approximate posteriors. Our responses are provided below, along with indications of how we will revise the manuscript.
read point-by-point responses
-
Referee: [Derivation of conditional velocity (abstract and §3)] The abstract and the section deriving the conditional velocity claim that exploiting radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity. No self-contained existence/uniqueness argument or verification that the bounded solution remains valid under the vMF concentration and dimension assumptions is supplied; this step is load-bearing for the entire velocity construction.
Authors: We agree that the manuscript would benefit from a more explicit justification of the existence and uniqueness of the bounded solution to the scalar ODE. The reduction exploits the fact that the vMF density depends only on the cosine similarity, allowing the continuity equation to be projected onto this one-dimensional quantity. In the revised version, we will include a dedicated subsection or appendix that solves the ODE explicitly or invokes Picard-Lindelöf theorem for local existence, and shows global boundedness due to the compact domain of cosine similarity [-1,1] and the form of the drift term. We will also verify the assumptions hold for the vMF parameters used in the experiments. revision: yes
-
Referee: [Marginal decomposition (§4)] The marginal velocity and score decomposition on (S^{d-1})^L into posterior-weighted tangent sums is presented as exact. Because the posterior is obtained only by cross-entropy training and is necessarily approximate on structured data (Sudoku constraints, token dependencies), the manuscript contains no analysis of how posterior calibration error propagates into velocity bias or ODE stability. This is load-bearing for the sampling claims on non-factorized discrete tasks.
Authors: The marginal decomposition is mathematically exact and holds for the true posterior as well as any approximation thereof, since it derives from the product structure and the definition of the marginal velocity as an expectation. We concur that analyzing the effect of posterior approximation error on the resulting velocity bias and ODE stability is important, particularly for tasks with dependencies like Sudoku. The current manuscript relies on empirical validation, where the cross-entropy trained posterior yields effective sampling. In the revision, we will expand the discussion to include a qualitative analysis of error propagation and note that the predictor-corrector sampler provides robustness. A rigorous quantitative bound is left for future investigation as it would require additional assumptions on the posterior error. revision: partial
Circularity Check
Derivation of velocity from continuity equation and vMF symmetry is mathematically independent of the learned posterior
full rationale
The paper derives the conditional velocity by reducing the continuity equation on S^{d-1} to a scalar ODE in cosine similarity via radial symmetry of the vMF density, then invoking the unique bounded solution of that ODE. This reduction and solution are presented as direct consequences of the PDE and the vMF functional form; they contain no fitted parameters, no data-dependent terms, and no self-referential definitions. The subsequent decomposition of marginal velocity and score on (S^{d-1})^L into posterior-weighted tangent sums follows algebraically once the conditional forms are known. The sole learned object—the posterior—is obtained by a separate cross-entropy loss on discrete tokens and is therefore an external input to the flow equations rather than an output that the derivation presupposes. No self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the chain. The derivation is therefore self-contained against external mathematical benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The von Mises-Fisher distribution on the sphere admits a closed-form conditional score.
- domain assumption Radial symmetry of the vMF density allows reduction of the continuity equation to a scalar ODE in cosine similarity.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Exploiting the radial symmetry of the vMF density we reduce the continuity equation on S^{d-1} to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The marginal velocity and marginal score on (S^{d-1})^L both decompose into posterior-weighted tangent sums.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders , author=. 2026 , eprint=
work page 2026
-
[2]
Generative Modeling of Discrete Joint Distributions by E-Geodesic Flow Matching on Assignment Manifolds , author=. 2024 , eprint=
work page 2024
-
[3]
Fisher Flow Matching for Generative Modeling over Discrete Data , author=. 2024 , eprint=
work page 2024
-
[4]
Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=
work page 2023
-
[5]
Dirichlet Flow Matching with Applications to
Hannes Stark and Bowen Jing and Chenyu Wang and Gabriele Corso and Bonnie Berger and Regina Barzilay and Tommi Jaakkola , year=. Dirichlet Flow Matching with Applications to. 2402.05841 , archivePrefix=
-
[6]
Large Language Models: A Mathematical Formulation , author=. 2026 , eprint=
work page 2026
-
[7]
Luigi Ambrosio and Dario Trevisan , year =. Well‑posedness of. Analysis and PDE , volume =
-
[8]
Cedric Villani , title =
-
[9]
Generator Matching: Generative modeling with arbitrary Markov processes , author=. 2025 , eprint=
work page 2025
- [10]
-
[11]
Variational and Information Flows in Machine Learning and Optimal Transport , author =
Flow Matching:. Variational and Information Flows in Machine Learning and Optimal Transport , author =
-
[12]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions , author=. 2025 , eprint=
work page 2025
-
[13]
A Continuous Time Framework for Discrete Denoising Models , author=. 2022 , eprint=
work page 2022
-
[14]
Telegrapher's Generative Model via Kac Flows , author=. 2026 , eprint=
work page 2026
-
[15]
Language Models are Unsupervised Multitask Learners , author=
- [16]
-
[17]
Neural Sampling from Boltzmann Densities: Fisher-Rao Curves in the Wasserstein Geometry , author=. 2024 , eprint=
work page 2024
-
[18]
Adapting Noise to Data: Generative Flows from 1D Processes , author=. 2026 , eprint=
work page 2026
- [19]
- [20]
- [21]
- [22]
-
[23]
Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. 2026 , eprint=
work page 2026
-
[24]
Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=
work page 2022
- [25]
-
[26]
nGPT: Normalized Transformer with Representation Learning on the Hypersphere , author=. 2025 , eprint=
work page 2025
-
[27]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=
work page 2024
- [28]
- [29]
-
[30]
Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=
work page 2021
- [31]
- [32]
-
[33]
Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=
work page 2024
- [34]
-
[35]
Score-Based Generative Modeling through Stochastic Differential Equations , author=. 2021 , eprint=
work page 2021
-
[36]
Nicolas Boumal , title =
- [37]
- [38]
-
[39]
Using the Output Embedding to Improve Language Models , author=. 2017 , eprint=
work page 2017
-
[40]
Trajectory Generator Matching for Time Series , author=. 2025 , eprint=
work page 2025
-
[41]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=
work page 2022
-
[42]
Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. 2024 , eprint=
work page 2024
-
[43]
Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=
work page 2025
-
[44]
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data , author=. 2026 , eprint=
work page 2026
- [45]
-
[46]
Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs , author=. 2026 , eprint=
work page 2026
-
[47]
Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning , author=. 2025 , eprint=
work page 2025
-
[48]
Edit Flows: Flow Matching with Edit Operations , author=. 2025 , eprint=
work page 2025
-
[49]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=
work page 2023
-
[50]
Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. 2026 , eprint=
work page 2026
-
[51]
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions , author=. 2025 , eprint=
work page 2025
-
[52]
Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error , author=. 2026 , eprint=
work page 2026
-
[53]
A Reparameterized Discrete Diffusion Model for Text Generation , author=. 2024 , eprint=
work page 2024
- [54]
- [55]
- [56]
-
[57]
Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=
work page 2025
- [58]
-
[59]
Stream of Search (SoS): Learning to Search in Language , author=. 2024 , eprint=
work page 2024
- [60]
-
[61]
Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models , author=. 2024 , eprint=
work page 2024
- [62]
-
[63]
Gregor Kornhardt and Jannis Chemseddine and Christian Wald and Gabriele Steidl , booktitle=. SELF-AWARE. 2026 , url=
work page 2026
-
[64]
Continuous Diffusion Model for Language Modeling , author=. 2025 , eprint=
work page 2025
-
[65]
Categorical Flow Matching on Statistical Manifolds , author=. 2025 , eprint=
work page 2025
-
[66]
CANDI: Hybrid Discrete-Continuous Diffusion Models , author=. 2025 , eprint=
work page 2025
- [67]
-
[68]
Generative Assignment Flows for Representing and Learning Joint Distributions of Discrete Data , author=. 2025 , eprint=
work page 2025
-
[69]
Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=
work page 2025
- [70]
-
[71]
Sun, Q. and Cui, Y. and Zhang, X. and Zhang, F. and Yu, Q. and Wang, Y. and Rao, Y. and Liu, J. and Huang, T. and Wang, X. , year =. Generative multimodal models are in-context learners , journal =
-
[72]
Shi, J. and Han, K. and Wang, Z. and Doucet, A. and Titsias, M. , year =. Simplified and generalized masked diffusion for discrete data , journal =
-
[73]
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling , author=. 2026 , eprint=
work page 2026
-
[74]
Scaling Behavior of Discrete Diffusion Language Models , author=. 2026 , eprint=
work page 2026
-
[75]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=
work page 2019
-
[76]
Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2025 , eprint=
work page 2025
-
[77]
Generalized Discrete Diffusion from Snapshots , author=. 2026 , eprint=
work page 2026
-
[78]
One billion word benchmark for measuring progress in statistical language modeling
Ciprian Chelba and Tom. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , journal =. 2013 , url =. 1312.3005 , timestamp =
-
[79]
Elucidating the Design Space of Diffusion-Based Generative Models , author=. 2022 , eprint=
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.