pith. machine review for the scientific record. sign in

arxiv: 2605.05629 · v2 · submitted 2026-05-07 · 📊 stat.ML · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Spherical Flows for Sampling Categorical Data

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:33 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG
keywords spherical flowsvon Mises-Fishercategorical samplinggenerative modelscontinuity equationpredictor-correctordiscrete sequencessampling on sphere
0
0 comments X

The pith

Spherical flows using the von Mises-Fisher distribution reduce categorical sequence sampling to solving a scalar ODE in cosine similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors map categorical sequences into a continuous spherical embedding space where the von Mises-Fisher distribution provides a natural noise model with closed-form conditional scores. Exploiting the radial symmetry of this density, they reduce the continuity equation on the sphere to a scalar ordinary differential equation in the cosine similarity, whose unique bounded solution determines the velocity field. On the product space for sequences of length L, both the marginal velocity and marginal score decompose into sums of tangent vectors weighted by the learned posterior over tokens. The posterior itself is trained only with a cross-entropy loss, after which either ODE or predictor-corrector sampling can be performed. Experiments show that the vMF path paired with predictor-corrector sampling improves performance over Euclidean and geodesic alternatives on Sudoku and language modeling tasks.

Core claim

The continuity equation on the sphere for the von Mises-Fisher density reduces, by radial symmetry, to a scalar ODE whose solution in the cosine similarity gives the velocity. The marginal velocity and marginal score on the product sphere then decompose into posterior-weighted tangent sums that differ only by per-token scalar weights, allowing both ODE and predictor-corrector sampling with a single cross-entropy-trained posterior.

What carries the argument

The von Mises-Fisher distribution on the sphere, which permits reduction of the vector continuity equation to a scalar ODE in cosine similarity due to its radial symmetry.

Load-bearing premise

The posterior learned by cross-entropy loss is sufficiently accurate to produce stable posterior-weighted sums for the velocity and score during sampling.

What would settle it

If the generated samples from the spherical flow are low-quality or the sampling process becomes unstable despite high posterior accuracy on test data, the claim that the reduced ODE yields a valid sampling path would be falsified.

Figures

Figures reproduced from arXiv: 2605.05629 by Gabriele Steidl, Gregor Kornhardt, Jannis Chemseddine.

Figure 1
Figure 1. Figure 1: LM1B: Generation per￾plexity vs. entropy at NFE=128, varying the predictor-to-corrector ratio. Predictor–corrector sampling (stars) outperforms ODE sampling (circles), with a tradeoff between entropy and generation perplexity when using more corrector steps, see view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of von Mises–Fisher density on view at source ↗
read the original abstract

We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes spherical flows for generative modeling of discrete sequences by embedding tokens on the sphere S^{d-1} and using von Mises-Fisher distributions to induce a noise process. It claims a closed-form conditional score and derives the conditional velocity by reducing the continuity equation on the sphere to a scalar ODE in cosine similarity, whose unique bounded solution is asserted to determine the velocity. The marginal velocity and marginal score on the product space (S^{d-1})^L are shown to decompose exactly into posterior-weighted sums of tangent vectors (differing only by per-token scalar weights). Only the posterior is learned, via cross-entropy loss, enabling both ODE and predictor-corrector sampling. Experiments compare the vMF path to geodesic and Euclidean baselines and report improvements on Sudoku and language modeling tasks.

Significance. If the derivations hold, the work offers a principled reduction that exploits vMF radial symmetry to obtain an analytically determined velocity field, leaving only the posterior as the learned component. This is a clear strength relative to methods that must learn full velocity or score networks. The exact decomposition into posterior-weighted tangent sums and the availability of both ODE and PC samplers are technically attractive. The approach could meaningfully advance continuous generative modeling of categorical data, particularly if the bounded-solution property and stability under approximate posteriors are confirmed.

major comments (2)
  1. [Derivation of conditional velocity (abstract and §3)] The abstract and the section deriving the conditional velocity claim that exploiting radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity. No self-contained existence/uniqueness argument or verification that the bounded solution remains valid under the vMF concentration and dimension assumptions is supplied; this step is load-bearing for the entire velocity construction.
  2. [Marginal decomposition (§4)] The marginal velocity and score decomposition on (S^{d-1})^L into posterior-weighted tangent sums is presented as exact. Because the posterior is obtained only by cross-entropy training and is necessarily approximate on structured data (Sudoku constraints, token dependencies), the manuscript contains no analysis of how posterior calibration error propagates into velocity bias or ODE stability. This is load-bearing for the sampling claims on non-factorized discrete tasks.
minor comments (2)
  1. The notation for the product manifold (S^{d-1})^L and the per-token indexing of the posterior-weighted sums would benefit from an explicit definition early in the methods section to improve readability.
  2. The experimental section should report the vMF concentration parameter(s) used and include a brief sensitivity study, as this hyper-parameter directly affects the noise process and the reduced ODE.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We have carefully considered the major comments regarding the derivation of the conditional velocity and the implications of the marginal decomposition under approximate posteriors. Our responses are provided below, along with indications of how we will revise the manuscript.

read point-by-point responses
  1. Referee: [Derivation of conditional velocity (abstract and §3)] The abstract and the section deriving the conditional velocity claim that exploiting radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity. No self-contained existence/uniqueness argument or verification that the bounded solution remains valid under the vMF concentration and dimension assumptions is supplied; this step is load-bearing for the entire velocity construction.

    Authors: We agree that the manuscript would benefit from a more explicit justification of the existence and uniqueness of the bounded solution to the scalar ODE. The reduction exploits the fact that the vMF density depends only on the cosine similarity, allowing the continuity equation to be projected onto this one-dimensional quantity. In the revised version, we will include a dedicated subsection or appendix that solves the ODE explicitly or invokes Picard-Lindelöf theorem for local existence, and shows global boundedness due to the compact domain of cosine similarity [-1,1] and the form of the drift term. We will also verify the assumptions hold for the vMF parameters used in the experiments. revision: yes

  2. Referee: [Marginal decomposition (§4)] The marginal velocity and score decomposition on (S^{d-1})^L into posterior-weighted tangent sums is presented as exact. Because the posterior is obtained only by cross-entropy training and is necessarily approximate on structured data (Sudoku constraints, token dependencies), the manuscript contains no analysis of how posterior calibration error propagates into velocity bias or ODE stability. This is load-bearing for the sampling claims on non-factorized discrete tasks.

    Authors: The marginal decomposition is mathematically exact and holds for the true posterior as well as any approximation thereof, since it derives from the product structure and the definition of the marginal velocity as an expectation. We concur that analyzing the effect of posterior approximation error on the resulting velocity bias and ODE stability is important, particularly for tasks with dependencies like Sudoku. The current manuscript relies on empirical validation, where the cross-entropy trained posterior yields effective sampling. In the revision, we will expand the discussion to include a qualitative analysis of error propagation and note that the predictor-corrector sampler provides robustness. A rigorous quantitative bound is left for future investigation as it would require additional assumptions on the posterior error. revision: partial

Circularity Check

0 steps flagged

Derivation of velocity from continuity equation and vMF symmetry is mathematically independent of the learned posterior

full rationale

The paper derives the conditional velocity by reducing the continuity equation on S^{d-1} to a scalar ODE in cosine similarity via radial symmetry of the vMF density, then invoking the unique bounded solution of that ODE. This reduction and solution are presented as direct consequences of the PDE and the vMF functional form; they contain no fitted parameters, no data-dependent terms, and no self-referential definitions. The subsequent decomposition of marginal velocity and score on (S^{d-1})^L into posterior-weighted tangent sums follows algebraically once the conditional forms are known. The sole learned object—the posterior—is obtained by a separate cross-entropy loss on discrete tokens and is therefore an external input to the flow equations rather than an output that the derivation presupposes. No self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the chain. The derivation is therefore self-contained against external mathematical benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard differential geometry of the sphere, properties of the von Mises-Fisher distribution, and the existence of a unique bounded solution to the reduced scalar ODE; no new entities are postulated and the only fitted component is the posterior network.

axioms (2)
  • domain assumption The von Mises-Fisher distribution on the sphere admits a closed-form conditional score.
    Invoked to obtain the noise process and score; stated in the abstract as given.
  • domain assumption Radial symmetry of the vMF density allows reduction of the continuity equation to a scalar ODE in cosine similarity.
    Core technical step; uniqueness and boundedness of the solution are asserted without further proof in the abstract.

pith-pipeline@v0.9.0 · 5484 in / 1468 out tokens · 35787 ms · 2026-05-12T01:33:40.029370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

  1. [1]

    2026 , eprint=

    Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders , author=. 2026 , eprint=

  2. [2]

    2024 , eprint=

    Generative Modeling of Discrete Joint Distributions by E-Geodesic Flow Matching on Assignment Manifolds , author=. 2024 , eprint=

  3. [3]

    2024 , eprint=

    Fisher Flow Matching for Generative Modeling over Discrete Data , author=. 2024 , eprint=

  4. [4]

    2023 , eprint=

    Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

  5. [5]

    Dirichlet Flow Matching with Applications to

    Hannes Stark and Bowen Jing and Chenyu Wang and Gabriele Corso and Bonnie Berger and Regina Barzilay and Tommi Jaakkola , year=. Dirichlet Flow Matching with Applications to. 2402.05841 , archivePrefix=

  6. [6]

    2026 , eprint=

    Large Language Models: A Mathematical Formulation , author=. 2026 , eprint=

  7. [7]

    Well‑posedness of

    Luigi Ambrosio and Dario Trevisan , year =. Well‑posedness of. Analysis and PDE , volume =

  8. [8]

    Cedric Villani , title =

  9. [9]

    2025 , eprint=

    Generator Matching: Generative modeling with arbitrary Markov processes , author=. 2025 , eprint=

  10. [10]

    2024 , eprint=

    Flow Matching Guide and Code , author=. 2024 , eprint=

  11. [11]

    Variational and Information Flows in Machine Learning and Optimal Transport , author =

    Flow Matching:. Variational and Information Flows in Machine Learning and Optimal Transport , author =

  12. [12]

    2025 , eprint=

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions , author=. 2025 , eprint=

  13. [13]

    2022 , eprint=

    A Continuous Time Framework for Discrete Denoising Models , author=. 2022 , eprint=

  14. [14]

    2026 , eprint=

    Telegrapher's Generative Model via Kac Flows , author=. 2026 , eprint=

  15. [15]

    Language Models are Unsupervised Multitask Learners , author=

  16. [16]

    2024 , eprint=

    Flow Matching on General Geometries , author=. 2024 , eprint=

  17. [17]

    2024 , eprint=

    Neural Sampling from Boltzmann Densities: Fisher-Rao Curves in the Wasserstein Geometry , author=. 2024 , eprint=

  18. [18]

    2026 , eprint=

    Adapting Noise to Data: Generative Flows from 1D Processes , author=. 2026 , eprint=

  19. [19]

    2022 , eprint=

    Hyperspherical Variational Auto-Encoders , author=. 2022 , eprint=

  20. [20]

    2022 , eprint=

    Continuous diffusion for categorical data , author=. 2022 , eprint=

  21. [21]

    2024 , eprint=

    Discrete Flow Matching , author=. 2024 , eprint=

  22. [22]

    2023 , eprint=

    Likelihood-Based Diffusion Language Models , author=. 2023 , eprint=

  23. [23]

    2026 , eprint=

    Flow Map Language Models: One-step Language Modeling via Continuous Denoising , author=. 2026 , eprint=

  24. [24]

    2022 , eprint=

    Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

  25. [25]

    2023 , eprint=

    Flow Matching for Generative Modeling , author=. 2023 , eprint=

  26. [26]

    2025 , eprint=

    nGPT: Normalized Transformer with Representation Learning on the Hypersphere , author=. 2025 , eprint=

  27. [27]

    2024 , eprint=

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution , author=. 2024 , eprint=

  28. [28]

    2023 , eprint=

    Scaling Riemannian Diffusion Models , author=. 2023 , eprint=

  29. [29]

    Mardia and Peter E

    Kanti V. Mardia and Peter E. Jupp , title =

  30. [30]

    2021 , eprint=

    Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

  31. [31]

    2020 , eprint=

    Normalizing Flows on Tori and Spheres , author=. 2020 , eprint=

  32. [32]

    2025 , eprint=

    Mean Flows for One-step Generative Modeling , author=. 2025 , eprint=

  33. [33]

    2024 , eprint=

    Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

  34. [34]

    2026 , eprint=

    Discrete Flow Maps , author=. 2026 , eprint=

  35. [35]

    2021 , eprint=

    Score-Based Generative Modeling through Stochastic Differential Equations , author=. 2021 , eprint=

  36. [36]

    Nicolas Boumal , title =

  37. [37]

    Absil and R

    P.-A. Absil and R. Mahony and R. Sepulchre , title =

  38. [38]

    2023 , eprint=

    Attention Is All You Need , author=. 2023 , eprint=

  39. [39]

    2017 , eprint=

    Using the Output Embedding to Improve Language Models , author=. 2017 , eprint=

  40. [40]

    2025 , eprint=

    Trajectory Generator Matching for Time Series , author=. 2025 , eprint=

  41. [41]

    2022 , eprint=

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. 2022 , eprint=

  42. [42]

    2024 , eprint=

    Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design , author=. 2024 , eprint=

  43. [43]

    2025 , eprint=

    Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

  44. [44]

    2026 , eprint=

    Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data , author=. 2026 , eprint=

  45. [45]

    2025 , eprint=

    Large Language Diffusion Models , author=. 2025 , eprint=

  46. [46]

    2026 , eprint=

    Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs , author=. 2026 , eprint=

  47. [47]

    2025 , eprint=

    Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    Edit Flows: Flow Matching with Edit Operations , author=. 2025 , eprint=

  49. [49]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  50. [50]

    2026 , eprint=

    Remasking Discrete Diffusion Models with Inference-Time Scaling , author=. 2026 , eprint=

  51. [51]

    2025 , eprint=

    Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions , author=. 2025 , eprint=

  52. [52]

    2026 , eprint=

    Teaching Transformers to Solve Combinatorial Problems through Efficient Trial & Error , author=. 2026 , eprint=

  53. [53]

    2024 , eprint=

    A Reparameterized Discrete Diffusion Model for Text Generation , author=. 2024 , eprint=

  54. [54]

    2025 , eprint=

    Hierarchical Reasoning Model , author=. 2025 , eprint=

  55. [55]

    , title =

    Radcliffe, David G. , title =. 2020 , url =

  56. [56]

    2023 , eprint=

    Variational Diffusion Models , author=. 2023 , eprint=

  57. [57]

    2025 , eprint=

    Generalized Interpolating Discrete Diffusion , author=. 2025 , eprint=

  58. [58]

    2026 , howpublished =

    Countdown (game show) , author =. 2026 , howpublished =

  59. [59]

    2024 , eprint=

    Stream of Search (SoS): Learning to Search in Language , author=. 2024 , eprint=

  60. [60]

    2022 , eprint=

    Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

  61. [61]

    2024 , eprint=

    Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models , author=. 2024 , eprint=

  62. [62]

    2023 , eprint=

    Scalable Diffusion Models with Transformers , author=. 2023 , eprint=

  63. [63]

    SELF-AWARE

    Gregor Kornhardt and Jannis Chemseddine and Christian Wald and Gabriele Steidl , booktitle=. SELF-AWARE. 2026 , url=

  64. [64]

    2025 , eprint=

    Continuous Diffusion Model for Language Modeling , author=. 2025 , eprint=

  65. [65]

    2025 , eprint=

    Categorical Flow Matching on Statistical Manifolds , author=. 2025 , eprint=

  66. [66]

    2025 , eprint=

    CANDI: Hybrid Discrete-Continuous Diffusion Models , author=. 2025 , eprint=

  67. [67]

    2025 , eprint=

    The Diffusion Duality , author=. 2025 , eprint=

  68. [68]

    2025 , eprint=

    Generative Assignment Flows for Representing and Learning Joint Distributions of Discrete Data , author=. 2025 , eprint=

  69. [69]

    2025 , eprint=

    Less is More: Recursive Reasoning with Tiny Networks , author=. 2025 , eprint=

  70. [70]

    2026 , eprint=

    Categorical Flow Maps , author=. 2026 , eprint=

  71. [71]

    and Cui, Y

    Sun, Q. and Cui, Y. and Zhang, X. and Zhang, F. and Yu, Q. and Wang, Y. and Rao, Y. and Liu, J. and Huang, T. and Wang, X. , year =. Generative multimodal models are in-context learners , journal =

  72. [72]

    and Han, K

    Shi, J. and Han, K. and Wang, Z. and Doucet, A. and Titsias, M. , year =. Simplified and generalized masked diffusion for discrete data , journal =

  73. [73]

    2026 , eprint=

    LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling , author=. 2026 , eprint=

  74. [74]

    2026 , eprint=

    Scaling Behavior of Discrete Diffusion Language Models , author=. 2026 , eprint=

  75. [75]

    2019 , eprint=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

  76. [76]

    2025 , eprint=

    Simple Guidance Mechanisms for Discrete Diffusion Models , author=. 2025 , eprint=

  77. [77]

    2026 , eprint=

    Generalized Discrete Diffusion from Snapshots , author=. 2026 , eprint=

  78. [78]

    One billion word benchmark for measuring progress in statistical language modeling

    Ciprian Chelba and Tom. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , journal =. 2013 , url =. 1312.3005 , timestamp =

  79. [79]

    2022 , eprint=

    Elucidating the Design Space of Diffusion-Based Generative Models , author=. 2022 , eprint=