Spherical Flows for Sampling Categorical Data

Gabriele Steidl; Gregor Kornhardt; Jannis Chemseddine

arxiv: 2605.05629 · v3 · pith:6CRN3T5Hnew · submitted 2026-05-07 · 📊 stat.ML · cs.CL· cs.LG

Spherical Flows for Sampling Categorical Data

Jannis Chemseddine , Gregor Kornhardt , Gabriele Steidl This is my paper

Pith reviewed 2026-06-30 23:48 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LG

keywords spherical flowsvon Mises-Fisher distributioncategorical sequence generationpredictor-corrector samplingODE samplingdiscrete data embeddinggenerative models on spheres

0 comments

The pith

On the sphere the von Mises-Fisher path reduces the continuity equation to a scalar ODE in cosine similarity whose solution yields both ODE and predictor-corrector sampling for categorical sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper embeds discrete sequences in a product of spheres and uses the von Mises-Fisher distribution to define a noise process whose conditional score is closed-form. Radial symmetry lets the continuity equation on each sphere collapse to a scalar ODE in the cosine similarity; its unique bounded solution supplies the conditional velocity. The resulting marginal velocity and marginal score on the product space are both posterior-weighted sums of tangent vectors that differ only by per-token scalar weights, so the same learned posterior supports both ODE integration and predictor-corrector sampling. Experiments show that the vMF path paired with predictor-corrector steps outperforms geodesic and Euclidean baselines on Sudoku and language-modeling tasks.

Core claim

Exploiting the radial symmetry of the vMF density reduces the continuity equation on S^{d-1} to a scalar ODE in the cosine similarity whose unique bounded solution determines the velocity. The marginal velocity and marginal score on (S^{d-1})^L both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights, granting access to ODE and predictor-corrector sampling after the posterior is trained by cross-entropy loss.

What carries the argument

Reduction of the continuity equation on S^{d-1} to a scalar ODE in cosine similarity, solved for the velocity that drives the vMF flow.

If this is right

Only the posterior distribution must be learned, via a standard cross-entropy objective.
Both continuous ODE integration and discrete predictor-corrector steps become available from the same trained model.
The marginal score and marginal velocity are constructed identically except for per-token scalar multipliers.
The vMF path combined with predictor-corrector sampling produces lower error than geodesic or Euclidean alternatives on Sudoku and language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scalar-ODE reduction could be attempted for other rotationally symmetric densities on the sphere.
The decomposition into posterior-weighted tangent sums may simplify flow-based sampling on other product manifolds used for structured discrete data.
If the learned posterior is accurate, the method supplies a parameter-free way to move between any two time marginals on the sphere product.

Load-bearing premise

The radial symmetry of the vMF density lets the continuity equation on the sphere reduce to a scalar ODE in cosine similarity that has a unique bounded solution for the velocity.

What would settle it

A direct numerical check showing that the velocity obtained from the scalar cosine-similarity ODE fails to satisfy the vector continuity equation on S^{d-1}, or that predictor-corrector sampling with this velocity yields no improvement over geodesic or Euclidean baselines on the Sudoku and language-modeling tasks.

Figures

Figures reproduced from arXiv: 2605.05629 by Gabriele Steidl, Gregor Kornhardt, Jannis Chemseddine.

**Figure 1.** Figure 1: LM1B: Generation perplexity vs. entropy at NFE=128, varying the predictor-to-corrector ratio. Predictor–corrector sampling (stars) outperforms ODE sampling (circles), with a tradeoff between entropy and generation perplexity when using more corrector steps, see view at source ↗

**Figure 2.** Figure 2: Illustration of von Mises–Fisher density on view at source ↗

read the original abstract

We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The spherical vMF construction reduces the flow to a scalar ODE in cosine similarity and enables PC sampling, but the uniqueness and PDE compliance of that ODE solution need direct verification.

read the letter

The main point is that this paper puts categorical sequences on the sphere with vMF noise, gets a closed-form conditional score, and derives the velocity from a scalar ODE in cosine similarity by using radial symmetry.

What is new is the reduction of the continuity equation on S^{d-1} to that scalar ODE, plus the posterior-weighted tangent decomposition of the marginal velocity and score on the product space (S^{d-1})^L. The two marginals differ only by per-token scalar weights, which directly gives access to both ODE and predictor-corrector sampling. Only the posterior is learned, via cross-entropy, so the flow construction stays separate from the fitting. Experiments show the vMF-plus-PC combination beats geodesic and Euclidean baselines on Sudoku and language modeling.

The soft spot is the reduction itself. The abstract states that radial symmetry yields a unique bounded solution that determines the velocity, but it is not clear whether this solution automatically respects the tangent-space constraint of the original PDE for every concentration parameter. If the full derivation supplies the missing steps and confirms uniqueness and boundedness, the concern disappears; otherwise the downstream decomposition and sampling claims rest on shaky ground. The citation pattern is standard and the training objective is simple, so those parts look solid.

This is for researchers building continuous flows for discrete sequences who want an alternative embedding. A reader focused on new velocity constructions or PC methods would get concrete value. The technical move is distinct enough from the cited Euclidean and simplex work, and the reported gains are large enough, that the paper deserves a serious referee even if the ODE details require extra scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes spherical flows for generative modeling of categorical sequences by embedding tokens on the sphere S^{d-1} and using von Mises-Fisher (vMF) distributions to induce a noise process with closed-form conditional score. It claims that radial symmetry of the vMF density reduces the continuity equation on S^{d-1} to a scalar ODE in the cosine similarity whose unique bounded solution determines the (conditional) velocity. The marginal velocity and marginal score on (S^{d-1})^L then decompose into posterior-weighted tangent sums (differing only by per-token scalar weights), granting access to both ODE and predictor-corrector sampling. The posterior is the sole learned component, trained by cross-entropy loss. Experiments compare the vMF path to geodesic and Euclidean alternatives and report significant improvements on Sudoku and language-modeling tasks when combining vMF with PC sampling.

Significance. If the reduction and decomposition are valid, the work supplies a new continuous-space framework for discrete sequence generation that naturally yields both probability-flow ODE and predictor-corrector samplers from a single learned posterior. The posterior-weighted tangent-sum decomposition is a clean technical device that could generalize beyond the vMF case. The empirical gains on Sudoku and language modeling, obtained without fitting additional parameters beyond the posterior, would constitute a practical advance for flow-based discrete models if the underlying math holds.

major comments (2)

[Abstract] Abstract (paragraph on vMF path): the claim that radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity is load-bearing for the entire construction. The provided text supplies neither the explicit ODE, the boundary/initial conditions, nor a uniqueness/boundedness argument; without these, it is impossible to verify that the derived velocity satisfies the original PDE on the tangent bundle or that the subsequent decomposition into posterior-weighted sums on (S^{d-1})^L remains valid.
[Abstract] Abstract (decomposition paragraph): the statement that marginal velocity and marginal score 'both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights' is asserted without derivation steps or error analysis. Because this decomposition is what grants access to both ODE and PC sampling, the manuscript must exhibit the explicit posterior-weighted expressions (presumably in §4 or §5) and confirm they arise directly from the conditional velocity derived in the preceding reduction.

minor comments (2)

[Experiments] The experimental section should report the precise embedding dimension d, sequence length L, and concentration schedule used for the vMF path, together with the number of function evaluations for the ODE and PC samplers, to allow direct comparison with the geodesic and Euclidean baselines.
[Abstract] Notation for the product manifold (S^{d-1})^L and the tangent-space projections should be introduced once and used consistently; the current abstract mixes 'tangent sums' and 'per-token scalar weights' without defining the projection operator.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for recognizing the potential significance of the spherical flow framework. We address the two major comments below point by point.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on vMF path): the claim that radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity is load-bearing for the entire construction. The provided text supplies neither the explicit ODE, the boundary/initial conditions, nor a uniqueness/boundedness argument; without these, it is impossible to verify that the derived velocity satisfies the original PDE on the tangent bundle or that the subsequent decomposition into posterior-weighted sums on (S^{d-1})^L remains valid.

Authors: The explicit scalar ODE, along with the initial condition v(0)=0 and the boundedness requirement ensuring |v(t)| remains bounded by the sphere's geometry, is derived in Section 3 using the radial symmetry of the vMF. Uniqueness follows from the contraction mapping principle on the space of continuous functions. The derivation confirms the velocity is tangent to the sphere. We will revise the manuscript to include a cross-reference from the abstract to Section 3 and ensure all steps are explicitly labeled. revision: partial
Referee: [Abstract] Abstract (decomposition paragraph): the statement that marginal velocity and marginal score 'both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights' is asserted without derivation steps or error analysis. Because this decomposition is what grants access to both ODE and PC sampling, the manuscript must exhibit the explicit posterior-weighted expressions (presumably in §4 or §5) and confirm they arise directly from the conditional velocity derived in the preceding reduction.

Authors: Section 4 derives the marginal velocity as the posterior-weighted sum of the conditional velocities, specifically v_marginal = sum_i p(x_i | x) * v_conditional,i , and the marginal score similarly with weights differing by a factor derived from the vMF normalization. These expressions arise directly by marginalizing the conditional quantities using the learned posterior. We will add the explicit formulas to the abstract where feasible and include a short derivation summary in the introduction for clarity. revision: partial

Circularity Check

0 steps flagged

Derivation self-contained via symmetry reduction; no fitted inputs or self-citations load-bearing

full rationale

The paper's core step reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity by exploiting vMF radial symmetry, then asserts a unique bounded solution yields the conditional velocity. This is a direct mathematical derivation from the PDE and density properties, not a redefinition or fit to model outputs. The marginal velocity/score decompositions follow from that solution plus posterior weighting, with the posterior itself trained separately by cross-entropy on external data. No equations equate a claimed prediction to a fitted parameter by construction, no self-citation chains justify uniqueness, and no ansatz is smuggled. The construction remains independent of the learned posterior values.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger extracted from abstract only; full paper may introduce additional parameters or assumptions not visible here.

axioms (2)

domain assumption The von Mises-Fisher distribution induces a natural noise process and admits a closed-form conditional score on the sphere.
Invoked in the opening paragraph of the abstract as the foundation for the entire construction.
domain assumption The radial symmetry of the vMF density permits reduction of the continuity equation on S^{d-1} to a scalar ODE in cosine similarity.
Stated directly as the key technical step that yields the velocity.

pith-pipeline@v0.9.1-grok · 5715 in / 1546 out tokens · 25986 ms · 2026-06-30T23:48:44.575751+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-conditioned Flow Map Language Models via Fixed-point Flows
cs.CL 2026-07 unverdicted novelty 7.0

Self-conditioned flow language models solve fixed-point iterations, enabling fixed-point flow maps that distill into FMLM* which outperforms SOTA in few-step generation on OpenWebText.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

12 Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek Joey Bose

URLhttps://arxiv.org/abs/2405.16441. 12 Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek Joey Bose. Fisher flow matching for generative modeling over discrete data, 2024. URLhttps://arxiv.org/abs/2405.14664. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidir...

work page arXiv 2024
[2]

i was a glamorous mom and a great citizen,

URLhttps://arxiv.org/abs/2510.22510. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL htt...

work page doi:10.1007/978-3-031-92731-7_4 2019

[1] [1]

12 Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek Joey Bose

URLhttps://arxiv.org/abs/2405.16441. 12 Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek Joey Bose. Fisher flow matching for generative modeling over discrete data, 2024. URLhttps://arxiv.org/abs/2405.14664. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidir...

work page arXiv 2024

[2] [2]

i was a glamorous mom and a great citizen,

URLhttps://arxiv.org/abs/2510.22510. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL htt...

work page doi:10.1007/978-3-031-92731-7_4 2019