Spherical Flows for Sampling Categorical Data
Pith reviewed 2026-06-30 23:48 UTC · model grok-4.3
The pith
On the sphere the von Mises-Fisher path reduces the continuity equation to a scalar ODE in cosine similarity whose solution yields both ODE and predictor-corrector sampling for categorical sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exploiting the radial symmetry of the vMF density reduces the continuity equation on S^{d-1} to a scalar ODE in the cosine similarity whose unique bounded solution determines the velocity. The marginal velocity and marginal score on (S^{d-1})^L both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights, granting access to ODE and predictor-corrector sampling after the posterior is trained by cross-entropy loss.
What carries the argument
Reduction of the continuity equation on S^{d-1} to a scalar ODE in cosine similarity, solved for the velocity that drives the vMF flow.
If this is right
- Only the posterior distribution must be learned, via a standard cross-entropy objective.
- Both continuous ODE integration and discrete predictor-corrector steps become available from the same trained model.
- The marginal score and marginal velocity are constructed identically except for per-token scalar multipliers.
- The vMF path combined with predictor-corrector sampling produces lower error than geodesic or Euclidean alternatives on Sudoku and language modeling.
Where Pith is reading between the lines
- The same scalar-ODE reduction could be attempted for other rotationally symmetric densities on the sphere.
- The decomposition into posterior-weighted tangent sums may simplify flow-based sampling on other product manifolds used for structured discrete data.
- If the learned posterior is accurate, the method supplies a parameter-free way to move between any two time marginals on the sphere product.
Load-bearing premise
The radial symmetry of the vMF density lets the continuity equation on the sphere reduce to a scalar ODE in cosine similarity that has a unique bounded solution for the velocity.
What would settle it
A direct numerical check showing that the velocity obtained from the scalar cosine-similarity ODE fails to satisfy the vector continuity equation on S^{d-1}, or that predictor-corrector sampling with this velocity yields no improvement over geodesic or Euclidean baselines on the Sudoku and language-modeling tasks.
Figures
read the original abstract
We study the problem of learning generative models for discrete sequences in a continuous embedding space. Whereas prior approaches typically operate in Euclidean space or on the probability simplex, we instead work on the sphere $\mathbb S^{d-1}$. There the von Mises-Fisher (vMF) distribution induces a natural noise process and admits a closed-form conditional score. The conditional velocity is in general intractable. Exploiting the radial symmetry of the vMF density we reduce the continuity equation on $\mathbb S^{d-1}$ to a scalar ODE in the cosine similarity, whose unique bounded solution determines the velocity. The marginal velocity and marginal score on $(\mathbb S^{d-1})^L$ both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights. This gives access to both ODE and predictor-corrector (PC) sampling. The posterior is the only learned object, trained by a cross-entropy loss. Experiments compare the vMF path against geodesic and Euclidean alternatives. The combination of vMF and PC sampling significantly improves results on Sudoku and language modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes spherical flows for generative modeling of categorical sequences by embedding tokens on the sphere S^{d-1} and using von Mises-Fisher (vMF) distributions to induce a noise process with closed-form conditional score. It claims that radial symmetry of the vMF density reduces the continuity equation on S^{d-1} to a scalar ODE in the cosine similarity whose unique bounded solution determines the (conditional) velocity. The marginal velocity and marginal score on (S^{d-1})^L then decompose into posterior-weighted tangent sums (differing only by per-token scalar weights), granting access to both ODE and predictor-corrector sampling. The posterior is the sole learned component, trained by cross-entropy loss. Experiments compare the vMF path to geodesic and Euclidean alternatives and report significant improvements on Sudoku and language-modeling tasks when combining vMF with PC sampling.
Significance. If the reduction and decomposition are valid, the work supplies a new continuous-space framework for discrete sequence generation that naturally yields both probability-flow ODE and predictor-corrector samplers from a single learned posterior. The posterior-weighted tangent-sum decomposition is a clean technical device that could generalize beyond the vMF case. The empirical gains on Sudoku and language modeling, obtained without fitting additional parameters beyond the posterior, would constitute a practical advance for flow-based discrete models if the underlying math holds.
major comments (2)
- [Abstract] Abstract (paragraph on vMF path): the claim that radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity is load-bearing for the entire construction. The provided text supplies neither the explicit ODE, the boundary/initial conditions, nor a uniqueness/boundedness argument; without these, it is impossible to verify that the derived velocity satisfies the original PDE on the tangent bundle or that the subsequent decomposition into posterior-weighted sums on (S^{d-1})^L remains valid.
- [Abstract] Abstract (decomposition paragraph): the statement that marginal velocity and marginal score 'both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights' is asserted without derivation steps or error analysis. Because this decomposition is what grants access to both ODE and PC sampling, the manuscript must exhibit the explicit posterior-weighted expressions (presumably in §4 or §5) and confirm they arise directly from the conditional velocity derived in the preceding reduction.
minor comments (2)
- [Experiments] The experimental section should report the precise embedding dimension d, sequence length L, and concentration schedule used for the vMF path, together with the number of function evaluations for the ODE and PC samplers, to allow direct comparison with the geodesic and Euclidean baselines.
- [Abstract] Notation for the product manifold (S^{d-1})^L and the tangent-space projections should be introduced once and used consistently; the current abstract mixes 'tangent sums' and 'per-token scalar weights' without defining the projection operator.
Simulated Author's Rebuttal
We thank the referee for the thorough review and for recognizing the potential significance of the spherical flow framework. We address the two major comments below point by point.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on vMF path): the claim that radial symmetry reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity whose 'unique bounded solution' determines the velocity is load-bearing for the entire construction. The provided text supplies neither the explicit ODE, the boundary/initial conditions, nor a uniqueness/boundedness argument; without these, it is impossible to verify that the derived velocity satisfies the original PDE on the tangent bundle or that the subsequent decomposition into posterior-weighted sums on (S^{d-1})^L remains valid.
Authors: The explicit scalar ODE, along with the initial condition v(0)=0 and the boundedness requirement ensuring |v(t)| remains bounded by the sphere's geometry, is derived in Section 3 using the radial symmetry of the vMF. Uniqueness follows from the contraction mapping principle on the space of continuous functions. The derivation confirms the velocity is tangent to the sphere. We will revise the manuscript to include a cross-reference from the abstract to Section 3 and ensure all steps are explicitly labeled. revision: partial
-
Referee: [Abstract] Abstract (decomposition paragraph): the statement that marginal velocity and marginal score 'both decompose into posterior-weighted tangent sums that differ only by per-token scalar weights' is asserted without derivation steps or error analysis. Because this decomposition is what grants access to both ODE and PC sampling, the manuscript must exhibit the explicit posterior-weighted expressions (presumably in §4 or §5) and confirm they arise directly from the conditional velocity derived in the preceding reduction.
Authors: Section 4 derives the marginal velocity as the posterior-weighted sum of the conditional velocities, specifically v_marginal = sum_i p(x_i | x) * v_conditional,i , and the marginal score similarly with weights differing by a factor derived from the vMF normalization. These expressions arise directly by marginalizing the conditional quantities using the learned posterior. We will add the explicit formulas to the abstract where feasible and include a short derivation summary in the introduction for clarity. revision: partial
Circularity Check
Derivation self-contained via symmetry reduction; no fitted inputs or self-citations load-bearing
full rationale
The paper's core step reduces the continuity equation on S^{d-1} to a scalar ODE in cosine similarity by exploiting vMF radial symmetry, then asserts a unique bounded solution yields the conditional velocity. This is a direct mathematical derivation from the PDE and density properties, not a redefinition or fit to model outputs. The marginal velocity/score decompositions follow from that solution plus posterior weighting, with the posterior itself trained separately by cross-entropy on external data. No equations equate a claimed prediction to a fitted parameter by construction, no self-citation chains justify uniqueness, and no ansatz is smuggled. The construction remains independent of the learned posterior values.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The von Mises-Fisher distribution induces a natural noise process and admits a closed-form conditional score on the sphere.
- domain assumption The radial symmetry of the vMF density permits reduction of the continuity equation on S^{d-1} to a scalar ODE in cosine similarity.
Forward citations
Cited by 1 Pith paper
-
Self-conditioned Flow Map Language Models via Fixed-point Flows
Self-conditioned flow language models solve fixed-point iterations, enabling fixed-point flow maps that distill into FMLM* which outperforms SOTA in few-step generation on OpenWebText.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2405.16441. 12 Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek Joey Bose. Fisher flow matching for generative modeling over discrete data, 2024. URLhttps://arxiv.org/abs/2405.14664. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidir...
-
[2]
i was a glamorous mom and a great citizen,
URLhttps://arxiv.org/abs/2510.22510. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. Daan Roos, Oscar Davis, Floor Eijkelboom, Michael Bronstein, Max Welling, ˙Ismail ˙Ilkan Ceylan, Luca Ambrogioni, and Jan-Willem van de Meent. Categorical flow maps, 2026. URL htt...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.