Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

Junhui Li; Rui Ma; Tieru Wu; Weiguang Zhao; Wenjian Zhang; Wuyang Luan

arxiv: 2604.05673 · v3 · pith:OUTCSWCOnew · submitted 2026-04-07 · 💻 cs.RO · cs.AI

Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

Wuyang Luan , Junhui Li , Weiguang Zhao , Wenjian Zhang , Tieru Wu , Rui Ma This is my paper

Pith reviewed 2026-05-10 19:20 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords Schrödinger Bridgevisual navigationdiffusion modelsembodied AIoptimal transportfew-step integrationvelocity fieldgenerative policies

0 comments

The pith

A single velocity network works across all regularization strengths in Schrödinger Bridge policies, enabling 3-step visual navigation at 92% success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the functional form of the conditional velocity field stays the same for any value of the entropic regularization parameter ε in Schrödinger Bridges. This invariance lets one trained network handle every regularization strength from maximum-entropy stochastic transport down to near-deterministic optimal transport. Lowering ε also reduces velocity variance linearly, which stabilizes integration even when large time steps are taken. Anchoring the process to a learned conditional prior that shortens transport paths lets the method sit at an intermediate ε that keeps both multimodal coverage and path straightness. Readers should care because standard diffusion and bridge policies need dozens of steps and therefore cannot run in real time on robots.

Core claim

We prove that the conditional velocity field's functional form is invariant across the entire ε-spectrum, enabling a single network to serve all regularization strengths, and that reducing ε linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate ε that balances multimodal coverage and path straightness, achieving over 94% cosine similarity and 92% success rate in merely 3 integration steps without distillation or multi-stage training.

What carries the argument

Rectified Schrödinger Bridge Matching (RSBM) framework controlled by the entropic regularization parameter ε, which exploits velocity structure invariance between standard Schrödinger Bridges and deterministic optimal transport.

If this is right

One network trained at any single ε can be reused for every other regularization strength.
Coarse-step ODE integration becomes stable because velocity variance drops linearly with ε.
Generative policies reach real-time latency while retaining multimodal action distributions.
No distillation or multi-stage training is required to reach few-step performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same invariance could let practitioners switch ε on the fly during deployment to trade off exploration and efficiency.
Similar rectification might shorten sampling in other bridge-based or flow-matching models used for robotic control.
The approach may extend to non-visual high-dimensional control tasks where long-horizon multimodal actions are needed.

Load-bearing premise

A learned conditional prior reliably shortens transport distance and the velocity structure invariance holds in practice for high-dimensional visual observations without extra training or adjustments.

What would settle it

Measuring whether cosine similarity between predicted and ground-truth actions falls below 90% or success rate falls below 80% when the trained network is evaluated with only three integration steps on new visual navigation environments.

Figures

Figures reproduced from arXiv: 2604.05673 by Junhui Li, Rui Ma, Tieru Wu, Weiguang Zhao, Wenjian Zhang, Wuyang Luan.

**Figure 2.** Figure 2: Overview of the RSBM framework. Left: A dual-stream EfficientNet-B0 vision encoder fϕ (§III-A) extracts observation and goal features, which are fused via positional encoding and self-attention into a context vector c ∈ R256 . Center: A learned variational prior network gψ (§III-A) produces a coarse action prior aT . Right: A conditional U-Net 1D velocity network vθ (§III-C) with FiLM conditioning iterativ… view at source ↗

**Figure 3.** Figure 3: dissects ε. ε = 1.0 recovers standard SB with high-curvature paths; very small ε over-regularizes. ε = 0.5 balances multimodal coverage with few-step fidelity. Disentangling prior and bridge contributions. Table VII reports five configurations isolating the effect of the learned prior and ε-rectification. The learned prior reduces transport distance, lowering MSE from 12.0 to 5.8 (2.1×), while εrectificat… view at source ↗

**Figure 4.** Figure 4: Quality–cost Pareto frontier. Each marker represents a method at a given sampling budget (k). (a) CosSim vs. NFE; (b) Success Rate vs. NFE. RSBM at k = 3 (NFE= 5) lies on the favorable frontier region, providing strong quality at substantially lower evaluations. TABLE II PER-DATASET GENERALIZATION. ACTION MSE↓ AND COSSIM↑ ACROSS FIVE DIVERSE REAL-WORLD DATASETS. RSBM(k = 3) CONSISTENTLY MATCHES OR EXCEEDS … view at source ↗

**Figure 5.** Figure 5: Qualitative trajectory comparison across eight challenging scenarios (2×4 grid, k = 3, NFE= 5). Top row: four indoor/structured environments. Bottom row: four large-scale environments. Baselines collide early (×); faint dotted lines show invalid ghost continuations. RSBM (green) remains collision-free and closely tracks the ground truth (dashed gray). d= 256). The prior network gψ is a 3-layer MLP conditio… view at source ↗

read the original abstract

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schr\"odinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schr\"odinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schr\"odinger Bridges ($\varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($\varepsilon\to 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $\varepsilon$. We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire $\varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $\varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $\varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $\geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps -- without distillation or multi-stage training -- substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RSBM gives a clean invariance result that could let one network handle different regularization strengths for faster bridge sampling in navigation, but the whole thing rests on a learned prior whose reliability in high-dim vision isn't shown.

read the letter

The main thing to know is that this paper claims a velocity-field invariance across the full range of the entropic parameter ε, plus a linear drop in variance as ε shrinks. That combination is supposed to let a single model run stable ODE integration in just three steps for visual navigation policies, hitting 94% cosine similarity and 92% success without distillation. If the proofs hold, it directly attacks the step-count barrier that keeps generative policies off real robots right now.

Referee Report

2 major / 2 minor

Summary. The paper proposes Rectified Schrödinger Bridge Matching (RSBM) for few-step visual navigation. It claims to prove that the conditional velocity field's functional form is invariant across the ε-spectrum of Schrödinger Bridges (Velocity Structure Invariance) and that reducing ε linearly decreases conditional velocity variance, enabling stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at intermediate ε and reports over 94% cosine similarity and 92% success rate in 3 integration steps without distillation or multi-stage training.

Significance. If the invariance and variance-reduction results hold and generalize beyond the reported setting, the work could meaningfully advance real-time deployment of generative policies in Embodied AI by closing the gap between high-fidelity multimodal action modeling and low-latency control requirements.

major comments (2)

[§3] §3 (Method/Theoretical Analysis): The proof of Velocity Structure Invariance is asserted to hold independently across the ε-spectrum, but the derivation details are not fully expanded; it is unclear whether the invariance is shown to be independent of the specific form of the learned conditional prior or reduces to a property of the chosen reference measure.
[§4] §4 (Experiments): The reported 94% cosine similarity and 92% success rate in 3 steps are presented without ablations that isolate the learned conditional prior's contribution to transport-distance shortening versus the ε-variance reduction alone, nor direct comparisons to standard SB at the same step count; this leaves the central empirical claim dependent on an unverified precondition.

minor comments (2)

Notation for the conditional velocity field v_ε and the prior could be introduced with an explicit equation early in the text for clarity.
Figure captions and axis labels in the navigation results should explicitly state the number of integration steps and ε values used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential impact of RSBM on real-time generative policies in Embodied AI. We address each major comment below and have revised the manuscript accordingly to strengthen both the theoretical exposition and the empirical validation.

read point-by-point responses

Referee: [§3] §3 (Method/Theoretical Analysis): The proof of Velocity Structure Invariance is asserted to hold independently across the ε-spectrum, but the derivation details are not fully expanded; it is unclear whether the invariance is shown to be independent of the specific form of the learned conditional prior or reduces to a property of the chosen reference measure.

Authors: We appreciate this observation. The proof of Velocity Structure Invariance (Theorem 1 in §3.2) establishes that the functional form of the conditional velocity field remains identical across the ε-spectrum because it follows directly from the Girsanov change of measure between the reference Wiener process and the Schrödinger Bridge marginals; the derivation is independent of the particular learned conditional prior π(x0,x1) and holds for any reference measure whose drift satisfies the required martingale property. To improve clarity, we have expanded the proof in the revised §3.2 with all intermediate steps (including the explicit computation of the Radon-Nikodym derivative and the resulting velocity expression) and added a remark explicitly stating its independence from the form of the conditional prior. revision: yes
Referee: [§4] §4 (Experiments): The reported 94% cosine similarity and 92% success rate in 3 steps are presented without ablations that isolate the learned conditional prior's contribution to transport-distance shortening versus the ε-variance reduction alone, nor direct comparisons to standard SB at the same step count; this leaves the central empirical claim dependent on an unverified precondition.

Authors: We agree that isolating the two mechanisms strengthens the central claim. While the original experiments already include overall comparisons of RSBM against standard SB (showing the latter requires ≥10 steps), we did not provide explicit ablations that turn the learned prior on/off or fix ε=1 while varying step count. In the revised manuscript we have added (i) a new ablation table in §4.3 that reports 3-step performance with and without the learned conditional prior at the same intermediate ε, and (ii) direct head-to-head results for standard SB at exactly 3 integration steps. These additions confirm that both the prior-induced distance shortening and the ε-variance reduction are necessary for the reported performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain.

full rationale

The abstract presents two explicit mathematical proofs (Velocity Structure Invariance of the conditional velocity field across the full ε-spectrum, and linear decrease in conditional velocity variance with ε) as independent derivations that justify using a single network and coarser ODE steps. These are not shown to reduce by construction to fitted parameters or self-citations. The anchoring to a learned conditional prior is stated as a design premise that shortens transport distance, but the performance claims (94% cosine similarity, 92% success in 3 steps) are reported as empirical outcomes rather than predictions forced from the prior by definition. No load-bearing step in the provided text equates a result to its own inputs via renaming, ansatz smuggling, or uniqueness imported from prior self-work. The framework remains self-contained with external experimental validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified invariance of the conditional velocity field form and the linear effect of ε on variance, plus reliance on a learned conditional prior whose training is not detailed; ε serves as the main tunable element.

free parameters (1)

ε
Entropic regularization parameter that controls the spectrum from maximum-entropy SB to deterministic OT and is adjusted to balance coverage and straightness.

axioms (1)

domain assumption Conditional velocity field functional form remains invariant across all ε values
Invoked as the basis for using a single network and for the rectification benefit.

pith-pipeline@v0.9.0 · 5589 in / 1248 out tokens · 40368 ms · 2026-05-10T19:20:57.485599+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Velocity Structure Invariance). ... the logarithmic derivative of the standard deviation satisfies d log σ_ε,t / dt = (1−2s_t)/[t(1−s_t)], which is independent of ε.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 1 (Velocity Variance Reduction). Var[v*_t | a0,aT] = ε · (1−2s_t)^2 / (1−s_t) · I_D
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Anchored to a learned conditional prior that shortens transport distance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.