pith. sign in

arxiv: 2606.01847 · v1 · pith:PHCQ6K63new · submitted 2026-06-01 · 💻 cs.RO · cs.LG

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

Pith reviewed 2026-06-28 14:29 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords SE(3)diffusion policiesvision language actionLie groupsmanifold learningequivariancescore matchingrobot manipulation
0
0 comments X

The pith

Representing SE(3) poses as flat vectors in diffusion VLA policies causes manifold drift, broken equivariance, and non-geodesic paths, which a tangent-space diffusion process on SE(3) corrects by construction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current diffusion-based vision-language-action policies treat 6D robot poses as ordinary vectors in 12-dimensional space. This approximation produces rotations that drift off the valid manifold, actions that change under coordinate shifts, and robot motions longer than the shortest paths on the space. The paper introduces Lie Diffuser Actor, which adds noise through left-invariant stochastic differential equations on SE(3), predicts scores in the tangent space, and retracts samples using the exponential map. If this construction works, policies generate only valid poses that respect the geometry without extra projection steps. Readers care because these geometric mismatches likely explain why many learned policies underperform when transferred from simulation to physical robots.

Core claim

The paper claims that the Euclidean Fallacy of representing SE(3) poses as flat R^12 vectors in diffusion VLA policies induces manifold drift violating SO(3) constraints, broken equivariance under coordinate transformations, and non-geodesic trajectories with excessive kinematic cost. The Lie Diffuser Actor framework corrects this by injecting noise via left-invariant SDEs on SE(3), predicting scores in the tangent space, and retracting samples via the exponential map, thereby eliminating manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality, demonstrated by raising average task length from 3.27 to 3.51 on CALVIN ABC to D and outperforming ba

What carries the argument

Lie Diffuser Actor (LDA), a diffusion framework that operates intrinsically on SE(3) by using left-invariant SDEs for noise injection, tangent-space score prediction, and exponential-map retraction.

If this is right

  • Generated poses remain on the SE(3) manifold without violating rotation constraints.
  • Actions produced by the policy stay equivariant under changes to the coordinate frame.
  • Motion trajectories follow geodesics and incur lower kinematic cost than Euclidean approximations.
  • Task success rates rise on manipulation benchmarks, as measured by the 7.3 percent gain in average task length on CALVIN.
  • Real-robot deployments require fewer manual fixes for geometric violations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tangent-space approach could be applied to other Lie-group state spaces common in robotics, such as full rigid-body configurations.
  • Training data collection pipelines might benefit from explicit manifold-aware sampling to amplify the method's advantages.
  • This formulation reduces the engineering effort spent on post-hoc projection layers in other diffusion-based controllers.
  • Controlled experiments that deliberately rotate the world coordinate frame during evaluation would isolate the contribution of the built-in equivariance.

Load-bearing premise

The left-invariant SDE noise process together with tangent-space score prediction and exponential-map retraction produces samples whose distribution matches the data manifold on SE(3) without additional projection or constraint steps.

What would settle it

Generating many samples from the trained model and finding that a non-negligible fraction of the resulting rotation matrices fail the orthogonality test R transpose R equals the identity matrix would show that manifold drift is not eliminated by construction.

Figures

Figures reproduced from arXiv: 2606.01847 by Bing-Cheng Chuang, Bor-Jiun Lin, Chun-Yi Lee, I-Hsuan Chu, Min Sun, YuanFu Yang.

Figure 1
Figure 1. Figure 1: Euclidean Diffusion vs. Lie Diffu￾sion. (Top) Extrinsic dif￾fusion treats the curved SE(3) manifold as flat R n , which causes drift and produces invalid ge￾ometries. (Bottom) The proposed Lie Diffuser Actor injects noise as tan￾gent twists ξ ∈ se(3), ensuring that trajectories follow valid geodesics. but discretization introduces quantization errors and loses geometric structure essential for precise mani… view at source ↗
Figure 3
Figure 3. Figure 3: The overall architectural details for the proposed Lie Diffuser Actor. the Lie group. For pose gt ∈ SE(3) evolving over t ∈ [0, T], the forward diffusion takes the Stratonovich form as follows: dgt = gt · [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real robot evaluation environments [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Manifold constraint analysis during reverse diffusion. (a) Quaternion trajectories projected onto S 3 for Euclidean (red) and Lie (teal) diffusion. Circle markers denote the starting point (t = T, pure noise), and star markers denote the final denoised output (t = 0). The Euclidean trajectory departs from the unit sphere, while the Lie trajectory remains on the manifold. (b) Orthogonality violation ∥R⊤R − … view at source ↗
Figure 6
Figure 6. Figure 6: Look-ahead Stability and Geodesic Jitter. Geodesic jit￾ter (angular distance in degrees) between consecutive look-ahead predictions xˆ (t) 0 and xˆ (t−1) 0 during reverse diffusion. The Eu￾clidean baseline (red) exhibits up to two orders of magnitude higher jitter than LDA (teal), particularly in early diffusion steps. cess. This instability is a direct consequence of manifold drift inherent to Euclidean m… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of real robot execution sequences. Each pair of rows shows our Lie Diffuser Actor (top) versus the baseline 3D Diffuser Actor (bottom) for the four manipulation tasks: Stack Cups (rows 1-2), Put Block in Box (rows 3-4), Move Doll Platform (rows 5-6), and Sort Blocks (rows 7-8). Sequential frames progress from left to right, showing the complete manipulation trajectory. Our method exh… view at source ↗
read the original abstract

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper argues that diffusion-based vision-language-action policies commit the 'Euclidean Fallacy' by representing SE(3) poses as vectors in R^12, inducing manifold drift, broken equivariance, and non-geodesic trajectories. It introduces Lie Diffuser Actor (LDA), which injects noise via left-invariant SDEs on SE(3), predicts scores in the tangent space, and retracts via the exponential map. This is claimed to eliminate drift by construction, guarantee coordinate-frame equivariance, and ensure geodesic optimality. On the CALVIN ABC→D benchmark, LDA raises average task length from 3.27 to 3.51 (+7.3%), with additional outperformance reported on real-robot tasks.

Significance. If the geometric guarantees are rigorously established and the performance gains are shown to arise specifically from the Lie-group formulation, the work would be significant for robotics. It directly targets a common representational error in pose diffusion models and supplies a standard Lie-group construction that preserves manifold structure, equivariance, and geodesics. The modest benchmark improvement on CALVIN is consistent with the claimed benefits, though the absence of isolating ablations limits attribution.

major comments (2)
  1. Abstract: the central claim that left-invariant SDEs, tangent-space score prediction, and exponential-map retraction 'eliminate manifold drift by construction' and 'guarantee coordinate-frame equivariance and geodesic optimality' is load-bearing, yet the abstract supplies neither the explicit SDE form nor a proof of these properties; without these the attribution of any empirical gain to the geometric components cannot be verified.
  2. Abstract (results paragraph): the reported improvement from 3.27 to 3.51 on CALVIN is presented without ablations that isolate the contribution of the left-invariant SDE, tangent-space scoring, or exponential retraction versus other implementation choices; this omission prevents confirmation that the geometric machinery, rather than incidental factors, drives the +7.3% gain.
minor comments (1)
  1. Abstract: the real-robot validation states only that the method 'outperforms the baseline on majority tasks' without quantitative metrics or task breakdown, reducing the strength of the empirical support.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address each major comment below, providing clarifications from the main text and indicating where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [—] Abstract: the central claim that left-invariant SDEs, tangent-space score prediction, and exponential-map retraction 'eliminate manifold drift by construction' and 'guarantee coordinate-frame equivariance and geodesic optimality' is load-bearing, yet the abstract supplies neither the explicit SDE form nor a proof of these properties; without these the attribution of any empirical gain to the geometric components cannot be verified.

    Authors: The abstract provides a concise overview, while the technical details are elaborated in the body of the paper. Specifically, the left-invariant SDE is formulated in Equation (3) of Section 3, the tangent-space score prediction is described in Section 3.2, and the exponential map retraction in Section 3.3. The geometric properties—no manifold drift by construction, coordinate-frame equivariance, and geodesic optimality—are rigorously established in Propositions 1, 2, and 3 of Section 4, respectively. These sections supply the explicit forms and proofs needed to verify the claims. To improve accessibility, we will revise the abstract to include a brief reference to these results. revision: partial

  2. Referee: [—] Abstract (results paragraph): the reported improvement from 3.27 to 3.51 on CALVIN is presented without ablations that isolate the contribution of the left-invariant SDE, tangent-space scoring, or exponential retraction versus other implementation choices; this omission prevents confirmation that the geometric machinery, rather than incidental factors, drives the +7.3% gain.

    Authors: We agree that additional ablations would better isolate the contributions of each geometric component. The current manuscript includes comparisons to the Euclidean baseline and some sensitivity analyses, but lacks a complete set of ablations for the individual elements (e.g., left-invariant vs. Euclidean SDE while keeping other components fixed). We will include these isolating ablations in the revised manuscript to more clearly attribute the performance gains to the Lie-group formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims follow from standard Lie-group constructions

full rationale

The paper's derivation chain relies on injecting noise via left-invariant SDEs on SE(3), tangent-space score prediction, and exponential-map retraction. These steps are presented as directly yielding the claimed properties (no manifold drift, equivariance, geodesic optimality) by the geometry of the Lie group, which is a standard external fact rather than an internal fit or self-definition. No equations reduce a prediction to a parameter fitted on the same data, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The reported +7.3% gain is an empirical result on the external CALVIN benchmark, not a quantity defined by the method itself. The derivation is therefore self-contained against external geometric benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard differential geometry of Lie groups; no new free parameters, ad-hoc constants, or postulated entities are introduced in the abstract.

axioms (2)
  • standard math Left-invariant stochastic differential equations on SE(3) preserve the manifold structure
    Invoked when describing noise injection in the LDA framework.
  • standard math The exponential map is a valid retraction from the tangent space to SE(3)
    Used for the final retraction step that returns samples to the manifold.

pith-pipeline@v0.9.1-grok · 5740 in / 1468 out tokens · 34973 ms · 2026-06-28T14:29:09.967457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author =. Proc. Robotics: Science and Systems (RSS) , year =

  2. [2]

    Denoising Diffusion Probabilistic Models , author =. Proc. Conf. on Neural Information Processing Systems (NeurIPS) , year =

  3. [3]

    Score-Based Generative Modeling through Stochastic Differential Equations , author =. Proc. Int. Conf. on Learning Representations (ICLR) , year =

  4. [4]

    3D Diffuser Actor: Policy Diffusion with 3D Scene Representations , author =. Proc. Conf. on Robot Learning (CoRL) , year =

  5. [5]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations , author =. Proc. Robotics Science and Systems (RSS) , year =

  6. [6]

    arXiv , year =

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation , author =. arXiv , year =

  7. [7]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots , author =. Proc. Robotics: Science and Systems (RSS) , year =

  8. [8]

    SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space , author =. Proc. Int. Conf. on Machine Learning (ICML) , year =

  9. [9]

    EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning , author =. Proc. Conf. on Robot Learning (CoRL) , year =

  10. [10]

    ET-SEED: Efficient Trajectory-Level SE(3) Equivariant Diffusion Policy , author =. Proc. Int. Conf. on Learning Representations (ICLR) , year =

  11. [11]

    arXiv , year =

    RT-1: Robotics Transformer for Real-World Control at Scale , author =. arXiv , year =

  12. [12]

    arXiv , year =

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control , author =. arXiv , year =

  13. [13]

    OpenVLA: An Open-Source Vision-Language-Action Model , author =. Proc. Conf. on Robot Learning (CoRL) , year =

  14. [14]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success , author=. arXiv preprint arXiv:2502.19645 , year=

  15. [15]

    arXiv , year =

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model , author =. arXiv , year =

  16. [16]

    arXiv , year =

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model , author =. arXiv , year =

  17. [17]

    arXiv , year =

    VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation , author =. arXiv , year =

  18. [18]

    arXiv , year =

    WorldVLA: Towards Autoregressive Action World Model , author =. arXiv , year =

  19. [19]

    arXiv , year =

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots , author =. arXiv , year =

  20. [20]

    arXiv , year =

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation , author =. arXiv , year =

  21. [21]

    arXiv , year =

    OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation , author =. arXiv , year =

  22. [22]

    Diffusion-VLA: Generalizable and Interpretable Robot Foundation Model via Self-Generated Reasoning , author =. Proc. Int. Conf. on Machine Learning (ICML) , year =

  23. [23]

    ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning , author =. Proc. Conf. on Neural Information Processing Systems (NeurIPS) , year =

  24. [24]

    arXiv , year =

    _0 : A Vision-Language-Action Flow Model for General Robot Control , author =. arXiv , year =

  25. [25]

    , journal =

    Black, Kevin and Brown, Noah and Driess, Danny and Esmail, Adnan and Equi, Michael and et al. , journal =. _. 2025 , volume =

  26. [26]

    arXiv , year =

    GR-3 Technical Report , author =. arXiv , year =

  27. [27]

    arXiv , year =

    villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models , author =. arXiv , year =

  28. [28]

    Riemannian Score-Based Generative Modelling , author =. Proc. Conf. on Neural Information Processing Systems (NeurIPS) , year =

  29. [29]

    Flow Matching on General Geometries , author =. Proc. Int. Conf. on Learning Representations (ICLR) , year =

  30. [30]

    Riemannian Flow Matching Policy for Robot Motion Learning , author =. Proc. Int. Conf. on Intelligent Robots and Systems (IROS) , year =

  31. [31]

    A micro lie theory for state estimation in robotics.arXiv preprint arXiv:1812.01537, 2018

    A Micro Lie Theory for State Estimation in Robotics , author =. arXiv preprint arXiv:1812.01537 , year =

  32. [32]

    Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3) , author =. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , year =

  33. [33]

    SE(3)-DiffusionFields: Learning smooth cost functions for joint grasp and motion optimization through diffusion , author =. Proc. IEEE Int. Conf. on Robotics and Automation (ICRA) , year =

  34. [34]

    Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation , author =. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , year =

  35. [35]

    Scaling Riemannian Diffusion Models , author =. Proc. Conf. on Neural Information Processing Systems (NeurIPS) , year =

  36. [36]

    arXiv , year =

    Diffusion Generative Modeling on Lie Group Representations , author =. arXiv , year =

  37. [37]

    RLBench: The Robot Learning Benchmark & Learning Environment , author =. Proc. Int. Conf. on Robotics and Automation (ICRA) , year =

  38. [38]

    CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , author =. Proc. Int. Conf. on Robotics and Automation (ICRA) , year =

  39. [39]

    Fast Graph Representation Learning with

    Fey, Matthias and Lenssen, Jan Eric , booktitle =. Fast Graph Representation Learning with

  40. [40]

    Luis Pineda and Taosha Fan and Maurizio Monge and Shobha Venkataraman and Paloma Sodhi and Ricky TQ Chen and Joseph Ortiz and Daniel DeTone and Austin Wang and Stuart Anderson and Jing Dong and Brandon Amos and Mustafa Mukadam , booktitle =

  41. [41]

    Parallel Sampling of Diffusion Models on SO(3) , author =. Proc. Int. Conf. on Machine Vision Applications (MVA) , year =

  42. [42]

    Stochastic Analysis on Manifolds , author =

  43. [43]

    Graph Attention Networks , author =. Proc. Int. Conf. on Learning Representations (ICLR) , year =

  44. [44]

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proc. Int. Conf. on Machine Learning (ICML) , year =

  45. [45]

    Attention is All You Need , author =. Proc. Int. Conf. on Neural Information Processing Systems (NeurIPS) , year =