pith. sign in

arxiv: 2606.13394 · v1 · pith:K3TFKJ53new · submitted 2026-06-11 · 💻 cs.RO

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

Pith reviewed 2026-06-27 06:35 UTC · model grok-4.3

classification 💻 cs.RO
keywords mobile manipulationaction transformergeometry adaptive fusiondiffusion policywhole body controlvision foundation modelhybrid action decoder
0
0 comments X

The pith

GeoHAT improves whole-body mobile manipulation by injecting reliable 3D geometry only where depth is valid and decoding arm and base actions in separate subspaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GeoHAT as an end-to-end diffusion-based policy for coordinating a mobile base and manipulator arm under changing viewpoints. It converts dense per-pixel 3D coordinates into geometric tokens with a lightweight Fourier encoder and fuses them into vision foundation model features through per-token gated fusion that depends on depth validity. This selective step adds spatial structure while leaving semantic priors untouched. For actions the hybrid decoder splits arm and base into distinct subspaces so each can attend sparsely to its own relevant visual context, with causal modeling handling timing across steps. On a simulation benchmark the approach reaches a 79.3 percent mean success rate, 23.7 percent above the strongest baseline, and the same pattern holds in physical robot experiments on varied tasks.

Core claim

GeoHAT establishes that whole-body mobile manipulation policies improve when geometry is injected only where reliable and attended to only where needed, using a Fourier spatial encoder to produce geometric tokens, per-token gated fusion modulated by depth validity to preserve semantic representations, and a Hybrid Whole-Body Action Decoder that decomposes arm and base into distinct subspaces linked by sparse cross-attention.

What carries the argument

The per-token gated fusion modulated by depth validity, which selectively enriches vision features with geometric tokens without corrupting pretrained semantics, together with the Hybrid Whole-Body Action Decoder that separates arm and base subspaces for sparse cross-attention and causal temporal modeling.

If this is right

  • Policies maintain performance under noisy depth by avoiding corruption of semantic priors from vision foundation models.
  • Separate subspaces for arm and base allow each to focus on its own task-relevant visual context without a single shared action vector.
  • Lightweight Fourier encoding supplies dense geometry without requiring a second full 3D vision backbone.
  • The same gains appear across both simulated benchmarks and real-world trials on diverse tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective gating idea may extend to other sensor-fusion settings where only part of the input is reliable at any moment.
  • The subspace separation suggests modular extensions could handle additional robot components such as grippers or sensors.
  • Further tests on longer-horizon tasks would show whether the causal temporal modeling continues to coordinate arm and base effectively.

Load-bearing premise

The per-token gated fusion based on depth validity enriches spatial understanding without harming pretrained semantic representations from the vision model, and separating arm and base into distinct subspaces with sparse cross-attention is enough to handle their different control demands.

What would settle it

An ablation study on the same benchmark tasks in which the depth-validity gating is removed and the resulting success rate falls to or below the level of the strongest baseline.

Figures

Figures reproduced from arXiv: 2606.13394 by Jinyan Liu, Luzhou Ge, Renjun Wu, Xiangyu Zhu, Xuesong Li.

Figure 1
Figure 1. Figure 1: Overview of GeoHAT. Left: GeoHAT fuses multi-view RGB features with PointMap features through reliability-aware gated fusion, and then predicts coordinated arm-base actions with a hybrid whole-body decoder. This design injects geometry only where reliable and routes visual context to the action subspaces that need it. Middle: Simulation and real-world mobile manipulation tasks used for evaluation. Right: S… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of GeoHAT. Patch-aligned PointMap coordinates are encoded via a Fourier MLP and selectively fused into DINOv2 features through per-token gating modulated by depth con￾fidence (left). The hybrid action decoder interleaves arm and base tokens, applies query-level Top-K cross-attention for subspace-specific visual grounding, and uses causal self-attention for temporal coordination (right). 3.1 Ge… view at source ↗
Figure 3
Figure 3. Figure 3: Causal mask structure. Tokens within the same timestep attend bidirectionally, while cross￾timestep attention is strictly causal. Causal Masked Self-Attention. The action tokens are pro￾cessed by causal masked self-attention to model temporal de￾pendencies. For the interleaved sequence Z of length 2h, the attention mask M ∈ {0, 1} 2h×2h is defined as: Mij = 1  i 2  ≥  j 2  , (10) where ⌈·/2⌉ maps each … view at source ↗
Figure 4
Figure 4. Figure 4: Success rate of GeoHAT and baselines in real-world tasks. To further validate real-world applicability, we conduct ex￾periments on four mobile manipulation tasks requiring co￾ordinated base-arm control: pick and place, drop rubbish, open drawer, and close drawer. Our platform consists of an AIRBOT Play robot arm, a Woosh mobile base, and two In￾tel RealSense D455 cameras placed on the head and wrist. Addit… view at source ↗
Figure 5
Figure 5. Figure 5: Query-level Sparse Attention Results of Pick Bowl Task. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world robot hardware configuration. The mobile manipulation platform is equipped [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Real-world observations used by GeoHAT. For each task, we show synchronized wrist [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Execution progress of real-world mobile manipulation tasks. Each row shows represen [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Execution progress of MSHAB SetTable tasks. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Execution progress of MSHAB TidyHouse tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Execution progress of MSHAB PrepareGroceries tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents GeoHAT, an end-to-end diffusion-based framework for whole-body mobile manipulation. It introduces a lightweight Fourier spatial encoder to map dense per-pixel 3D coordinates into geometric tokens, selectively injects them into vision foundation model features via per-token gated fusion modulated by depth validity, and uses a Hybrid Whole-Body Action Decoder that decomposes arm and base into distinct subspaces with sparse cross-attention plus causal temporal modeling. Experiments claim a 79.3% mean success rate on the ManiSkill-HAB simulation benchmark (23.7% above the strongest baseline) with consistent real-world improvements on diverse tasks.

Significance. If the empirical results are robust, the work offers a practical approach to injecting reliable geometry into pretrained vision representations without corruption or excessive compute, while addressing distinct control demands of base and arm through modality-specific attention. The core design principle of selective geometry injection is a clear strength for handling noisy depth in mobile manipulation.

major comments (1)
  1. [Abstract] Abstract: the abstract states empirical results but supplies no experimental protocol, baseline descriptions, error bars, ablation studies, or statistical tests; without these details it is impossible to assess whether the 79.3% figure supports the architectural claims about gated fusion and hybrid decoding.
minor comments (1)
  1. The description of the per-token gated fusion and sparse cross-attention would benefit from an accompanying diagram showing token flow and attention masks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The single major comment concerns the level of detail in the abstract. We address it point-by-point below and agree that a modest expansion of the abstract will improve clarity without altering the paper's technical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract states empirical results but supplies no experimental protocol, baseline descriptions, error bars, ablation studies, or statistical tests; without these details it is impossible to assess whether the 79.3% figure supports the architectural claims about gated fusion and hybrid decoding.

    Authors: We agree that the abstract, constrained by length, does not enumerate the full experimental protocol, baseline list, error bars, or ablation results. The body of the manuscript supplies these details: Section 4 describes the ManiSkill-HAB benchmark and evaluation protocol; Section 4.1 lists all baselines with implementation references; Section 4.3 presents ablations on the gated fusion and hybrid decoder; and all reported success rates include standard deviations computed over five random seeds together with pairwise statistical comparisons. To address the concern directly, we will revise the abstract to include a concise statement of the benchmark, the primary baseline, and that results are means with standard deviations across multiple runs. This change will make the empirical claims more self-contained while preserving the abstract's brevity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical architecture (GeoHAT) whose central claims are performance numbers on ManiSkill-HAB and real-world tasks. These are reported as measured outcomes of training and evaluation, not quantities derived from or forced by the model's own equations. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear in the abstract or described method; the design choices (Fourier encoder, gated fusion, hybrid decoder) are presented as engineering decisions whose value is validated externally by benchmark results rather than by internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no explicit free parameters, mathematical axioms, or new postulated entities beyond standard components of diffusion policies and transformers.

pith-pipeline@v0.9.1-grok · 5793 in / 1292 out tokens · 24092 ms · 2026-06-27T06:35:34.283892+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    S. Chen, J. Liu, S. Qian, H. Jiang, Z. Liu, C. Gu, X. Li, C. Hou, P. Wang, Z. Wang, et al. Ac- dit: Adaptive coordination diffusion transformer for mobile manipulation.Advances in Neural Information Processing Systems, 38:64008–64036, 2026

  2. [2]

    Z. Wu, Y . Zhou, X. Xu, Z. Wang, and H. Yan. Momanipvla: Transferring vision-language- action models for general mobile manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1714–1723, 2025

  3. [3]

    Jiang, R

    Y . Jiang, R. Zhang, J. Wong, C. Wang, Y . Ze, H. Yin, C. Gokmen, S. Song, J. Wu, and L. Fei- Fei. Behavior robot suite: Streamlining real-world whole-body manipulation for everyday household activities. InConference on Robot Learning, pages 1246–1281. PMLR, 2025

  4. [4]

    J. S. Lim, Z. Zhang, P. Bohm, B. Tidd, Z. Huang, and Y . Luo. Anchorvla: Anchored diffusion for efficient end-to-end mobile manipulation.arXiv preprint arXiv:2604.01567, 2026

  5. [5]

    Y . Su, C. Zhang, S. Chen, L. Tan, Y . Tang, J. Wang, and X. Liu. Dspv2: Improved dense policy for effective and generalizable whole-body mobile manipulation.arXiv preprint arXiv:2509.16063, 2025

  6. [6]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  7. [7]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  8. [8]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), 2024

  9. [9]

    S. Yan, Z. Zhang, M. Han, Z. Wang, Q. Xie, Z. Li, Z. Li, H. Liu, X. Wang, and S.-C. Zhu. M 2 diffuser: Diffusion-based trajectory optimization for mobile manipulation in 3d scenes.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  10. [10]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  11. [11]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al. Openvla: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  12. [12]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  13. [13]

    X. Jia, Q. Wang, A. Wang, H. Wang, B. Gyenes, E. Gospodinov, X. Jiang, G. Li, H. Zhou, W. Liao, et al. Pointmappolicy: Structured point cloud processing for multi-modal imitation learning.Advances in Neural Information Processing Systems, 38:160188–160216, 2026

  14. [14]

    L. Sun, B. Xie, Y . Liu, H. Shi, T. Wang, and J. Cao. Geovla: Empowering 3d representations in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

  15. [15]

    Y . Jia, J. Liu, S. Chen, C. Gu, Z. Wang, L. Luo, L. Lee, P. Wang, Z. Wang, R. Zhang, et al. Lift3d foundation policy: Lifting 2d large-scale pretrained models for robust 3d robotic ma- nipulation.arXiv preprint arXiv:2411.18623, 2024. 10

  16. [16]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  17. [17]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

  18. [18]

    Goyal, J

    A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023

  19. [19]

    K. Fang, Y . Chen, X. Zhu, F. Niroui, L. Sun, and J. Wang. Saga: Open-world mobile manipu- lation via structured affordance grounding.arXiv preprint arXiv:2512.12842, 2025

  20. [20]

    C. He, G. Sun, Y . Bai, J. Lu, J. Zhao, and G. Sartoretti. Falcon: Actively decoupled visuomo- tor policies for loco-manipulation with foundation-model-based coordination.arXiv preprint arXiv:2512.04381, 2025

  21. [21]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  22. [22]

    Shukla, S

    A. Shukla, S. Tao, and H. Su. Maniskill-hab: A benchmark for low-level manipulation in home rearrangement tasks. InInternational Conference on Learning Representations, volume 2025, pages 15288–15317, 2025

  23. [23]

    Goyal, V

    A. Goyal, V . Blukis, J. Xu, Y . Guo, Y .-W. Chao, and D. Fox. Rvt-2: Learning precise manipu- lation from few demonstrations.arXiv preprint arXiv:2406.08545, 2024

  24. [24]

    T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024

  25. [25]

    Gervet and Z

    T. Gervet and Z. Xiao. Act3d: 3d feature field transformers for multi-task robotic manipulation. InConference on Robot Learning. Proceedings of Machine Learning Research, 2023

  26. [26]

    C. Li, J. Wen, Y . Peng, Y . Peng, F. Feng, and Y . Zhu. Pointvla: Injecting the 3d world into vision-language-action models.arXiv preprint arXiv:2503.07511, 2025

  27. [27]

    P. Liu, Y . Orru, J. Vakil, C. Paxton, N. M. M. Shafiullah, and L. Pinto. Ok-robot: What really matters in integrating open-knowledge models for robotics.arXiv preprint arXiv:2401.12202, 2024

  28. [28]

    C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-P´erez. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4(1):265–293, 2021

  29. [29]

    M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakr- ishnan, K. Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022

  30. [30]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation us- ing low-cost whole-body teleoperation. InConference on Robot Learning, pages 4066–4083. PMLR, 2025

  31. [31]

    Tancik, P

    M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ra- mamoorthi, J. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains.Advances in neural information processing systems, 33:7537– 7547, 2020

  32. [32]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 11

  33. [33]

    Perez, F

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  34. [34]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Con- ference on Learning Representations, 2021. URLhttps://openreview.net/forum?id= St1giarCHLP

  35. [35]

    Xiang, Y

    F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wang, et al. Sapien: A simulated part-based interactive environment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11097–11107, 2020

  36. [36]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  37. [37]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  38. [38]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  39. [39]

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu. Rdt-1b: a diffu- sion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

  40. [40]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Con- ference on Learning Representations, 2021. URLhttps://openreview.net/forum?id= YicbFdNTTy. 12 Supp...