pith. sign in

arxiv: 2605.15836 · v1 · pith:TDDWQPVVnew · submitted 2026-05-15 · 💻 cs.RO · cs.AI

GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks

Pith reviewed 2026-05-20 18:27 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords Geometric Anchor Pre-trainingvisuomotor policiesdata-efficient imitation learningrobotic manipulationspatial adaptersvision foundation modelsfew-shot policy learninggeometric keypoints
0
0 comments X

The pith

Pre-training a spatial adapter on a simulated geometric proxy task creates stable keypoints that raise few-shot visuomotor success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that a short action-free warm-up on a cheap simulated task can regularize the spatial pooling module so that it outputs object-centered, repeatable keypoints rather than task-irrelevant shortcuts. This would matter because high-dimensional visual features from frozen foundation models often lose geometric grounding when only a small adapter is updated with scarce expert demonstrations. By exposing the adapter to object masks in simulation, the method encourages keypoints that stay on the object, cover its extent, and remain consistent across frames. These anchors then serve as a reliable coordinate frame for downstream imitation learning while the main vision model stays frozen. Experiments under severe data scarcity and domain shift show the regularized adapter outperforming both attention-based poolers and full end-to-end fine-tuning.

Core claim

GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations, 63% on the long-horizon Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations.

What carries the argument

Geometric Anchor Pre-training (GAP), an action-free warm-up stage on simulated proxy tasks with free object masks that regularizes the spatial adapter to output stable, object-covering keypoints for downstream policy learning.

If this is right

  • A GAP-regularized adapter reaches 62% success on RoboMimic Can with 15 demonstrations, 16 points above attention-based alternatives.
  • It attains 63% success on the long-horizon high-precision Tool Hang task with 50 demonstrations.
  • On ManiSkill StackCube it reaches 61% success with 30 demonstrations, 11 points above full fine-tuning.
  • The proxy stage is lightweight and fully decoupled from the target task, allowing reuse across environments and skills without retraining the vision model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy regularization could be applied to other frozen vision backbones or to depth and point-cloud inputs where geometric consistency is also needed.
  • If the learned anchors prove robust to larger visual changes, the approach might reduce the number of real-world trials required when transferring policies between robot platforms.
  • Combining GAP with other forms of geometric supervision, such as optical flow or 3-D reconstruction losses, offers a testable route to further data efficiency.

Load-bearing premise

The keypoints learned from the simulated proxy task remain reliable and keep their geometric meaning after the adapter is fine-tuned on real-world demonstrations that differ in appearance and lighting.

What would settle it

Measure whether the adapter's keypoints stay aligned with object surfaces and retain frame-to-frame repeatability after fine-tuning on real manipulation data; large drift or loss of consistency would eliminate the reported gains over baselines.

Figures

Figures reproduced from arXiv: 2605.15836 by Andrea Protopapa, Davide Buoso, Francesca Pistilli, Giuseppe Averta, Stefano Di Carlo.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. 1. The spatial pooling layer extracts keypoints from the semantic pretrained backbone (frozen). GAP supervises this layer with the proposed loss, providing geometric grounding for policy learning. 2. Backbone and warmed-up pooling layer are then used to generate the input for the Diffusion Policy. During downstream training, the pooling layer is fine-tuned per task, to adapt object keypoin… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Keypoint Transfer. When pre-trained on a proxy task and then transferred to a new task and simulator, GAP allows for keeping a favorable geometric grounding even in zero-shot. (a) shows the keypoints placement on the Robomimic LiftCube task after pretraining. (b) and (c) show keypoints positioning when using the pre-trained visual encoder in zero-shot on a different task (of the same simulator)… view at source ↗
Figure 4
Figure 4. Figure 4: GAP impact on different backbones (Square, 30 Demos). We evaluate various pretrained backbones (R3M [10], VC1 [2] and DinoV2 [3]) with different poolers: a Global Avg. Pooling (blue), unregularized geometric adapter with Spatial Softmax (yellow), AFA [6] (purple) and a GAP-pretrained spatial adapter (green). GAP consistently outperforms other methods by a large margin, demonstrating that all backbones bene… view at source ↗
Figure 5
Figure 5. Figure 5: VC-1 backbone with GAP pretrained spatial pooler in the wild. We use one video from [23] of a real￾world robot performing a pick and place task and apply the GAP pretrained model (Lift task of Robomimic) to the video. This qualitatively shows a very good initialization of the keypoints even in the sim-to-real scenario, which vouches for good transfer of results in the real world. V. CONCLUSION This paper i… view at source ↗
read the original abstract

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Geometric Anchor Pre-training (GAP), a lightweight, action-free, and fully decoupled pre-training stage for the spatial pooling adapter in frozen Vision Foundation Model (VFM) based visuomotor policies. Using simulated proxy data with readily available object masks, GAP regularizes the adapter to output keypoints that lie on objects, cover their spatial extent, and are sharp and temporally repeatable. These geometric anchors are then used in downstream few-shot imitation learning on manipulation tasks. The manuscript reports concrete gains under severe data scarcity and domain shift: 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on Tool Hang with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning), while outperforming stronger attention-based poolers.

Significance. If the central assumption holds, GAP offers a practical, reusable inductive bias for geometric grounding that improves data efficiency without fine-tuning large VFMs or requiring action labels in the proxy stage. The decoupling from downstream tasks is a clear engineering strength. The reported numbers suggest meaningful gains on standard benchmarks, but the significance is tempered by the absence of supporting quantitative diagnostics on keypoint stability.

major comments (2)
  1. Experimental results (abstract and §4): the reported success rates and relative improvements (e.g., +16% on RoboMimic Can) are presented without ablations that isolate GAP from other design choices, without statistical details such as standard deviations or number of evaluation seeds, and without a complete experimental protocol. This makes it difficult to attribute the gains specifically to the geometric regularization rather than implementation factors.
  2. Methods and results sections on keypoint evaluation: the manuscript supplies only qualitative visualizations of keypoints. No quantitative post-fine-tuning metrics are reported that measure whether the learned keypoints remain on-object, spatially covering, sharp, and temporally repeatable after the adapter is adapted on 15–50 real demonstrations under domain shift. Because the central explanatory claim rests on the stability of these geometric anchors, the absence of such diagnostics leaves the performance story under-supported.
minor comments (2)
  1. Abstract: the description of the 'simple adapter' could be expanded with a brief architectural diagram or equation showing how it interfaces with VFM features and produces the keypoint output.
  2. Notation: ensure consistent use of terms such as 'spatial adapter' versus 'pooling layer' across the methods and experiments sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the changes we will incorporate in the revised manuscript.

read point-by-point responses
  1. Referee: Experimental results (abstract and §4): the reported success rates and relative improvements (e.g., +16% on RoboMimic Can) are presented without ablations that isolate GAP from other design choices, without statistical details such as standard deviations or number of evaluation seeds, and without a complete experimental protocol. This makes it difficult to attribute the gains specifically to the geometric regularization rather than implementation factors.

    Authors: We agree that additional ablations, statistical reporting, and a complete protocol are needed for rigorous attribution. In the revision we will add ablations that isolate the geometric regularization component of GAP from other factors such as VFM backbone and adapter architecture. We will report success rates as means with standard deviations over at least five evaluation seeds per task and include a detailed experimental protocol in the appendix. revision: yes

  2. Referee: Methods and results sections on keypoint evaluation: the manuscript supplies only qualitative visualizations of keypoints. No quantitative post-fine-tuning metrics are reported that measure whether the learned keypoints remain on-object, spatially covering, sharp, and temporally repeatable after the adapter is adapted on 15–50 real demonstrations under domain shift. Because the central explanatory claim rests on the stability of these geometric anchors, the absence of such diagnostics leaves the performance story under-supported.

    Authors: We acknowledge that quantitative post-adaptation metrics would strengthen support for keypoint stability. While downstream task success is our primary metric, we will add quantitative diagnostics in the revision, including on-object coverage and temporal repeatability scores computed on simulated proxy data with available masks, both pre- and post-adaptation. For domain-shift settings we will provide proxy quantitative analysis where ground-truth masks can be obtained or simulated. revision: partial

Circularity Check

0 steps flagged

No significant circularity; pre-training is decoupled

full rationale

The paper presents GAP as an independent, action-free pre-training stage on a simulated proxy task that supplies object masks at no cost to regularize the spatial adapter toward on-object, spatially covering, sharp, and temporally repeatable keypoints. This proxy objective operates on separate data and produces the adapter weights that are then frozen or lightly fine-tuned in the downstream imitation phase on scarce real or simulated demonstrations without masks. The reported success rates (e.g., 62% on RoboMimic Can with 15 demos) are empirical outcomes of that transfer, not quantities that are statistically forced by or identical to the proxy loss terms. No equations, fitted parameters, or self-citations are shown to reduce the central claim to its own inputs by construction; the two stages remain distinct and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that supervision from simulated object masks produces transferable geometric anchors; this is a standard sim-to-real domain assumption rather than a fitted parameter or new invented entity.

axioms (1)
  • domain assumption Object masks available in simulation provide useful geometric supervision that transfers to real manipulation scenes.
    Invoked when the proxy task is defined to encourage keypoints on objects.

pith-pipeline@v0.9.0 · 5859 in / 1310 out tokens · 131233 ms · 2026-05-20T18:27:53.415198+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  2. [2]

    Where are we in the search for an effective robot motor control foundation model?

    A. Majumdar, K. Yadav, S. Arnaud, J. Ma, V . Chen, S. Silwal, A. Jain, V .-P. Berges, T. Wu, J. Vakil,et al., “Where are we in the search for an effective robot motor control foundation model?” inAdvances in Neural Information Processing Systems, vol. 36, 2023

  3. [3]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby,et al., “DINOv2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024

  4. [4]

    Deep spatial autoencoders for visuomotor learning,

    C. Finn, X. Y . Tan, Y . Duan, T. Darrell, S. Levine, and P. Abbeel, “Deep spatial autoencoders for visuomotor learning,” in2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 512–519

  5. [5]

    Tokenlearner: Adaptive space-time tokenization for videos,

    M. Ryoo, A. Piergiovanni, A. Arnab, M. Dehghani, and A. Angelova, “Tokenlearner: Adaptive space-time tokenization for videos,”Ad- vances in neural information processing systems, vol. 34, pp. 12 786– 12 797, 2021

  6. [6]

    Attentive feature aggregation or: How policies learn to stop worrying about robustness and attend to task- relevant visual cues,

    N. Tsagkas, A. Sochopoulos, D. Danier, S. Vijayakumar, A. Kouris, O. Mac Aodha, and C. X. Lu, “Attentive feature aggregation or: How policies learn to stop worrying about robustness and attend to task- relevant visual cues,”arXiv preprint arXiv:2511.10762, 2025

  7. [7]

    The temporal trap: Entanglement in pre-trained visual representations for visuomotor policy learning,

    N. Tsagkas, A. Sochopoulos, D. Danier, C. X. Lu, and O. M. Aodha, “The temporal trap: Entanglement in pre-trained visual representations for visuomotor policy learning,” 2025. [Online]. Available: https://arxiv.org/abs/2502.03270

  8. [8]

    What matters in learning from offline human demonstrations for robot manipulation,

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, F.-F. Li, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın, “What matters in learning from offline human demonstrations for robot manipulation,” inConference on Robot Learning. PMLR, 2021, pp. 1678–1690

  9. [9]

    arXiv preprint arXiv:2410.00425 (2024)

    S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T.-k. Chan,et al., “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,” arXiv preprint arXiv:2410.00425, 2024

  10. [10]

    R3M: A universal visual representation for robot manipulation,

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” inConference on Robot Learning. PMLR, 2023, pp. 892–909

  11. [11]

    Masked visual pre-training for motor control.arXiv preprint arXiv:2203.06173, 2022

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre- training for motor control,”arXiv preprint arXiv:2203.06173, 2022

  12. [12]

    Ego4D: Around the world in 3,000 hours of egocentric video,

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu,et al., “Ego4D: Around the world in 3,000 hours of egocentric video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012

  13. [13]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 8748–8763

  14. [14]

    Enhancing visual domain robustness in behaviour cloning via saliency-guided augmentation,

    Z. Zhuang, R. Wang, N. Ingelhag, V . Kyrki, and D. Kragic, “Enhancing visual domain robustness in behaviour cloning via saliency-guided augmentation,” inConference on Robot Learning. PMLR, 2025, pp. 4314–4331

  15. [15]

    Transporter networks: Rearranging the visual world for robotic manipulation,

    A. Zeng, P. Florence, J. Tompson, S. Welker, J. Chien, M. Attarian, T. Armstrong, I. Krasin, D. Duong, V . Sindhwani,et al., “Transporter networks: Rearranging the visual world for robotic manipulation,” in Conference on Robot Learning. PMLR, 2021, pp. 726–747

  16. [16]

    Dis- covery of latent 3d keypoints via end-to-end geometric reasoning,

    S. Suwajanakorn, N. Snavely, J. J. Tompson, and M. Norouzi, “Dis- covery of latent 3d keypoints via end-to-end geometric reasoning,” Advances in Neural Information Processing Systems, vol. 31, 2018

  17. [17]

    Emergent correspondence from image diffusion,

    L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan, “Emergent correspondence from image diffusion,”Advances in Neural Informa- tion Processing Systems, vol. 36, pp. 1363–1389, 2023

  18. [18]

    arXiv preprint arXiv:2501.14400 , year=

    S. Wang, J. You, Y . Hu, J. Li, and Y . Gao, “SKIL: Semantic keypoint imitation learning for generalizable data-efficient manipulation,”arXiv preprint arXiv:2501.14400, 2025

  19. [19]

    kPAM: Keypoint affordances for category-level robotic manipulation,

    L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kPAM: Keypoint affordances for category-level robotic manipulation,” inThe Interna- tional Symposium of Robotics Research. Springer, 2019, pp. 132–157

  20. [20]

    kPAM-SC: Generalizable manipulation planning using keypoint affordance and shape completion,

    W. Gao and R. Tedrake, “kPAM-SC: Generalizable manipulation planning using keypoint affordance and shape completion,” in2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 6527–6533

  21. [21]

    PointMapPolicy: Structured point cloud processing for multi-modal imitation learning,

    X. Jia, Q. Wang, A. Wang, H. A. Wang, B. Gyenes, E. Gospodinov, X. Jiang, G. Li, H. Zhou, W. Liao,et al., “PointMapPolicy: Structured point cloud processing for multi-modal imitation learning,” inThirty- Ninth Annual Conference on Neural Information Processing Systems, 2025

  22. [22]

    Perceiver io: A general architecture for structured inputs & outputs,

    A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer,et al., “Perceiver io: A general architecture for structured inputs & outputs,” inInternational Conference on Learning Representations, 2022

  23. [23]

    Rebot: Scaling robot learning with real-to-sim-to-real robotic video synthesis,

    Y . Fang, Y . Yang, X. Zhu, K. Zheng, G. Bertasius, D. Szafir, and M. Ding, “Rebot: Scaling robot learning with real-to-sim-to-real robotic video synthesis,”arXiv preprint arXiv:2503.14526, 2025

  24. [24]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026