pith. sign in

arxiv: 2606.19586 · v1 · pith:BRSXQTTSnew · submitted 2026-06-17 · 💻 cs.RO

One Demo is Worth a Thousand Trajectories: Action-View Augmentation for Visuomotor Policies

Pith reviewed 2026-06-26 20:28 UTC · model grok-4.3

classification 💻 cs.RO
keywords visuomotor policiesdata augmentationGaussian Splattingfisheye camerarobotic manipulationtrajectory optimizationeye-in-handcollision avoidance
0
0 comments X

The pith

Augmenting one fisheye demonstration via scene reconstruction and trajectory optimization improves visuomotor policy success rates on manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a data augmentation method that takes one real-world eye-in-hand demonstration and generates many additional training examples consisting of image sequences and action trajectories. It uses a Gaussian Splatting technique modified for fisheye cameras to rebuild the 3D scene, insert new obstacles, and create new paths that avoid collisions while allowing good views for rendering. This is useful because visuomotor policies trained on limited data often fail when the robot starts in a slightly different position or encounters unseen objects. If successful, the method lets policies learn from far less original data collection.

Core claim

The central claim is that the proposed augmentation framework generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations by using a novel Gaussian Splatting formulation adapted to wide field-of-view fisheye cameras to reconstruct and edit the 3D scene, along with trajectory optimization to produce smooth collision-free paths, leading to higher success rates in both the original and augmented scenes.

What carries the argument

A Gaussian Splatting formulation adapted to fisheye cameras for 3D scene reconstruction and editing, paired with trajectory optimization to create new executable action sequences.

If this is right

  • The framework improves success rates for various manipulation tasks in the same scene.
  • Success rates also rise in scenes with added obstacles that require collision avoidance.
  • The improvements appear in both simulation experiments and real-world tests.
  • The generated trajectories remain physically feasible and executable on the robot.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other types of cameras or robot configurations if the underlying reconstruction holds up.
  • Adding dynamic elements to the edited scenes might enable training for tasks with moving obstacles.
  • Measuring the reduction in required original demonstrations for a target performance level would test its efficiency gains.

Load-bearing premise

The Gaussian Splatting must reconstruct the fisheye scene accurately enough to support realistic rendering of novel views and edited scenes with obstacles.

What would settle it

If a policy trained on the augmented data achieves the same or lower success rate than one trained only on the single original demonstration when tested with new obstacles, the claim would be false.

Figures

Figures reproduced from arXiv: 2606.19586 by Benjamin Burchfiel, Chuer Pan, Dominik Bauer, Eric Cousineau, Litian Liang, Shuran Song, Siyuan Feng.

Figure 1
Figure 1. Figure 1: 1001 DEMOS. From a single human demonstration (e.g., picking up the blue mug), our approach generates valid training trajectories with large spatial variance and augmented obstacles, while respecting action-view consistency, 3D collision and con￾tact dynamics constraints. Visuomotor policies trained through imitation learn￾ing [1, 2, 3] enable complex robot behaviors but often remain brittle: minor changes… view at source ↗
Figure 2
Figure 2. Figure 2: 1001 DEMOS Overview. From an initial mapping run, we reconstruct the 3D scene point cloud for easy trajectory planning and a fisheye 3DGS scene for fast novel view rendering (§3.1). Given a single demonstration video (green), we optimize additional physically feasible ac￾tion trajectories (§3.2) and render the corresponding visually consistent fisheye-image observations (§3.3), thereby generating thousands… view at source ↗
Figure 3
Figure 3. Figure 3: Fisheye 3DGS. We propose Fisheye-3DGS, using a ray sampler that accounts for fisheye distortion. Sam￾pling density adapts to pixel location, allocating more rays to the image cen￾ter than the periphery for better rasteri￾zation quality. Fisheye 3D Gaussians. A critical design choice en￾abling fast rasterization, 3DGS [36] tiles pinhole image into 16×16 pixel patches and uses 256-thread cuda blocks per tile… view at source ↗
Figure 4
Figure 4. Figure 4: Augmentation with Obstacle Avoidance. Top: original demo; Mid￾dle: augmented trajectories without ob￾stacle avoidance; Bottom: augmented trajectories with obstacle avoidance. We preserve the original contact dynamics using a delta funnel loss Lfunnel to produce trajectories that converge consistently to the same pre-contact pose of the origi￾nal demonstration. Let R and t represent the rotation and transla… view at source ↗
Figure 5
Figure 5. Figure 5: Simulation Evaluation. (a) Initial state distribution for training data highlighted in blue overlay over custom test data. (b) Task success rate with action-view augmentation, compared to no augmentation, oracle action-view augmentation and other augmentation baselines. (a) OOD Camera View Scene Init (Free Space) (b) OOD Obstacle Scene Init (Obstacle) Free Space Obstacle [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world Evaluation. We report task performance for two versions of our augmented poli￾cies – trained with free-space augmentation (FreeSpace Aug), and free-space & obstacle-distractor augmentation (Obstacle Aug) – against a vanilla policy trained with no augmentation (No Aug). Initial states for a subset of all evaluation episodes for (a) OOD camera view test case, (b) OOD obstacle distractor test case … view at source ↗
read the original abstract

Visuomotor policies for manipulation have demonstrated remarkable potential in modeling complex robotic behaviors, yet minor alterations in the robot's initial configuration and unseen obstacles easily lead to out-of-distribution observations. Without extensive data collection effort, these result in catastrophic execution failures. In this work, we introduce an effective data augmentation framework that generates visually realistic fisheye image sequences and corresponding physically feasible action trajectories from real-world eye-in-hand demonstrations, captured with a portable parallel gripper with a single fisheye camera. We introduce a novel Gaussian Splatting formulation, adapted to wide FoV fisheye cameras, to reconstruct and edit the 3D scene with unseen objects. We utilize trajectory optimization to generate smooth, collision-free, view-rendering-friendly action trajectories and render visual observations from corresponding novel views. Comprehensive experiments in simulation and the real world show that our augmentation framework improves the success rate for various manipulation tasks in both the same scene and the augmented scene with obstacles requiring collision avoidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents an action-view augmentation framework for visuomotor policies that reconstructs 3D scenes from a single real-world eye-in-hand fisheye demonstration using a novel wide-FoV Gaussian Splatting adaptation, optimizes smooth collision-free trajectories, renders novel views, and augments training data to improve policy success rates on manipulation tasks in both original and obstacle-augmented scenes.

Significance. If the reported gains are substantiated with quantitative validation, the method offers a practical route to data-efficient visuomotor learning by synthesizing diverse, physically feasible training trajectories and views from minimal demonstrations, directly addressing out-of-distribution failures in real-world manipulation.

major comments (2)
  1. [Abstract] Abstract: the central claim that the framework 'improves the success rate for various manipulation tasks' is stated without any numerical results, baseline comparisons, ablation studies, or statistical details, rendering the empirical contribution unverifiable from the provided text.
  2. [Method] Gaussian Splatting adaptation (method section): the claim that rendered novel fisheye views are sufficiently realistic to train generalizable policies rests on the adapted 3DGS reconstruction, yet no quantitative metrics (PSNR, SSIM, or held-out view-synthesis error on real fisheye frames) are supplied to validate reconstruction fidelity or artifact levels.
minor comments (1)
  1. [Experiments] Clarify in the experiments section how trajectory optimization constraints ensure direct executability on the physical robot without additional compliance adjustments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the framework 'improves the success rate for various manipulation tasks' is stated without any numerical results, baseline comparisons, ablation studies, or statistical details, rendering the empirical contribution unverifiable from the provided text.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will include key quantitative results from the simulation and real-world experiments, such as success-rate improvements relative to baselines, to make the empirical claims verifiable directly from the abstract. revision: yes

  2. Referee: [Method] Gaussian Splatting adaptation (method section): the claim that rendered novel fisheye views are sufficiently realistic to train generalizable policies rests on the adapted 3DGS reconstruction, yet no quantitative metrics (PSNR, SSIM, or held-out view-synthesis error on real fisheye frames) are supplied to validate reconstruction fidelity or artifact levels.

    Authors: We acknowledge that the current manuscript does not report quantitative reconstruction metrics. While downstream policy performance serves as the primary validation, we agree that PSNR, SSIM, and held-out view-synthesis error would strengthen the claim of reconstruction fidelity. We will add these metrics, computed on held-out real fisheye frames, in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an engineering framework for data augmentation via adapted Gaussian Splatting reconstruction of fisheye scenes followed by trajectory optimization to generate novel views and actions. No equations, fitted parameters, or self-citations are described that reduce any reported success-rate improvement to an input by construction. The central claims rest on experimental validation in simulation and real-world settings rather than on any self-referential definition or renaming of known results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient technical detail to enumerate concrete free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about reconstruction fidelity and trajectory executability.

pith-pipeline@v0.9.1-grok · 5718 in / 1088 out tokens · 24399 ms · 2026-06-26T20:28:55.005232+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 20 canonical work pages · 9 internal anchors

  1. [1]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.arXiv preprint arXiv:2303.04137, 2023

  2. [2]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  3. [3]

    N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto. Behavior transformers: Cloningk modes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

  4. [4]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

  5. [5]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  6. [6]

    Mirchandani, S

    S. Mirchandani, S. Belkhale, J. Hejna, E. Choi, M. S. Islam, and D. Sadigh. So you think you can scale up autonomous robot data collection?arXiv preprint arXiv:2411.01813, 2024. 9

  7. [7]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. InProceedings of Robotics: Science and Systems (RSS), 2024

  8. [8]

    Shorten and T

    C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019

  9. [9]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

  10. [10]

    Hansen and X

    N. Hansen and X. Wang. Generalization in reinforcement learning by soft data augmentation. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611– 13617. IEEE, 2021

  11. [11]

    Laskin, K

    M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas. Reinforcement learning with augmented data.Advances in neural information processing systems, 33:19884–19895, 2020

  12. [12]

    Yarats, I

    D. Yarats, I. Kostrikov, and R. Fergus. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational conference on learning representations, 2021

  13. [13]

    Z. Chen, Z. Mandi, H. Bharadhwaj, M. Sharma, S. Song, A. Gupta, and V . Kumar. Semanti- cally controllable augmentations for generalizable robot learning.The International Journal of Robotics Research, page 02783649241273686, 2024

  14. [14]

    T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter, et al. Scaling robot learning with semantically imagined experience.arXiv preprint arXiv:2302.11550, 2023

  15. [15]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639, 2023

  16. [16]

    L. Y . Chen, C. Xu, K. Dharmarajan, M. Z. Irshad, R. Cheng, K. Keutzer, M. Tomizuka, Q. Vuong, and K. Goldberg. Rovi-aug: Robot and viewpoint augmentation for cross- embodiment robot learning.arXiv preprint arXiv:2409.03403, 2024

  17. [17]

    H. Chen, C. Zhu, Y . Li, and K. Driggs-Campbell. Tool-as-interface: Learning robot policies from human tool usage through imitation learning.arXiv preprint arXiv:2504.04612, 2025

  18. [18]

    Florence, L

    P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning.IEEE Robotics and Automation Letters, 5(2):492–499, 2019

  19. [19]

    L. Ke, Y . Zhang, A. Deshpande, S. Srinivasa, and A. Gupta. Ccil: Continuity-based data augmentation for corrective imitation learning.arXiv preprint arXiv:2310.12972, 2023

  20. [20]

    Deshpande, L

    A. Deshpande, L. Ke, Q. Pfeifer, A. Gupta, and S. S. Srinivasa. Data efficient behavior cloning for fine manipulation via continuity-based corrective labels. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8531–8538. IEEE, 2024

  21. [21]

    Mitrano and D

    P. Mitrano and D. Berenson. Data augmentation for manipulation.arXiv preprint arXiv:2205.02886, 2022

  22. [22]

    MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. arXiv preprint arXiv:2310.17596, 2023. 10

  23. [23]

    Jiang, Y

    Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning.arXiv preprint arXiv:2410.24185, 2024

  24. [24]

    S. Yang, W. Yu, J. Zeng, J. Lv, K. Ren, C. Lu, D. Lin, and J. Pang. Novel demonstra- tion generation with gaussian splatting enables robust one-shot manipulation.arXiv preprint arXiv:2504.13175, 2025

  25. [25]

    S. Tian, B. Wulfe, K. Sargent, K. Liu, S. Zakharov, V . Guizilini, and J. Wu. View-invariant policy learning via zero-shot novel view synthesis.arXiv preprint arXiv:2409.03685, 2024

  26. [26]

    C. Pan, B. Okorn, H. Zhang, B. Eisner, and D. Held. Tax-pose: Task-specific cross-pose esti- mation for robot manipulation. InConference on Robot Learning, pages 1783–1792. PMLR, 2023

  27. [27]

    Tagliabue and J

    A. Tagliabue and J. P. How. Tube-nerf: Efficient imitation learning of visuomotor policies from mpc via tube-guided data augmentation and nerfs.IEEE Robotics and Automation Letters, 2024

  28. [28]

    J. Low, M. Adang, J. Yu, K. Nagami, and M. Schwager. Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum.IEEE Robotics and Automation Letters, 2025

  29. [29]

    A. Zhou, M. J. Kim, L. Wang, P. Florence, and C. Finn. Nerf in the palm of your hand: Cor- rective augmentation for robotics via novel-view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17907–17917, 2023

  30. [30]

    Zhang, M

    X. Zhang, M. Chang, P. Kumar, and S. Gupta. Diffusion meets dagger: Supercharging eye-in- hand imitation learning.arXiv preprint arXiv:2402.17768, 2024

  31. [31]

    Hoque, A

    R. Hoque, A. Mandlekar, C. Garrett, K. Goldberg, and D. Fox. Intervengen: Interventional data generation for robust and data-efficient robot imitation learning. In2024 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 2840–2846. IEEE, 2024

  32. [32]

    Garrett, A

    C. Garrett, A. Mandlekar, B. Wen, and D. Fox. Skillmimicgen: Automated demonstration generation for efficient skill learning and deployment.arXiv preprint arXiv:2410.18907, 2024

  33. [33]

    S. Yang, W. Yu, J. Zeng, J. Lv, K. Ren, C. Lu, D. Lin, and J. Pang. Novel demonstration generation with gaussian splatting enables robust one-shot manipulation.RSS, 2025

  34. [34]

    Z. Xue, S. Deng, Z. Chen, Y . Wang, Z. Yuan, and H. Xu. Demogen: Synthetic demonstration generation for data-efficient visuomotor policy learning.arXiv preprint arXiv:2502.16932, 2025

  35. [35]

    J. L. Sch ¨onberger and J.-M. Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016

  36. [36]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

  37. [37]

    Kannala and S

    J. Kannala and S. S. Brandt. A generic camera model and calibration method for conven- tional, wide-angle, and fish-eye lenses.IEEE transactions on pattern analysis and machine intelligence, 28(8):1335–1340, 2006

  38. [38]

    Deitke, D

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 11

  39. [39]

    Karaman and E

    S. Karaman and E. Frazzoli. Sampling-based algorithms for optimal motion planning.The international journal of robotics research, 30(7):846–894, 2011

  40. [40]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  41. [41]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  42. [42]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  43. [43]

    Todorov, T

    E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  44. [44]

    Z. Zhu, Z. Fan, Y . Jiang, and Z. Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. InEuropean conference on computer vision, pages 145–163. Springer, 2024

  45. [45]

    Chung, J

    J. Chung, J. Oh, and K. M. Lee. Depth-regularized optimization for 3d gaussian splatting in few-shot images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 811–820, 2024

  46. [46]

    Huang, Z

    B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 12 A.1 How Much to Augment? Fig. A1:How much to augment?While larger augmentation range could increase the data diver- sity, it also reduces the image rendering quality due to limited d...