ConCent: Contact-Centric Real-to-Sim-to-Real Learning from One Demonstration

Heecheol Kim; Katsushi Ikeuchi; Namiko Saito; Yasuyuki Matsushita

arxiv: 2606.30268 · v1 · pith:X7PZKL67new · submitted 2026-06-29 · 💻 cs.RO

ConCent: Contact-Centric Real-to-Sim-to-Real Learning from One Demonstration

Heecheol Kim , Namiko Saito , Katsushi Ikeuchi , Yasuyuki Matsushita This is my paper

Pith reviewed 2026-06-30 05:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords contact-centric fidelitysim-to-real transferrobot manipulationreinforcement learningcontact dynamicsone demonstrationreal-to-sim

0 comments

The pith

Reproducing task-relevant contact sequences from one real demonstration enables stable sim-to-real transfer in contact-rich manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that successful sim-to-real transfer for robot manipulation requires matching the precise sequence and local dynamics of contacts that occur in the real demonstration, not merely overall task outcomes. It shows that extracting these contact events automatically and using them as the learning objective in simulation prevents policies from exploiting unrealistic contact behaviors available only in the simulator. Objects are approximated as groups of primitives whose geometries are tuned so that simulated contacts explain the observed motions, yielding a reward signal derived directly from the demonstration. This process requires no manual per-task reward design and produces policies that transfer more reliably than unconstrained reinforcement learning baselines. The central claim is that contact-centric fidelity is a necessary condition for task success in such settings.

Core claim

Contact-centric fidelity—reproducing both the contact event sequence (when, where, and how contacts occur) and the local contact dynamics (how forces and motions evolve at each contact)—is a necessary condition for task success. The proposed framework extracts task-relevant contact event sequences from real demonstrations, approximates objects as groups of primitives, optimizes their contact geometry in simulation to explain observed state transitions, and uses the resulting sequence as a structured reward signal to guide policy learning toward physically plausible regimes.

What carries the argument

The contact event sequence extracted by replaying the demonstration, which serves as an automatic structured reward signal.

If this is right

Policies avoid exploiting simulator contacts that have no real-world counterpart.
Stable sim-to-real transfer occurs in contact-rich tasks without manual reward engineering.
Learning is constrained to contact regimes already validated by the real demonstration.
The approach operates from a single demonstration and produces more robust transfer than standard RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Focusing reward on contact matching may allow effective use of simulators whose global dynamics are only approximate.
Increasing the number of primitives could extend the method to objects whose shapes deviate from simple groups.
Contact-matched policies might tolerate moderate changes in object mass or friction as long as the sequence remains feasible.

Load-bearing premise

Objects can be approximated as groups of primitives whose contact geometry can be optimized in simulation to explain the observed state transitions from the demonstration.

What would settle it

A demonstration in which no set of primitive contact geometries reproduces the observed state transitions, or a policy that matches the contact sequence yet fails to succeed after real-world deployment.

Figures

Figures reproduced from arXiv: 2606.30268 by Heecheol Kim, Katsushi Ikeuchi, Namiko Saito, Yasuyuki Matsushita.

**Figure 1.** Figure 1: ConCent grounds simulation in real interaction physics from a single demonstration and learns a contact-rich policy that transfers back to the real world. In Real→ Sim Contact Optimization, one real RGB-D demo provides point tracks that we use to optimize contact geometry, so the simulator reproduces the observed motion; replaying it extracts a contact event sequence used as a structured reward. A contact… view at source ↗

**Figure 2.** Figure 2: Principle of ConCent. The local contact dynamics can be revealed by a real-world demonstration (left); the contact-centric RL then constrains the policy to it, minimizing the sim-toreal gap (right). From this perspective, we propose a new realto-sim-to-real RL framework (grounding the simulator in a real demonstration, training an RL policy in that simulator, and deploying it back to the real world) tha… view at source ↗

**Figure 3.** Figure 3: Real-world rollout keyframes (ConCent). A successful rollout. The block is grasped, aligned, and inserted into the tight 2 mm-clearance hole; in the last frame the block has vanished into the hole. Q2. How do the two contact-centric components produce the gains observed in Q1? (Sec. 4.3) Q3. Does VCP reduce the computational cost of training and scale to massively parallel simulation, compared to full phy… view at source ↗

**Figure 4.** Figure 4: In-hand view during insertion (wrist camera). Top (ConCent): the block stays aligned and slides in. Bottom (w/o contact optimization): a Wrong Contact knocks the in-hand pose off and the block jams against the rim. the block to collide with and jam against the hole edge, lose its pose, and fail the insertion. That is, success on this task is governed by the reproduction of task-relevant local contact (Q1).… view at source ↗

**Figure 5.** Figure 5: Contact geometry optimization aligns the simulated insertion with the demonstration. (a) Block position error (simulation vs. real observation) over the insertion phase, before vs. after optimization. (b) The last replayed frame under the initial contact geometry (left, red) and the optimized geometry C ∗ (right, green). frame 0 frame 28 frame 72 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Contact schedule from replaying the optimized contact geometry C ∗ . Simulation frames at the three key timesteps of the rollout (frame index continuous across stages): frame 0 the gripper reaches toward the block, frame 28 the grasped block is transported toward the hole, and frame 72 it is aligned and inserted. The paired cyan and green spheres connected by white lines denote the active contact pairs of … view at source ↗

**Figure 7.** Figure 7: Throughput scaling of VCP vs. FULL physical contact. VCP scales steadily with the number of parallel environments, whereas FULL is stable only at 256 and diverges at ≥ 512 environments when the simultaneous contacts exceed the contact-buffer capacity (red ×). data-efficient pathway that does not rely on large-scale imitation learning data. Moreover, because such contact-centric constraints are inherently m… view at source ↗

**Figure 8.** Figure 8: Decomposition of the hybrid rendering pipeline. Rows: two base-camera and two wrist-camera samples; columns: raw sim render | final hybrid composite | flow-matching robotand-background | 3DGS objects (block+box) | object-shadow render. The last three columns are composited into the second, closing the sim-to-real appearance gap used for VLA distillation. F Reward Ablation: Engineered Insertion Reward Base… view at source ↗

read the original abstract

Sim-to-real policy transfer -- deploying policies trained in simulation in the real world -- is a promising paradigm for scaling robot manipulation without large-scale real-world data. However, transferring simulation-trained policies remains challenging due to discrepancies in contact dynamics -- particularly in contact-rich tasks where subtle differences can alter task outcomes entirely. Because interaction between the manipulated object and the environment is mediated through contact, task success depends on accurately reproducing task-relevant contacts. Accordingly, in manipulation, contact-centric fidelity -- reproducing both the contact event sequence (when, where, and how contacts occur) and the local contact dynamics (how forces and motions evolve at each contact) -- is a necessary condition for task success. Based on this insight, we propose a contact-centric real-to-sim-to-real RL framework that uses task-relevant contact event sequences extracted from real demonstrations as the learning objective. We approximate objects as groups of primitives and optimize their contact geometry in simulation so that the resulting local contact dynamics explain the observed state transitions. The contact event sequence is automatically extracted by replaying the demonstration. This sequence serves as a structured reward signal, guiding the policy toward physically plausible contact regimes validated in reality and preventing exploitation of unrealistic simulator contacts. The signal is obtained automatically, requiring no per-task reward design. Experiments on contact-rich manipulation tasks demonstrate more stable and robust sim-to-real policy transfer compared to unconstrained RL baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The contact event sequence from one demo as automatic RL reward is a reasonable targeted move for contact-rich sim-to-real, but the primitive geometry optimization step looks underconstrained.

read the letter

The paper's core idea is to pull task-relevant contact event sequences automatically from a single real demonstration, then use that sequence as the structured reward when training policies in simulation. They approximate objects as groups of primitives, optimize the contact geometry so the simulated local dynamics match the observed state transitions from the demo, and replay the demonstration to generate the event sequence. This is positioned as a way to enforce physically plausible contacts without hand-designed rewards per task.

The approach does target a real pain point: in contact-rich manipulation, small mismatches in when and how contacts occur can break transfer. Framing the reward around the contact sequence rather than just final task success is a clear step beyond standard unconstrained RL baselines. Making the signal automatic from one demo is also practical.

The main soft spot is the optimization of primitive contact geometry. If friction, stiffness, or other parameters remain free, multiple geometry-plus-parameter combinations can likely reproduce the same coarse state transitions. In that case the reward would not reliably select the true contact regime, and the claim of "physically plausible contact regimes validated in reality" would not hold. The abstract presents the step as automatic and sufficient, but the underconstrained risk is real and load-bearing. Experimental details are also absent, so it is impossible to judge whether the reported improvement over baselines is robust or just a lucky fit.

This is for people working on sim-to-real transfer in robot manipulation. A reader already thinking about contact modeling would get value from the framing and the automatic extraction mechanism. The paper shows clear thinking on the problem even if the evidence is thin, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ConCent, a contact-centric real-to-sim-to-real RL framework for robot manipulation. From a single real demonstration, it automatically extracts task-relevant contact event sequences, approximates objects as groups of primitives, optimizes their contact geometry in simulation to reproduce observed state transitions via local contact dynamics, and uses the resulting sequence as a structured reward signal. This guides policy learning toward physically plausible contact regimes and yields more stable sim-to-real transfer than unconstrained RL baselines on contact-rich tasks.

Significance. If the optimization step reliably produces unique, reality-validated contact rewards, the approach would offer a data-efficient alternative to manual reward design for sim-to-real transfer, directly targeting the contact discrepancies that dominate failure in manipulation. The automatic extraction of contact sequences from one demonstration and the focus on both event ordering and local dynamics are genuine strengths that could generalize across tasks. The significance is nevertheless conditional on whether the primitive-group fitting is sufficiently constrained to avoid simulator artifacts.

major comments (2)

[Abstract] Abstract (and Method section on contact geometry optimization): the central claim that optimizing primitive contact geometry produces a reward signal enforcing 'physically plausible contact regimes validated in reality' is load-bearing, yet the description does not specify whether friction, stiffness, or restitution coefficients are held fixed during the optimization. If these remain free, multiple geometry-plus-parameter combinations can reproduce the same coarse state transitions, rendering the extracted reward non-unique and potentially rewarding simulator artifacts rather than true contact dynamics.
[Abstract] Abstract (and Experiments): the assertion of 'more stable and robust sim-to-real policy transfer' compared to unconstrained RL baselines is presented without reported metrics, error bars, or ablations isolating the contribution of the contact-centric reward versus other implementation choices (e.g., primitive count, optimization objective). This makes it impossible to verify that improved transfer stems from contact fidelity rather than incidental factors.

minor comments (1)

[Abstract] Abstract: the phrase 'the signal is obtained automatically, requiring no per-task reward design' is repeated; a single concise statement would suffice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract (and Method section on contact geometry optimization): the central claim that optimizing primitive contact geometry produces a reward signal enforcing 'physically plausible contact regimes validated in reality' is load-bearing, yet the description does not specify whether friction, stiffness, or restitution coefficients are held fixed during the optimization. If these remain free, multiple geometry-plus-parameter combinations can reproduce the same coarse state transitions, rendering the extracted reward non-unique and potentially rewarding simulator artifacts rather than true contact dynamics.

Authors: We agree this clarification is necessary. The optimization in ConCent is performed exclusively over primitive contact geometry (positions, orientations, and sizes), while friction, stiffness, and restitution coefficients are held fixed at simulator defaults. This constraint ensures the extracted contact sequence is tied to geometry that reproduces observed transitions under fixed material properties, avoiding non-uniqueness. We will revise the abstract and method section to state this explicitly. revision: yes
Referee: [Abstract] Abstract (and Experiments): the assertion of 'more stable and robust sim-to-real policy transfer' compared to unconstrained RL baselines is presented without reported metrics, error bars, or ablations isolating the contribution of the contact-centric reward versus other implementation choices (e.g., primitive count, optimization objective). This makes it impossible to verify that improved transfer stems from contact fidelity rather than incidental factors.

Authors: The experiments section reports quantitative success rates, transfer metrics, and baseline comparisons, including analysis varying primitive count. The abstract summarizes these without specific numbers or error bars. We will revise the abstract to include key quantitative results and error bars. Additional ablations isolating the reward signal will be added if space allows; the current primitive-count analysis already provides partial isolation of implementation choices. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain extracts contact event sequences directly from real demonstration data by replaying trajectories and optimizes primitive contact geometry to reproduce observed state transitions as an explicit fitting step to generate a reward signal for RL policy training. This is a data-driven procedure grounded in external real-world measurements rather than any self-definitional loop, fitted-input-renamed-as-prediction, or self-citation load-bearing premise. No equations or claims in the abstract or described method reduce the central result (improved transfer via contact-centric reward) to its own inputs by construction; the optimization serves as an intermediate mechanism whose output is then used for downstream policy learning, with validation against unconstrained RL baselines. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the explicit domain assumption stated as the foundation of the work.

axioms (1)

domain assumption Contact-centric fidelity is a necessary condition for task success in contact-rich manipulation because interaction is mediated through contact.
This premise is directly stated in the abstract and used to justify the entire framework.

pith-pipeline@v0.9.1-grok · 5786 in / 1444 out tokens · 38648 ms · 2026-06-30T05:24:15.741748+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. T. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023
[2]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), volume 270, pages 2679–2713, 2024

2024
[3]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025

2025
[4]

W. Yu, J. Tan, C. K. Liu, and G. Turk. Preparing for the unknown: Learning a universal policy with online system identification. InRobotics: Science and Systems (RSS), 2017

2017
[5]

T. Lin, K. Sachdev, L. Fan, J. Malik, and Y . Zhu. Sim-to-real reinforcement learning for vision- based dexterous manipulation on humanoids. InConference on Robot Learning (CoRL), pages 4926–4940, 2025

2025
[6]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

2017
[7]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation (ICRA), pages 1–8, 2018

2018
[8]

Takamatsu, K

J. Takamatsu, K. Ogawara, H. Kimura, and K. Ikeuchi. Recognizing assembly tasks through human demonstration.Int. J. Robotics Res., 26(7):641–659, 2007

2007
[9]

Ikeuchi, N

K. Ikeuchi, N. Wake, J. Takamatsu, and K. Sasabuchi.Learning-from-Observation 2.0. 2025

2025
[10]

X. Li, M. Baum, and O. Brock. Augmentation enables one-shot generalization in learning from demonstration for contact-rich manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3656–3663, 2023

2023
[11]

Subramani, M

G. Subramani, M. R. Zinn, and M. Gleicher. Inferring geometric constraints in human demon- strations. InConference on Robot Learning (CoRL), volume 87, pages 223–236, 2018

2018
[12]

X. Zhu, J. Ke, Z. Xu, Z. Sun, B. Bai, J. Lv, Q. Liu, Y . Zeng, Q. Ye, C. Lu, M. Tomizuka, and L. Shao. Diff-lfd: Contact-aware model-based learning from visual demonstration for robotic manipulation via differentiable physics-based simulation and rendering. InConference on Robot Learning (CoRL), volume 229, pages 499–512, 2023

2023
[13]

M. Wang, W. Jin, K. Cao, L. Xie, and Y . Hong. Contactgaussian-wm: Learning physics- grounded world model from videos.arXiv preprint arXiv:2602.11021, 2026

work page arXiv 2026
[14]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), volume 229, pages 1820–1864, 2023. 11

2023
[15]

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: example-guided deep reinforcement learning of physics-based character skills.ACM Trans. Graph., 37(4):143, 2018

2018
[16]

A. Y . Ng and S. Russell. Algorithms for inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), pages 663–670, 2000

2000
[17]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), pages 4565–4573, 2016

2016
[18]

Y . J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models. In International Conference on Learning Representations (ICLR), 2024

2024
[19]

P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. InConference on Robot Learning (CoRL), 2025

2025
[20]

T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real RL using one human demonstration. InConference on Robot Learning (CoRL), 2025

2025
[21]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis.Commun. ACM, 65(1):99–106, 2022

2022
[22]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139:1–139:14, 2023

2023
[23]

Moran, M

B. Moran, M. Comi, A. Byravan, S. Bohez, T. Erez, Z. Li, and L. Hasenclever. Splat- ting physical scenes: End-to-end real-to-sim from imperfect robot data.arXiv preprint arXiv:2506.04120, 2025

work page arXiv 2025
[24]

M. T. Villasevil, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Recon- ciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. In Robotics: Science and Systems (RSS), 2024

2024
[25]

X. Han, M. Liu, Y . Chen, J. Yu, X. Lyu, Y . Tian, B. Wang, W. Zhang, and J. Pang. Re 3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipu- lation. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[26]

T. Dai, J. Wong, Y . Jiang, C. Wang, C. Gokmen, R. Zhang, J. Wu, and L. Fei-Fei. Automated creation of digital cousins for robust policy learning. InConference on Robot Learning (CoRL), volume 270, pages 4912–4943, 2024

2024
[27]

B. Tang, M. A. Lin, I. Akinola, A. Handa, G. S. Sukhatme, F. Ramos, D. Fox, and Y . Narang. Industreal: Transferring contact-rich assembly tasks from simulation to reality. InRobotics: Science and Systems (RSS), 2023

2023
[28]

N. Ravi, V . Gabeur, Y . Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations (ICLR), 2025

2025
[29]

Doersch, Y

C. Doersch, Y . Yang, M. Vecer´ık, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman. TAPIR: tracking any point with per-frame initialization and temporal refinement. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10027–10038, 2023

2023
[30]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026– 5033, 2012. 12

2012
[31]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax - A differen- tiable physics engine for large scale rigid body simulation. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021
[33]

P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning.arXiv preprint arXiv:2603.15789, 2026

work page arXiv 2026
[34]

Reuss, H

M. Reuss, H. Zhou, M. R ¨uhle, ¨O. E. Ya˘gmurlu, F. Otto, and R. Lioutikov. Flower: Democratiz- ing generalist robot policies with efficient vision-language-action flow policies. InConference on Robot Learning (CoRL), 2025

2025
[35]

Beyer and H

H. Beyer and H. Schwefel. Evolution strategies - A comprehensive introduction.Nat. Comput., 1(1):3–52, 2002

2002
[36]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019

2019
[37]

J. L. Sch ¨onberger and J. Frahm. Structure-from-motion revisited. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016

2016
[38]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351, pages 234–241, 2015

2015
[39]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023
[40]

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

1992
[41]

Perez, F

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence (AAAI), pages 3942–3951, 2018

2018
[42]

B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1132–1140, 2017

2017
[43]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning (ICML), volume 37, pages 448–456, 2015

2015
[44]

W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1874–1883, 2016. 13 Appendix A Contact Geometry Optimization: Implementation Det...

2016

[1] [1]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. T. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. San- keti, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Jul...

2023

[2] [2]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. InConference on Robot Learning (CoRL), volume 270, pages 2679–2713, 2024

2024

[3] [3]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRobotics: Science and Systems (RSS), 2025

2025

[4] [4]

W. Yu, J. Tan, C. K. Liu, and G. Turk. Preparing for the unknown: Learning a universal policy with online system identification. InRobotics: Science and Systems (RSS), 2017

2017

[5] [5]

T. Lin, K. Sachdev, L. Fan, J. Malik, and Y . Zhu. Sim-to-real reinforcement learning for vision- based dexterous manipulation on humanoids. InConference on Robot Learning (CoRL), pages 4926–4940, 2025

2025

[6] [6]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

2017

[7] [7]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation (ICRA), pages 1–8, 2018

2018

[8] [8]

Takamatsu, K

J. Takamatsu, K. Ogawara, H. Kimura, and K. Ikeuchi. Recognizing assembly tasks through human demonstration.Int. J. Robotics Res., 26(7):641–659, 2007

2007

[9] [9]

Ikeuchi, N

K. Ikeuchi, N. Wake, J. Takamatsu, and K. Sasabuchi.Learning-from-Observation 2.0. 2025

2025

[10] [10]

X. Li, M. Baum, and O. Brock. Augmentation enables one-shot generalization in learning from demonstration for contact-rich manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3656–3663, 2023

2023

[11] [11]

Subramani, M

G. Subramani, M. R. Zinn, and M. Gleicher. Inferring geometric constraints in human demon- strations. InConference on Robot Learning (CoRL), volume 87, pages 223–236, 2018

2018

[12] [12]

X. Zhu, J. Ke, Z. Xu, Z. Sun, B. Bai, J. Lv, Q. Liu, Y . Zeng, Q. Ye, C. Lu, M. Tomizuka, and L. Shao. Diff-lfd: Contact-aware model-based learning from visual demonstration for robotic manipulation via differentiable physics-based simulation and rendering. InConference on Robot Learning (CoRL), volume 229, pages 499–512, 2023

2023

[13] [13]

M. Wang, W. Jin, K. Cao, L. Xie, and Y . Hong. Contactgaussian-wm: Learning physics- grounded world model from videos.arXiv preprint arXiv:2602.11021, 2026

work page arXiv 2026

[14] [14]

Mandlekar, S

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InConference on Robot Learning (CoRL), volume 229, pages 1820–1864, 2023. 11

2023

[15] [15]

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: example-guided deep reinforcement learning of physics-based character skills.ACM Trans. Graph., 37(4):143, 2018

2018

[16] [16]

A. Y . Ng and S. Russell. Algorithms for inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), pages 663–670, 2000

2000

[17] [17]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), pages 4565–4573, 2016

2016

[18] [18]

Y . J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models. In International Conference on Learning Representations (ICLR), 2024

2024

[19] [19]

P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real. InConference on Robot Learning (CoRL), 2025

2025

[20] [20]

T. G. W. Lum, O. Y . Lee, C. K. Liu, and J. Bohg. Crossing the human-robot embodiment gap with sim-to-real RL using one human demonstration. InConference on Robot Learning (CoRL), 2025

2025

[21] [21]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis.Commun. ACM, 65(1):99–106, 2022

2022

[22] [22]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139:1–139:14, 2023

2023

[23] [23]

Moran, M

B. Moran, M. Comi, A. Byravan, S. Bohez, T. Erez, Z. Li, and L. Hasenclever. Splat- ting physical scenes: End-to-end real-to-sim from imperfect robot data.arXiv preprint arXiv:2506.04120, 2025

work page arXiv 2025

[24] [24]

M. T. Villasevil, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Recon- ciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. In Robotics: Science and Systems (RSS), 2024

2024

[25] [25]

X. Han, M. Liu, Y . Chen, J. Yu, X. Lyu, Y . Tian, B. Wang, W. Zhang, and J. Pang. Re 3sim: Generating high-fidelity simulation data via 3d-photorealistic real-to-sim for robotic manipu- lation. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[26] [26]

T. Dai, J. Wong, Y . Jiang, C. Wang, C. Gokmen, R. Zhang, J. Wu, and L. Fei-Fei. Automated creation of digital cousins for robust policy learning. InConference on Robot Learning (CoRL), volume 270, pages 4912–4943, 2024

2024

[27] [27]

B. Tang, M. A. Lin, I. Akinola, A. Handa, G. S. Sukhatme, F. Ramos, D. Fox, and Y . Narang. Industreal: Transferring contact-rich assembly tasks from simulation to reality. InRobotics: Science and Systems (RSS), 2023

2023

[28] [28]

N. Ravi, V . Gabeur, Y . Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll ´ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations (ICLR), 2025

2025

[29] [29]

Doersch, Y

C. Doersch, Y . Yang, M. Vecer´ık, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman. TAPIR: tracking any point with per-frame initialization and temporal refinement. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 10027–10038, 2023

2023

[30] [30]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026– 5033, 2012. 12

2012

[31] [31]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax - A differen- tiable physics engine for large scale rigid body simulation. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

2021

[33] [33]

P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning.arXiv preprint arXiv:2603.15789, 2026

work page arXiv 2026

[34] [34]

Reuss, H

M. Reuss, H. Zhou, M. R ¨uhle, ¨O. E. Ya˘gmurlu, F. Otto, and R. Lioutikov. Flower: Democratiz- ing generalist robot policies with efficient vision-language-action flow policies. InConference on Robot Learning (CoRL), 2025

2025

[35] [35]

Beyer and H

H. Beyer and H. Schwefel. Evolution strategies - A comprehensive introduction.Nat. Comput., 1(1):3–52, 2002

2002

[36] [36]

Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019

2019

[37] [37]

J. L. Sch ¨onberger and J. Frahm. Structure-from-motion revisited. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016

2016

[38] [38]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351, pages 234–241, 2015

2015

[39] [39]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

2023

[40] [40]

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization, 30(4):838–855, 1992

1992

[41] [41]

Perez, F

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI Conference on Artificial Intelligence (AAAI), pages 3942–3951, 2018

2018

[42] [42]

B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1132–1140, 2017

2017

[43] [43]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning (ICML), volume 37, pages 448–456, 2015

2015

[44] [44]

W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1874–1883, 2016. 13 Appendix A Contact Geometry Optimization: Implementation Det...

2016