Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

Chen yizhou; Guanqi Chen; Hua Chen; Jia Pan; Tengye Xu; Yangting Sun; Zhen Fu; Ziju Shen

arxiv: 2605.22123 · v1 · pith:O6OIB7BKnew · submitted 2026-05-21 · 💻 cs.RO

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

Tengye Xu , Yangting Sun , Ziju Shen , Guanqi Chen , Zhen Fu , Chen yizhou , Hua Chen , Jia Pan This is my paper

Pith reviewed 2026-05-22 05:27 UTC · model grok-4.3

classification 💻 cs.RO

keywords invariant rewardsrobot manipulationreinforcement learningfew-shot learningsymbolic rewardsgeneralizationreward learningbehavioral invariants

0 comments

The pith

Learning invariant symbolic rewards from few demonstrations enables zero-shot generalization across visual changes in robot manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to build reward functions for robotics that remain useful when the same task appears with different objects, positions, or camera angles. Rather than memorizing pixel patterns from demonstrations, the approach identifies higher-level behavioral properties that stay fixed despite those visual differences. This is done by pairing a structural way of writing rewards that encodes strategies and constraints without altering the best policies, together with a procedure that extracts the properties from only five examples and without further robot interaction. If the claim holds, one learned reward can support many task variants in real settings and speed up policy training compared to methods that overfit to specific visuals.

Core claim

The paper claims that invariant symbolic reward functions can be learned from as few as five demonstrations by shifting focus to task-level properties that remain constant across visual instantiations. This is realized through two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations. Experiments show stronger process alignment and policy rollout ranking on eight Meta-World tasks and three Franka tasks, faster downstream learning, and zero-shot transfer in three real-world out-of-distribution tests.

What carries the argument

The structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, coupled with a hybrid symbolic-numerical procedure to distill invariants from demonstrations.

If this is right

The method produces stronger process alignment and better policy rollout ranking than baselines on eight Meta-World tasks and three Franka manipulation tasks.
Downstream policy learning is accelerated when using the learned reward.
A single reward transfers zero-shot to new positions, viewpoints, and objects in real-world experiments without retraining or online interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same invariants could let a robot reuse one reward model for a family of related manipulation problems that differ only in surface appearance.
Extending the structural constraints to include additional physical rules might handle tasks with deformable objects or partial observability.
If the distillation procedure scales, it could reduce reliance on hand-crafted rewards when deploying robots in unstructured environments.

Load-bearing premise

Task-level properties and the ranking of optimal policies stay the same even when object instances, positions, and viewpoints change substantially.

What would settle it

A demonstration that the learned reward ranks unsuccessful policies higher than successful ones or fails to produce working rollouts under new object, position, or viewpoint conditions beyond the three tested real-world variations.

Figures

Figures reproduced from arXiv: 2605.22123 by Chen yizhou, Guanqi Chen, Hua Chen, Jia Pan, Tengye Xu, Yangting Sun, Zhen Fu, Ziju Shen.

**Figure 2.** Figure 2: Framework Overview: (a) Structural Reward Formulation: A pipeline that maps raw visual input to robust reward signals. It comprises a FlowGenerator for flow generation, a symbolic potential function for progress estimation, and a Potential-Based Reward Shaping-MileStone (PBRS-MS) module to ensure optimal policy invariance and signal stability. (b) Hybrid Symbolic-Numerical Learning: A bi-level optimizatio… view at source ↗

**Figure 3.** Figure 3: The flow generation procedure. IV. STRUCTURAL REWARD FORMULATION Rather than learning a reward function end-to-end from images, FLORA encodes the Behavioral and Optimality Invariance constraints directly into the reward architecture, converting a constrained optimization problem into a tractable unconstrained one. The formulation has three components, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 4.** Figure 4: Potential collapse under standard PBRS and its resolution by our [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The Hybrid Symbolic-Numerical Learning Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Representative rollouts and reward curves on the Lever-Pull task under (a) base and (b) viewpoint-OOD settings. Origin denotes the default dense reward [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Meta-World Performance: We report interquartile means of success [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Generalization Performance: We reuse the trained reward models in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Real-world manipulation tasks and OOD variants. (a)–(c) Base tasks [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Ablation study on the PBRS-MS module. f) Hybrid Optimization Method: The hybrid optimizer is compared against three reduced variants: LLM reflection alone, Bayesian Optimization alone, and direct selection of the best LLM-generated candidate without further optimization. Each variant is run five times; performance is measured by [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Symbolic invariant rewards from five demos look practical for visual generalization in robotics, but the invariance property lacks formal verification or counterexamples.

read the letter

Hey colleague, the main thing here is that the paper shifts from pixel-fitting rewards to symbolic ones distilled from just five demonstrations, with the goal of zero-shot reuse across visual changes like position, viewpoint, and objects in real robots. That combination is the clearest new element relative to the vision-based models they reference. They do a solid job running the method on eight Meta-World tasks plus three Franka manipulations, reporting better process alignment and rollout ranking than baselines, which then speeds up policy learning. The three real-world out-of-distribution tests are the most concrete part if the details hold. The soft spots are exactly where the stress-test note points: the structural formulation is described as preserving optimal policy invariance across visual instantiations, yet there is no derivation, proof, or counterexample analysis showing why the chosen form survives camera or geometry shifts. Without that, the zero-shot claims rest more on the experimental outcomes than on the construction itself. The hybrid symbolic-numerical procedure from minimal demos is a reasonable practical step, but the absence of error analysis or checks against post-hoc fitting leaves the soundness harder to judge. The assumption that task-level properties stay constant is plausible but untested in the formal sense. This is aimed at robotics researchers working on reward design and reducing data needs for real-world deployment. A reader focused on practical sim-to-real transfer would find the experiments and the invariant-reward idea useful even if the theory side needs more work. It deserves a serious referee to examine the implementation details and see how robust the generalization actually is.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework to learn invariant symbolic reward functions from as few as five demonstrations for robotic manipulation. It introduces two coupled components: a structural reward formulation encoding task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical distillation procedure that extracts these invariants without online interaction. Experiments on eight Meta-World tasks and three Franka tasks show improved process alignment and rollout ranking over baselines, with three real-world OOD tests demonstrating zero-shot generalization to position, viewpoint, and object variations.

Significance. If the invariance properties and generalization results hold under scrutiny, the work could meaningfully advance reward learning in robotics by moving beyond pixel-memorization approaches, enabling reusable rewards across visual variants and reducing demonstration requirements for policy learning in open-world settings.

major comments (2)

[§3.2] §3.2 (Structural Reward Formulation): The claim that the structural reward encodes task-level properties while preserving optimal policy invariance across visual instantiations (camera parameters, object geometry) is central to the zero-shot OOD transfer results. No derivation, invariance proof, or counterexample analysis is provided to establish that the chosen form remains invariant or optimality-preserving under these changes; if invariance fails for even one variant, the reported real-world generalization does not follow from the construction.
[§5.3] §5.3 (Real-World OOD Experiments): The three real-world experiments report zero-shot transfer, but without access to full error analysis, variance across trials, or explicit checks that the structural form was not post-hoc adjusted to the test variants, it is difficult to confirm that the results support the invariance claim rather than task-specific fitting.

minor comments (2)

[Abstract and §5] The metrics 'process alignment' and 'policy rollout ranking' are referenced in the abstract and results but lack a concise definition or pointer to their exact computation in the main text; adding this would improve readability.
[Figures 4-6] Figure captions for the real-world setups could more explicitly label the variations (position, viewpoint, object) tested in each OOD case to make the generalization evidence easier to parse.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and outline the revisions we will make to strengthen the presentation of the invariance properties and experimental details.

read point-by-point responses

Referee: [§3.2] §3.2 (Structural Reward Formulation): The claim that the structural reward encodes task-level properties while preserving optimal policy invariance across visual instantiations (camera parameters, object geometry) is central to the zero-shot OOD transfer results. No derivation, invariance proof, or counterexample analysis is provided to establish that the chosen form remains invariant or optimality-preserving under these changes; if invariance fails for even one variant, the reported real-world generalization does not follow from the construction.

Authors: We agree that an explicit derivation would clarify the central claim. The structural reward is defined over a symbolic state space consisting of task predicates (e.g., contact relations, relative goal distances) that are invariant to camera intrinsics, extrinsics, and object geometry by construction; the numerical component is used only for grounding and does not alter the symbolic structure. Because the reward depends solely on these invariant predicates, any policy that is optimal with respect to the original task remains optimal under visual transformations that preserve the symbolic state. We will add a concise derivation of this invariance property together with a short counterexample analysis in the revised Section 3.2. revision: yes
Referee: [§5.3] §5.3 (Real-World OOD Experiments): The three real-world experiments report zero-shot transfer, but without access to full error analysis, variance across trials, or explicit checks that the structural form was not post-hoc adjusted to the test variants, it is difficult to confirm that the results support the invariance claim rather than task-specific fitting.

Authors: We acknowledge that additional statistical reporting and clarification on experimental procedure would increase confidence in the results. The real-world trials were performed with a fixed structural reward form determined exclusively from the five training demonstrations; no post-hoc modification occurred. We will expand Section 5.3 to include the complete per-variant success rates, standard deviations across repeated trials, and an explicit statement confirming that the symbolic structure was not adjusted after seeing the OOD test outcomes. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; invariance stated as property of formulation without reduction to inputs

full rationale

The visible abstract and context describe a structural reward formulation that 'encodes task-level strategies and physical constraints while preserving optimal policy invariance' and a hybrid procedure that distills invariants from five demonstrations. No equations, fitted parameters, or self-citations are shown that would reduce a claimed prediction or zero-shot generalization to a fitted input or self-definition by construction. Experimental results on Meta-World, Franka, and real-world OOD cases function as independent benchmarks rather than tautological outputs. This matches the default expectation of no significant circularity for a high-level framework description.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full details on parameters and assumptions unavailable. The core domain assumption is that behavioral invariants exist and can be distilled symbolically.

axioms (1)

domain assumption Task-level properties remain constant across diverse visual instantiations
Invoked to justify shifting from pixel distributions to symbolic rewards.

pith-pipeline@v0.9.0 · 5756 in / 1177 out tokens · 41362 ms · 2026-05-22T05:27:50.759929+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a Potential-Based Reward Shaping (PBRS) formulation... the final reward signal is derived via a PBRS-style post-processing module, which mathematically guarantees optimal policy invariance.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 8 internal anchors

[1]

Vision-language models are zero-shot re- ward models for reinforcement learning,

J. Rocamonde, V . Montesinos, E. Nava, E. Perez, and D. Lindner, “Vision-language models are zero-shot re- ward models for reinforcement learning,” inProc. Int. Conf. Learn. Representations (ICLR), 2023

work page 2023
[2]

Vision language models are in-context value learners,

Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiaoet al., “Vision language models are in-context value learners,” inProc. Int. Conf. Learn. Representations (ICLR), 2024

work page 2024
[3]

Subtask-aware visual reward learning from segmented demonstrations,

C. Kim, M. Heo, D. Lee, H. Lee, J. Shin, J. J. Lim, and K. Lee, “Subtask-aware visual reward learning from segmented demonstrations,” inProc. Int. Conf. Learn. Representations (ICLR), 2025

work page 2025
[4]

RoboClip: One demon- stration is enough to learn robot policies,

S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “RoboClip: One demon- stration is enough to learn robot policies,”Adv. Neural Inf. Process. Syst., vol. 36, pp. 55 681–55 693, 2023

work page 2023
[5]

VICtoR: Learning hierarchical vision- instruction correlation rewards for long-horizon manipu- lation,

K.-H. Hung, P.-C. Lo, J.-F. Yeh, H.-Y . Hsu, Y .-T. Chen, and W. H. Hsu, “VICtoR: Learning hierarchical vision- instruction correlation rewards for long-horizon manipu- lation,”arXiv:2405.16545, 2024

work page arXiv 2024
[6]

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

W. Ye, Y . Zhang, H. Weng, X. Gu, S. Wang, T. Zhang, M. Wang, P. Abbeel, and Y . Gao, “Reinforcement learn- ing with foundation priors: Let the embodied agent efficiently learn on its own,”arXiv:2310.02635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Progressor: A perceptually guided reward estimator with self-supervised online refinement,

T. W. Ayalew, X. Zhang, K. Y . Wu, T. Jiang, M. Maire, and M. R. Walter, “Progressor: A perceptually guided reward estimator with self-supervised online refinement,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 10 297–10 306

work page 2025
[8]

TOPRe- ward: Token probabilities as hidden zero-shot rewards for robotics,

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Krishna, “TOPRe- ward: Token probabilities as hidden zero-shot rewards for robotics,”arXiv:2602.19313, 2026

work page arXiv 2026
[9]

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. An- war, S. Kaushik, A. Shah, A. S. Huang, L. Zettle- moyer, D. Foxet al., “RoboMeter: Scaling general- purpose robotic reward models via trajectory compar- isons,”arXiv:2603.02115, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

RoboReward: General-purpose vision- language reward models for robotics,

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn, “RoboReward: General-purpose vision- language reward models for robotics,”arXiv:2601.00675, 2026

work page arXiv 2026
[11]

LIV: Language-image representations and rewards for robotic control,

Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayara- man, “LIV: Language-image representations and rewards for robotic control,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2023, pp. 23 301–23 320

work page 2023
[12]

Video-language critic: Transferable reward functions for language- conditioned robotics,

M. Alakuijala, R. McLean, I. Woungang, N. Farsad, S. Kaski, P. Marttinen, and K. Yuan, “Video-language critic: Transferable reward functions for language- conditioned robotics,”Transactions on Machine Learning Research (TMLR), 2025

work page 2025
[13]

ReWiND: Language-guided rewards teach robot policies without new demonstrations,

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “ReWiND: Language-guided rewards teach robot policies without new demonstrations,”arXiv:2505.10911, 2025

work page arXiv 2025
[14]

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu, “SARM: Stage-aware reward modeling for long horizon robot manipulation,”arXiv:2509.25358, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Eureka: Human-Level Reward Design via Coding Large Language Models

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,”arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Text2Reward: Reward shaping with language models for reinforcement learning,

T. Xie, S. Zhao, C. H. Wu, Y . Liu, Q. Luo, V . Zhong, Y . Yang, and T. Yu, “Text2Reward: Reward shaping with language models for reinforcement learning,” inProc. Int. Conf. Learn. Representations (ICLR), 2023

work page 2023
[17]

Algorithms for inverse rein- forcement learning

A. Y . Ng and S. Russell, “Algorithms for inverse rein- forcement learning.” inProc. Int. Conf. Mach. Learn. 12 (ICML), 2000

work page 2000
[18]

A survey of inverse reinforce- ment learning: Challenges, methods and progress,

S. Arora and P. Doshi, “A survey of inverse reinforce- ment learning: Challenges, methods and progress,”Artif. Intell., vol. 297, p. 103500, 2021

work page 2021
[19]

Apprenticeship learning via inverse reinforcement learning,

P. Abbeel and A. Y . Ng, “Apprenticeship learning via inverse reinforcement learning,” inProc. Int. Conf. Mach. Learn. (ICML), 2004, p. 1

work page 2004
[20]

Maximum entropy inverse reinforcement learning

B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Deyet al., “Maximum entropy inverse reinforcement learning.” in Proc. AAAI Conf. Artif. Intell, vol. 8, 2008, pp. 1433– 1438

work page 2008
[21]

Guided cost learning: Deep inverse optimal control via policy optimization,

C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in Proc. Int. Conf. Mach. Learn. (ICML). PMLR, 2016, pp. 49–58

work page 2016
[22]

Few-shot preference learning for human-in-the-loop RL,

D. J. Hejna III and D. Sadigh, “Few-shot preference learning for human-in-the-loop RL,” inProc. Conf. Robot Learn. (CoRL). PMLR, 2023, pp. 2014–2025

work page 2023
[23]

Deep reinforcement learning from hu- man preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from hu- man preferences,”Adv. Neural Inf. Process. Syst., vol. 30, 2017

work page 2017
[24]

Active preference-based learning of reward functions,

D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” inProc. Robot.: Sci. Syst. (RSS), 2017

work page 2017
[25]

RL-VLM-F: Reinforcement learn- ing from vision language foundation model feedback,

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “RL-VLM-F: Reinforcement learn- ing from vision language foundation model feedback,” arXiv:2402.03681, 2024

work page arXiv 2024
[26]

Real-world offline reinforce- ment learning from vision language model feedback,

S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforce- ment learning from vision language model feedback,” arXiv:2411.05273, 2024

work page arXiv 2024
[27]

Preference VLM: Leveraging VLMs for scalable preference-based reinforcement learn- ing,

U. Ghosh, D. S. Raychaudhuri, J. Li, K. Karydis, and A. Roy-Chowdhury, “Preference VLM: Leveraging VLMs for scalable preference-based reinforcement learn- ing,”arXiv:2502.01616, 2025

work page arXiv 2025
[28]

Language instructed reinforce- ment learning for human-Ai coordination,

H. Hu and D. Sadigh, “Language instructed reinforce- ment learning for human-Ai coordination,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2023, pp. 13 584– 13 598

work page 2023
[29]

Language to rewards for robotic skill synthesis,

W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, J. Humpliket al., “Language to rewards for robotic skill synthesis,”arXiv:2306.08647, 2023

work page arXiv 2023
[30]

DrEureka: Language model guided sim-to-real transfer,

J. Ma, W. Liang, H.-J. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman, “DrEureka: Language model guided sim-to-real transfer,” inProc. Robot.: Sci. Syst. (RSS), 2024

work page 2024
[31]

Video2reward: Generating reward function from videos for legged robot behavior learning,

R. Zeng, D. Zhou, Q. Liang, J. Liu, H. Li, C. Huang, J. Li, X. Hu, and F. Sun, “Video2reward: Generating reward function from videos for legged robot behavior learning,”arXiv:2412.05515, 2024

work page arXiv 2024
[32]

Any-point Trajectory Modeling for Policy Learning

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,”arXiv:2401.00025, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Flow as the cross-domain manipulation interface,

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” inProc. Conf. Robot Learn. (CoRL), 2024

work page 2024
[34]

General flow as foundation affordance for scalable robot learning,

C. Yuan, C. Wen, T. Zhang, and Y . Gao, “General flow as foundation affordance for scalable robot learning,” arXiv:2401.11439, 2024

work page arXiv 2024
[35]

3DFlowAction: Learning cross- embodiment manipulation from 3d flow world model,

H. Zhi, P. Chen, S. Zhou, Y . Dong, Q. Wu, L. Han, and M. Tan, “3DFlowAction: Learning cross- embodiment manipulation from 3d flow world model,” arXiv:2506.06199, 2025

work page arXiv 2025
[36]

A0: An affordance-aware hierarchical model for general robotic manipulation,

R. Xu, J. Zhang, M. Guo, Y . Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wanget al., “A0: An affordance-aware hierarchical model for general robotic manipulation,”arXiv:2504.12636, 2025

work page arXiv 2025
[37]

HuDOR: Bridging the human to robot dexterity gap through object-oriented rewards,

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto, “HuDOR: Bridging the human to robot dexterity gap through object-oriented rewards,” inWorkshop on Con- tinual Robot Learning from Humans, 2024

work page 2024
[38]

GenFlowRL: Generative object- centric flow matching for reward shaping in visual re- inforcement learning,

K. Yu, S. Zhang, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao, “GenFlowRL: Generative object- centric flow matching for reward shaping in visual re- inforcement learning,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025

work page 2025
[39]

Policy invariance under reward transformations: Theory and application to reward shaping,

A. Y . Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” inProc. 16th Int. Conf. Mach. Learn. (ICML), 1999, pp. 278–287

work page 1999
[40]

Rapidly adapting policies to the real world via simulation-guided fine- tuning,

P. Yin, T. Westenbroek, S. Bagaria, K. Huang, C.-a. Cheng, A. Kobolov, and A. Gupta, “Rapidly adapting policies to the real world via simulation-guided fine- tuning,”arXiv:2502.02705, 2025

work page arXiv 2025
[41]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhuet al., “Grounding DINO: Marrying dino with grounded pre-training for open-set object detection,”arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “DINOv2: Learning robust visual features without supervision,”Trans. Mach. Learn. Res., 2024

work page 2024
[43]

TAPIP3D: Tracking any point in persistent 3d geom- etry,

B. Zhang, L. Ke, A. W. Harley, and K. Fragkiadaki, “TAPIP3D: Tracking any point in persistent 3d geom- etry,”arXiv:2504.14717, 2025

work page arXiv 2025
[44]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt- man, S. Anadkatet al., “GPT-4 technical report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

A Tutorial on Bayesian Optimization

P. I. Frazier, “A tutorial on bayesian optimization,” arXiv:1807.02811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

On bayesian upper confidence bounds for bandit problems,

E. Kaufmann, O. Cappe, and A. Garivier, “On bayesian upper confidence bounds for bandit problems,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), ser. Proceedings of Machine Learning Research, N. D. Lawrence and M. Girolami, Eds., vol. 22. La Palma, Canary Islands: PMLR, 21–23 Apr 2012, pp. 592–600

work page 2012
[47]

Meta-World: A benchmark and evalua- tion for multi-task and meta reinforcement learning,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-World: A benchmark and evalua- tion for multi-task and meta reinforcement learning,” in Proc. Conf. Robot Learn. (CoRL). PMLR, 2020, pp. 1094–1100. 13

work page 2020
[48]

The EPIC-Kitchens dataset: Col- lection, challenges and baselines,

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Per- rett, W. Priceet al., “The EPIC-Kitchens dataset: Col- lection, challenges and baselines,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 11, pp. 4125–4141, 2020

work page 2020
[49]

Open X-Embodiment: Robotic learning datasets and RT-X models,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA)). IEEE, 2024, pp. 6892–6903

work page 2024
[50]

Efficient online reinforcement learning with offline data,

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient online reinforcement learning with offline data,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2023, pp. 1577–1594

work page 2023
[51]

The Franka Emika robot: A reference platform for robotics research and education,

S. Haddadin, S. Parusel, L. Johannsmeier, S. Golz, S. Gabl, F. Walch, M. Sabaghian, C. J ¨ahne, L. Haus- perger, and S. Haddadin, “The Franka Emika robot: A reference platform for robotics research and education,” IEEE Robot. Autom. Mag, vol. 29, no. 2, pp. 46–64, 2022

work page 2022
[52]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos,

N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker3: Simpler and better point tracking by pseudo-labelling real videos,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 6013–6022

work page 2025

[1] [1]

Vision-language models are zero-shot re- ward models for reinforcement learning,

J. Rocamonde, V . Montesinos, E. Nava, E. Perez, and D. Lindner, “Vision-language models are zero-shot re- ward models for reinforcement learning,” inProc. Int. Conf. Learn. Representations (ICLR), 2023

work page 2023

[2] [2]

Vision language models are in-context value learners,

Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiaoet al., “Vision language models are in-context value learners,” inProc. Int. Conf. Learn. Representations (ICLR), 2024

work page 2024

[3] [3]

Subtask-aware visual reward learning from segmented demonstrations,

C. Kim, M. Heo, D. Lee, H. Lee, J. Shin, J. J. Lim, and K. Lee, “Subtask-aware visual reward learning from segmented demonstrations,” inProc. Int. Conf. Learn. Representations (ICLR), 2025

work page 2025

[4] [4]

RoboClip: One demon- stration is enough to learn robot policies,

S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti, “RoboClip: One demon- stration is enough to learn robot policies,”Adv. Neural Inf. Process. Syst., vol. 36, pp. 55 681–55 693, 2023

work page 2023

[5] [5]

VICtoR: Learning hierarchical vision- instruction correlation rewards for long-horizon manipu- lation,

K.-H. Hung, P.-C. Lo, J.-F. Yeh, H.-Y . Hsu, Y .-T. Chen, and W. H. Hsu, “VICtoR: Learning hierarchical vision- instruction correlation rewards for long-horizon manipu- lation,”arXiv:2405.16545, 2024

work page arXiv 2024

[6] [6]

Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own

W. Ye, Y . Zhang, H. Weng, X. Gu, S. Wang, T. Zhang, M. Wang, P. Abbeel, and Y . Gao, “Reinforcement learn- ing with foundation priors: Let the embodied agent efficiently learn on its own,”arXiv:2310.02635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Progressor: A perceptually guided reward estimator with self-supervised online refinement,

T. W. Ayalew, X. Zhang, K. Y . Wu, T. Jiang, M. Maire, and M. R. Walter, “Progressor: A perceptually guided reward estimator with self-supervised online refinement,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 10 297–10 306

work page 2025

[8] [8]

TOPRe- ward: Token probabilities as hidden zero-shot rewards for robotics,

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Krishna, “TOPRe- ward: Token probabilities as hidden zero-shot rewards for robotics,”arXiv:2602.19313, 2026

work page arXiv 2026

[9] [9]

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. An- war, S. Kaushik, A. Shah, A. S. Huang, L. Zettle- moyer, D. Foxet al., “RoboMeter: Scaling general- purpose robotic reward models via trajectory compar- isons,”arXiv:2603.02115, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

RoboReward: General-purpose vision- language reward models for robotics,

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn, “RoboReward: General-purpose vision- language reward models for robotics,”arXiv:2601.00675, 2026

work page arXiv 2026

[11] [11]

LIV: Language-image representations and rewards for robotic control,

Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayara- man, “LIV: Language-image representations and rewards for robotic control,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2023, pp. 23 301–23 320

work page 2023

[12] [12]

Video-language critic: Transferable reward functions for language- conditioned robotics,

M. Alakuijala, R. McLean, I. Woungang, N. Farsad, S. Kaski, P. Marttinen, and K. Yuan, “Video-language critic: Transferable reward functions for language- conditioned robotics,”Transactions on Machine Learning Research (TMLR), 2025

work page 2025

[13] [13]

ReWiND: Language-guided rewards teach robot policies without new demonstrations,

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “ReWiND: Language-guided rewards teach robot policies without new demonstrations,”arXiv:2505.10911, 2025

work page arXiv 2025

[14] [14]

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu, “SARM: Stage-aware reward modeling for long horizon robot manipulation,”arXiv:2509.25358, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Eureka: Human-Level Reward Design via Coding Large Language Models

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,”arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Text2Reward: Reward shaping with language models for reinforcement learning,

T. Xie, S. Zhao, C. H. Wu, Y . Liu, Q. Luo, V . Zhong, Y . Yang, and T. Yu, “Text2Reward: Reward shaping with language models for reinforcement learning,” inProc. Int. Conf. Learn. Representations (ICLR), 2023

work page 2023

[17] [17]

Algorithms for inverse rein- forcement learning

A. Y . Ng and S. Russell, “Algorithms for inverse rein- forcement learning.” inProc. Int. Conf. Mach. Learn. 12 (ICML), 2000

work page 2000

[18] [18]

A survey of inverse reinforce- ment learning: Challenges, methods and progress,

S. Arora and P. Doshi, “A survey of inverse reinforce- ment learning: Challenges, methods and progress,”Artif. Intell., vol. 297, p. 103500, 2021

work page 2021

[19] [19]

Apprenticeship learning via inverse reinforcement learning,

P. Abbeel and A. Y . Ng, “Apprenticeship learning via inverse reinforcement learning,” inProc. Int. Conf. Mach. Learn. (ICML), 2004, p. 1

work page 2004

[20] [20]

Maximum entropy inverse reinforcement learning

B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Deyet al., “Maximum entropy inverse reinforcement learning.” in Proc. AAAI Conf. Artif. Intell, vol. 8, 2008, pp. 1433– 1438

work page 2008

[21] [21]

Guided cost learning: Deep inverse optimal control via policy optimization,

C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” in Proc. Int. Conf. Mach. Learn. (ICML). PMLR, 2016, pp. 49–58

work page 2016

[22] [22]

Few-shot preference learning for human-in-the-loop RL,

D. J. Hejna III and D. Sadigh, “Few-shot preference learning for human-in-the-loop RL,” inProc. Conf. Robot Learn. (CoRL). PMLR, 2023, pp. 2014–2025

work page 2023

[23] [23]

Deep reinforcement learning from hu- man preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from hu- man preferences,”Adv. Neural Inf. Process. Syst., vol. 30, 2017

work page 2017

[24] [24]

Active preference-based learning of reward functions,

D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” inProc. Robot.: Sci. Syst. (RSS), 2017

work page 2017

[25] [25]

RL-VLM-F: Reinforcement learn- ing from vision language foundation model feedback,

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “RL-VLM-F: Reinforcement learn- ing from vision language foundation model feedback,” arXiv:2402.03681, 2024

work page arXiv 2024

[26] [26]

Real-world offline reinforce- ment learning from vision language model feedback,

S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforce- ment learning from vision language model feedback,” arXiv:2411.05273, 2024

work page arXiv 2024

[27] [27]

Preference VLM: Leveraging VLMs for scalable preference-based reinforcement learn- ing,

U. Ghosh, D. S. Raychaudhuri, J. Li, K. Karydis, and A. Roy-Chowdhury, “Preference VLM: Leveraging VLMs for scalable preference-based reinforcement learn- ing,”arXiv:2502.01616, 2025

work page arXiv 2025

[28] [28]

Language instructed reinforce- ment learning for human-Ai coordination,

H. Hu and D. Sadigh, “Language instructed reinforce- ment learning for human-Ai coordination,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2023, pp. 13 584– 13 598

work page 2023

[29] [29]

Language to rewards for robotic skill synthesis,

W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. G. Arenas, H.-T. L. Chiang, T. Erez, L. Hasenclever, J. Humpliket al., “Language to rewards for robotic skill synthesis,”arXiv:2306.08647, 2023

work page arXiv 2023

[30] [30]

DrEureka: Language model guided sim-to-real transfer,

J. Ma, W. Liang, H.-J. Wang, Y . Zhu, L. Fan, O. Bastani, and D. Jayaraman, “DrEureka: Language model guided sim-to-real transfer,” inProc. Robot.: Sci. Syst. (RSS), 2024

work page 2024

[31] [31]

Video2reward: Generating reward function from videos for legged robot behavior learning,

R. Zeng, D. Zhou, Q. Liang, J. Liu, H. Li, C. Huang, J. Li, X. Hu, and F. Sun, “Video2reward: Generating reward function from videos for legged robot behavior learning,”arXiv:2412.05515, 2024

work page arXiv 2024

[32] [32]

Any-point Trajectory Modeling for Policy Learning

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,”arXiv:2401.00025, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Flow as the cross-domain manipulation interface,

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” inProc. Conf. Robot Learn. (CoRL), 2024

work page 2024

[34] [34]

General flow as foundation affordance for scalable robot learning,

C. Yuan, C. Wen, T. Zhang, and Y . Gao, “General flow as foundation affordance for scalable robot learning,” arXiv:2401.11439, 2024

work page arXiv 2024

[35] [35]

3DFlowAction: Learning cross- embodiment manipulation from 3d flow world model,

H. Zhi, P. Chen, S. Zhou, Y . Dong, Q. Wu, L. Han, and M. Tan, “3DFlowAction: Learning cross- embodiment manipulation from 3d flow world model,” arXiv:2506.06199, 2025

work page arXiv 2025

[36] [36]

A0: An affordance-aware hierarchical model for general robotic manipulation,

R. Xu, J. Zhang, M. Guo, Y . Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wanget al., “A0: An affordance-aware hierarchical model for general robotic manipulation,”arXiv:2504.12636, 2025

work page arXiv 2025

[37] [37]

HuDOR: Bridging the human to robot dexterity gap through object-oriented rewards,

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto, “HuDOR: Bridging the human to robot dexterity gap through object-oriented rewards,” inWorkshop on Con- tinual Robot Learning from Humans, 2024

work page 2024

[38] [38]

GenFlowRL: Generative object- centric flow matching for reward shaping in visual re- inforcement learning,

K. Yu, S. Zhang, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao, “GenFlowRL: Generative object- centric flow matching for reward shaping in visual re- inforcement learning,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025

work page 2025

[39] [39]

Policy invariance under reward transformations: Theory and application to reward shaping,

A. Y . Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” inProc. 16th Int. Conf. Mach. Learn. (ICML), 1999, pp. 278–287

work page 1999

[40] [40]

Rapidly adapting policies to the real world via simulation-guided fine- tuning,

P. Yin, T. Westenbroek, S. Bagaria, K. Huang, C.-a. Cheng, A. Kobolov, and A. Gupta, “Rapidly adapting policies to the real world via simulation-guided fine- tuning,”arXiv:2502.02705, 2025

work page arXiv 2025

[41] [41]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhuet al., “Grounding DINO: Marrying dino with grounded pre-training for open-set object detection,”arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “DINOv2: Learning robust visual features without supervision,”Trans. Mach. Learn. Res., 2024

work page 2024

[43] [43]

TAPIP3D: Tracking any point in persistent 3d geom- etry,

B. Zhang, L. Ke, A. W. Harley, and K. Fragkiadaki, “TAPIP3D: Tracking any point in persistent 3d geom- etry,”arXiv:2504.14717, 2025

work page arXiv 2025

[44] [44]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt- man, S. Anadkatet al., “GPT-4 technical report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

A Tutorial on Bayesian Optimization

P. I. Frazier, “A tutorial on bayesian optimization,” arXiv:1807.02811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

On bayesian upper confidence bounds for bandit problems,

E. Kaufmann, O. Cappe, and A. Garivier, “On bayesian upper confidence bounds for bandit problems,” inProc. Int. Conf. Artif. Intell. Stat. (AISTATS), ser. Proceedings of Machine Learning Research, N. D. Lawrence and M. Girolami, Eds., vol. 22. La Palma, Canary Islands: PMLR, 21–23 Apr 2012, pp. 592–600

work page 2012

[47] [47]

Meta-World: A benchmark and evalua- tion for multi-task and meta reinforcement learning,

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-World: A benchmark and evalua- tion for multi-task and meta reinforcement learning,” in Proc. Conf. Robot Learn. (CoRL). PMLR, 2020, pp. 1094–1100. 13

work page 2020

[48] [48]

The EPIC-Kitchens dataset: Col- lection, challenges and baselines,

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Per- rett, W. Priceet al., “The EPIC-Kitchens dataset: Col- lection, challenges and baselines,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 11, pp. 4125–4141, 2020

work page 2020

[49] [49]

Open X-Embodiment: Robotic learning datasets and RT-X models,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA)). IEEE, 2024, pp. 6892–6903

work page 2024

[50] [50]

Efficient online reinforcement learning with offline data,

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Efficient online reinforcement learning with offline data,” inProc. Int. Conf. Mach. Learn. (ICML). PMLR, 2023, pp. 1577–1594

work page 2023

[51] [51]

The Franka Emika robot: A reference platform for robotics research and education,

S. Haddadin, S. Parusel, L. Johannsmeier, S. Golz, S. Gabl, F. Walch, M. Sabaghian, C. J ¨ahne, L. Haus- perger, and S. Haddadin, “The Franka Emika robot: A reference platform for robotics research and education,” IEEE Robot. Autom. Mag, vol. 29, no. 2, pp. 46–64, 2022

work page 2022

[52] [52]

Cotracker3: Simpler and better point tracking by pseudo-labelling real videos,

N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker3: Simpler and better point tracking by pseudo-labelling real videos,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 6013–6022

work page 2025