STEAM: Self-Supervised Temporal Ensemble Advantage Modeling for Real-World Robot Learning

Chao Yu; Dongming Qiao; Feng Gao; Guoliang Fan; Jincheng Yu; Kang Chen; Liangzhi Shi; Qiuyi Gu; Quanlu Zhang; Shuaihang Chen

arxiv: 2606.29834 · v1 · pith:7DL6QNVWnew · submitted 2026-06-29 · 💻 cs.RO

STEAM: Self-Supervised Temporal Ensemble Advantage Modeling for Real-World Robot Learning

Zhihao Liu , Qiuyi Gu , Yitao Wang , Dongming Qiao , Yixian Zhang , Shuaihang Chen , Liangzhi Shi , Tianxing Zhou

show 11 more authors

Zefang Huang Kang Chen Zhen Guo Quanlu Zhang Jincheng Yu Xiaodan Liang Guoliang Fan Yu Wang Feng Gao Xinlei Chen Chao Yu

This is my paper

Pith reviewed 2026-06-30 06:15 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot learningself-supervised learningadvantage estimationtemporal modelingreal-world roboticspolicy improvementbimanual manipulationensemble methods

0 comments

The pith

STEAM learns frame-level advantages for robot policies by predicting normalized temporal offsets between frames in expert demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that expert trajectories contain a built-in signal for advantage: the normalized time gap between any two frames. An ensemble of predictors is trained to map pairs of frames to a distribution over possible offsets, and the resulting scalar advantage is taken as the minimum across the ensemble. This conservative score is then applied to mixed-quality rollout data to down-weight stalls, failures, and regressions while up-weighting reliable progress. When the scores are used inside CFGRL, the method raises success rates on four real-world manipulation tasks. The approach requires no external labels or human preference data.

Core claim

STEAM trains an ensemble of temporal-offset predictors on frame pairs drawn from expert trajectories, treating the normalized temporal offset as a self-supervised target. Each predictor outputs a distribution over offsets; this distribution is converted to a scalar advantage value. The minimum advantage across the ensemble is used to score individual frames in mixed-quality rollout data, providing a conservative estimate that distinguishes local progress from stalls and regressions without any additional supervision.

What carries the argument

Ensemble of temporal-offset predictors that map frame pairs to offset distributions, converted to scalar advantages whose minimum supplies conservative scoring of rollout data.

If this is right

STEAM identifies stalls, failures, and recoveries directly from unlabeled rollout data.
When combined with CFGRL, STEAM raises policy success rate by 59% on bimanual towel folding, 54.3% on chip checkout, 23% on cola restocking, and 16.2% on single-arm pick-and-place.
The same self-supervised signal can be extracted from any set of expert trajectories that contain consistent temporal ordering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to any sequential decision domain where expert traces exhibit reliable ordering, such as video game play or surgical tool trajectories.
Because the advantage is computed per frame pair, it may support fine-grained credit assignment inside long-horizon tasks without requiring full episode returns.
An online version could periodically retrain the ensemble on newly collected expert data to keep the advantage model current as task distributions shift.

Load-bearing premise

Normalized temporal offset between frames in expert trajectories acts as a reliable proxy for advantage that transfers to scoring non-expert rollout data.

What would settle it

Collect a held-out set of mixed-quality robot trajectories, have humans label frame-level progress, and check whether STEAM advantage scores fail to rank frames in the same order as the human labels.

Figures

Figures reproduced from arXiv: 2606.29834 by Chao Yu, Dongming Qiao, Feng Gao, Guoliang Fan, Jincheng Yu, Kang Chen, Liangzhi Shi, Qiuyi Gu, Quanlu Zhang, Shuaihang Chen, Tianxing Zhou, Xiaodan Liang, Xinlei Chen, Yitao Wang, Yixian Zhang, Yu Wang, Zefang Huang, Zhen Guo, Zhihao Liu.

**Figure 1.** Figure 1: STEAM. STEAM is a self-supervised advantage modeling framework for real-world robot learning. It learns advantage prediction offline from expert demonstrations without manual annotations or hand-crafted rewards, and can be applied to expert data, human corrections, and policy rollouts to provide robust advantage estimates. When combined with CFGRL [1], STEAM substantially improves policy performance on var… view at source ↗

**Figure 2.** Figure 2: STEAM framework. (a) Expert demonstrations provide frame pairs for normalized temporal offset calculation. Both forward and reversed pairs are used as self-supervised targets. (b) An ensemble of M predictors is trained on expert data to map frame pairs and language instructions to categorical distributions over temporal bins, converting these into scalar advantage scores. (c) The trained ensemble scores mi… view at source ↗

**Figure 3.** Figure 3: Robot setup and tasks. We evaluate STEAM on four real-world manipulation tasks with varying horizons: towel folding (5 stages), chip checkout (8 stages), and cola restocking (4 stages) using an ARX dual-arm robot, and pick-and-place (2 stages) using a single Franka arm. To train and evaluate STEAM, we collect datasets containing varying mixtures of expert demonstrations, autonomous rollouts, and human corr… view at source ↗

**Figure 4.** Figure 4: Visualization of STEAM advantage curves on the towel folding task. Frame-level ASTEAM is visualized on four representative episode types, including expert demonstrations, successful rollouts, failed rollouts, and human correction episodes. Images on top show corresponding frames from each episode, while shaded regions highlight segments with retry, slow progress, failure, or human takeover. 4.1 STEAM Can… view at source ↗

**Figure 5.** Figure 5: Probability density of frame-level ASTEAM across data types. Expert Rollout (succ.) Rollout (fail.) Human corr. 0 200 400 600 800 1000 S u m o f A STE A M (a) Towel Folding Expert Rollout (succ.) Rollout (fail.) Human corr. 0 200 400 600 S u m o f A STE A M (b) Chip Checkout Expert Rollout (succ.) Rollout (fail.) 0 200 400 600 S u m o f A STE A M (c) Cola Restocking Expert Rollout (succ.) Rollout (fail.) 5… view at source ↗

**Figure 6.** Figure 6: reports the episode-wise sum of frame-level advantages as an aggregate measure of overall task progress. While successful episodes yield comparable cumulative advantages, failed rollouts are clearly separated by significantly lower sums. Notably, cumulative advantages are higher for towel folding and chip checkout than for cola restocking and pick-and-place, particularly among failures. This discrepancy st… view at source ↗

**Figure 7.** Figure 7: STEAM performance across different training Data combinations. We compare the success rate of Behavior Cloning against STEAM trained with varying sources of data: expert demonstrations only (STEAM (Exp)), expert data supplemented with human correction episodes (STEAM (Exp+Dagg)), and the full dataset containing expert, correction, and autonomous rollout data (STEAM (Full)). To understand the impact of diff… view at source ↗

**Figure 8.** Figure 8: Chip checkout advantage curves. Expert, successful-rollout, failed-rollout, and humancorrection episodes. shelf due to initial misalignment between frames 400 and 600, causing a temporary advantage drop. After a successful realignment and retry after frame 600, the advantage rises back to a high level. In the successful rollout, the right arm makes multiple consecutive attempts to grasp the cola during th… view at source ↗

**Figure 9.** Figure 9: Cola restocking advantage curves. Expert, successful-rollout, and failed-rollout episodes. E.4 Effectiveness of Conservative Ensemble Aggregation To demonstrate why a conservative ensemble strategy is essential for robust advantage estimation, we visualize the individual estimates from each ensemble member alongside their aggregated minimum in [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Pick-and-place advantage curves. Expert, successful-rollout, and failed-rollout episodes. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of individual ensemble predictions vs. aggregated STEAM advantage. During the retry of the final towel-folding step (between frames 1200 and 1400), the green curve (Ensemble 3) severely overestimates the advantage. By taking the minimum across the ensemble, the final aggregated STEAM advantage (black curve) successfully suppresses this false positive. 19 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

read the original abstract

Real-world robot learning increasingly relies on heterogeneous data, but demonstrations and rollouts often mix useful progress with stalls, corrections, and suboptimal behavior. Effective policy learning therefore requires frame-level advantages that distinguish reliable local progress from failures and regressions. We propose Self-supervised Temporal Ensemble Advantage Modeling (STEAM), a label-free method that learns such advantages from expert demonstrations. STEAM trains an ensemble of temporal-offset predictors on frame pairs within expert trajectories, using the normalized temporal offset between two frames as a self-supervised signal. Each predictor maps a frame pair to a distribution over temporal offsets, which is converted into a scalar advantage. STEAM then takes the minimum advantage across the ensemble to score mixed-quality rollout data conservatively. Across real-world bimanual towel folding, chip checkout, cola restocking, and single-arm pick-and-place tasks, STEAM identifies stalls, failures, and recoveries. When combined with CFGRL, STEAM further improves policy success rate by 59%, 54.3%, 23% and 16.2% over baselines, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STEAM trains an ensemble on expert temporal offsets to score advantages and applies the min to mixed rollouts, but the transfer step from expert to non-expert data has no reported validation.

read the letter

The core idea here is straightforward: train predictors on pairs of frames from expert trajectories to regress the normalized time offset between them, turn the output distribution into a scalar advantage, then take the minimum across the ensemble to conservatively label stalls and recoveries in noisier rollouts. When this is fed into CFGRL the paper reports success-rate lifts of 59%, 54.3%, 23% and 16.2% on four real-robot tasks. That is the main new piece—an explicit self-supervised temporal signal turned into a conservative advantage estimator for heterogeneous robot data.

The practical framing is useful. Real-world collections really do mix progress with stalls and corrections, and a label-free way to down-weight the bad frames without extra human annotation would be handy for anyone doing offline or imitation-style learning on physical robots.

The soft spot is exactly where the stress-test note flags it. The predictors are fit only on expert frame pairs; the claim that they will assign low advantage to stalls or recoveries in non-expert rollouts is an untested distributional transfer. The abstract gives no correlation numbers against human progress labels, no held-out return correlation on mixed trajectories, and no ablation that isolates the ensemble or the min operation. Without those checks the reported gains could be driven by CFGRL itself or by other unstated factors. Implementation details on how the distribution becomes a scalar and how the ensemble is trained are also thin in the supplied text.

This is for robotics groups already running real-robot data pipelines and looking for lightweight ways to clean advantage signals. It is not yet ready for broad citation because the key generalization step is asserted rather than measured. A serious editor should send it to review so the authors can supply the missing validation experiments and ablations; the underlying construction is simple enough that referees can check it quickly.

Referee Report

3 major / 2 minor

Summary. The paper proposes STEAM, a self-supervised method for learning frame-level advantages from expert demonstrations in robot learning. It trains an ensemble of predictors on frame pairs from expert trajectories using normalized temporal offset as the regression target, converts predictions to scalar advantages, and uses the minimum across the ensemble to conservatively score mixed-quality rollout data. When combined with CFGRL, it reports improvements in policy success rates of 59%, 54.3%, 23%, and 16.2% on four real-world tasks: bimanual towel folding, chip checkout, cola restocking, and single-arm pick-and-place.

Significance. If the self-supervised advantage modeling generalizes reliably to non-expert data, this approach could enable more effective use of heterogeneous robot datasets without requiring additional labels, addressing a key challenge in real-world robot learning.

major comments (3)

[Method] The central claim that the learned temporal-offset predictors transfer to score stalls, failures, and recoveries in mixed-quality non-expert rollouts is load-bearing for attributing the reported gains to STEAM, yet the manuscript provides no explicit validation such as correlation with human-labeled progress or ground-truth returns on held-out mixed trajectories.
[Experiments] The abstract and experiments report specific success-rate improvements (59%, 54.3%, 23%, 16.2%) when STEAM is combined with CFGRL, but without ablation studies that isolate the contribution of the min-ensemble advantage scores versus CFGRL alone or other factors, the source of the gains cannot be verified.
[Abstract] No implementation details, error analysis on the self-supervised regression target, or sensitivity to the choice of ensemble size and min operation are provided, leaving the soundness of the advantage modeling unverifiable from the supplied text.

minor comments (2)

Clarify the precise mapping from the predicted distribution over temporal offsets to the scalar advantage value.
Specify the number of predictors in the ensemble and any hyperparameters used for training the temporal-offset models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing the strongest honest defense based on the manuscript content while agreeing to revisions where the points identify verifiable gaps.

read point-by-point responses

Referee: [Method] The central claim that the learned temporal-offset predictors transfer to score stalls, failures, and recoveries in mixed-quality non-expert rollouts is load-bearing for attributing the reported gains to STEAM, yet the manuscript provides no explicit validation such as correlation with human-labeled progress or ground-truth returns on held-out mixed trajectories.

Authors: The manuscript demonstrates the transfer through qualitative visualizations in the experiments, where STEAM-assigned advantages correctly highlight stalls, failures, and recoveries in mixed-quality rollouts, directly supporting the downstream policy gains. While explicit quantitative correlations with human labels or ground-truth returns on held-out mixed trajectories are not reported, the self-supervised training on expert data and conservative min-ensemble design provide a principled basis for generalization. We will add such correlation analysis in the revision to make the validation explicit. revision: yes
Referee: [Experiments] The abstract and experiments report specific success-rate improvements (59%, 54.3%, 23%, 16.2%) when STEAM is combined with CFGRL, but without ablation studies that isolate the contribution of the min-ensemble advantage scores versus CFGRL alone or other factors, the source of the gains cannot be verified.

Authors: The reported gains are measured against baselines that lack STEAM, establishing that the combined system outperforms alternatives. However, we acknowledge that dedicated ablations isolating the min-ensemble advantage scores (e.g., CFGRL with vs. without STEAM or with random scores) would strengthen attribution. We will incorporate these ablations in the revised experiments section. revision: yes
Referee: [Abstract] No implementation details, error analysis on the self-supervised regression target, or sensitivity to the choice of ensemble size and min operation are provided, leaving the soundness of the advantage modeling unverifiable from the supplied text.

Authors: The abstract is a concise summary by design. Full implementation details, error analysis on the regression target (including prediction distributions and normalization), and sensitivity studies to ensemble size and the min operation are presented in Sections 3 (method) and 4 (experiments) with supporting figures and tables. We will add explicit forward references in the abstract and ensure these elements are highlighted more prominently in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: advantage proxy constructed independently of target rollouts

full rationale

The described method trains temporal-offset predictors exclusively on expert frame pairs with normalized offset as the explicit regression target, then converts the output distribution to a scalar advantage and applies the min-ensemble only to separate mixed-quality rollouts. This separation means the advantage labels on non-expert data are not forced by construction to reproduce the training inputs; any success on stalls or recoveries is an empirical generalization claim rather than a definitional equivalence. No equations, self-citations, or ansatzes are supplied that would collapse the final scores back to the expert offsets by algebraic identity. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified from provided text.

pith-pipeline@v0.9.1-grok · 5780 in / 1072 out tokens · 29252 ms · 2026-06-30T06:15:43.486693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Frans, S

K. Frans, S. Park, P. Abbeel, and S. Levine. Diffusion guidance is a controllable policy im- provement operator.arXiv preprint arXiv:2505.23458, 2025. URLhttps://arxiv.org/ abs/2505.23458

work page arXiv 2025
[2]

Belkhale, Y

S. Belkhale, Y . Cui, and D. Sadigh. Data quality in imitation learning.Advances in neural information processing systems, 36:80375–80395, 2023

2023
[3]

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. SARM: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Rep- resentations, 2026. URLhttps://openreview.net/forum?id=aemqAxScl9

2026
[4]

Y . Mao, Z. Yu, W. Mao, Y . Li, Q. Hu, Z. Lan, M. Zhu, and H. Chen. ARM: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dha- balia, J. DiCarlo, D. Driess, et al.π ∗ 0.6: a VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

J. Yang, K. Lin, J. Li, W. Zhang, T. Lin, L. Wu, Z. Su, H. Zhao, Y .-Q. Zhang, L. Chen, P. Luo, X. Yue, and H. Li. RISE: Self-improving robot policy with compositional world model. Robotics: Science and Systems, 2026. URLhttps://arxiv.org/abs/2602.11075

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. RoboReward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026. URLhttps://arxiv.org/abs/2601.00675

work page arXiv 2026
[8]

Y . Liu, C. Wen, Y . Hu, D. Jayaraman, and Y . Gao. Timerewarder: Learning dense reward from passive videos via frame-wise temporal distance.arXiv preprint arXiv:2509.26627, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025. URLhttps://arxiv.org/abs/2509.15937

work page arXiv 2025
[10]

H. Xu, X. Zhan, H. Yin, and H. Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InInternational Conference on Machine Learning, pages 24725– 24742. PMLR, 2022

2022
[11]

Kostrikov, A

I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022

2022
[12]

X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. URL https://arxiv.org/abs/1910.00177

work page internal anchor Pith review Pith/arXiv arXiv 1910
[13]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, S. Xie, G. Yao, P. Wang, Z. Wang, and S. Zhang. Robo-Dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025. URLhttps: //arxiv.org/abs/2512.23703. 10

work page arXiv 2025
[14]

Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, J. Tompson, O. Bastani, D. Jayaraman, W. Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=friHAl5ofG

2025
[15]

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory compar- isons.Robotics: Science and Systems, 2026. URLhttps://arxiv.org/abs/2603.02115

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Dwibedi, Y

D. Dwibedi, Y . Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Tempo- ral cycle-consistency learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1801–1810, 2019. URL https://openaccess.thecvf.com/content_CVPR_2019/html/Dwibedi_Temporal_ Cycle-Consistency_Learning_CVPR_2019_paper.html

2019
[17]

Zakka, A

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. XIRL: Cross- embodiment inverse reinforcement learning. In5th Conference on Robot Learning, 2022. URLhttps://openreview.net/forum?id=RO4DM85Z4P7

2022
[18]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual rep- resentation for robot manipulation. In6th Conference on Robot Learning, 2022. URL https://arxiv.org/abs/2203.12601

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=YJ7o2wetJ2

2023
[20]

Y . J. Ma, W. Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image representations and rewards for robotic control. InProceedings of the 40th In- ternational Conference on Machine Learning, volume 202 ofProceedings of Machine Learn- ing Research, 2023

2023
[21]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Bıyık, and J. Zhang. ReWiND: Language-guided rewards teach robot policies without new demonstrations. In9th Conference on Robot Learning, 2025

2025
[22]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[23]

Reward Model Ensembles Help Mitigate Overoptimization , March 2024

T. Coste, U. Anwar, R. Kirk, and D. Krueger. Reward model ensembles help mitigate overoptimization. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.02743

work page arXiv 2024
[24]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019
[26]

R. He, Y . Wei, L. Yu, and X. Zeng. Spars: Structure-informed progress-aware reward shaping for fabric manipulation learning from demonstration.Robotics and Autonomous Systems, page 105499, 2026. 11 A Classifier-Free Guidance RL Details We integrate the frame-level advantages learned by STEAM into the classifier-free guidance rein- forcement learning fram...

2026

[1] [1]

Frans, S

K. Frans, S. Park, P. Abbeel, and S. Levine. Diffusion guidance is a controllable policy im- provement operator.arXiv preprint arXiv:2505.23458, 2025. URLhttps://arxiv.org/ abs/2505.23458

work page arXiv 2025

[2] [2]

Belkhale, Y

S. Belkhale, Y . Cui, and D. Sadigh. Data quality in imitation learning.Advances in neural information processing systems, 36:80375–80395, 2023

2023

[3] [3]

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. SARM: Stage-aware reward modeling for long horizon robot manipulation. InInternational Conference on Learning Rep- resentations, 2026. URLhttps://openreview.net/forum?id=aemqAxScl9

2026

[4] [4]

Y . Mao, Z. Yu, W. Mao, Y . Li, Q. Hu, Z. Lan, M. Zhu, and H. Chen. ARM: Advantage reward modeling for long-horizon manipulation.arXiv preprint arXiv:2604.03037, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dha- balia, J. DiCarlo, D. Driess, et al.π ∗ 0.6: a VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

J. Yang, K. Lin, J. Li, W. Zhang, T. Lin, L. Wu, Z. Su, H. Zhao, Y .-Q. Zhang, L. Chen, P. Luo, X. Yue, and H. Li. RISE: Self-improving robot policy with compositional world model. Robotics: Science and Systems, 2026. URLhttps://arxiv.org/abs/2602.11075

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. RoboReward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026. URLhttps://arxiv.org/abs/2601.00675

work page arXiv 2026

[8] [8]

Y . Liu, C. Wen, Y . Hu, D. Jayaraman, and Y . Gao. Timerewarder: Learning dense reward from passive videos via frame-wise temporal distance.arXiv preprint arXiv:2509.26627, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025. URLhttps://arxiv.org/abs/2509.15937

work page arXiv 2025

[10] [10]

H. Xu, X. Zhan, H. Yin, and H. Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InInternational Conference on Machine Learning, pages 24725– 24742. PMLR, 2022

2022

[11] [11]

Kostrikov, A

I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. InInternational Conference on Learning Representations, 2022

2022

[12] [12]

X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. URL https://arxiv.org/abs/1910.00177

work page internal anchor Pith review Pith/arXiv arXiv 1910

[13] [13]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, S. Xie, G. Yao, P. Wang, Z. Wang, and S. Zhang. Robo-Dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025. URLhttps: //arxiv.org/abs/2512.23703. 10

work page arXiv 2025

[14] [14]

Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, J. Tompson, O. Bastani, D. Jayaraman, W. Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=friHAl5ofG

2025

[15] [15]

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory compar- isons.Robotics: Science and Systems, 2026. URLhttps://arxiv.org/abs/2603.02115

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Dwibedi, Y

D. Dwibedi, Y . Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Tempo- ral cycle-consistency learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1801–1810, 2019. URL https://openaccess.thecvf.com/content_CVPR_2019/html/Dwibedi_Temporal_ Cycle-Consistency_Learning_CVPR_2019_paper.html

2019

[17] [17]

Zakka, A

K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. XIRL: Cross- embodiment inverse reinforcement learning. In5th Conference on Robot Learning, 2022. URLhttps://openreview.net/forum?id=RO4DM85Z4P7

2022

[18] [18]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual rep- resentation for robot manipulation. In6th Conference on Robot Learning, 2022. URL https://arxiv.org/abs/2203.12601

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. VIP: Towards universal visual reward and representation via value-implicit pre-training. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum? id=YJ7o2wetJ2

2023

[20] [20]

Y . J. Ma, W. Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image representations and rewards for robotic control. InProceedings of the 40th In- ternational Conference on Machine Learning, volume 202 ofProceedings of Machine Learn- ing Research, 2023

2023

[21] [21]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Bıyık, and J. Zhang. ReWiND: Language-guided rewards teach robot policies without new demonstrations. In9th Conference on Robot Learning, 2025

2025

[22] [22]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[23] [23]

Reward Model Ensembles Help Mitigate Overoptimization , March 2024

T. Coste, U. Anwar, R. Kirk, and D. Krueger. Reward model ensembles help mitigate overoptimization. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.02743

work page arXiv 2024

[24] [24]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Kelly, C

M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer. Hg-dagger: Interactive imitation learning with human experts. In2019 International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

2019

[26] [26]

R. He, Y . Wei, L. Yu, and X. Zeng. Spars: Structure-informed progress-aware reward shaping for fabric manipulation learning from demonstration.Robotics and Autonomous Systems, page 105499, 2026. 11 A Classifier-Free Guidance RL Details We integrate the frame-level advantages learned by STEAM into the classifier-free guidance rein- forcement learning fram...

2026