Temporal Self-Imitation Learning

Boyuan Chen; Yinsen Jia

arxiv: 2606.19752 · v2 · pith:G4TLPCOInew · submitted 2026-06-18 · 💻 cs.RO · cs.AI

Temporal Self-Imitation Learning

Yinsen Jia , Boyuan Chen This is my paper

Pith reviewed 2026-06-26 17:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords reinforcement learningrobot manipulationself-imitationtemporal efficiencylong-horizon taskspolicy improvementtrajectory mining

0 comments

The pith

Mining temporally efficient successful trajectories during training turns them into adaptive self-supervision that improves long-horizon robot manipulation policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that temporal efficiency in successful behaviors supplies an underused form of self-supervision for reinforcement learning. It converts fast trajectories found during training into configuration-conditioned adaptive targets and applies efficiency-weighted imitation to preserve and replay those behaviors. A reader would care because reward shaping alone often permits high return through inefficient paths while discarding rare efficient ones. The method is tested on fifteen distinct long-horizon manipulation tasks where it raises learning speed, task-completion speed, and training stability.

Core claim

TSIL mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. The framework progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across fifteen distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions.

What carries the argument

Temporal Self-Imitation Learning that mines efficient trajectories and applies configuration-conditioned adaptive temporal targets together with efficiency-weighted self-imitation.

If this is right

Policies reach task goals with fewer total steps because they repeatedly revisit fast successful paths.
Learning curves become steeper and more stable under the same reward function.
Rare efficient behaviors discovered early are retained rather than overwritten by later inefficient exploration.
The approach works across fifteen different manipulation tasks without task-specific reward redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mining step could be applied in non-robotics domains where episode length or energy cost matters.
If the adaptive targets prove stable, they might replace some hand-tuned shaping terms in existing reward functions.
A direct test would measure whether removing the efficiency-weighting term alone drops performance back to baseline levels.

Load-bearing premise

Temporally efficient successful trajectories mined during training form a reliable non-biased signal that can be turned into targets and imitation weights without introducing instability or overfitting to early lucky rollouts.

What would settle it

Running the method on a fresh collection of long-horizon tasks and finding that success rates or efficiency metrics fall below a standard reinforcement-learning baseline without the self-imitation component.

Figures

Figures reproduced from arXiv: 2606.19752 by Boyuan Chen, Yinsen Jia.

**Figure 1.** Figure 1: Temporal Self-Imitation Learning. Previously discovered fast successful trajectories set adaptive temporal targets and are replayed to guide further policy improvement. Motor learning in humans and animals progressively refines behavior toward more efficient and reliable movement. Through repeated interaction and reinforcement, biological systems learn to suppress unnecessary motion, reduce behavioral va… view at source ↗

**Figure 2.** Figure 2: Simple success-reward tuning is brittle. Increasing the successreward scale can weaken dense guidance, whereas TSIL remains robust by converting fast successes into temporal training signal. Existing approaches encourage efficient behavior mostly through manually designed temporal preferences. Increasing sparse success rewards can weaken dense guidance and increase advantage-estimation variance, wherea… view at source ↗

**Figure 3.** Figure 3: Representative evaluation tasks. Our tests cover long-horizon manipulation skills including assembly, insertion, transport, tool use, articulated-object interaction, and contact-rich manipulation. We evaluate TSIL along three questions: (1) Effectiveness: does temporal self-supervision improve long-horizon manipulation performance? (2) Efficiency: does TSIL improve both learning efficiency and behavioral … view at source ↗

**Figure 4.** Figure 4: Main training curves. TSIL reaches high sample efficiency and success while completing more successful episodes under the same interaction budget. Shaded regions denote the standard error of the mean. As shown in Tab. 1 and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Adaptive temporal targets redirect PPO optimization. Adaptive temporal targets rapidly concentrate positive PPO update signal on fast successful trajectories while suppressing reward-distracted and slow-failure behavior. Left: one representative Peg Insertion Side learningsignal map. Right: aggregated results over all 15 tasks. Error bars denote the standard error of the mean. Details on metric computa… view at source ↗

**Figure 6.** Figure 6: Fast-success revisitation landscapes. Left: TSIL with efficiency-weighted fast-success replay. Right: generic high-return replay. The background shows fast-success replay-buffer log probability around the current policy. Replay rotated PPO updates toward stored successful behavior, but TSIL aligned updates more strongly with regions containing high fast-success memory likelihood. guidance. These results s… view at source ↗

**Figure 7.** Figure 7: MT15 task suite. The appendix contact sheet shows all 15 Meta-World manipulation tasks used in our MTBench evaluation. B.2 Disturbed training settings All disturbance experiments used the same training setting and compared ATTL, ATTL + SIL, and TSIL over the same MT15 tasks. Each experiment perturbed one training factor while leaving the 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Disturbance-sweep results. Each column sweeps one training disturbance from [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Additional learning-signal maps: Assembly and Drawer Open. Positive-advantage mass is summarized over completion time and task-reward return for T0 and T18. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Additional learning-signal maps: Hammer and Peg Unplug Side. Positive-advantage mass is summarized over completion time and task-reward return for T21 and T29. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Additional learning-signal map for T44: Stick Pull. Positive-advantage mass is summarized over completion time and task-reward return for the representative Stick Pull task. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Additional fast-success revisitation landscapes. Landscapes for T0, T18, T21, T29, and T44 compare TSIL and generic SIL update directions relative to stored fast-success behavior. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Long-horizon robot manipulation policies trained with reward shaping can still achieve high return through inefficient interactions, while rare efficient behaviors discovered during training may be forgotten. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSIL mines fast successful trajectories for adaptive temporal targets and weighted imitation, which looks like a workable addition to RL for long-horizon manipulation if the experiments hold up.

read the letter

The paper's main contribution is a concrete way to turn the timing of self-generated successes into supervision signals that standard reward shaping overlooks. TSIL extracts temporally efficient trajectories during training, builds configuration-conditioned targets from them, and replays the efficient ones with weighted imitation. This directly tackles forgetting of rare fast behaviors in long-horizon robot tasks.

What stands out is the combination of adaptive temporal targets and efficiency weighting. It is not just another imitation add-on; the method tries to make the timing itself the supervisory cue. The claim of consistent gains across 15 manipulation tasks in learning speed, completion efficiency, and training stability is the part that matters most for practitioners.

The experiments test the core assumption by running on a range of tasks, which gives the work some grounding. If the full paper includes ablations that isolate the temporal component and shows the trajectory mining does not introduce bias from early lucky rollouts, the results would be usable.

The soft spot is the lack of visible detail on baseline choices and filtering rules in the high-level description. Without those, it is difficult to judge whether the reported improvements are robust or sensitive to implementation choices. The method also assumes enough successful trajectories appear early enough to mine, which may not hold in harder domains.

This paper is for people already working on sample-efficient RL for manipulation who need better ways to retain efficient behaviors. A reader focused on self-imitation or temporal aspects of RL would find the mechanisms worth examining.

It deserves peer review because the idea is straightforward to implement and the experimental scope is reasonable, even though the write-up will likely need more implementation specifics.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during training and converts them into configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning. It claims that this yields consistent improvements in learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions across 15 long-horizon manipulation tasks.

Significance. If the empirical results hold under rigorous controls, the work could be significant for RL in robotics by demonstrating that the temporal structure of successful trajectories can serve as a scalable self-supervisory signal, reducing reliance on manually engineered reward shaping for long-horizon tasks. The multi-task evaluation on 15 tasks is a strength if properly documented.

major comments (2)

[Abstract and Experimental Setup] Abstract and Experimental Setup: The claim of consistent improvements on 15 tasks supplies no information on baselines, statistical tests, ablation studies, or how trajectories are selected and filtered; without these details the support for the central empirical claim cannot be evaluated from the given text.
[Method] Method: The conversion of mined trajectories into reusable supervision via configuration-conditioned adaptive temporal targets and efficiency-weighted imitation is described at a high level without specific algorithmic details, equations, or pseudocode, which is load-bearing for assessing whether the self-supervisory signal is reliable and non-biased.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with references to the full manuscript and note planned revisions.

read point-by-point responses

Referee: [Abstract and Experimental Setup] Abstract and Experimental Setup: The claim of consistent improvements on 15 tasks supplies no information on baselines, statistical tests, ablation studies, or how trajectories are selected and filtered; without these details the support for the central empirical claim cannot be evaluated from the given text.

Authors: The abstract is concise by design, but the full manuscript provides these details in Section 4 (Experimental Setup), which specifies the 15 tasks, baselines (PPO, SAC, and imitation variants), statistical tests (paired t-tests over 5 seeds with p-values), ablation studies (Section 5.3), and trajectory filtering (successful episodes with completion time below median threshold, detailed in Section 3.2). We will revise the abstract to briefly reference the evaluation protocol and controls. revision: partial
Referee: [Method] Method: The conversion of mined trajectories into reusable supervision via configuration-conditioned adaptive temporal targets and efficiency-weighted imitation is described at a high level without specific algorithmic details, equations, or pseudocode, which is load-bearing for assessing whether the self-supervisory signal is reliable and non-biased.

Authors: The full paper supplies the requested details in Section 3, including equations for the configuration-conditioned adaptive temporal target (Eqs. 1-4), the efficiency weight computation, and the weighted self-imitation objective (Eq. 5), along with Algorithm 1 in the appendix. These ensure the signal prioritizes faster trajectories without introducing bias toward suboptimal paths. We will expand the main method section with explicit pseudocode and additional derivation for clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces an empirical RL method (TSIL) for mining and replaying temporally efficient trajectories in long-horizon manipulation tasks. No equations, derivations, or first-principles results are presented that reduce any claimed prediction or improvement to quantities defined by the method's own fitted parameters or self-referential loops. The central claims rest on experimental validation across 15 tasks rather than any self-definitional or fitted-input structure, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated or derivable from the given text.

pith-pipeline@v0.9.1-grok · 5680 in / 1172 out tokens · 22475 ms · 2026-06-26T17:35:02.362777+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 9 canonical work pages · 1 internal anchor

[1]

J. W. Krakauer, A. M. Hadjiosif, J. Xu, A. L. Wong, and A. M. Haith. Motor learning.Com- prehensive Physiology, 9(2):613–663, 2019. doi:10.1002/cphy.c170043

work page doi:10.1002/cphy.c170043 2019
[2]

Vassiliadis, G

P. Vassiliadis, G. Derosiere, C. Dubuc, A. Lete, F. Crevecoeur, F. C. Hummel, and J. Duque. Reward boosts reinforcement-based motor learning.iScience, 24(7):102821, 2021. doi:10. 1016/j.isci.2021.102821

work page arXiv 2021
[3]

S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipula- tion with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017

2017
[4]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

2020
[6]

A. Y . Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InIcml, volume 99, pages 278–287, 1999

1999
[7]

Y . Luo, K. Dong, L. Zhao, Z. Sun, C. Zhou, and B. Song. Balance between efficient and effective learning: Dense2Sparse reward shaping for robot manipulation with environment uncertainty.arXiv preprint arXiv:2003.02740, 2020

work page arXiv 2003
[8]

Pardo, A

F. Pardo, A. Tavakoli, V . Levdik, and P. Kormushev. Time limits in reinforcement learning. In International Conference on Machine Learning, pages 4045–4054. PMLR, 2018

2018
[9]

K. Doya. Reinforcement learning in continuous time and space.Neural computation, 12(1): 219–245, 2000

2000
[10]

Tallec, L

C. Tallec, L. Blier, and Y . Ollivier. Making deep q-learning methods robust to time discretiza- tion. InInternational Conference on Machine Learning, pages 6096–6104. PMLR, 2019

2019
[11]

Yildiz, M

C. Yildiz, M. Heinonen, and H. L ¨ahdesm¨aki. Continuous-time model-based reinforcement learning. InInternational Conference on Machine Learning, pages 12009–12018. PMLR, 2021

2021
[12]

Ramstedt and C

S. Ramstedt and C. Pal. Real-time reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, pages 3067–3076, 2019

2019
[13]

R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for tem- poral abstraction in reinforcement learning.Artificial intelligence, 112(1–2):181–211, 1999

1999
[14]

A. S. Lakshminarayanan, S. Sharma, and B. Ravindran. Dynamic action repetition for deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

2017
[15]

Sharma, A

S. Sharma, A. Srinivas, and B. Ravindran. Learning to repeat: Fine grained action repetition for deep reinforcement learning. InInternational Conference on Learning Representations, 2017. 9

2017
[16]

Biedenkapp, R

A. Biedenkapp, R. Rajan, F. Hutter, and M. Lindauer. Temporl: Learning when to act. In International Conference on Machine Learning, pages 914–924. PMLR, 2021

2021
[17]

Y . J. Kim and M. Chi. Time-aware q-networks: Resolving temporal irregularity for deep reinforcement learning.arXiv preprint arXiv:2105.02580, 2021

work page arXiv 2021
[18]

Y . J. Kim and M. Chi. Time-aware deep reinforcement learning with multi-temporal abstrac- tion.Applied Intelligence, 53(17):20007–20033, 2023

2023
[19]

A. N. Nhu, S. Son, and M. Lin. Time-aware world model for adaptive prediction and control. InForty-second International Conference on Machine Learning, 2025

2025
[20]

G. Li, J. Wu, and Y . He. Act better by timing: A timing-aware reinforcement learning for autonomous driving.arXiv preprint arXiv:2406.13223, 2024

work page arXiv 2024
[21]

Y . Chen, R. Ye, Z. Tao, H. Liu, G. Chen, J. Peng, J. Ma, Y . Zhang, J. Ji, and Y . Zhang. Reinforcement learning for robot navigation with adaptive forward simulation time (afst) in a semi-markov model, 2023. URLhttps://arxiv.org/abs/2108.06161

work page arXiv 2023
[22]

Q. Lin, B. Tang, Z. Wu, C. Yu, S. Mao, Q. Xie, X. Wang, and D. Wang. Safe offline rein- forcement learning with real-time budget constraints. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 21127–21152. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/ lin23h.html

2023
[23]

Jia and B

Y . Jia and B. Chen. Time as a control dimension in robot learning, 2026. URLhttps: //arxiv.org/abs/2511.07654

work page arXiv 2026
[24]

J. Oh, Y . Guo, S. Singh, and H. Lee. Self-imitation learning. InInternational Conference on Machine Learning, pages 3878–3887. PMLR, 2018

2018
[25]

T. Dai, H. Liu, and A. A. Bharath. Episodic self-imitation learning with hindsight.Electronics, 9(10):1742, 2020

2020
[26]

Chen and M

Z. Chen and M. Lin. Self-imitation learning for robot tasks with sparse and delayed rewards. In2021 IEEE International Conference on Mechatronics and Automation (ICMA), pages 477–
[27]

S. Luo, H. Kasaei, and L. Schomaker. Self-imitation learning by planning. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4823–4829. IEEE, 2021

2021
[28]

Luo and L

S. Luo and L. Schomaker. Reinforcement learning in robotic motion planning by combined experience-based planning and self-imitation learning.Robotics and Autonomous Systems, 170:104545, 2023

2023
[29]

Bujalance and F

J. Bujalance and F. Moutarde. Reward relabelling for combined reinforcement and imitation learning on sparse-reward tasks. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 2565–2567, 2023

2023
[30]

Y . Li, Y . Wu, H. Xu, X. Wang, and Y . Wu. Solving compositional reinforcement learning problems via task reduction. InInternational Conference on Learning Representations, 2021

2021
[31]

Y . Li, T. Gao, J. Yang, H. Xu, and Y . Wu. Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. InInternational Conference on Machine Learning, pages 12765–12781. PMLR, 2022

2022
[32]

Y . Li, Y . Wang, and X. Tan. Self-imitation guided goal-conditioned reinforcement learning. Pattern Recognition, 144:109845, 2023. 10

2023
[33]

Sharma, A

A. Sharma, A. M. Ahmed, R. Ahmad, and C. Finn. Self-improving robots: End-to-end autonomous visuomotor reinforcement learning. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 3292–3308. PMLR, 2023. URLhttps://proceedings.mlr.press/v229/sharma23b.html

2023
[34]

Bousmalis, G

K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation. Transactions on Machine Learning Research, 2023

2023
[35]

S. K. Seyed Ghasemipour, A. Wahid, J. Tompson, P. Sanketi, and I. Mordatch. Self-improving embodied foundation models. InAdvances in Neural Information Processing Systems, 2025. URLhttps://arxiv.org/abs/2509.15155

work page arXiv 2025
[36]

Joshi, Z

V . Joshi, Z. Xu, B. Liu, P. Stone, and A. Zhang. Benchmarking massively parallelized multi- task reinforcement learning for robotics tasks. InReinforcement Learning Conference, 2025. URLhttps://openreview.net/forum?id=z0MM0y20I2. 11 A Algorithm and Implementation Details A.1 Algorithm details Optimization objective.TSIL uses the same PPO objective as all...

2025

[1] [1]

J. W. Krakauer, A. M. Hadjiosif, J. Xu, A. L. Wong, and A. M. Haith. Motor learning.Com- prehensive Physiology, 9(2):613–663, 2019. doi:10.1002/cphy.c170043

work page doi:10.1002/cphy.c170043 2019

[2] [2]

Vassiliadis, G

P. Vassiliadis, G. Derosiere, C. Dubuc, A. Lete, F. Crevecoeur, F. C. Hummel, and J. Duque. Reward boosts reinforcement-based motor learning.iScience, 24(7):102821, 2021. doi:10. 1016/j.isci.2021.102821

work page arXiv 2021

[3] [3]

S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipula- tion with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE, 2017

2017

[4] [4]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

2020

[6] [6]

A. Y . Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InIcml, volume 99, pages 278–287, 1999

1999

[7] [7]

Y . Luo, K. Dong, L. Zhao, Z. Sun, C. Zhou, and B. Song. Balance between efficient and effective learning: Dense2Sparse reward shaping for robot manipulation with environment uncertainty.arXiv preprint arXiv:2003.02740, 2020

work page arXiv 2003

[8] [8]

Pardo, A

F. Pardo, A. Tavakoli, V . Levdik, and P. Kormushev. Time limits in reinforcement learning. In International Conference on Machine Learning, pages 4045–4054. PMLR, 2018

2018

[9] [9]

K. Doya. Reinforcement learning in continuous time and space.Neural computation, 12(1): 219–245, 2000

2000

[10] [10]

Tallec, L

C. Tallec, L. Blier, and Y . Ollivier. Making deep q-learning methods robust to time discretiza- tion. InInternational Conference on Machine Learning, pages 6096–6104. PMLR, 2019

2019

[11] [11]

Yildiz, M

C. Yildiz, M. Heinonen, and H. L ¨ahdesm¨aki. Continuous-time model-based reinforcement learning. InInternational Conference on Machine Learning, pages 12009–12018. PMLR, 2021

2021

[12] [12]

Ramstedt and C

S. Ramstedt and C. Pal. Real-time reinforcement learning. InAdvances in Neural Information Processing Systems, volume 32, pages 3067–3076, 2019

2019

[13] [13]

R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for tem- poral abstraction in reinforcement learning.Artificial intelligence, 112(1–2):181–211, 1999

1999

[14] [14]

A. S. Lakshminarayanan, S. Sharma, and B. Ravindran. Dynamic action repetition for deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

2017

[15] [15]

Sharma, A

S. Sharma, A. Srinivas, and B. Ravindran. Learning to repeat: Fine grained action repetition for deep reinforcement learning. InInternational Conference on Learning Representations, 2017. 9

2017

[16] [16]

Biedenkapp, R

A. Biedenkapp, R. Rajan, F. Hutter, and M. Lindauer. Temporl: Learning when to act. In International Conference on Machine Learning, pages 914–924. PMLR, 2021

2021

[17] [17]

Y . J. Kim and M. Chi. Time-aware q-networks: Resolving temporal irregularity for deep reinforcement learning.arXiv preprint arXiv:2105.02580, 2021

work page arXiv 2021

[18] [18]

Y . J. Kim and M. Chi. Time-aware deep reinforcement learning with multi-temporal abstrac- tion.Applied Intelligence, 53(17):20007–20033, 2023

2023

[19] [19]

A. N. Nhu, S. Son, and M. Lin. Time-aware world model for adaptive prediction and control. InForty-second International Conference on Machine Learning, 2025

2025

[20] [20]

G. Li, J. Wu, and Y . He. Act better by timing: A timing-aware reinforcement learning for autonomous driving.arXiv preprint arXiv:2406.13223, 2024

work page arXiv 2024

[21] [21]

Y . Chen, R. Ye, Z. Tao, H. Liu, G. Chen, J. Peng, J. Ma, Y . Zhang, J. Ji, and Y . Zhang. Reinforcement learning for robot navigation with adaptive forward simulation time (afst) in a semi-markov model, 2023. URLhttps://arxiv.org/abs/2108.06161

work page arXiv 2023

[22] [22]

Q. Lin, B. Tang, Z. Wu, C. Yu, S. Mao, Q. Xie, X. Wang, and D. Wang. Safe offline rein- forcement learning with real-time budget constraints. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 21127–21152. PMLR, 2023. URLhttps://proceedings.mlr.press/v202/ lin23h.html

2023

[23] [23]

Jia and B

Y . Jia and B. Chen. Time as a control dimension in robot learning, 2026. URLhttps: //arxiv.org/abs/2511.07654

work page arXiv 2026

[24] [24]

J. Oh, Y . Guo, S. Singh, and H. Lee. Self-imitation learning. InInternational Conference on Machine Learning, pages 3878–3887. PMLR, 2018

2018

[25] [25]

T. Dai, H. Liu, and A. A. Bharath. Episodic self-imitation learning with hindsight.Electronics, 9(10):1742, 2020

2020

[26] [26]

Chen and M

Z. Chen and M. Lin. Self-imitation learning for robot tasks with sparse and delayed rewards. In2021 IEEE International Conference on Mechatronics and Automation (ICMA), pages 477–

[27] [27]

S. Luo, H. Kasaei, and L. Schomaker. Self-imitation learning by planning. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4823–4829. IEEE, 2021

2021

[28] [28]

Luo and L

S. Luo and L. Schomaker. Reinforcement learning in robotic motion planning by combined experience-based planning and self-imitation learning.Robotics and Autonomous Systems, 170:104545, 2023

2023

[29] [29]

Bujalance and F

J. Bujalance and F. Moutarde. Reward relabelling for combined reinforcement and imitation learning on sparse-reward tasks. InProceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pages 2565–2567, 2023

2023

[30] [30]

Y . Li, Y . Wu, H. Xu, X. Wang, and Y . Wu. Solving compositional reinforcement learning problems via task reduction. InInternational Conference on Learning Representations, 2021

2021

[31] [31]

Y . Li, T. Gao, J. Yang, H. Xu, and Y . Wu. Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. InInternational Conference on Machine Learning, pages 12765–12781. PMLR, 2022

2022

[32] [32]

Y . Li, Y . Wang, and X. Tan. Self-imitation guided goal-conditioned reinforcement learning. Pattern Recognition, 144:109845, 2023. 10

2023

[33] [33]

Sharma, A

A. Sharma, A. M. Ahmed, R. Ahmad, and C. Finn. Self-improving robots: End-to-end autonomous visuomotor reinforcement learning. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 3292–3308. PMLR, 2023. URLhttps://proceedings.mlr.press/v229/sharma23b.html

2023

[34] [34]

Bousmalis, G

K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation. Transactions on Machine Learning Research, 2023

2023

[35] [35]

S. K. Seyed Ghasemipour, A. Wahid, J. Tompson, P. Sanketi, and I. Mordatch. Self-improving embodied foundation models. InAdvances in Neural Information Processing Systems, 2025. URLhttps://arxiv.org/abs/2509.15155

work page arXiv 2025

[36] [36]

Joshi, Z

V . Joshi, Z. Xu, B. Liu, P. Stone, and A. Zhang. Benchmarking massively parallelized multi- task reinforcement learning for robotics tasks. InReinforcement Learning Conference, 2025. URLhttps://openreview.net/forum?id=z0MM0y20I2. 11 A Algorithm and Implementation Details A.1 Algorithm details Optimization objective.TSIL uses the same PPO objective as all...

2025