Recognition: 2 theorem links
· Lean TheoremSaliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning
Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3
The pith
SRCP decouples saliency-guided representation learning from successor training and adds consistency policies to fix attention and multi-modal modeling failures in visual unsupervised RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability.
What carries the argument
Saliency-guided dynamics task that isolates dynamics-relevant regions combined with a fast-sampling consistency policy using classifier-free guidance.
If this is right
- Successor measures become more accurate because representations now emphasize state transitions that matter for future rewards.
- Policies achieve higher controllability across multiple modes of behavior for each skill.
- The framework remains compatible with multiple existing successor-representation algorithms.
- Zero-shot generalization improves across 16 tasks spanning four visual datasets.
Where Pith is reading between the lines
- The same saliency-plus-consistency pattern could be tested on other representation-learning objectives beyond successor representations.
- Downstream fine-tuning on a small number of labeled tasks may require fewer samples once the base representation already respects dynamics.
- If the saliency map itself is learned jointly rather than in a separate stage, further gains or instabilities may appear.
Load-bearing premise
The two diagnosed limitations of existing successor representations—attention to dynamics-irrelevant regions and inability to model multi-modal policies—are the main obstacles to scaling and that adding saliency guidance plus consistency training will correct them without creating new instabilities or biases.
What would settle it
A controlled ablation on the same ExORL tasks that removes either the saliency map or the consistency objective and measures whether zero-shot success rates fall back to the level of prior SR baselines.
Figures
read the original abstract
Zero-shot unsupervised reinforcement learning (URL) offers a promising direction for building generalist agents capable of generalizing to unseen tasks without additional supervision. Among existing approaches, successor representations (SR) have emerged as a prominent paradigm due to their effectiveness in structured, low-dimensional settings. However, SR methods struggle to scale to high-dimensional visual environments. Through empirical analysis, we identify two key limitations of SR in visual URL: (1) SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures and degraded task generalization; and (2) these flawed representations hinder SR policies from modeling multi-modal skill-conditioned action distributions and ensuring skill controllability. To address these limitations, we propose Saliency-Guided Representation with Consistency Policy Learning (SRCP), a novel framework that improves zero-shot generalization of SR methods in visual URL. SRCP decouples representation learning from successor training by introducing a saliency-guided dynamics task to capture dynamics-relevant representations, thereby improving successor measure and task generalization. Moreover, it integrates a fast-sampling consistency policy with URL-specific classifier-free guidance and tailored training objectives to improve skill-conditioned policy modeling and controllability. Extensive experiments on 16 tasks across 4 datasets from the ExORL benchmark demonstrate that SRCP achieves state-of-the-art zero-shot generalization in visual URL and is compatible with various SR methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Saliency-Guided Representation with Consistency Policy Learning (SRCP) for visual unsupervised reinforcement learning. It empirically identifies two limitations of successor representations (SR) in high-dimensional settings—suboptimal attention to dynamics-irrelevant regions yielding inaccurate successor measures, and impaired multi-modal skill-conditioned policy modeling—and addresses them via a decoupled saliency-guided dynamics task for improved representations plus a fast-sampling consistency policy with classifier-free guidance and tailored objectives. Experiments across 16 tasks on 4 ExORL datasets report state-of-the-art zero-shot generalization, with compatibility to multiple base SR methods.
Significance. If the performance gains prove robust, the work would meaningfully advance visual URL by rendering SR methods practical in pixel-based domains and supplying a modular enhancement usable with existing SR pipelines. The explicit empirical diagnosis of SR limitations and the breadth of the evaluation (16 tasks, 4 datasets) constitute clear strengths.
major comments (2)
- [Section 5.3] Ablation studies (Section 5.3): the claim that the saliency-guided dynamics task specifically improves successor-measure accuracy by focusing on dynamics-relevant regions is not isolated. No direct metrics—successor prediction error, attention-map overlap with dynamics-relevant pixels, or zero-shot performance when the saliency-augmented representation is paired with the identical policy head—are reported against the base SR representation. Consequently it remains unclear whether observed gains derive from the representation change or from the consistency-policy component.
- [Section 4.1] Method (Section 4.1): the saliency estimation procedure used to guide the decoupled dynamics task is described at a high level but lacks quantitative validation that it avoids introducing new biases (e.g., over-attention to static background elements) or training instabilities when combined with successor training.
minor comments (3)
- [Table 1] Table 1 and Figure 2: error bars or standard deviations across seeds are not shown; adding them would strengthen the SOTA claims.
- [Equation (3)] Notation in Equation (3): the definition of the consistency loss mixes policy and representation terms without an explicit separation symbol, making the objective harder to parse.
- [Related Work] Related Work: several recent papers on saliency-driven representation learning in visual RL (post-2022) are omitted; a brief discussion of how SRCP differs would improve context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below and indicate revisions that will be incorporated in the next version of the manuscript.
read point-by-point responses
-
Referee: [Section 5.3] Ablation studies (Section 5.3): the claim that the saliency-guided dynamics task specifically improves successor-measure accuracy by focusing on dynamics-relevant regions is not isolated. No direct metrics—successor prediction error, attention-map overlap with dynamics-relevant pixels, or zero-shot performance when the saliency-augmented representation is paired with the identical policy head—are reported against the base SR representation. Consequently it remains unclear whether observed gains derive from the representation change or from the consistency-policy component.
Authors: We agree that the current ablations do not fully isolate the saliency-guided representation's effect on successor-measure accuracy. While the manuscript demonstrates overall SOTA zero-shot performance and compatibility with multiple base SR methods, direct isolation via successor prediction error, attention-map overlap, and zero-shot results with the saliency representation paired to the original policy head is absent. In the revision we will add these metrics to Section 5.3 (and an appendix if needed), using environment proxies for dynamics-relevant pixels where available. This will clarify the independent contribution of the representation component. revision: yes
-
Referee: [Section 4.1] Method (Section 4.1): the saliency estimation procedure used to guide the decoupled dynamics task is described at a high level but lacks quantitative validation that it avoids introducing new biases (e.g., over-attention to static background elements) or training instabilities when combined with successor training.
Authors: We acknowledge that the saliency estimation is presented at a high level and that quantitative checks for bias or instability are not provided. The procedure adapts established saliency techniques to the decoupled dynamics task, but additional validation would strengthen the claims. In the revised manuscript we will include quantitative analysis (e.g., attention distribution on static vs. dynamic regions via optical-flow proxies, and training-loss/variance curves when combined with successor objectives) to confirm absence of new biases or instabilities. revision: yes
Circularity Check
No significant circularity; new components are independent additions
full rationale
The paper empirically identifies two limitations of existing SR methods in visual URL, then introduces SRCP as a framework with a decoupled saliency-guided dynamics task for representation learning and a fast-sampling consistency policy with classifier-free guidance. These are presented as novel additions on top of base SR methods, with claims evaluated via experiments on 16 tasks across 4 ExORL datasets. No load-bearing equations, predictions, or self-citations reduce the central results to inputs by construction; the performance gains are asserted through benchmark comparisons rather than tautological re-derivations or fitted quantities renamed as predictions. The derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SR objectives often lead to suboptimal representations that attend to dynamics-irrelevant regions, resulting in inaccurate successor measures... saliency-guided dynamics task to capture dynamics-relevant representations
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat induction and embed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
consistency policy with URL-specific classifier-free guidance... Lπ = LπQ + λ1 Lπbc1 + λ2 Lπbc2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Constrained ensemble exploration for unsupervised skill discovery
Chenjia Bai, Rushuai Yang, Qiaosheng Zhang, Kang Xu, Yi Chen, Ting Xiao, and Xuelong Li. Constrained ensemble exploration for unsupervised skill discovery. InInternational Conference on Machine Learning, pages 2418–2442, 2024. 2, 19
work page 2024
-
[2]
Randall Balestriero and Yann LeCun. Contrastive and non- contrastive self-supervised learning recover global and local spectral embedding methods. InAdvances in Neural Infor- mation Processing Systems, pages 26671–26685, 2022. 3, 6, 7, 19, 21
work page 2022
-
[3]
Suc- cessor features for transfer in reinforcement learning
Andr ´e Barreto, Will Dabney, R´emi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Suc- cessor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems, page 4058–4068, 2017. 1, 2, 19
work page 2017
-
[4]
Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Re...
work page 2025
-
[5]
Exploration by random network distillation
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. InIn- ternational Conference on Learning Representations, 2019. 2, 19, 20
work page 2019
-
[6]
Offline reinforcement learning via high-fidelity gen- erative behavior modeling
Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity gen- erative behavior modeling. InInternational Conference on Learning Representations, 2023. 3, 19
work page 2023
-
[7]
Boosting contin- uous control with consistency policy
Yuhui Chen, Haoran Li, and Dongbin Zhao. Boosting contin- uous control with consistency policy. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, page 335–344, 2024. 3, 19
work page 2024
-
[8]
Improving generalization for temporal differ- ence learning: The successor representation
Peter Dayan. Improving generalization for temporal differ- ence learning: The successor representation. InNeural Com- putation, pages 613–624, 1993. 19, 21
work page 1993
-
[9]
arXiv preprint arXiv:2201.13425 , year=
Yarats Denis, Brandfonbrener David, Liu Hao, Laskin Michael, Abbeel Pieter, Lazaric Alessandro, and Pinto Ler- rel. Don’t change the algorithm, change the data: Ex- ploratory data for offline reinforcement learning.arXiv preprint arXiv:2201.13425, 2022. 6, 20
-
[10]
Consistency models as a rich and efficient policy class for reinforcement learning
Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. InInterna- tional Conference on Learning Representations, 2024. 3, 19
work page 2024
-
[11]
Diversity is all you need: Learning skills without a reward function
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019. 2, 19
work page 2019
-
[12]
The information geometry of unsupervised rein- forcement learning
Benjamin Eysenbach, Ruslan Salakhutdinov, and Sergey Levine. The information geometry of unsupervised rein- forcement learning. InInternational Conference on Learning Representations, 2022. 2
work page 2022
-
[13]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InAdvances in Neural Informa- tion Processing Systems, pages 6840–6851, 2020. 3, 19
work page 2020
-
[14]
Scott Jeen, Tom Bewley, and Jonathan M. Cullen. Zero-shot reinforcement learning from low quality data. InAdvances in Neural Information Processing Systems, pages 16894– 16942, 2024. 3
work page 2024
-
[15]
Efficient diffusion policies for offline re- inforcement learning
Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline re- inforcement learning. InProceedings of the 37th Interna- tional Conference on Neural Information Processing Sys- tems, 2023. 3, 19
work page 2023
-
[16]
Michael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, and Pieter Abbeel. Urlb: Unsupervised reinforcement learning bench- mark.Advances in Neural Information Processing Systems,
-
[17]
Unsupervised re- inforcement learning with contrastive intrinsic control
Michael Laskin, Hao Liu, Xue Bin Peng, Denis Yarats, Ar- avind Rajeswaran, and Pieter Abbeel. Unsupervised re- inforcement learning with contrastive intrinsic control. In Advances in Neural Information Processing Systems, pages 34478–34491, 2022. 2, 19
work page 2022
-
[18]
arXiv preprint arXiv:1906.05274 , year=
Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, and Ruslan Salakhutdinov. Efficient ex- ploration via state marginal matching.arXiv preprint arXiv:1906.05274, 2019. 2
-
[19]
Generalizing consistency policy to visual rl with prioritized proximal experience regularization
Haoran Li, Zhennan Jiang, yuhui Chen, and Dongbin Zhao. Generalizing consistency policy to visual rl with prioritized proximal experience regularization. InThe Thirty-eighth An- nual Conference on Neural Information Processing Systems,
-
[20]
Aps: Active pretraining with successor features
Hao Liu and Pieter Abbeel. Aps: Active pretraining with successor features. InInternational Conference on Machine Learning, pages 6736–6747, 2021. 20
work page 2021
-
[21]
Behavior from the void: Unsu- pervised active pre-training
Hao Liu and Pieter Abbeel. Behavior from the void: Unsu- pervised active pre-training. InAdvances in Neural Informa- tion Processing Systems, pages 18459–18473, 2021. 2, 19, 20
work page 2021
-
[22]
Balancing state ex- ploration and skill diversity in unsupervised skill discovery
Xin Liu, Yaran Chen, and Dongbin Zhao. Balancing state ex- ploration and skill diversity in unsupervised skill discovery. IEEE Transactions on Cybernetics, 2025. 1
work page 2025
-
[23]
Videos are sample- efficient supervisions: Behavior Cloning from videos via la- tent representations
Xin Liu, Haoran Li, and Dongbin Zhao. Videos are sample- efficient supervisions: Behavior Cloning from videos via la- tent representations. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2025. 1
work page 2025
-
[24]
Rea- sonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving
Xueyi Liu, Zuodong Zhong, Qichao Zhang, Yuxin Guo, Yupeng Zheng, Junli Wang, Dongbin Zhao, Yun-Fu Liu, Zhiguo Su, Yinfeng Gao, Qiao Lin, and Chen Huiyong. Rea- sonplan: Unified scene prediction and decision reasoning for closed-loop autonomous driving. InProceedings of The 9th Conference on Robot Learning, pages 3051–3068, 2025. 1
work page 2025
-
[25]
Runyu Lu, Peng Zhang, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao, Yang Liu, Dong Wang, and Cesare Alippi. Equilibrium policy generalization: A reinforcement learn- ing framework for cross-graph zero-shot generalization in pursuit-evasion games. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 1
work page 2025
-
[26]
Lipschitz-constrained unsupervised skill discovery
Seohong Park, Jongwook Choi, Jaekyeom Kim, Honglak Lee, and Gunhee Kim. Lipschitz-constrained unsupervised skill discovery. InInternational Conference on Learning Representations, 2022. 2
work page 2022
-
[27]
Foun- dation policies with Hilbert representations
Seohong Park, Tobias Kreiman, and Sergey Levine. Foun- dation policies with Hilbert representations. InProceedings of the 41st International Conference on Machine Learning, pages 39737–39761. PMLR, 2024. 1, 3, 5, 6, 7, 19, 21
work page 2024
-
[28]
Me- tra: Scalable unsupervised rl with metric-aware abstraction
Seohong Park, Oleh Rybkin, and Sergey Levine. Me- tra: Scalable unsupervised rl with metric-aware abstraction. InInternational Conference on Learning Representations,
-
[29]
Self- supervised exploration via disagreement
Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Self- supervised exploration via disagreement. InInternational Conference on Machine Learning, pages 5062–5071, 2019. 2, 19
work page 2019
-
[30]
Imitating human behaviour with diffusion models
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. InInter- national Conference on Learning Representations, 2023. 3, 19
work page 2023
-
[31]
Lucas Pinheiro Cinelli, Matheus Ara ´ujo Marins, Ed- uardo Ant´unio Barros da Silva, and S´ergio Lima Netto. Vari- ational autoencoder. InVariational methods for machine learning with applications to deep networks, pages 111–149. Springer, 2021. 6, 7, 21
work page 2021
-
[32]
Spectral decomposition representation for reinforcement learning
Tongzheng Ren, Tianjun Zhang, Lisa Lee, Joseph E Gonza- lez, Dale Schuurmans, and Bo Dai. Spectral decomposition representation for reinforcement learning. InInternational Conference on Learning Representations, 2023. 3, 19
work page 2023
-
[33]
Universal value function approximators
Tom Schaul, Daniel Horgan, Karol Gregor, and David Sil- ver. Universal value function approximators. InInterna- tional Conference on Machine Learning, pages 1312–1320,
-
[34]
Dynamics-aware unsupervised discov- ery of skills
Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discov- ery of skills. InInternational Conference on Learning Rep- resentations, 2020. 2
work page 2020
-
[35]
Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu- cas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge.nature, 550(7676): 354–359, 2017. 1
work page 2017
-
[36]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. 19
work page 2023
-
[37]
Unsupervised zero-shot re- inforcement learning via dual-value forward-backward rep- resentation
Jingbo Sun, Songjun Tu, Haoran Li, Xin Liu, Yaran Chen, Ke Chen, Dongbin Zhao, et al. Unsupervised zero-shot re- inforcement learning via dual-value forward-backward rep- resentation. InThe Thirteenth International Conference on Learning Representations, 2025. 2
work page 2025
-
[38]
Salience-invariant consistent policy learning for generalization in visual reinforcement learning
Jingbo Sun, Songjun Tu, Qichao Zhang, Ke Chen, and Dong- bin Zhao. Salience-invariant consistent policy learning for generalization in visual reinforcement learning. InProceed- ings of the 24th International Conference on Autonomous Agents and Multiagent Systems, page 1987–1995, 2025. 1
work page 1987
-
[39]
R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction.IEEE Transactions on Neural Networks, 9(5): 1054–1054, 1998. 1, 19
work page 1998
-
[40]
Learning one representa- tion to optimize all rewards
Ahmed Touati and Yann Ollivier. Learning one representa- tion to optimize all rewards. InAdvances in Neural Informa- tion Processing Systems, pages 13–23, 2021. 1, 3, 16, 19, 27
work page 2021
-
[41]
Ahmed Touati, J ´er´emy Rapin, and Yann Ollivier. Does zero- shot reinforcement learning exist? InInternational Confer- ence on Learning Representations, 2023. 1, 3, 6, 7, 19, 21
work page 2023
-
[42]
Chao Wang, Hehe Fan, Huichen Yang, Zhengdong Hu, Sarvnaz Karimi, Lina Yao, and Yi Yang
Songjun Tu, Qichao Zhang, Jingbo Sun, Yuqian Fu, Lin- jing Li, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, and Dongbin Zhao. Perception-consistency multimodal large language models reasoning via caption-regularized policy optimization.arXiv preprint arXiv:2509.21854, 2025. 1
-
[43]
dm control: Software and tasks for continuous control.Software Impacts, 6:100022,
Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm control: Software and tasks for continuous control.Software Impacts, 6:100022,
-
[44]
J Wang, Q Zhang, and D Zhao. Dynamic-horizon model- based value estimation with latent imagination.IEEE Trans- actions on Neural Networks and Learning Systems, 35(7): 8812–8825, 2024. 19
work page 2024
-
[45]
The laplacian in rl: Learning representations with efficient approximations
Yifan Wu, George Tucker, and Ofir Nachum. The laplacian in rl: Learning representations with efficient approximations. InInternational Conference on Learning Representations,
-
[46]
Yucheng Yang, Tianyi Zhou, Qiang He, Lei Han, Mykola Pechenizkiy, and Meng Fang. Task adaptation from skills: Information geometry, disentanglement, and new objectives for unsupervised reinforcement learning. InInternational Conference on Learning Representations, 2024. 2
work page 2024
-
[47]
Reinforcement learning with prototypical represen- tations
Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Reinforcement learning with prototypical represen- tations. InInternational Conference on Machine Learning, pages 11920–11931. PMLR, 2021. 19, 20
work page 2021
-
[48]
Qichao Zhang, Yinfeng Gao, Yikang Zhang, Youtian Guo, Dawei Ding, Yunpeng Wang, Peng Sun, and Dongbin Zhao. Trajgen: Generating realistic and diverse trajectories with re- active and feasible agent behaviors for autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 23 (12):24474–24487, 2022. 1, 19
work page 2022
-
[49]
Learning and plan- ning multi-agent tasks via a MoE-based world model
Zijie Zhao, Zhongyue Zhao, Kaixuan Xu, Yuqian Fu, Jiajun Chai, Yuanheng Zhu, and Dongbin Zhao. Learning and plan- ning multi-agent tasks via a MoE-based world model. InAd- vances in Neural Information Processing Systems (NeurIPS),
-
[50]
Taco: temporal latent action-driven contrastive loss for visual re- inforcement learning
Ruijie Zheng, Xiyao Wang, Yanchao Sun, Shuang Ma, Jieyu Zhao, Huazhe Xu, Hal Daum ´e, and Furong Huang. Taco: temporal latent action-driven contrastive loss for visual re- inforcement learning. InProceedings of the 37th Interna- tional Conference on Neural Information Processing Sys- tems, 2023. 7, 26
work page 2023
-
[51]
Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Shuang Ma, Hal Daum ´e III, Huazhe Xu, John Langford, Praveen Palanisamy, Kalyan Shankar Basu, and Furong Huang. Premier-taco is a few-shot policy learner: pretraining mul- titask representation via temporal action-driven contrastive loss. InProceedings of the 41st International Conference on Machine Learning. JML...
work page 2024
-
[52]
Specifically, the procedure illustrates how SRCP incorporates the HILP method to learn skill-conditioned representations during the pretraining phase. Algorithm 1SRCP Algorithm 1:Inputs:pre-collected datasetD, randomly initialized representation networkf θ, basic feature networksφ ν, successor feature networkψ κ, actor networkπ ζ , learning rateη, mini-ba...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.