TRAM: Test-Time Risk Adaptation with Mixture of Agents

Amrit Singh Bedi; Amy Zhang; Hao Zhu; Mohamad Fares El Hajj Chehade

arxiv: 2408.08812 · v2 · pith:MS46TU7Anew · submitted 2024-08-16 · 💻 cs.LG

TRAM: Test-Time Risk Adaptation with Mixture of Agents

Mohamad Fares El Hajj Chehade , Amrit Singh Bedi , Amy Zhang , Hao Zhu This is my paper

Pith reviewed 2026-05-23 22:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningtest time adaptationrisk adaptationmixture of agentsoccupancy measuredeployment safetyzero shot adaptation

0 comments

The pith

TRAM adapts fixed RL policies to new risk requirements at test time without updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TRAM as a method for zero-update adaptation of reinforcement learning agents to newly specified reward-risk tradeoffs at deployment. It reuses a library of risk-neutral source policies by scoring them under the target reward and an occupancy-based risk measure to select actions. This enables handling of various test-time risks like spatial barriers or behavioral constraints that training-time methods cannot easily accommodate. A reader would care if they need to deploy agents in environments where safety specifications evolve after initial training without the cost of retraining. The work characterizes TRAM as a surrogate approach with an explicit mismatch term between scored and realized risk.

Core claim

TRAM evaluates each source policy under the target reward and the occupancy-based deployment risk, then uses risk-adjusted source scores to select actions from the mixture. Unlike methods tied to fixed surrogates during training, it supports arbitrary risk types specified only at test time. The approach is explicitly characterized as a surrogate for the occupancy-control problem of the stitched policy, admitting a measurable source-hull mismatch term that connects the source-scored risk to the realized risk under the mixture.

What carries the argument

Source-scored composition rule that scores policies by target reward and occupancy risk to form the action mixture.

If this is right

Supports spatial barrier, divergence, and volatility risks at test time.
Reduces deployment risk while preserving reward without parameter updates.
Applies to gridworlds, MuJoCo, Safety-Gymnasium, and LLM alignment tasks.
Characterized with an explicit source-hull mismatch term for the surrogate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the source policy library lacks diversity, the mismatch term may grow and adaptation quality could degrade.
The method could be combined with online policy addition to the library for better coverage over time.
Occupancy measure computation might limit scalability in very large state spaces.

Load-bearing premise

The library of source policies is diverse enough for their convex combinations to closely approximate the desired policy under the new risk constraints.

What would settle it

Measuring the realized risk and reward in an environment where the target risk specification cannot be well-approximated by mixtures of the given source policies, and checking if TRAM still outperforms non-adaptive baselines.

Figures

Figures reproduced from arXiv: 2408.08812 by Amrit Singh Bedi, Amy Zhang, Hao Zhu, Mohamad Fares El Hajj Chehade.

**Figure 2.** Figure 2: (a), (b) show the risk-aware optimal policies for each of the training tasks. (c) shows the results of risk-aware and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The set of source tasks are transferred to different [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The optimal risk-neutral policies for the three [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The performance of CAT and the baseline on 10 different test tasks. For example, in (a) and (b), since the structure of [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The Reacher domain (adapted from (Barreto et al. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Deterministic environments. (a) Training policies. (b) The performance of entropic utility-based transfer is the same as [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Deployed reinforcement learning agents often face safety requirements that are specified only after training, such as new hazard maps, revised risk thresholds, or behavioral alignment constraints. We study zero-update deployment-time adaptation, where a fixed library of risk-neutral source policies is reused under a newly specified reward-risk tradeoff. We propose TRAM (Test-Time Risk Adaptation via Mixture of Agents), a source-scored composition rule that evaluates each source policy under the target reward and an occupancy-based deployment risk, then selects actions using risk-adjusted source scores. Unlike training-time risk-sensitive methods tied to a fixed surrogate such as return variance, TRAM supports spatial barrier exposure, divergence from a reference behavior, and local volatility risks specified at test time. We explicitly characterize TRAM as a surrogate method: it does not solve the full occupancy-control problem of the stitched policy, but admits a measurable source-hull mismatch term connecting source-scored risk to realized risk. Experiments in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment setting show that TRAM reduces deployment risk while preserving reward, without requiring any parameter updates at test time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRAM gives a test-time mixture rule for new risk specs in RL with an explicit surrogate mismatch term, but the abstract leaves the conditions for small mismatch unaddressed.

read the letter

TRAM gives a concrete way to handle new risk specs at test time by scoring and mixing existing risk-neutral policies. The key move is the source-scored rule that uses both the target reward and an occupancy risk, plus the explicit characterization of the method as a surrogate with a measurable mismatch term between scored and realized risk. This is new compared to training-time risk-sensitive approaches that fix the surrogate like variance during learning. TRAM keeps the sources fixed and lets the risk definition change at deployment, which matches real scenarios where hazard maps or alignment constraints arrive later. The work does well at naming the practical problem and sketching how the mixture can reduce deployment risk without parameter changes. The experiments across gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM setting are presented as evidence that reward is preserved while risk drops. The main soft spot is around the source-hull mismatch. The abstract ties the surrogate property to this term but gives no conditions under which it stays small. If the fixed library of sources cannot span the occupancy needed for the new risk constraint, the approximation can break and risk reduction may not hold. The stress-test note flags this correctly based on the abstract. Without bounds or richness assumptions, the method's reliability depends heavily on the chosen domains. The paper is for researchers focused on safe reinforcement learning and test-time adaptation. Anyone dealing with deployed agents where retraining is costly would find the idea useful. It deserves serious referee time because the distinction from prior work is clear and the problem is relevant, though the full manuscript will need to address the mismatch conditions and supply quantitative details to strengthen the case. Recommendation: send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes TRAM, a test-time adaptation method for RL agents facing post-training risk specifications (e.g., new hazard maps or divergence constraints). It reuses a fixed library of risk-neutral source policies via a source-scored composition rule that evaluates each source under the target reward and an occupancy-based deployment risk, then selects actions accordingly. TRAM is explicitly framed as a surrogate (not solving the full occupancy-control problem) that admits a measurable source-hull mismatch term linking source-scored risk to realized risk. Experiments in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment task are reported to show risk reduction while preserving reward, with no parameter updates at test time.

Significance. If the central claims hold, TRAM would provide a practical, training-free mechanism for adapting deployed RL policies to evolving safety requirements, which is valuable for real-world settings where risk specifications are not known at training time. The multi-domain experimental validation (gridworlds through LLM alignment) and the explicit surrogate characterization with a measurable mismatch term are strengths that support practical utility and distinguish the work from training-time risk-sensitive methods.

major comments (2)

[Abstract and surrogate characterization] Abstract and surrogate characterization: the source-hull mismatch term is introduced as connecting source-scored risk to realized risk and is described as measurable, yet no derivation, Lipschitz-style bound on occupancy deviation, or richness condition on the source library is supplied to guarantee when the term remains small. This is load-bearing for the claim that risk reduction holds under arbitrary test-time risk specs, as an insufficiently expressive source library can cause the convex combination to fail to approximate the target occupancy-constrained policy.
[Experiments] Experimental evaluation: while results are reported across domains, there is no ablation or sensitivity analysis quantifying how source-library expressiveness affects the realized mismatch term or risk-reduction margin (e.g., by varying library size or diversity relative to the test-time hazard map). Without this, it is difficult to assess whether the observed risk reductions generalize beyond the chosen source libraries.

minor comments (2)

[Method] Notation for the occupancy-based deployment risk and source scores should be introduced with a single consolidated definition early in the method section to improve readability.
[Method] The abstract states that TRAM supports spatial barrier exposure, divergence, and local volatility risks, but the main text should include a brief table mapping each risk type to the corresponding occupancy measure used in the scoring rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of TRAM as a surrogate method. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and surrogate characterization] Abstract and surrogate characterization: the source-hull mismatch term is introduced as connecting source-scored risk to realized risk and is described as measurable, yet no derivation, Lipschitz-style bound on occupancy deviation, or richness condition on the source library is supplied to guarantee when the term remains small. This is load-bearing for the claim that risk reduction holds under arbitrary test-time risk specs, as an insufficiently expressive source library can cause the convex combination to fail to approximate the target occupancy-constrained policy.

Authors: The manuscript explicitly frames TRAM as a surrogate that admits a measurable source-hull mismatch term without claiming a universal bound or guarantee that the term remains small for arbitrary source libraries or test-time specifications. The primary support for risk reduction is empirical, across the reported domains. We agree that adding an explicit derivation of the mismatch term, along with a discussion of conditions on source-library richness, would strengthen the surrogate characterization. We will incorporate this derivation and discussion in the revised manuscript. revision: yes
Referee: [Experiments] Experimental evaluation: while results are reported across domains, there is no ablation or sensitivity analysis quantifying how source-library expressiveness affects the realized mismatch term or risk-reduction margin (e.g., by varying library size or diversity relative to the test-time hazard map). Without this, it is difficult to assess whether the observed risk reductions generalize beyond the chosen source libraries.

Authors: The current experiments use domain-appropriate source libraries that enable the observed risk reductions, but we acknowledge that explicit sensitivity analysis on library size and diversity would better quantify the mismatch term's dependence on expressiveness. We will add such ablations in the revised manuscript, including measurements of the mismatch term where feasible. revision: yes

Circularity Check

0 steps flagged

No circularity; surrogate characterization includes independent mismatch term

full rationale

The provided abstract and description contain no self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. TRAM is explicitly framed as a surrogate admitting a measurable source-hull mismatch term that connects source-scored risk to realized risk without claiming exact equivalence to the occupancy-control problem. The source-library richness assumption is stated as a premise rather than derived by construction from the method itself. No equations or steps reduce the central claim to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a fixed library of risk-neutral policies can be scored and mixed at test time to control new risk measures; no free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption A fixed library of risk-neutral source policies is available and sufficient for the target risk-reward tradeoff.
Stated directly in the abstract as the starting point for zero-update adaptation.
domain assumption Occupancy-based deployment risk can be evaluated from source policies without additional training.
Used to define the risk-adjusted source scores.

pith-pipeline@v0.9.0 · 5737 in / 1249 out tokens · 19379 ms · 2026-05-23T22:13:28.925282+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Barreto, A.; Borsa, D.; Quan, J.; Schaul, T.; Silver, D.; Hessel, M.; Mankowitz, D.; Zidek, A.; and Munos, R. 2018. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, 501--510. PMLR

work page 2018
[4]

J.; Schaul, T.; van Hasselt, H

Barreto, A.; Dabney, W.; Munos, R.; Hunt, J. J.; Schaul, T.; van Hasselt, H. P.; and Silver, D. 2017. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30

work page 2017
[5]

Barreto, A.; Hou, S.; Borsa, D.; Silver, D.; and Precup, D. 2020. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48): 30079--30087

work page 2020
[6]

G.; Dabney, W.; and Munos, R

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In International conference on machine learning, 449--458. PMLR

work page 2017
[7]

Bellman, R. 1957. Dynamic Programming . Dover Publications. ISBN 9780486428093

work page 1957
[8]

Bisi, L.; Sabbioni, L.; Vittori, E.; Papini, M.; and Restelli, M. 2019. Risk-averse trust region optimization for reward-volatility reduction. arXiv preprint arXiv:1912.03193

work page arXiv 2019
[9]

S.; Yin, M.; Manocha, D.; Wang, M.; Bedi, A

Chakraborty, S.; Ghosal, S. S.; Yin, M.; Manocha, D.; Wang, M.; Bedi, A. S.; and Huang, F. 2024. Transfer Q Star: Principled Decoding for LLM Alignment. arXiv preprint arXiv:2405.20495

work page arXiv 2024
[10]

Devroye, L.; Mehrabian, A.; and Reddad, T. 2018. The total variation distance between high-dimensional Gaussians with the same mean. arXiv preprint arXiv:1810.08693

work page arXiv 2018
[11]

Fei, Y.; Yang, Z.; Chen, Y.; Wang, Z.; and Xie, Q. 2020. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33: 22384--22395

work page 2020
[12]

Garc \' a, J.; and Fern \'a ndez, F. 2019. Probabilistic policy reuse for safe reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 13(3): 1--24

work page 2019
[13]

Gimelfarb, M.; Barreto, A.; Sanner, S.; and Lee, C.-G. 2021. Risk-aware transfer in reinforcement learning using successor features. Advances in Neural Information Processing Systems, 34: 17298--17310

work page 2021
[14]

Held, D.; McCarthy, Z.; Zhang, M.; Shentu, F.; and Abbeel, P. 2017. Probabilistically safe policy transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 5798--5805. IEEE

work page 2017
[15]

Higgins, I.; Pal, A.; Rusu, A.; Matthey, L.; Burgess, C.; Pritzel, A.; Botvinick, M.; Blundell, C.; and Lerchner, A. 2017. Darla: Improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning, 1480--1490. PMLR

work page 2017
[16]

Jain, A.; Khetarpal, K.; and Precup, D. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review, 36: e4

work page 2021
[17]

Jain, A.; Patil, G.; Jain, A.; Khetarpal, K.; and Precup, D. 2021 a . Variance penalized on-policy and off-policy actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7899--7907

work page 2021
[18]

Jain, A.; Patil, G.; Jain, A.; Khetarpal, K.; and Precup, D. 2021 b . Variance penalized on-policy and off-policy actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7899--7907

work page 2021
[19]

Kamthe, S.; and Deisenroth, M. 2018. Data-efficient reinforcement learning with probabilistic model predictive control. In International conference on artificial intelligence and statistics, 1701--1710. PMLR

work page 2018
[20]

Kwon, K.-b.; Ye, L.; Gupta, V.; and Zhu, H. 2022. Model-free learning for risk-constrained linear quadratic regulator with structured feedback in networked systems. In 2022 IEEE 61st Conference on Decision and Control (CDC), 7260--7265. IEEE

work page 2022
[21]

Mankowitz, D.; Mann, T.; Bacon, P.-L.; Precup, D.; and Mannor, S. 2018. Learning robust options. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32

work page 2018
[22]

Situational Awareness by Risk-Conscious Skills

Mankowitz, D. J.; Tamar, A.; and Mannor, S. 2016. Situational awareness by risk-conscious skills. arXiv preprint arXiv:1610.02847

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Mannor, S.; and Tsitsiklis, J. N. 2013 a . Algorithmic aspects of mean--variance optimization in Markov decision processes. European Journal of Operational Research, 231(3): 645--653

work page 2013
[24]

Mannor, S.; and Tsitsiklis, J. N. 2013 b . Algorithmic aspects of mean--variance optimization in Markov decision processes. European Journal of Operational Research, 231(3): 645--653

work page 2013
[25]

Variance Reduction for Reinforcement Learning in Input-Driven Environments

Mao, H.; Venkatakrishnan, S. B.; Schwarzkopf, M.; and Alizadeh, M. 2018. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Marom, O.; and Rosman, B. 2018. Zero-shot transfer with deictic object-oriented representation in reinforcement learning. Advances in Neural Information Processing Systems, 31

work page 2018
[27]

Mudgal, S.; Lee, J.; Ganapathy, H.; Li, Y.; Wang, T.; Huang, Y.; Chen, Z.; Cheng, H.-T.; Collins, M.; Strohman, T.; et al. 2023. Controlled decoding from language models. arXiv preprint arXiv:2310.17022

work page arXiv 2023
[28]

Nachum, O.; and Dai, B. 2020. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866

work page arXiv 2020
[29]

Nachum, O.; Dai, B.; Kostrikov, I.; Chow, Y.; Li, L.; and Schuurmans, D. 2019. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074

work page arXiv 2019
[30]

Nagarajan, P.; Warnell, G.; and Stone, P. 2018. Deterministic implementations for reproducibility in deep reinforcement learning. arXiv preprint arXiv:1809.05676

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Nass, D.; Belousov, B.; and Peters, J. 2019. Entropic risk measure in policy search. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1101--1106. IEEE

work page 2019
[32]

Oh, J.; Singh, S.; Lee, H.; and Kohli, P. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, 2661--2670. PMLR

work page 2017
[33]

Puterman, M. L. 1990. Markov decision processes. Handbooks in operations research and management science, 2: 331--434

work page 1990
[34]

R.; Dudek, G.; and Meger, D

Rezaei-Shoshtari, S.; Morissette, C.; Hogan, F. R.; Dudek, G.; and Meger, D. 2023. Hypernetworks for zero-shot transfer in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 9579--9587

work page 2023
[35]

J.; Sommer, T.; and Obermayer, K

Shen, Y.; Tobia, M. J.; Sommer, T.; and Obermayer, K. 2014. Risk-sensitive reinforcement learning. Neural computation, 26(7): 1298--1328

work page 2014
[36]

Sikchi, H.; Zheng, Q.; Zhang, A.; and Niekum, S. 2023. Dual rl: Unification and new methods for reinforcement and imitation learning. arXiv preprint arXiv:2302.08560

work page arXiv 2023
[37]

Srinivasan, K.; Eysenbach, B.; Ha, S.; Tan, J.; and Finn, C. 2020. Learning to be safe: Deep rl with a safety critic. arXiv preprint arXiv:2010.14603

work page arXiv 2020
[38]

S.; and Barto, A

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press

work page 2018
[39]

Tamar, A.; Di Castro, D.; and Mannor, S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17(13): 1--36

work page 2016
[40]

E.; and Stone, P

Taylor, M. E.; and Stone, P. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7)

work page 2009
[41]

Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026--5033. IEEE

work page 2012
[42]

Touati, A.; Rapin, J.; and Ollivier, Y. 2022. Does Zero-Shot Reinforcement Learning Exist? In The Eleventh International Conference on Learning Representations

work page 2022
[43]

Turchetta, M.; Kolobov, A.; Shah, S.; Krause, A.; and Agarwal, A. 2020. Safe reinforcement learning via curriculum induction. Advances in Neural Information Processing Systems, 33: 12151--12162

work page 2020
[44]

Wen, Z.; and Van Roy, B. 2017. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762--782

work page 2017
[45]

Whiteson, S. 2021. Mean- Variance Policy Iteration for Risk- Averse Reinforcement Learning

work page 2021
[46]

Zhang, H.; Seal, S.; Wu, D.; Bouffard, F.; and Boulet, B. 2022. Building energy management with reinforcement learning and model predictive control: A survey. IEEE Access, 10: 27853--27862

work page 2022
[47]

S.; Wang, M.; and Koppel, A

Zhang, J.; Bedi, A. S.; Wang, M.; and Koppel, A. 2021. Cautious reinforcement learning via distributional risk in the dual domain. IEEE Journal on Selected Areas in Information Theory, 2(2): 611--626

work page 2021
[48]

D.; Liu, M.; Murugesan, K.; Lu, S.; Chen, P.-Y.; Chen, T.; and Wang, M

Zhang, S.; Fernando, H. D.; Liu, M.; Murugesan, K.; Lu, S.; Chen, P.-Y.; Chen, T.; and Wang, M. 2024. SF-DQN: Provable Knowledge Transfer using Successor Feature for Deep Reinforcement Learning. arXiv preprint arXiv:2405.15920

work page arXiv 2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Barreto, A.; Borsa, D.; Quan, J.; Schaul, T.; Silver, D.; Hessel, M.; Mankowitz, D.; Zidek, A.; and Munos, R. 2018. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, 501--510. PMLR

work page 2018

[4] [4]

J.; Schaul, T.; van Hasselt, H

Barreto, A.; Dabney, W.; Munos, R.; Hunt, J. J.; Schaul, T.; van Hasselt, H. P.; and Silver, D. 2017. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30

work page 2017

[5] [5]

Barreto, A.; Hou, S.; Borsa, D.; Silver, D.; and Precup, D. 2020. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48): 30079--30087

work page 2020

[6] [6]

G.; Dabney, W.; and Munos, R

Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In International conference on machine learning, 449--458. PMLR

work page 2017

[7] [7]

Bellman, R. 1957. Dynamic Programming . Dover Publications. ISBN 9780486428093

work page 1957

[8] [8]

Bisi, L.; Sabbioni, L.; Vittori, E.; Papini, M.; and Restelli, M. 2019. Risk-averse trust region optimization for reward-volatility reduction. arXiv preprint arXiv:1912.03193

work page arXiv 2019

[9] [9]

S.; Yin, M.; Manocha, D.; Wang, M.; Bedi, A

Chakraborty, S.; Ghosal, S. S.; Yin, M.; Manocha, D.; Wang, M.; Bedi, A. S.; and Huang, F. 2024. Transfer Q Star: Principled Decoding for LLM Alignment. arXiv preprint arXiv:2405.20495

work page arXiv 2024

[10] [10]

Devroye, L.; Mehrabian, A.; and Reddad, T. 2018. The total variation distance between high-dimensional Gaussians with the same mean. arXiv preprint arXiv:1810.08693

work page arXiv 2018

[11] [11]

Fei, Y.; Yang, Z.; Chen, Y.; Wang, Z.; and Xie, Q. 2020. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33: 22384--22395

work page 2020

[12] [12]

Garc \' a, J.; and Fern \'a ndez, F. 2019. Probabilistic policy reuse for safe reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 13(3): 1--24

work page 2019

[13] [13]

Gimelfarb, M.; Barreto, A.; Sanner, S.; and Lee, C.-G. 2021. Risk-aware transfer in reinforcement learning using successor features. Advances in Neural Information Processing Systems, 34: 17298--17310

work page 2021

[14] [14]

Held, D.; McCarthy, Z.; Zhang, M.; Shentu, F.; and Abbeel, P. 2017. Probabilistically safe policy transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 5798--5805. IEEE

work page 2017

[15] [15]

Higgins, I.; Pal, A.; Rusu, A.; Matthey, L.; Burgess, C.; Pritzel, A.; Botvinick, M.; Blundell, C.; and Lerchner, A. 2017. Darla: Improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning, 1480--1490. PMLR

work page 2017

[16] [16]

Jain, A.; Khetarpal, K.; and Precup, D. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review, 36: e4

work page 2021

[17] [17]

Jain, A.; Patil, G.; Jain, A.; Khetarpal, K.; and Precup, D. 2021 a . Variance penalized on-policy and off-policy actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7899--7907

work page 2021

[18] [18]

Jain, A.; Patil, G.; Jain, A.; Khetarpal, K.; and Precup, D. 2021 b . Variance penalized on-policy and off-policy actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7899--7907

work page 2021

[19] [19]

Kamthe, S.; and Deisenroth, M. 2018. Data-efficient reinforcement learning with probabilistic model predictive control. In International conference on artificial intelligence and statistics, 1701--1710. PMLR

work page 2018

[20] [20]

Kwon, K.-b.; Ye, L.; Gupta, V.; and Zhu, H. 2022. Model-free learning for risk-constrained linear quadratic regulator with structured feedback in networked systems. In 2022 IEEE 61st Conference on Decision and Control (CDC), 7260--7265. IEEE

work page 2022

[21] [21]

Mankowitz, D.; Mann, T.; Bacon, P.-L.; Precup, D.; and Mannor, S. 2018. Learning robust options. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32

work page 2018

[22] [22]

Situational Awareness by Risk-Conscious Skills

Mankowitz, D. J.; Tamar, A.; and Mannor, S. 2016. Situational awareness by risk-conscious skills. arXiv preprint arXiv:1610.02847

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Mannor, S.; and Tsitsiklis, J. N. 2013 a . Algorithmic aspects of mean--variance optimization in Markov decision processes. European Journal of Operational Research, 231(3): 645--653

work page 2013

[24] [24]

Mannor, S.; and Tsitsiklis, J. N. 2013 b . Algorithmic aspects of mean--variance optimization in Markov decision processes. European Journal of Operational Research, 231(3): 645--653

work page 2013

[25] [25]

Variance Reduction for Reinforcement Learning in Input-Driven Environments

Mao, H.; Venkatakrishnan, S. B.; Schwarzkopf, M.; and Alizadeh, M. 2018. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Marom, O.; and Rosman, B. 2018. Zero-shot transfer with deictic object-oriented representation in reinforcement learning. Advances in Neural Information Processing Systems, 31

work page 2018

[27] [27]

Mudgal, S.; Lee, J.; Ganapathy, H.; Li, Y.; Wang, T.; Huang, Y.; Chen, Z.; Cheng, H.-T.; Collins, M.; Strohman, T.; et al. 2023. Controlled decoding from language models. arXiv preprint arXiv:2310.17022

work page arXiv 2023

[28] [28]

Nachum, O.; and Dai, B. 2020. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866

work page arXiv 2020

[29] [29]

Nachum, O.; Dai, B.; Kostrikov, I.; Chow, Y.; Li, L.; and Schuurmans, D. 2019. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074

work page arXiv 2019

[30] [30]

Nagarajan, P.; Warnell, G.; and Stone, P. 2018. Deterministic implementations for reproducibility in deep reinforcement learning. arXiv preprint arXiv:1809.05676

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Nass, D.; Belousov, B.; and Peters, J. 2019. Entropic risk measure in policy search. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1101--1106. IEEE

work page 2019

[32] [32]

Oh, J.; Singh, S.; Lee, H.; and Kohli, P. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, 2661--2670. PMLR

work page 2017

[33] [33]

Puterman, M. L. 1990. Markov decision processes. Handbooks in operations research and management science, 2: 331--434

work page 1990

[34] [34]

R.; Dudek, G.; and Meger, D

Rezaei-Shoshtari, S.; Morissette, C.; Hogan, F. R.; Dudek, G.; and Meger, D. 2023. Hypernetworks for zero-shot transfer in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 9579--9587

work page 2023

[35] [35]

J.; Sommer, T.; and Obermayer, K

Shen, Y.; Tobia, M. J.; Sommer, T.; and Obermayer, K. 2014. Risk-sensitive reinforcement learning. Neural computation, 26(7): 1298--1328

work page 2014

[36] [36]

Sikchi, H.; Zheng, Q.; Zhang, A.; and Niekum, S. 2023. Dual rl: Unification and new methods for reinforcement and imitation learning. arXiv preprint arXiv:2302.08560

work page arXiv 2023

[37] [37]

Srinivasan, K.; Eysenbach, B.; Ha, S.; Tan, J.; and Finn, C. 2020. Learning to be safe: Deep rl with a safety critic. arXiv preprint arXiv:2010.14603

work page arXiv 2020

[38] [38]

S.; and Barto, A

Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press

work page 2018

[39] [39]

Tamar, A.; Di Castro, D.; and Mannor, S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17(13): 1--36

work page 2016

[40] [40]

E.; and Stone, P

Taylor, M. E.; and Stone, P. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7)

work page 2009

[41] [41]

Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026--5033. IEEE

work page 2012

[42] [42]

Touati, A.; Rapin, J.; and Ollivier, Y. 2022. Does Zero-Shot Reinforcement Learning Exist? In The Eleventh International Conference on Learning Representations

work page 2022

[43] [43]

Turchetta, M.; Kolobov, A.; Shah, S.; Krause, A.; and Agarwal, A. 2020. Safe reinforcement learning via curriculum induction. Advances in Neural Information Processing Systems, 33: 12151--12162

work page 2020

[44] [44]

Wen, Z.; and Van Roy, B. 2017. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762--782

work page 2017

[45] [45]

Whiteson, S. 2021. Mean- Variance Policy Iteration for Risk- Averse Reinforcement Learning

work page 2021

[46] [46]

Zhang, H.; Seal, S.; Wu, D.; Bouffard, F.; and Boulet, B. 2022. Building energy management with reinforcement learning and model predictive control: A survey. IEEE Access, 10: 27853--27862

work page 2022

[47] [47]

S.; Wang, M.; and Koppel, A

Zhang, J.; Bedi, A. S.; Wang, M.; and Koppel, A. 2021. Cautious reinforcement learning via distributional risk in the dual domain. IEEE Journal on Selected Areas in Information Theory, 2(2): 611--626

work page 2021

[48] [48]

D.; Liu, M.; Murugesan, K.; Lu, S.; Chen, P.-Y.; Chen, T.; and Wang, M

Zhang, S.; Fernando, H. D.; Liu, M.; Murugesan, K.; Lu, S.; Chen, P.-Y.; Chen, T.; and Wang, M. 2024. SF-DQN: Provable Knowledge Transfer using Successor Feature for Deep Reinforcement Learning. arXiv preprint arXiv:2405.15920

work page arXiv 2024