TRAM: Test-Time Risk Adaptation with Mixture of Agents
Pith reviewed 2026-05-23 22:13 UTC · model grok-4.3
The pith
TRAM adapts fixed RL policies to new risk requirements at test time without updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRAM evaluates each source policy under the target reward and the occupancy-based deployment risk, then uses risk-adjusted source scores to select actions from the mixture. Unlike methods tied to fixed surrogates during training, it supports arbitrary risk types specified only at test time. The approach is explicitly characterized as a surrogate for the occupancy-control problem of the stitched policy, admitting a measurable source-hull mismatch term that connects the source-scored risk to the realized risk under the mixture.
What carries the argument
Source-scored composition rule that scores policies by target reward and occupancy risk to form the action mixture.
If this is right
- Supports spatial barrier, divergence, and volatility risks at test time.
- Reduces deployment risk while preserving reward without parameter updates.
- Applies to gridworlds, MuJoCo, Safety-Gymnasium, and LLM alignment tasks.
- Characterized with an explicit source-hull mismatch term for the surrogate.
Where Pith is reading between the lines
- If the source policy library lacks diversity, the mismatch term may grow and adaptation quality could degrade.
- The method could be combined with online policy addition to the library for better coverage over time.
- Occupancy measure computation might limit scalability in very large state spaces.
Load-bearing premise
The library of source policies is diverse enough for their convex combinations to closely approximate the desired policy under the new risk constraints.
What would settle it
Measuring the realized risk and reward in an environment where the target risk specification cannot be well-approximated by mixtures of the given source policies, and checking if TRAM still outperforms non-adaptive baselines.
Figures
read the original abstract
Deployed reinforcement learning agents often face safety requirements that are specified only after training, such as new hazard maps, revised risk thresholds, or behavioral alignment constraints. We study zero-update deployment-time adaptation, where a fixed library of risk-neutral source policies is reused under a newly specified reward-risk tradeoff. We propose TRAM (Test-Time Risk Adaptation via Mixture of Agents), a source-scored composition rule that evaluates each source policy under the target reward and an occupancy-based deployment risk, then selects actions using risk-adjusted source scores. Unlike training-time risk-sensitive methods tied to a fixed surrogate such as return variance, TRAM supports spatial barrier exposure, divergence from a reference behavior, and local volatility risks specified at test time. We explicitly characterize TRAM as a surrogate method: it does not solve the full occupancy-control problem of the stitched policy, but admits a measurable source-hull mismatch term connecting source-scored risk to realized risk. Experiments in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment setting show that TRAM reduces deployment risk while preserving reward, without requiring any parameter updates at test time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TRAM, a test-time adaptation method for RL agents facing post-training risk specifications (e.g., new hazard maps or divergence constraints). It reuses a fixed library of risk-neutral source policies via a source-scored composition rule that evaluates each source under the target reward and an occupancy-based deployment risk, then selects actions accordingly. TRAM is explicitly framed as a surrogate (not solving the full occupancy-control problem) that admits a measurable source-hull mismatch term linking source-scored risk to realized risk. Experiments in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment task are reported to show risk reduction while preserving reward, with no parameter updates at test time.
Significance. If the central claims hold, TRAM would provide a practical, training-free mechanism for adapting deployed RL policies to evolving safety requirements, which is valuable for real-world settings where risk specifications are not known at training time. The multi-domain experimental validation (gridworlds through LLM alignment) and the explicit surrogate characterization with a measurable mismatch term are strengths that support practical utility and distinguish the work from training-time risk-sensitive methods.
major comments (2)
- [Abstract and surrogate characterization] Abstract and surrogate characterization: the source-hull mismatch term is introduced as connecting source-scored risk to realized risk and is described as measurable, yet no derivation, Lipschitz-style bound on occupancy deviation, or richness condition on the source library is supplied to guarantee when the term remains small. This is load-bearing for the claim that risk reduction holds under arbitrary test-time risk specs, as an insufficiently expressive source library can cause the convex combination to fail to approximate the target occupancy-constrained policy.
- [Experiments] Experimental evaluation: while results are reported across domains, there is no ablation or sensitivity analysis quantifying how source-library expressiveness affects the realized mismatch term or risk-reduction margin (e.g., by varying library size or diversity relative to the test-time hazard map). Without this, it is difficult to assess whether the observed risk reductions generalize beyond the chosen source libraries.
minor comments (2)
- [Method] Notation for the occupancy-based deployment risk and source scores should be introduced with a single consolidated definition early in the method section to improve readability.
- [Method] The abstract states that TRAM supports spatial barrier exposure, divergence, and local volatility risks, but the main text should include a brief table mapping each risk type to the corresponding occupancy measure used in the scoring rule.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of TRAM as a surrogate method. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract and surrogate characterization] Abstract and surrogate characterization: the source-hull mismatch term is introduced as connecting source-scored risk to realized risk and is described as measurable, yet no derivation, Lipschitz-style bound on occupancy deviation, or richness condition on the source library is supplied to guarantee when the term remains small. This is load-bearing for the claim that risk reduction holds under arbitrary test-time risk specs, as an insufficiently expressive source library can cause the convex combination to fail to approximate the target occupancy-constrained policy.
Authors: The manuscript explicitly frames TRAM as a surrogate that admits a measurable source-hull mismatch term without claiming a universal bound or guarantee that the term remains small for arbitrary source libraries or test-time specifications. The primary support for risk reduction is empirical, across the reported domains. We agree that adding an explicit derivation of the mismatch term, along with a discussion of conditions on source-library richness, would strengthen the surrogate characterization. We will incorporate this derivation and discussion in the revised manuscript. revision: yes
-
Referee: [Experiments] Experimental evaluation: while results are reported across domains, there is no ablation or sensitivity analysis quantifying how source-library expressiveness affects the realized mismatch term or risk-reduction margin (e.g., by varying library size or diversity relative to the test-time hazard map). Without this, it is difficult to assess whether the observed risk reductions generalize beyond the chosen source libraries.
Authors: The current experiments use domain-appropriate source libraries that enable the observed risk reductions, but we acknowledge that explicit sensitivity analysis on library size and diversity would better quantify the mismatch term's dependence on expressiveness. We will add such ablations in the revised manuscript, including measurements of the mismatch term where feasible. revision: yes
Circularity Check
No circularity; surrogate characterization includes independent mismatch term
full rationale
The provided abstract and description contain no self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. TRAM is explicitly framed as a surrogate admitting a measurable source-hull mismatch term that connects source-scored risk to realized risk without claiming exact equivalence to the occupancy-control problem. The source-library richness assumption is stated as a premise rather than derived by construction from the method itself. No equations or steps reduce the central claim to its inputs by definition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A fixed library of risk-neutral source policies is available and sufficient for the target risk-reward tradeoff.
- domain assumption Occupancy-based deployment risk can be evaluated from source policies without additional training.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Barreto, A.; Borsa, D.; Quan, J.; Schaul, T.; Silver, D.; Hessel, M.; Mankowitz, D.; Zidek, A.; and Munos, R. 2018. Transfer in deep reinforcement learning using successor features and generalised policy improvement. In International Conference on Machine Learning, 501--510. PMLR
work page 2018
-
[4]
J.; Schaul, T.; van Hasselt, H
Barreto, A.; Dabney, W.; Munos, R.; Hunt, J. J.; Schaul, T.; van Hasselt, H. P.; and Silver, D. 2017. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30
work page 2017
-
[5]
Barreto, A.; Hou, S.; Borsa, D.; Silver, D.; and Precup, D. 2020. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48): 30079--30087
work page 2020
-
[6]
Bellemare, M. G.; Dabney, W.; and Munos, R. 2017. A distributional perspective on reinforcement learning. In International conference on machine learning, 449--458. PMLR
work page 2017
-
[7]
Bellman, R. 1957. Dynamic Programming . Dover Publications. ISBN 9780486428093
work page 1957
- [8]
-
[9]
S.; Yin, M.; Manocha, D.; Wang, M.; Bedi, A
Chakraborty, S.; Ghosal, S. S.; Yin, M.; Manocha, D.; Wang, M.; Bedi, A. S.; and Huang, F. 2024. Transfer Q Star: Principled Decoding for LLM Alignment. arXiv preprint arXiv:2405.20495
- [10]
-
[11]
Fei, Y.; Yang, Z.; Chen, Y.; Wang, Z.; and Xie, Q. 2020. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. Advances in Neural Information Processing Systems, 33: 22384--22395
work page 2020
-
[12]
Garc \' a, J.; and Fern \'a ndez, F. 2019. Probabilistic policy reuse for safe reinforcement learning. ACM Transactions on Autonomous and Adaptive Systems (TAAS), 13(3): 1--24
work page 2019
-
[13]
Gimelfarb, M.; Barreto, A.; Sanner, S.; and Lee, C.-G. 2021. Risk-aware transfer in reinforcement learning using successor features. Advances in Neural Information Processing Systems, 34: 17298--17310
work page 2021
-
[14]
Held, D.; McCarthy, Z.; Zhang, M.; Shentu, F.; and Abbeel, P. 2017. Probabilistically safe policy transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 5798--5805. IEEE
work page 2017
-
[15]
Higgins, I.; Pal, A.; Rusu, A.; Matthey, L.; Burgess, C.; Pritzel, A.; Botvinick, M.; Blundell, C.; and Lerchner, A. 2017. Darla: Improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning, 1480--1490. PMLR
work page 2017
-
[16]
Jain, A.; Khetarpal, K.; and Precup, D. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review, 36: e4
work page 2021
-
[17]
Jain, A.; Patil, G.; Jain, A.; Khetarpal, K.; and Precup, D. 2021 a . Variance penalized on-policy and off-policy actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7899--7907
work page 2021
-
[18]
Jain, A.; Patil, G.; Jain, A.; Khetarpal, K.; and Precup, D. 2021 b . Variance penalized on-policy and off-policy actor-critic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7899--7907
work page 2021
-
[19]
Kamthe, S.; and Deisenroth, M. 2018. Data-efficient reinforcement learning with probabilistic model predictive control. In International conference on artificial intelligence and statistics, 1701--1710. PMLR
work page 2018
-
[20]
Kwon, K.-b.; Ye, L.; Gupta, V.; and Zhu, H. 2022. Model-free learning for risk-constrained linear quadratic regulator with structured feedback in networked systems. In 2022 IEEE 61st Conference on Decision and Control (CDC), 7260--7265. IEEE
work page 2022
-
[21]
Mankowitz, D.; Mann, T.; Bacon, P.-L.; Precup, D.; and Mannor, S. 2018. Learning robust options. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32
work page 2018
-
[22]
Situational Awareness by Risk-Conscious Skills
Mankowitz, D. J.; Tamar, A.; and Mannor, S. 2016. Situational awareness by risk-conscious skills. arXiv preprint arXiv:1610.02847
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Mannor, S.; and Tsitsiklis, J. N. 2013 a . Algorithmic aspects of mean--variance optimization in Markov decision processes. European Journal of Operational Research, 231(3): 645--653
work page 2013
-
[24]
Mannor, S.; and Tsitsiklis, J. N. 2013 b . Algorithmic aspects of mean--variance optimization in Markov decision processes. European Journal of Operational Research, 231(3): 645--653
work page 2013
-
[25]
Variance Reduction for Reinforcement Learning in Input-Driven Environments
Mao, H.; Venkatakrishnan, S. B.; Schwarzkopf, M.; and Alizadeh, M. 2018. Variance reduction for reinforcement learning in input-driven environments. arXiv preprint arXiv:1807.02264
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Marom, O.; and Rosman, B. 2018. Zero-shot transfer with deictic object-oriented representation in reinforcement learning. Advances in Neural Information Processing Systems, 31
work page 2018
- [27]
- [28]
- [29]
-
[30]
Nagarajan, P.; Warnell, G.; and Stone, P. 2018. Deterministic implementations for reproducibility in deep reinforcement learning. arXiv preprint arXiv:1809.05676
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Nass, D.; Belousov, B.; and Peters, J. 2019. Entropic risk measure in policy search. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1101--1106. IEEE
work page 2019
-
[32]
Oh, J.; Singh, S.; Lee, H.; and Kohli, P. 2017. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, 2661--2670. PMLR
work page 2017
-
[33]
Puterman, M. L. 1990. Markov decision processes. Handbooks in operations research and management science, 2: 331--434
work page 1990
-
[34]
Rezaei-Shoshtari, S.; Morissette, C.; Hogan, F. R.; Dudek, G.; and Meger, D. 2023. Hypernetworks for zero-shot transfer in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 9579--9587
work page 2023
-
[35]
J.; Sommer, T.; and Obermayer, K
Shen, Y.; Tobia, M. J.; Sommer, T.; and Obermayer, K. 2014. Risk-sensitive reinforcement learning. Neural computation, 26(7): 1298--1328
work page 2014
- [36]
- [37]
-
[38]
Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press
work page 2018
-
[39]
Tamar, A.; Di Castro, D.; and Mannor, S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research, 17(13): 1--36
work page 2016
-
[40]
Taylor, M. E.; and Stone, P. 2009. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7)
work page 2009
-
[41]
Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026--5033. IEEE
work page 2012
-
[42]
Touati, A.; Rapin, J.; and Ollivier, Y. 2022. Does Zero-Shot Reinforcement Learning Exist? In The Eleventh International Conference on Learning Representations
work page 2022
-
[43]
Turchetta, M.; Kolobov, A.; Shah, S.; Krause, A.; and Agarwal, A. 2020. Safe reinforcement learning via curriculum induction. Advances in Neural Information Processing Systems, 33: 12151--12162
work page 2020
-
[44]
Wen, Z.; and Van Roy, B. 2017. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762--782
work page 2017
-
[45]
Whiteson, S. 2021. Mean- Variance Policy Iteration for Risk- Averse Reinforcement Learning
work page 2021
-
[46]
Zhang, H.; Seal, S.; Wu, D.; Bouffard, F.; and Boulet, B. 2022. Building energy management with reinforcement learning and model predictive control: A survey. IEEE Access, 10: 27853--27862
work page 2022
-
[47]
Zhang, J.; Bedi, A. S.; Wang, M.; and Koppel, A. 2021. Cautious reinforcement learning via distributional risk in the dual domain. IEEE Journal on Selected Areas in Information Theory, 2(2): 611--626
work page 2021
-
[48]
D.; Liu, M.; Murugesan, K.; Lu, S.; Chen, P.-Y.; Chen, T.; and Wang, M
Zhang, S.; Fernando, H. D.; Liu, M.; Murugesan, K.; Lu, S.; Chen, P.-Y.; Chen, T.; and Wang, M. 2024. SF-DQN: Provable Knowledge Transfer using Successor Feature for Deep Reinforcement Learning. arXiv preprint arXiv:2405.15920
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.