Safe Deep Reinforcement Learning for Building Heating Control and Demand-side Flexibility

Colin J\"uni; Federica Bellizio; Giovanni Sansavini; Mina Montazeri; Philipp Heer; Yi Guo

arxiv: 2604.16033 · v1 · submitted 2026-04-17 · 📡 eess.SY · cs.AI· cs.SY

Safe Deep Reinforcement Learning for Building Heating Control and Demand-side Flexibility

Colin J\"uni , Mina Montazeri , Yi Guo , Federica Bellizio , Giovanni Sansavini , Philipp Heer This is my paper

Pith reviewed 2026-05-10 08:16 UTC · model grok-4.3

classification 📡 eess.SY cs.AIcs.SY

keywords flexibilityenergydeepreinforcementdemand-sideheatingbuildingcontroller

0 comments

The pith

A DRL-based heating controller with a real-time adaptive safety filter guarantees flexibility compliance, achieves up to 50% energy savings over rule-based methods, and outperforms plain DRL with only minor comfort violations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Buildings consume huge amounts of energy for heating. The authors train an AI agent using deep deterministic policy gradient to learn heating strategies by interacting with a computer model of the building. The agent tries to cut energy costs and respond to grid requests for flexibility, such as temporarily reducing heating. To prevent unsafe actions that violate those requests, they add a safety filter that overrides the AI's output in real time if needed. Simulation tests show the combined system uses less energy and costs less than simple rule-based controllers, beats a version without the safety filter on efficiency, and keeps temperature violations small.

Core claim

The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy and cost efficiency -- achieving up to 50% savings compared to a rule-based controller -- while outperforming a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations.

Load-bearing premise

The building thermal model used for training and evaluation accurately captures real-world dynamics, occupant behavior, and disturbances, allowing the learned policy and safety filter to transfer without significant performance degradation or safety violations in physical buildings.

Figures

Figures reproduced from arXiv: 2604.16033 by Colin J\"uni, Federica Bellizio, Giovanni Sansavini, Mina Montazeri, Philipp Heer, Yi Guo.

**Figure 2.** Figure 2: Timeline representation of a flexibility provision [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) The NEST building with the UMAR unit [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Room temperature evolution, control action and ambient variables under DRL-based control with safety filter [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Mean reward of DRL-based controller during the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of Key Performance Indicators over [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Buildings account for approximately 40% of global energy consumption, and with the growing share of intermittent renewable energy sources, enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency. This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. To address safety concerns with reinforcement learning, particularly regarding compliance with flexibility requests, we propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision. The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy and cost efficiency -- achieving up to 50% savings compared to a rule-based controller -- while outperforming a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a real-time adaptive safety filter to DDPG for building heating to guarantee flexibility compliance, but the 50% savings and full-guarantee claims rest on an unvalidated thermal model with no robustness checks shown.

read the letter

The main contribution is a DDPG policy for space heating control paired with a new real-time adaptive safety filter that enforces operator flexibility requests while trying to keep comfort and cost in check. It reports up to 50% energy and cost savings over a rule-based baseline and better results than plain DRL, with only a small rise in temperature violations. The filter is the clearest addition; it appears to adjust actions on the fly to meet external constraints without retraining the whole policy. That is a reasonable engineering step for making RL usable in grid-interactive buildings. The problem framing is also solid—buildings really do use 40% of energy and flexibility matters for renewables. The soft spots are in the evidence. The abstract gives headline numbers but no information on simulation length, number of random seeds, statistical tests, or how the rule-based and DRL baselines were coded. More critically, everything runs inside one thermal model. The safety guarantee and the reported performance only hold if that model matches the real building at every step. Any gap in occupancy, sensor noise, forecast error, or parameter drift could break the compliance or increase comfort violations beyond what is shown. No sensitivity runs or hardware tests are mentioned, so the numbers stay conditional. This is for people already working on safe RL or model-predictive control for HVAC and demand response. A reader could extract the filter design and try it in their own simulator. I would send it to peer review. The application area is important and the filter idea is concrete enough that referees can push for the missing experiments and robustness analysis rather than desk-rejecting outright.

Referee Report

3 major / 2 minor

Summary. The paper proposes a safe deep reinforcement learning framework for optimizing building space heating control. It employs a deep deterministic policy gradient (DDPG) algorithm to learn policies that minimize energy costs and maintain comfort while providing demand-side flexibility to grid operators. A real-time adaptive safety filter is introduced to enforce compliance with flexibility requests, with claims of full compliance, up to 50% energy savings versus rule-based controllers, and outperformance of standalone DRL with only minor comfort violations.

Significance. If the safety guarantees hold under realistic conditions and the performance improvements are robustly validated, the work could contribute to practical RL deployment for grid-interactive buildings. The combination of DRL with an adaptive safety layer addresses a relevant gap in safe control for demand response, but the significance is currently limited by the absence of evidence that results transfer beyond the training model.

major comments (3)

[Abstract / Safety Filter] Abstract and safety filter section: The claim that the real-time adaptive safety filter 'guarantees full compliance with flexibility requests' is presented without a formal derivation, proof, or set of assumptions under which the guarantee holds. No conditions on model accuracy, prediction horizons, or disturbance bounds are stated, making the central safety claim unsupported.
[Results / Evaluation] Evaluation and results: Quantitative claims (50% savings, outperformance over rule-based and standalone DRL controllers) are reported without details on simulation setup, number of independent runs, statistical tests, variance across seeds or scenarios, or exact baseline implementations. This prevents assessment of whether the reported gains are statistically meaningful or reproducible.
[Model and Evaluation] Building thermal model and robustness: All training, safety-filter operation, and evaluation occur inside a single thermal model. No sensitivity analysis to model mismatch (e.g., parameter drift, unmodeled occupancy, weather forecast error, or sensor noise) is provided, so the headline metrics and 'full compliance' are conditional on perfect model fidelity and do not address the transferability concern.

minor comments (2)

[Abstract / Introduction] The abstract and introduction could more clearly distinguish the proposed safety filter from standard projection or shielding methods in safe RL literature.
[Preliminaries] Notation for state, action, and constraint sets in the DDPG formulation and safety filter should be introduced consistently with a single table or list of symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating the revisions we will make to strengthen the presentation of the safety claims, experimental details, and limitations.

read point-by-point responses

Referee: [Abstract / Safety Filter] Abstract and safety filter section: The claim that the real-time adaptive safety filter 'guarantees full compliance with flexibility requests' is presented without a formal derivation, proof, or set of assumptions under which the guarantee holds. No conditions on model accuracy, prediction horizons, or disturbance bounds are stated, making the central safety claim unsupported.

Authors: We acknowledge that the current manuscript does not include an explicit formal derivation or stated assumptions for the safety guarantee. The real-time adaptive safety filter projects the DRL-proposed action onto the feasible set defined by the flexibility constraints at each timestep, using the building thermal model for forward simulation of constraint satisfaction. We will revise both the abstract and the safety-filter section to explicitly list the operating assumptions (accurate model, perfect state observation, and disturbances within the modeled bounds) and provide a concise mathematical argument showing why the projection step enforces compliance under those conditions. The claim will be qualified accordingly rather than stated unconditionally. revision: yes
Referee: [Results / Evaluation] Evaluation and results: Quantitative claims (50% savings, outperformance over rule-based and standalone DRL controllers) are reported without details on simulation setup, number of independent runs, statistical tests, variance across seeds or scenarios, or exact baseline implementations. This prevents assessment of whether the reported gains are statistically meaningful or reproducible.

Authors: We agree that the reported quantitative results require additional methodological detail for reproducibility. We will expand the simulation-setup and results sections to describe the exact building thermal model parameters, the training protocol for the DDPG agent, the number of independent runs with different random seeds, the precise definition of the rule-based baseline (hysteresis thermostat) and the standalone DRL baseline, and the observed variance in energy and cost metrics. Standard deviations or error bars will be added to the headline figures, allowing readers to evaluate statistical meaningfulness. revision: yes
Referee: [Model and Evaluation] Building thermal model and robustness: All training, safety-filter operation, and evaluation occur inside a single thermal model. No sensitivity analysis to model mismatch (e.g., parameter drift, unmodeled occupancy, weather forecast error, or sensor noise) is provided, so the headline metrics and 'full compliance' are conditional on perfect model fidelity and do not address the transferability concern.

Authors: This observation is correct: the present evaluation assumes an exact match between the controller's internal model and the simulation environment. Because the safety filter is model-based, any mismatch directly affects both compliance and performance. In the revised manuscript we will add an explicit limitations paragraph in the discussion section stating this assumption and its implications for real-world transfer. We will also include a limited sensitivity study that perturbs key model parameters and forecast noise within realistic ranges and reports the resulting changes in compliance rate and energy savings, thereby partially addressing the transferability concern while identifying robustification as future work. revision: partial

Circularity Check

0 steps flagged

No circularity; performance claims arise from simulation, not definitional reduction

full rationale

The paper applies the standard DDPG algorithm augmented by a proposed real-time adaptive safety filter whose constraints are stated as external inputs (flexibility requests and comfort bounds). Reported metrics such as 50% energy savings and full compliance are presented as empirical outcomes of closed-loop simulation against the building thermal model, not as quantities that equal the filter definition or training data by algebraic construction. No equations, self-citations, or uniqueness theorems are invoked that would collapse the central claims into the inputs; the derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claims rest on unstated assumptions about model fidelity and filter correctness.

axioms (1)

domain assumption The building thermal model is sufficiently accurate to train and validate a controller that transfers to real buildings.
Implicit in using the model for RL training and performance claims.

pith-pipeline@v0.9.0 · 5511 in / 1218 out tokens · 38077 ms · 2026-05-10T08:16:27.869596+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Nest - gemeinsam an der zukunft bauen.https:// www.empa.ch/web/nest/

(????). Nest - gemeinsam an der zukunft bauen.https:// www.empa.ch/web/nest/. (????). Urban mining and recycling unit.http:// nest-umar.net/. Agency, I.E. (2013). Transition to sustainable buildings: Strategies and opportunities to 2050.Organisation for Economic Co-Operation and Development. Belic, F., Hocenski, Z., and Sliskovic, D. (2015). HVAC control ...

work page 2013
[2]

Di Natale, L., Svetozarevic, B., Heer, P., and Jones, C

Comparison of Key Performance Indicators over three days. Di Natale, L., Svetozarevic, B., Heer, P., and Jones, C. (2022). Physically consistent neural networks for build- ing thermal modeling: Theory and analysis.Applied Energy, 325, 119806. Drgoˇ na, J., Arroyo, J., Cupeiro Figueroa, I., Blum, D., Arendt, K., Kim, D., Oll´ e, E.P., Oravec, J., Wetter, M...

work page arXiv 2022
[3]

and Kevrekidis, I.G.and Lu, L

Karniadakis, G. and Kevrekidis, I.G.and Lu, L. (2021). Physics-informed machine learning.Nature Reviews Physics, 3, 422–440. Kazmi, H., Mehmood, F., Lodeweyckx, S., and Driesen, J. (2018). Gigawatt-hour scale savings on a budget of zero: Deep reinforcement learning based optimal control of hot water systems.Energy, 144, 159–168. Liu, Z., Chen, Y., Yang, X...

work page arXiv 2021
[4]

Wang, X., Wang, P., Huang, R., Zhu, X., Arroyo, J., and Li, N. (2025). Safe deep reinforcement learning for building energy management.Applied Energy, 377, 124328. Wang, Z. and Hong, T. (2020). Reinforcement learning for building controls: The opportunities and challenges. Applied Energy, 269, 115036. Wei, T., Wang, Y., and Zhu, Q. (2017). Deep rein- forc...

work page 2025
[5]

Yang, S., Gao, H.O., and You, F. (2024). Demand flexibility and cost-saving potentials via smart building energy management: Opportunities in residential space heating across the US.Advances in Applied Energy,

work page 2024
[6]

Yu, L., Qin, S., Zhang, M., Shen, C., Jiang, T., and Guan, X. (2021). A review of deep reinforcement learning for smart building energy management.IEEE Internet of Things Journal, 8, 12046–12063. Zhang, K., Prakash, A., Paul, L., Blum, D., Alstone, P., Zoellick, J., Brown, R., and Pritoni, M. (2022). Model predictive control for demand flexibility: Real-w...

work page arXiv 2021

[1] [1]

Nest - gemeinsam an der zukunft bauen.https:// www.empa.ch/web/nest/

(????). Nest - gemeinsam an der zukunft bauen.https:// www.empa.ch/web/nest/. (????). Urban mining and recycling unit.http:// nest-umar.net/. Agency, I.E. (2013). Transition to sustainable buildings: Strategies and opportunities to 2050.Organisation for Economic Co-Operation and Development. Belic, F., Hocenski, Z., and Sliskovic, D. (2015). HVAC control ...

work page 2013

[2] [2]

Di Natale, L., Svetozarevic, B., Heer, P., and Jones, C

Comparison of Key Performance Indicators over three days. Di Natale, L., Svetozarevic, B., Heer, P., and Jones, C. (2022). Physically consistent neural networks for build- ing thermal modeling: Theory and analysis.Applied Energy, 325, 119806. Drgoˇ na, J., Arroyo, J., Cupeiro Figueroa, I., Blum, D., Arendt, K., Kim, D., Oll´ e, E.P., Oravec, J., Wetter, M...

work page arXiv 2022

[3] [3]

and Kevrekidis, I.G.and Lu, L

Karniadakis, G. and Kevrekidis, I.G.and Lu, L. (2021). Physics-informed machine learning.Nature Reviews Physics, 3, 422–440. Kazmi, H., Mehmood, F., Lodeweyckx, S., and Driesen, J. (2018). Gigawatt-hour scale savings on a budget of zero: Deep reinforcement learning based optimal control of hot water systems.Energy, 144, 159–168. Liu, Z., Chen, Y., Yang, X...

work page arXiv 2021

[4] [4]

Wang, X., Wang, P., Huang, R., Zhu, X., Arroyo, J., and Li, N. (2025). Safe deep reinforcement learning for building energy management.Applied Energy, 377, 124328. Wang, Z. and Hong, T. (2020). Reinforcement learning for building controls: The opportunities and challenges. Applied Energy, 269, 115036. Wei, T., Wang, Y., and Zhu, Q. (2017). Deep rein- forc...

work page 2025

[5] [5]

Yang, S., Gao, H.O., and You, F. (2024). Demand flexibility and cost-saving potentials via smart building energy management: Opportunities in residential space heating across the US.Advances in Applied Energy,

work page 2024

[6] [6]

Yu, L., Qin, S., Zhang, M., Shen, C., Jiang, T., and Guan, X. (2021). A review of deep reinforcement learning for smart building energy management.IEEE Internet of Things Journal, 8, 12046–12063. Zhang, K., Prakash, A., Paul, L., Blum, D., Alstone, P., Zoellick, J., Brown, R., and Pritoni, M. (2022). Model predictive control for demand flexibility: Real-w...

work page arXiv 2021