PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

Khashayar Yavari; Shadmehr Zaregarizi

arxiv: 2605.28232 · v1 · pith:AOM3T56Nnew · submitted 2026-05-27 · 💻 cs.AI

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

Shadmehr Zaregarizi , Khashayar Yavari This is my paper

Pith reviewed 2026-06-29 12:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords physics-informed reward shapingbuilding energy managementsoft actor-criticISO 7730 PMVthermal comfortreinforcement learningCityLearn

0 comments

The pith

PIRS replaces temperature-deviation comfort proxies in SAC rewards with the ISO 7730 PMV equation for building energy control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PIRS as a way to ground reward functions for Soft Actor-Critic agents in building energy management using established thermal comfort physics instead of ad-hoc terms. It inserts the ISO 7730 Predicted Mean Vote formulation as the comfort signal inside an otherwise unchanged weighted multi-objective reward. The authors evaluate the approach on CityLearn v2.1.2 against rule-based, manually engineered, energy-only, and naive temperature-deviation baselines. Results indicate PIRS reaches cost, carbon, and electricity performance on par with the manual reward while improving load ramping and daily peak demand relative to the non-physics variants.

Core claim

Anchoring the comfort term in the ISO 7730 PMV formulation inside the SAC reward produces district-level KPIs comparable to a manually engineered baseline and superior to temperature-deviation and energy-only designs on ramping (1.78x vs. ~2.4x RBC) and peak demand, all without modifying any other element of the learning pipeline.

What carries the argument

The ISO 7730 Predicted Mean Vote (PMV) equation inserted as the comfort component of a weighted multi-objective reward for SAC.

If this is right

Cost, carbon, and electricity KPIs stay comparable to a manually tuned reward at 50k training steps.
Load ramping reaches 1.78 times the RBC baseline versus roughly 2.4 times for non-PIRS DRL variants.
Daily peak demand reduction improves over the naive temperature-deviation reward.
All tested DRL agents remain above RBC performance at the reported training budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same PMV substitution could be tested in other building control tasks where ISO standards already define comfort.
If PMV changes policy behavior beyond simple rescaling, longer training runs may reveal larger gaps from temperature-based rewards.
Reward design that cites an explicit standard may simplify transfer of controllers across buildings with different envelope properties.

Load-bearing premise

Replacing temperature-deviation proxies with the ISO 7730 PMV equation will create meaningfully different learning dynamics in the SAC agent rather than merely rescaling an existing comfort penalty already handled by the weighted reward.

What would settle it

A controlled run in which the SAC policy and final KPIs under the PMV reward prove statistically identical to those under the temperature-deviation reward at the same training budget.

Figures

Figures reproduced from arXiv: 2605.28232 by Khashayar Yavari, Shadmehr Zaregarizi.

**Figure 2.** Figure 2: Daily peak demand ratio vs. RBC for E2–E5. Man [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Cost vs. load-ramping trade-off (E2–E5). PIRS (E5) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIRS is a straightforward substitution of the ISO PMV formula into an SAC reward for CityLearn that improves interpretability without altering the rest of the pipeline or claiming large gains.

read the letter

The paper's main move is replacing ad-hoc comfort terms with the ISO 7730 PMV equation inside a weighted SAC reward for building energy management. They run it on CityLearn v2.1.2 against a rule-based controller and a few other reward variants, training for 50k steps over five seeds.

What stands out is the honesty in the write-up. The abstract notes that all the DRL policies still lag the RBC at this budget and frames PIRS as a standards-aligned foundation rather than a performance breakthrough. That keeps the contribution proportionate: the substitution supplies a physics-grounded proxy that is easier to explain than temperature-deviation heuristics, and the KPI ratios show it matching the manual E2 baseline while beating the naive E4 version on some metrics like load ramping.

The soft spots are mostly execution details rather than conceptual holes. The reported improvements come as simple ratios with no error bars, no statistical tests, and no ablation on the PMV weight itself. Five seeds is thin for claiming robustness, and 50k steps is a modest training horizon, so it is unclear how much the PMV term actually changes learning dynamics versus just rescaling an existing penalty. The paper does not overclaim on these points, but the evidence remains provisional.

This is useful for people already working on DRL controllers for buildings who need a defensible comfort term drawn from an external standard. It is not reshaping RL theory or building science more broadly. The work shows clear thinking and stays within its stated scope, so it deserves a serious referee even if the gains are incremental.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PIRS, which replaces ad-hoc comfort terms in a weighted multi-objective reward for SAC with the ISO 7730 PMV formulation. The central claim is that this substitution improves reward interpretability and supplies a standards-grounded comfort proxy without altering any other component of the SAC pipeline. Evaluation in CityLearn v2.1.2 (50k steps, five seeds) reports district-level KPI ratios versus a rule-based controller, showing PIRS on par with a manual baseline (E2) and better than energy-only (E3) and naive temperature-deviation (E4) rewards, particularly on load ramping and peak demand; all DRL agents remain above RBC at this budget.

Significance. If the central claim holds, the work supplies a reproducible, standards-aligned template for comfort terms in building DRL that can be adopted without pipeline changes. The explicit use of the published ISO 7730 PMV equation and the five-seed training protocol are strengths that support reproducibility.

major comments (2)

[Abstract] Abstract: KPI ratios versus RBC are reported after 50k steps without error bars, standard deviations across the five seeds, or statistical tests; this is load-bearing for the claim of 'substantially outperforming' E4 on load ramping (1.78x vs. ~2.4x) and daily peak demand.
[Evaluation] Evaluation: no ablation is presented on the PMV weighting coefficient within the multi-objective reward, so it is unclear whether observed KPI differences arise from the physics grounding or from the particular scalar chosen for the comfort term.

minor comments (2)

[Abstract] The abstract states that PIRS 'improves reward interpretability' but does not quantify or illustrate how an operator would interpret the PMV term differently from a temperature-deviation proxy in practice.
[Abstract] The positioning that 'all DRL policies remain above RBC' is honest but would benefit from a brief discussion of whether the 50k-step budget is representative of typical training horizons in the CityLearn literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: KPI ratios versus RBC are reported after 50k steps without error bars, standard deviations across the five seeds, or statistical tests; this is load-bearing for the claim of 'substantially outperforming' E4 on load ramping (1.78x vs. ~2.4x) and daily peak demand.

Authors: We agree that the abstract should convey variability to support the performance claims. Although the evaluation protocol uses five random seeds, the abstract reports only point estimates. In the revised manuscript we will update the abstract to include mean KPI ratios accompanied by standard deviations (and, space permitting, note the absence of statistical significance testing at this training budget). The full evaluation section will be expanded to present these statistics explicitly. revision: yes
Referee: [Evaluation] Evaluation: no ablation is presented on the PMV weighting coefficient within the multi-objective reward, so it is unclear whether observed KPI differences arise from the physics grounding or from the particular scalar chosen for the comfort term.

Authors: The weighting coefficient is held constant across the compared reward formulations (E2–E4) to isolate the effect of replacing the comfort proxy. The observed gains relative to E4 therefore reflect the substitution of the ISO 7730 PMV model for a naïve temperature-deviation term rather than a change in scalar weight. Nevertheless, we acknowledge that an explicit sensitivity study on the comfort weight would further strengthen the claim. In the revised manuscript we will add a short paragraph in the evaluation section explaining the rationale for the chosen weight (balancing the competing objectives while remaining within the same multi-objective structure) and noting that a full ablation lies outside the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core claim is that inserting the external ISO 7730 PMV equation into an otherwise unchanged SAC reward function supplies a standards-grounded comfort term and improves interpretability. This substitution is performed by direct reference to a public standard and does not invoke any self-citation chain, fitted parameter renamed as prediction, or self-definitional loop. The evaluation section compares PIRS against an explicit naive temperature-deviation baseline (E4) and reports KPI ratios; those comparisons remain independent of the substitution itself. No equation or derivation step in the manuscript reduces the stated benefit to a tautology or to a quantity already fixed by the authors' prior choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the ISO 7730 PMV equation is a suitable and sufficient proxy for occupant comfort in the reward signal; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption The ISO 7730 Predicted Mean Vote formulation is an appropriate and physics-grounded model for occupant thermal comfort in building control rewards.
Invoked as the replacement for ad-hoc temperature-deviation proxies.

pith-pipeline@v0.9.1-grok · 5845 in / 1394 out tokens · 24686 ms · 2026-06-29T12:15:12.207418+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 11 canonical work pages

[1]

2017.ANSI/ASHRAE Standard 55-2017: Thermal Environmental Condi- tions for Human Occupancy

ASHRAE. 2017.ANSI/ASHRAE Standard 55-2017: Thermal Environmental Condi- tions for Human Occupancy. ASHRAE, Atlanta, GA, USA

2017
[2]

1970.Thermal Comfort: Analysis and Applications in Environ- mental Engineering

Povl Ole Fanger. 1970.Thermal Comfort: Analysis and Applications in Environ- mental Engineering. Danish Technical Press, Copenhagen, Denmark

1970
[3]

Goldfeder and John A

Judah A. Goldfeder and John A. Sipple. 2023. A Lightweight Calibrated Simulation Enabling Efficient Offline Learning for Optimal Control of Real Buildings. InPro- ceedings of the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys ’23). Association for Computing Machinery, New York, NY, USA, 35...

work page doi:10.1145/3600100.3625682 2023
[4]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). 1861–1870

2018
[5]

ISO. 2005.ISO 7730:2005 — Ergonomics of the Thermal Environment — Analytical Determination and Interpretation of Thermal Comfort Using Calculation of the PMV and PPD Indices and Local Thermal Comfort Criteria. Technical Report. International Organization for Standardization

2005
[6]

Nature Reviews Physics3(6), 422–440 (2021) https://doi.org/ 10.1038/s42254-021-00314-5

George Em Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. 2021. Physics-Informed Machine Learning.Nature Reviews Physics3 (2021), 422–440. doi:10.1038/s42254-021-00314-5

work page doi:10.1038/s42254-021-00314-5 2021
[7]

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernes- tus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learn- ing Implementations.Journal of Machine Learning Research22, 268 (2021), 1–8

2021
[8]

Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2019. Physics- Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations.J. Comput. Phys.378 (2019), 686–707. doi:10.1016/j.jcp.2018.10.045

work page doi:10.1016/j.jcp.2018.10.045 2019
[9]

Zekun Shi, Ruifan Zheng, Jun Zhao, Rendong Shen, Lei Gu, Yuanchao Liu, Jiahui Wu, and Guangliang Wang. 2024. Towards Various Occupants with Different Thermal Comfort Requirements: A Deep Reinforcement Learning Approach Combined with a Dynamic PMV Model for HVAC Control in Buildings.Energy Conversion and Management320 (2024), 118995. doi:10.1016/j.enconman...

work page doi:10.1016/j.enconman.2024 2024
[10]

Federico Tartarini and Stefano Schiavon. 2020. pythermalcomfort: A Python Package for Thermal Comfort Research.SoftwareX12 (2020), 100578. doi:10. 1016/j.softx.2020.100578

work page arXiv 2020
[11]

Salahuddin, Hossein Hassanpour, and Ana Tizpaz-Niari

Farzad Tashtarian, Mohammad A. Salahuddin, Hossein Hassanpour, and Ana Tizpaz-Niari. 2023. A Comprehensive Survey of Deep Reinforcement Learning in Smart Buildings.Comput. Surveys56, 3 (2023), 1–38. doi:10.1145/3624016

work page doi:10.1145/3624016 2023
[12]

Eisuke Togashi. 2025. Reward Function Design in Reinforcement Learning for HVAC Control: A Review of Thermal Comfort and Energy Efficiency Trade-offs. Energy and Buildings348 (2025), 116439. doi:10.1016/j.enbuild.2025.116439

work page doi:10.1016/j.enbuild.2025.116439 2025
[13]

Vázquez-Canteli, Sourav Dey, Gregor Henze, and Zoltan Nagy

José R. Vázquez-Canteli, Sourav Dey, Gregor Henze, and Zoltan Nagy. 2020. CityLearn: Standardizing Research in Multi-Agent Reinforcement Learning for Demand Response and Urban Energy Management. arXiv:2012.10504 [cs.LG] doi:10.48550/arXiv.2012.10504

work page doi:10.48550/arxiv.2012.10504 2020
[14]

Vázquez-Canteli and Zoltan Nagy

José R. Vázquez-Canteli and Zoltan Nagy. 2019. Reinforcement Learning for Demand Response: A Review of Algorithms and Modeling Techniques.Applied Energy235 (2019), 1072–1089. doi:10.1016/j.apenergy.2018.11.002

work page doi:10.1016/j.apenergy.2018.11.002 2019
[15]

Tianzhen Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep Reinforcement Learning for Building HVAC Control. InProceedings of the 54th Annual Design Automation Conference. 1–6. doi:10.1145/3061639.3062224

work page doi:10.1145/3061639.3062224 2017
[16]

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasen- clever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, and Fei Xia. 2023. Language to Rewards for Robotic Skill Synthesis. arXiv:2306.08647 ...

work page doi:10.48550/arxiv.2306.08647 2023

[1] [1]

2017.ANSI/ASHRAE Standard 55-2017: Thermal Environmental Condi- tions for Human Occupancy

ASHRAE. 2017.ANSI/ASHRAE Standard 55-2017: Thermal Environmental Condi- tions for Human Occupancy. ASHRAE, Atlanta, GA, USA

2017

[2] [2]

1970.Thermal Comfort: Analysis and Applications in Environ- mental Engineering

Povl Ole Fanger. 1970.Thermal Comfort: Analysis and Applications in Environ- mental Engineering. Danish Technical Press, Copenhagen, Denmark

1970

[3] [3]

Goldfeder and John A

Judah A. Goldfeder and John A. Sipple. 2023. A Lightweight Calibrated Simulation Enabling Efficient Offline Learning for Optimal Control of Real Buildings. InPro- ceedings of the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys ’23). Association for Computing Machinery, New York, NY, USA, 35...

work page doi:10.1145/3600100.3625682 2023

[4] [4]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. InProceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). 1861–1870

2018

[5] [5]

ISO. 2005.ISO 7730:2005 — Ergonomics of the Thermal Environment — Analytical Determination and Interpretation of Thermal Comfort Using Calculation of the PMV and PPD Indices and Local Thermal Comfort Criteria. Technical Report. International Organization for Standardization

2005

[6] [6]

Nature Reviews Physics3(6), 422–440 (2021) https://doi.org/ 10.1038/s42254-021-00314-5

George Em Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. 2021. Physics-Informed Machine Learning.Nature Reviews Physics3 (2021), 422–440. doi:10.1038/s42254-021-00314-5

work page doi:10.1038/s42254-021-00314-5 2021

[7] [7]

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernes- tus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learn- ing Implementations.Journal of Machine Learning Research22, 268 (2021), 1–8

2021

[8] [8]

Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. 2019. Physics- Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations.J. Comput. Phys.378 (2019), 686–707. doi:10.1016/j.jcp.2018.10.045

work page doi:10.1016/j.jcp.2018.10.045 2019

[9] [9]

Zekun Shi, Ruifan Zheng, Jun Zhao, Rendong Shen, Lei Gu, Yuanchao Liu, Jiahui Wu, and Guangliang Wang. 2024. Towards Various Occupants with Different Thermal Comfort Requirements: A Deep Reinforcement Learning Approach Combined with a Dynamic PMV Model for HVAC Control in Buildings.Energy Conversion and Management320 (2024), 118995. doi:10.1016/j.enconman...

work page doi:10.1016/j.enconman.2024 2024

[10] [10]

Federico Tartarini and Stefano Schiavon. 2020. pythermalcomfort: A Python Package for Thermal Comfort Research.SoftwareX12 (2020), 100578. doi:10. 1016/j.softx.2020.100578

work page arXiv 2020

[11] [11]

Salahuddin, Hossein Hassanpour, and Ana Tizpaz-Niari

Farzad Tashtarian, Mohammad A. Salahuddin, Hossein Hassanpour, and Ana Tizpaz-Niari. 2023. A Comprehensive Survey of Deep Reinforcement Learning in Smart Buildings.Comput. Surveys56, 3 (2023), 1–38. doi:10.1145/3624016

work page doi:10.1145/3624016 2023

[12] [12]

Eisuke Togashi. 2025. Reward Function Design in Reinforcement Learning for HVAC Control: A Review of Thermal Comfort and Energy Efficiency Trade-offs. Energy and Buildings348 (2025), 116439. doi:10.1016/j.enbuild.2025.116439

work page doi:10.1016/j.enbuild.2025.116439 2025

[13] [13]

Vázquez-Canteli, Sourav Dey, Gregor Henze, and Zoltan Nagy

José R. Vázquez-Canteli, Sourav Dey, Gregor Henze, and Zoltan Nagy. 2020. CityLearn: Standardizing Research in Multi-Agent Reinforcement Learning for Demand Response and Urban Energy Management. arXiv:2012.10504 [cs.LG] doi:10.48550/arXiv.2012.10504

work page doi:10.48550/arxiv.2012.10504 2020

[14] [14]

Vázquez-Canteli and Zoltan Nagy

José R. Vázquez-Canteli and Zoltan Nagy. 2019. Reinforcement Learning for Demand Response: A Review of Algorithms and Modeling Techniques.Applied Energy235 (2019), 1072–1089. doi:10.1016/j.apenergy.2018.11.002

work page doi:10.1016/j.apenergy.2018.11.002 2019

[15] [15]

Tianzhen Wei, Yanzhi Wang, and Qi Zhu. 2017. Deep Reinforcement Learning for Building HVAC Control. InProceedings of the 54th Annual Design Automation Conference. 1–6. doi:10.1145/3061639.3062224

work page doi:10.1145/3061639.3062224 2017

[16] [16]

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasen- clever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, and Fei Xia. 2023. Language to Rewards for Robotic Skill Synthesis. arXiv:2306.08647 ...

work page doi:10.48550/arxiv.2306.08647 2023