Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

Khashayar Yavari; Shadmehr Zaregarizi

arxiv: 2605.29733 · v1 · pith:YN5M5XGGnew · submitted 2026-05-28 · 💻 cs.AI

Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

Shadmehr Zaregarizi , Khashayar Yavari This is my paper

Pith reviewed 2026-06-29 07:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords transfer learningenergy forecastingtemporal fusion transformeruncertainty quantificationcross-buildingfine-tuningdistrict energy

0 comments

The pith

Probe-only fine-tuning of a Temporal Fusion Transformer updates just 455 parameters yet yields the highest transfer robustness across buildings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to enable reuse of energy forecasting models across different buildings by combining transfer learning with uncertainty quantification, so that district-scale deployment does not require collecting large new datasets for every structure. It tests this on real sub-meter data from an educational building in Denmark as the source domain and a multi-typology building in Switzerland as the target. Four fine-tuning strategies are compared using a new Transfer Robustness Index that measures how well performance holds across the domain gap. The strategy that freezes the entire encoder and updates only the output layer produces the best TRI score while also allowing Monte Carlo dropout to deliver prediction intervals with 93.2 percent coverage. Additional experiments show that forecast quality rises steadily as more target-domain samples become available.

Core claim

When a Temporal Fusion Transformer pre-trained on one building's energy data is transferred to a second building, updating only the 455 parameters in the final output layer produces a Transfer Robustness Index of 3,097, higher than full fine-tuning or other partial updates, while Monte Carlo dropout simultaneously achieves 93.2 percent prediction-interval coverage probability against a nominal 95 percent target.

What carries the argument

The Transfer Robustness Index (TRI), an architecture-agnostic score that ranks transfer quality by how well a model maintains accuracy after crossing a building domain gap.

If this is right

TFT encoders learn temporal representations that remain useful when the model is moved to a new building.
Only the output layer needs to be adapted, so each new building requires far less computation and data than training from scratch.
Monte Carlo dropout supplies usable uncertainty bands without extra architectural changes.
Forecast accuracy improves in a predictable way as more target-building measurements are added.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

District operators could maintain a single pre-trained encoder and attach cheap per-building output heads, lowering the cost of scaling forecasts.
The same layer-freezing pattern may apply to other transformer-based time-series models outside energy forecasting.
If the TRI advantage holds, data-scarce buildings could still receive reliable forecasts after collecting only a few weeks of measurements.

Load-bearing premise

The domain shift between the two specific buildings studied is typical of the shifts that will appear when the same model is applied to any other buildings in a district.

What would settle it

A direct test in which the same probe-only procedure is applied to a third building whose energy-use patterns differ more sharply from the source than the current target does, checking whether the TRI advantage disappears.

Figures

Figures reproduced from arXiv: 2605.29733 by Khashayar Yavari, Shadmehr Zaregarizi.

**Figure 2.** Figure 2: MC Dropout (𝑁 = 50) prediction intervals on the NEST test set using the Probe-Only model. Top: ground truth, mean prediction, and 95% interval in normalized units. Bottom: per-step prediction standard deviation. The interval widens near the hour-50 surge, indicating increased uncertainty in unfamiliar high-variability regimes. PICP = 93.2%. stochastic spikes; therefore, MAE and TRI serve as the primary re… view at source ↗

**Figure 3.** Figure 3: Data-scarcity analysis. Left: test-set MAE versus the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Probe-only TFT fine-tuning beats full fine-tuning on these two buildings with a new TRI metric, but the single-pair evaluation leaves the district-scale robustness claim thin.

read the letter

The main thing to know is that probe-only fine-tuning, touching just the output layer, comes out on top in their ablation and they back it with numbers from two real high-resolution datasets. They also introduce TRI as a way to score transfer quality and show Monte Carlo dropout getting close to the nominal coverage.

What stands out is the four-strategy comparison on the Temporal Fusion Transformer. Updating only 455 parameters out of 806k gives the highest TRI of 3097 and beats full fine-tuning, which they read as evidence that the encoder picks up transferable temporal structure. The data-scarcity curve is monotonic and useful, the uncertainty interval hits 93.2% coverage, and the datasets from Aalborg and EMPA are concrete and newly released. TRI itself is a straightforward empirical metric that does not collapse into fitted parameters.

The limitation is the narrow test bed. Everything is measured on one source-target pair, so the claim that this supports scalable district-level forecasting rests on whether the gap between an educational building and a multi-typology one is representative of real stock variation. The stress-test point holds: without additional typologies, climates, or sensor setups, it is hard to know if the probe-only ranking would survive elsewhere. The abstract also gives point estimates without error bars or statistical tests, which keeps the comparisons moderate in strength.

This is for people doing applied energy forecasting who need to move models with little target data and want uncertainty estimates. A reader already working on TFT transfer or district energy systems would pick up practical pointers from the ablation and the scarcity analysis.

It deserves peer review because the experiments use real data, the question is relevant, and the ablation is cleanly reported, even if the generalization argument would need more buildings to land solidly.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an uncertainty-aware transfer learning framework using the Temporal Fusion Transformer (TFT) for cross-building energy load forecasting. It introduces the Transfer Robustness Index (TRI) and evaluates four fine-tuning strategies on a source educational building (Aalborg University) and target multi-typology building (EMPA NEST), finding that probe-only fine-tuning (updating 455 parameters) yields the highest TRI of 3,097. Monte Carlo Dropout achieves 93.2% coverage probability, and performance improves monotonically with more target data.

Significance. If validated more broadly, the work offers practical value for district-scale energy management by demonstrating effective transfer with minimal parameter updates and reliable uncertainty quantification. The probe-only result suggests TFT's encoder learns general temporal features, and TRI provides an architecture-agnostic metric. The data-scarcity analysis gives actionable guidance. However, the single-pair evaluation limits the strength of the scalability claims.

major comments (3)

[Abstract and Experimental Setup] The central claim of robustness and scalability to district-level building stocks rests on the Aalborg-to-NEST domain gap being representative of variations across arbitrary building stocks (typology, usage, climate). Only this single pair is evaluated, with no additional buildings or cross-typology validation reported to support the TRI ranking or 93.2% coverage. This is load-bearing for the headline conclusion in the abstract.
[Results (TRI and coverage reporting)] Concrete outcomes such as TRI = 3,097 for probe-only and 93.2% coverage are reported without error bars, standard deviations across multiple runs, or statistical significance tests comparing the four fine-tuning strategies. This weakens the claim that probe-only outperforms full fine-tuning.
[§3 (TRI definition)] The Transfer Robustness Index is introduced without comparison to existing transfer-learning metrics or analysis of its sensitivity to the specific domain gap; its definition and aggregation (e.g., across forecast horizons) are not shown to be robust outside the tested pair.

minor comments (2)

[Notation and §3] Clarify the exact mathematical definition of TRI at first use, including any normalization or weighting terms.
[Experimental details] Add a table or explicit list of all baselines and hyper-parameters used in the ablation to enable direct reproduction of the reported TRI ordering.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are warranted.

read point-by-point responses

Referee: [Abstract and Experimental Setup] The central claim of robustness and scalability to district-level building stocks rests on the Aalborg-to-NEST domain gap being representative of variations across arbitrary building stocks (typology, usage, climate). Only this single pair is evaluated, with no additional buildings or cross-typology validation reported to support the TRI ranking or 93.2% coverage. This is load-bearing for the headline conclusion in the abstract.

Authors: We agree that the single source-target pair limits the strength of broad scalability claims to arbitrary district-level stocks. The chosen pair does feature meaningful differences in building typology, climate zone, and usage patterns, which we selected to test transfer under realistic conditions. TRI is defined to be architecture-agnostic. In revision we will moderate the abstract language to emphasize the evaluated setting, add explicit discussion of the pair's representativeness, and outline future multi-building validation needs. No new data collection is required for these textual changes. revision: partial
Referee: [Results (TRI and coverage reporting)] Concrete outcomes such as TRI = 3,097 for probe-only and 93.2% coverage are reported without error bars, standard deviations across multiple runs, or statistical significance tests comparing the four fine-tuning strategies. This weakens the claim that probe-only outperforms full fine-tuning.

Authors: The reported point estimates derive from the primary experimental runs. To strengthen statistical support we will re-execute the four fine-tuning strategies across multiple random seeds, report means with standard deviations and error bars, and include pairwise significance tests (e.g., paired t-tests) between strategies in the revised results section. revision: yes
Referee: [§3 (TRI definition)] The Transfer Robustness Index is introduced without comparison to existing transfer-learning metrics or analysis of its sensitivity to the specific domain gap; its definition and aggregation (e.g., across forecast horizons) are not shown to be robust outside the tested pair.

Authors: TRI is presented as a new, architecture-agnostic index tailored to forecasting transfer robustness. In the revision we will expand §3 with explicit comparisons to established metrics (e.g., relative transfer gain, domain discrepancy scores) and add a sensitivity study of TRI values under controlled perturbations of the domain gap and alternative horizon-aggregation schemes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper reports empirical ablation results (Probe-Only vs full fine-tuning) and uncertainty metrics (93.2% coverage) computed on held-out target-domain data from the NEST building. TRI is introduced as a new metric and applied to quantify observed transfer performance; no equations or definitions make TRI or the rankings reduce to fitted parameters by construction. No self-citations are load-bearing for the central claims, no uniqueness theorems are invoked, and no ansatzes are smuggled via prior author work. The derivation chain consists of standard transfer-learning experiments and Monte Carlo Dropout evaluation, all externally falsifiable on the reported datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central empirical claims rest on the validity of the newly introduced TRI metric and the representativeness of the two-building domain shift; no free parameters or additional axioms are specified in the abstract.

invented entities (1)

Transfer Robustness Index (TRI) no independent evidence
purpose: Architecture-agnostic quantification of generalization quality across building domain gaps
New metric introduced by the authors; no independent external validation or derivation from prior literature is mentioned.

pith-pipeline@v0.9.1-grok · 5736 in / 1226 out tokens · 45514 ms · 2026-06-29T07:47:52.311428+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages

[1]

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A Decoder-Only Foundation Model for Time-Series Forecasting. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR

2024
[2]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. InProceedings of the 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48). PMLR, 1050–1059

2016
[3]

International Journal of Forecasting 37, 1748–1764

Bryan Lim, Sercan Ö. Arik, Nicolas Loeff, and Tomas Pfister. 2021. Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting.International Journal of Forecasting37, 4 (2021), 1748–1764. doi:10.1016/j.ijforecast.2021.03.012

work page doi:10.1016/j.ijforecast.2021.03.012 2021
[4]

Simon Pommerencke Melgaard et al. 2026. High-Resolution Sub-Meter Building Energy Dataset (AAU and NEST Pilots). doi:10.5281/zenodo.19019863

work page doi:10.5281/zenodo.19019863 2026
[5]

Robert Spencer, Surangika Ranathunga, Mikael Boulic, Andries Hennie van Heerden, and Teo Susnjak. 2025. Transfer Learning on Transformers for Building Energy Consumption Forecasting—A Comparative Study.Energy and Buildings 336 (2025), 115632. doi:10.1016/j.enbuild.2025.115632

work page doi:10.1016/j.enbuild.2025.115632 2025
[6]

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. Unified Training of Universal Time Series Forecasting Transformers. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR

2024

[1] [1]

Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A Decoder-Only Foundation Model for Time-Series Forecasting. InProceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR

2024

[2] [2]

Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. InProceedings of the 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48). PMLR, 1050–1059

2016

[3] [3]

International Journal of Forecasting 37, 1748–1764

Bryan Lim, Sercan Ö. Arik, Nicolas Loeff, and Tomas Pfister. 2021. Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting.International Journal of Forecasting37, 4 (2021), 1748–1764. doi:10.1016/j.ijforecast.2021.03.012

work page doi:10.1016/j.ijforecast.2021.03.012 2021

[4] [4]

Simon Pommerencke Melgaard et al. 2026. High-Resolution Sub-Meter Building Energy Dataset (AAU and NEST Pilots). doi:10.5281/zenodo.19019863

work page doi:10.5281/zenodo.19019863 2026

[5] [5]

Robert Spencer, Surangika Ranathunga, Mikael Boulic, Andries Hennie van Heerden, and Teo Susnjak. 2025. Transfer Learning on Transformers for Building Energy Consumption Forecasting—A Comparative Study.Energy and Buildings 336 (2025), 115632. doi:10.1016/j.enbuild.2025.115632

work page doi:10.1016/j.enbuild.2025.115632 2025

[6] [6]

Gerald Woo, Chenghao Liu, Akshat Kumar, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2024. Unified Training of Universal Time Series Forecasting Transformers. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR

2024