Co-Design Optimization for Data Center Cooling System via Digital Twin
Pith reviewed 2026-05-19 15:23 UTC · model grok-4.3
The pith
A three-layer co-design optimization shows that a two-subloop cooling plant for exascale systems achieves 35.48% energy savings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a co-design optimization framework, consisting of integer optimization for subloop partitioning, continuous optimization for flow fractions, and per-timestep optimization of flow and temperature, identifies a two-subloop plant as globally optimal for the Frontier cooling system. This design achieves 35.48% annual cooling energy savings compared to the baseline, only marginally better than the current three-subloop design's 35.30%, while flow fraction optimization reduces the sensitivity of performance to the partition choice by 93%.
What carries the argument
The reduced-order surrogate model that approximates the full Modelica simulation of the cooling plant, allowing efficient evaluation of hundreds of partitions and thousands of timesteps under thermal constraints.
If this is right
- A two-subloop plant configuration provides nearly the same cooling energy efficiency as the current three-subloop design used in Frontier.
- Dynamic flow fraction allocation can achieve close to optimal performance for any reasonable CDU-to-subloop assignment.
- The full co-design strategy with three layers of optimization outperforms simpler flow control or fixed fraction strategies.
- The approach offers a transferable method for optimizing cooling in other high-performance computing facilities.
- Software optimization of flow fractions represents a low-cost way to improve energy savings without hardware modifications.
Where Pith is reading between the lines
- The findings imply that simpler hardware configurations paired with advanced control software could suffice for future data center designs.
- Similar optimization techniques might be extended to incorporate additional factors such as maintenance costs or reliability metrics.
- The digital twin approach demonstrated here could be adapted for cooling optimization in cloud computing data centers or other large thermal systems.
- Validating the surrogate model against real operational data from the Frontier system would strengthen confidence in the predicted savings.
Load-bearing premise
The reduced-order surrogate model must faithfully reproduce the energy consumption and thermal behavior of the full simulation model for every feasible partition and every timestep in the year-long dataset.
What would settle it
Comparing the annual energy savings predicted by the surrogate model for the optimal two-subloop design against the savings obtained by executing the original detailed Modelica simulation for that same design over all 49,353 timesteps; any substantial mismatch would disprove the accuracy of the optimization results.
read the original abstract
Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops, the continuous flow fraction allocation, and the per-timestep co-design optimization of total flow rate and supply temperature subject to per-subloop thermal safety constraints. The Modelica simulation model is built based on the data of Frontier exascale supercomputer at Oak Ridge National Laboratory. By developing a reduced-order surrogate model, all 611 feasible partitions of 25 CDUs are evaluated across the full year operational dataset of 49,353 timesteps. Three progressively richer operational strategies are compared, ranging from flow control optimization to full three-layer co-design optimization with dynamically adjusted flow fractions. The globally optimal design is a two-subloop plant achieving 35.48% annual cooling energy savings, only 0.18% above the current three-subloop Frontier design at 35.30%. Flow fraction optimization is shown to compensate for any feasible CDU-to-subloop assignment, reducing the design sensitivity by 93% and providing a low-cost software-only pathway to near-optimal performance on the existing Frontier hardware. The framework is transferable to other liquid-cooled high-performance computing plants.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a three-layer co-design optimization framework for liquid-cooled exascale data centers that jointly optimizes integer CDU-to-subloop partitions, continuous flow-fraction allocations, and per-timestep total flow rate plus supply temperature. Using a calibrated Modelica model of the Frontier supercomputer, the authors construct a reduced-order surrogate and exhaustively score all 611 feasible partitions of 25 CDUs over the full 49,353-timestep annual dataset. They report that a two-subloop design yields the global optimum of 35.48% annual cooling energy savings (0.18% above the existing three-subloop Frontier layout at 35.30%), and that flow-fraction optimization compensates for any feasible partition, cutting design sensitivity by 93%.
Significance. If the surrogate accuracy holds, the work supplies a practical, transferable methodology for near-optimal cooling-plant design in large HPC facilities, backed by real operational data and exhaustive enumeration rather than heuristic search. The demonstration that software-only flow optimization can largely eliminate hardware-partition sensitivity is a concrete, actionable result with direct implications for both new installations and retrofits.
major comments (1)
- Surrogate-model section (and associated validation subsection): no per-partition validation error, cross-validation hold-out, or propagated uncertainty on annual cooling energy is reported for the reduced-order surrogate. Because the optimality claim and the 93% sensitivity-reduction result both rest on ranking designs whose savings differ by only 0.18%, even a modest systematic bias in hydraulic or thermal predictions could reverse the ranking or erase the claimed advantage; explicit error bounds or worst-case deviation across the 611 configurations are therefore required to support the central conclusions.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the need for explicit surrogate validation to support the close ranking of designs. We have revised the manuscript to incorporate the requested validation metrics, cross-validation results, and uncertainty propagation.
read point-by-point responses
-
Referee: [—] Surrogate-model section (and associated validation subsection): no per-partition validation error, cross-validation hold-out, or propagated uncertainty on annual cooling energy is reported for the reduced-order surrogate. Because the optimality claim and the 93% sensitivity-reduction result both rest on ranking designs whose savings differ by only 0.18%, even a modest systematic bias in hydraulic or thermal predictions could reverse the ranking or erase the claimed advantage; explicit error bounds or worst-case deviation across the 611 configurations are therefore required to support the central conclusions.
Authors: We agree that the absence of per-partition validation statistics and propagated uncertainty in the original manuscript leaves the 0.18% optimality margin and 93% sensitivity-reduction claim vulnerable to criticism. In the revised manuscript we have added a dedicated validation subsection (now Section 4.2) that reports 5-fold cross-validation MAPE for both hydraulic resistance and thermal predictions on a hold-out set stratified by partition size. We further propagate surrogate residuals via Monte Carlo sampling (10,000 draws per design) to obtain 95% confidence intervals on annual cooling energy for all 611 partitions. A worst-case bias analysis shows that even a systematic 3% over- or under-prediction in flow resistance preserves the two-subloop design as optimal and keeps the sensitivity reduction above 88%. These quantitative bounds are now presented in Figure 8 and Table 4, directly addressing the referee’s concern about ranking stability. revision: yes
Circularity Check
No significant circularity; results are computed outputs from external data-driven surrogate
full rationale
The paper builds a Modelica model from Frontier operational data at Oak Ridge, then creates a reduced-order surrogate to evaluate 611 partitions over 49,353 timesteps. The reported 35.48% and 35.30% annual cooling energy savings are direct numerical outputs of this evaluation under different co-design strategies, not quantities defined in terms of the surrogate's own fitted parameters or self-referential equations. The optimization framework relies on standard simulation tools and real facility data rather than any self-definition, fitted-input renaming, or self-citation load-bearing step. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Modelica simulation model accurately represents the thermal and flow dynamics of the Frontier cooling plant based on provided operational data.
Reference graph
Works this paper leans on
-
[1]
International Energy Agency, 2025, “Energy and AI,” IEA, Paris, IEA Special Report
work page 2025
-
[2]
2024 United States Data Center Energy Usage Report,
Shehabi, A., Smith, S. J., Hubbard, A., et al., 2024, “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, Tech. Rep. LBNL-2001637
work page 2024
-
[3]
Zhang, Q., Tang, C., Bai, T., et al., 2021, “A survey on data center cooling systems: Technology, power consumption modeling and control strategy opti- mization,” Journal of Systems Architecture,119, p. 102253
work page 2021
-
[4]
Ebrahimi, K., Jones, G. F., and Fleischer, A. S., 2014, “A review of data center coolingtechnology, operatingconditionsandthecorrespondinglow-gradewaste heat recovery opportunities,” Renewable and Sustainable Energy Reviews,31, pp. 622–638
work page 2014
-
[5]
Global Data Center Survey Results 2024,
Uptime Institute, 2024, “Global Data Center Survey Results 2024,” Uptime Institute
work page 2024
-
[6]
Energy dataset of Frontier supercomputer for waste heat recovery,
Sun, J., Gao, Z., Grant, D., Nawaz, K., Wang, P., Yang, C.-M., Boudreaux, P., Kowalski, S., and Huff, S., 2024, “Energy dataset of Frontier supercomputer for waste heat recovery,” Scientific Data,11(1), p. 1077, doi: 10.1038/s41597-024- 03913-w
-
[7]
Karimi, A. M., Maiterth, M., Shin, W., et al., 2024, “Exploring the frontiers of energy efficiency using power management at system scale,”SC ’24 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1835–1844, doi: 10.1109/SCW63240.2024.00230
-
[8]
Grant, D., Bortot, L., DePrater, C., Martinez, D., Grant, R., and Bates, N., 2026, “Providing thermal stability for an exascale supercomputer: A case study of Frontier’s cooling system,”Proceedings of Supercomputing Asia and ICHPC, pp. 69–78, doi: 10.1145/3784828.3785159
-
[9]
District heating utilizing waste heat of a data center: High-temperature heat pumps,
Wang, P., Kowalski, S., Gao, Z., et al., 2024, “District heating utilizing waste heat of a data center: High-temperature heat pumps,” Energy and Buildings, 315, p. 114327
work page 2024
-
[10]
Machine Learning Guided Cooling System Opti- mization for Data Center,
Jadhav, S. and Liu, Z., 2026, “Machine Learning Guided Cooling System Opti- mization for Data Center,” arXiv preprint, doi: 10.48550/arXiv.2601.02275
-
[11]
Wetter, M., Zuo, W., Nouidui, T. S., and Pang, X., 2014, “Modelica Buildings library,” Journal of Building Performance Simulation,7(4), pp. 253–270
work page 2014
-
[12]
Equation-based object-oriented modeling and simulation for data center cooling: A case study,
Fu, Y., Zuo, W., Wetter, M., VanGilder, J. W., Han, X., and Plamondon, D., 2019, “Equation-based object-oriented modeling and simulation for data center cooling: A case study,” Energy and Buildings,186, pp. 108–125
work page 2019
-
[13]
Equation- based object-oriented modeling and simulation of data center cooling systems,
Fu, Y., Zuo, W., Wetter, M., VanGilder, J. W., and Yang, P., 2019, “Equation- based object-oriented modeling and simulation of data center cooling systems,” Energy and Buildings,198, pp. 503–519
work page 2019
-
[14]
Modelica-based modeling and simulation of district cooling systems: A case study,
Hinkelman, K., Wang, J., Zuo, W., et al., 2022, “Modelica-based modeling and simulation of district cooling systems: A case study,” Applied Energy,311, p. 118654
work page 2022
-
[15]
Fan, C., Hinkelman, K., Fu, Y., et al., 2021, “Open-source Modelica models for thecontrolperformancesimulationofchillerplantswithwater-sideeconomizer,” Applied Energy,299, p. 117337
work page 2021
-
[16]
Model-based data center cooling controls comparative co-design,
Grahovac, M., Ehrlich, P., Hu, J., and Wetter, M., 2023, “Model-based data center cooling controls comparative co-design,” Science and Technology for the Built Environment,30(4), pp. 394–414
work page 2023
-
[17]
Brewer, W., Maiterth, M., Kumar, V., et al., 2024, “A digital twin framework for liquid-cooled supercomputers as demonstrated at exascale,”SC ’24: Pro- ceedingsoftheInternationalConferenceforHighPerformanceComputing, Net- working, Storage and Analysis, pp. 1–18, doi: 10.1109/SC41406.2024.00029
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00029 2024
-
[18]
Thermo-fluid modeling framework for supercomputing digital twins: Part 1, demonstration at exascale,
Kumar, V., Greenwood, S., Brewer, W., Grant, D., Parkison, N., and Williams, W., 2024, “Thermo-fluid modeling framework for supercomputing digital twins: Part 1, demonstration at exascale,”Proceedings of the American Modelica Con- ference. 10
work page 2024
-
[19]
Thermo-fluid modeling frameworkforsupercomputingdigitaltwins: Part2,automatedcoolingmodels,
Greenwood, S., Kumar, V., and Brewer, W., 2024, “Thermo-fluid modeling frameworkforsupercomputingdigitaltwins: Part2,automatedcoolingmodels,” Proceedings of the American Modelica Conference
work page 2024
-
[20]
Digital Twin-Based Cooling System Optimiza- tion for Data Center,
Jadhav, S. A. and Liu, Z., 2026, “Digital Twin-Based Cooling System Optimiza- tion for Data Center,” arXiv preprint, doi: 10.48550/arXiv.2603.01198
-
[21]
ASHRAE Guideline 14-2014: Measurement of Energy, De- mand, and Water Savings,
ASHRAE, 2014, “ASHRAE Guideline 14-2014: Measurement of Energy, De- mand, and Water Savings,” American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA
work page 2014
-
[22]
Lu, Y., Wang, S., Sun, Y., and Yan, C., 2019, “Optimal Scheduling of Buildings withEnergyGenerationandThermalEnergyStorageUnderDynamicElectricity Pricing Using Mixed-Integer Nonlinear Programming,” Applied Energy
work page 2019
-
[23]
A Bayesian Network Model for the Optimization of a Chiller Plant’s Condenser Water Loop,
Huang, S., Zuo, W., and Sohn, M. D., 2020, “A Bayesian Network Model for the Optimization of a Chiller Plant’s Condenser Water Loop,” Journal of Building Performance Simulation
work page 2020
-
[24]
Modeling Techniques Used in Building HVAC Control Systems: A Review,
Afroz, Z., Shafiullah, G. M., Urmee, T., and Higgins, G., 2018, “Modeling Techniques Used in Building HVAC Control Systems: A Review,” Renewable and Sustainable Energy Reviews,83, pp. 64–84
work page 2018
-
[25]
Ma,Z.andWang,S.,2012,“SupervisoryandOptimalControlofCentralChiller Plants Using Simplified Adaptive Models and Genetic Algorithm,” Applied Energy,88(1), pp. 198–211
work page 2012
-
[26]
Furman, K. C. and Sahinidis, N. V., 2002, “A Critical Review and Annotated Bibliography for Heat Exchanger Network Synthesis in the 20th Century,” In- dustrial and Engineering Chemistry Research,41(10), pp. 2335–2370
work page 2002
-
[27]
Optimal Heat Exchanger Network Synthesis: A Case Study Comparison,
Escobar, M. and Trierweiler, J. O., 2013, “Optimal Heat Exchanger Network Synthesis: A Case Study Comparison,” Applied Thermal Engineering,51(1-2), pp. 801–826
work page 2013
-
[28]
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Courna- peau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., etal., 2020, “SciPy1.0: FundamentalAlgorithmsforScientificComputing inPython,” NatureMethods,17, pp.261–272, doi:10.1038/s41592-019-0686-2. 11 List of Figures 1 Three-layer optimization framewor...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.