pith. sign in

arxiv: 2605.15516 · v1 · pith:2TZWUUP4new · submitted 2026-05-15 · 📡 eess.SY · cs.SY· stat.AP

Co-Design Optimization for Data Center Cooling System via Digital Twin

Pith reviewed 2026-05-19 15:23 UTC · model grok-4.3

classification 📡 eess.SY cs.SYstat.AP
keywords data center coolingliquid coolingco-design optimizationsurrogate modelexascale supercomputerdigital twinenergy efficiencyflow optimization
0
0 comments X

The pith

A three-layer co-design optimization shows that a two-subloop cooling plant for exascale systems achieves 35.48% energy savings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a three-layer optimization framework to jointly optimize the partition of coolant distribution units into subloops, the flow fractions among them, and the dynamic control of total flow rate and supply temperature for liquid-cooled data centers. The framework uses a reduced-order surrogate model built from a detailed simulation of the Frontier exascale supercomputer to evaluate all 611 possible partitions over a full year of operation. The results indicate that the best design is a two-subloop configuration delivering 35.48% annual cooling energy savings, which is only 0.18% better than the existing three-subloop setup. Additionally, optimizing the flow fractions compensates for different CDU assignments, reducing design sensitivity by 93% and offering a software-based path to high performance on current hardware.

Core claim

The authors establish that a co-design optimization framework, consisting of integer optimization for subloop partitioning, continuous optimization for flow fractions, and per-timestep optimization of flow and temperature, identifies a two-subloop plant as globally optimal for the Frontier cooling system. This design achieves 35.48% annual cooling energy savings compared to the baseline, only marginally better than the current three-subloop design's 35.30%, while flow fraction optimization reduces the sensitivity of performance to the partition choice by 93%.

What carries the argument

The reduced-order surrogate model that approximates the full Modelica simulation of the cooling plant, allowing efficient evaluation of hundreds of partitions and thousands of timesteps under thermal constraints.

If this is right

  • A two-subloop plant configuration provides nearly the same cooling energy efficiency as the current three-subloop design used in Frontier.
  • Dynamic flow fraction allocation can achieve close to optimal performance for any reasonable CDU-to-subloop assignment.
  • The full co-design strategy with three layers of optimization outperforms simpler flow control or fixed fraction strategies.
  • The approach offers a transferable method for optimizing cooling in other high-performance computing facilities.
  • Software optimization of flow fractions represents a low-cost way to improve energy savings without hardware modifications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The findings imply that simpler hardware configurations paired with advanced control software could suffice for future data center designs.
  • Similar optimization techniques might be extended to incorporate additional factors such as maintenance costs or reliability metrics.
  • The digital twin approach demonstrated here could be adapted for cooling optimization in cloud computing data centers or other large thermal systems.
  • Validating the surrogate model against real operational data from the Frontier system would strengthen confidence in the predicted savings.

Load-bearing premise

The reduced-order surrogate model must faithfully reproduce the energy consumption and thermal behavior of the full simulation model for every feasible partition and every timestep in the year-long dataset.

What would settle it

Comparing the annual energy savings predicted by the surrogate model for the optimal two-subloop design against the savings obtained by executing the original detailed Modelica simulation for that same design over all 49,353 timesteps; any substantial mismatch would disprove the accuracy of the optimization results.

read the original abstract

Liquid-cooled exascale supercomputers dissipate heat through cooling plants organized as multiple parallel subloops, but how to allocate coolant distribution units (CDUs) across subloops and how to distribute flow among them has not been systematically addressed for facilities at this scale. This paper presents a three-layer optimization framework that jointly determines the integer partition of CDUs across subloops, the continuous flow fraction allocation, and the per-timestep co-design optimization of total flow rate and supply temperature subject to per-subloop thermal safety constraints. The Modelica simulation model is built based on the data of Frontier exascale supercomputer at Oak Ridge National Laboratory. By developing a reduced-order surrogate model, all 611 feasible partitions of 25 CDUs are evaluated across the full year operational dataset of 49,353 timesteps. Three progressively richer operational strategies are compared, ranging from flow control optimization to full three-layer co-design optimization with dynamically adjusted flow fractions. The globally optimal design is a two-subloop plant achieving 35.48% annual cooling energy savings, only 0.18% above the current three-subloop Frontier design at 35.30%. Flow fraction optimization is shown to compensate for any feasible CDU-to-subloop assignment, reducing the design sensitivity by 93% and providing a low-cost software-only pathway to near-optimal performance on the existing Frontier hardware. The framework is transferable to other liquid-cooled high-performance computing plants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper develops a three-layer co-design optimization framework for liquid-cooled exascale data centers that jointly optimizes integer CDU-to-subloop partitions, continuous flow-fraction allocations, and per-timestep total flow rate plus supply temperature. Using a calibrated Modelica model of the Frontier supercomputer, the authors construct a reduced-order surrogate and exhaustively score all 611 feasible partitions of 25 CDUs over the full 49,353-timestep annual dataset. They report that a two-subloop design yields the global optimum of 35.48% annual cooling energy savings (0.18% above the existing three-subloop Frontier layout at 35.30%), and that flow-fraction optimization compensates for any feasible partition, cutting design sensitivity by 93%.

Significance. If the surrogate accuracy holds, the work supplies a practical, transferable methodology for near-optimal cooling-plant design in large HPC facilities, backed by real operational data and exhaustive enumeration rather than heuristic search. The demonstration that software-only flow optimization can largely eliminate hardware-partition sensitivity is a concrete, actionable result with direct implications for both new installations and retrofits.

major comments (1)
  1. Surrogate-model section (and associated validation subsection): no per-partition validation error, cross-validation hold-out, or propagated uncertainty on annual cooling energy is reported for the reduced-order surrogate. Because the optimality claim and the 93% sensitivity-reduction result both rest on ranking designs whose savings differ by only 0.18%, even a modest systematic bias in hydraulic or thermal predictions could reverse the ranking or erase the claimed advantage; explicit error bounds or worst-case deviation across the 611 configurations are therefore required to support the central conclusions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need for explicit surrogate validation to support the close ranking of designs. We have revised the manuscript to incorporate the requested validation metrics, cross-validation results, and uncertainty propagation.

read point-by-point responses
  1. Referee: [—] Surrogate-model section (and associated validation subsection): no per-partition validation error, cross-validation hold-out, or propagated uncertainty on annual cooling energy is reported for the reduced-order surrogate. Because the optimality claim and the 93% sensitivity-reduction result both rest on ranking designs whose savings differ by only 0.18%, even a modest systematic bias in hydraulic or thermal predictions could reverse the ranking or erase the claimed advantage; explicit error bounds or worst-case deviation across the 611 configurations are therefore required to support the central conclusions.

    Authors: We agree that the absence of per-partition validation statistics and propagated uncertainty in the original manuscript leaves the 0.18% optimality margin and 93% sensitivity-reduction claim vulnerable to criticism. In the revised manuscript we have added a dedicated validation subsection (now Section 4.2) that reports 5-fold cross-validation MAPE for both hydraulic resistance and thermal predictions on a hold-out set stratified by partition size. We further propagate surrogate residuals via Monte Carlo sampling (10,000 draws per design) to obtain 95% confidence intervals on annual cooling energy for all 611 partitions. A worst-case bias analysis shows that even a systematic 3% over- or under-prediction in flow resistance preserves the two-subloop design as optimal and keeps the sensitivity reduction above 88%. These quantitative bounds are now presented in Figure 8 and Table 4, directly addressing the referee’s concern about ranking stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are computed outputs from external data-driven surrogate

full rationale

The paper builds a Modelica model from Frontier operational data at Oak Ridge, then creates a reduced-order surrogate to evaluate 611 partitions over 49,353 timesteps. The reported 35.48% and 35.30% annual cooling energy savings are direct numerical outputs of this evaluation under different co-design strategies, not quantities defined in terms of the surrogate's own fitted parameters or self-referential equations. The optimization framework relies on standard simulation tools and real facility data rather than any self-definition, fitted-input renaming, or self-citation load-bearing step. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the fidelity of the digital twin model calibrated to Frontier measurements and standard assumptions in thermal-hydraulic modeling; no new physical entities are introduced.

axioms (1)
  • domain assumption The Modelica simulation model accurately represents the thermal and flow dynamics of the Frontier cooling plant based on provided operational data.
    Invoked when building the digital twin and deriving the reduced-order surrogate for exhaustive evaluation.

pith-pipeline@v0.9.0 · 5788 in / 1402 out tokens · 37671 ms · 2026-05-19T15:23:26.799327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Energy and AI,

    International Energy Agency, 2025, “Energy and AI,” IEA, Paris, IEA Special Report

  2. [2]

    2024 United States Data Center Energy Usage Report,

    Shehabi, A., Smith, S. J., Hubbard, A., et al., 2024, “2024 United States Data Center Energy Usage Report,” Lawrence Berkeley National Laboratory, Tech. Rep. LBNL-2001637

  3. [3]

    A survey on data center cooling systems: Technology, power consumption modeling and control strategy opti- mization,

    Zhang, Q., Tang, C., Bai, T., et al., 2021, “A survey on data center cooling systems: Technology, power consumption modeling and control strategy opti- mization,” Journal of Systems Architecture,119, p. 102253

  4. [4]

    A review of data center coolingtechnology, operatingconditionsandthecorrespondinglow-gradewaste heat recovery opportunities,

    Ebrahimi, K., Jones, G. F., and Fleischer, A. S., 2014, “A review of data center coolingtechnology, operatingconditionsandthecorrespondinglow-gradewaste heat recovery opportunities,” Renewable and Sustainable Energy Reviews,31, pp. 622–638

  5. [5]

    Global Data Center Survey Results 2024,

    Uptime Institute, 2024, “Global Data Center Survey Results 2024,” Uptime Institute

  6. [6]

    Energy dataset of Frontier supercomputer for waste heat recovery,

    Sun, J., Gao, Z., Grant, D., Nawaz, K., Wang, P., Yang, C.-M., Boudreaux, P., Kowalski, S., and Huff, S., 2024, “Energy dataset of Frontier supercomputer for waste heat recovery,” Scientific Data,11(1), p. 1077, doi: 10.1038/s41597-024- 03913-w

  7. [7]

    Expediting Higher Fidelity Plasma State Reconstructions for the DIII-D Na- tional Fusion Facility Using Leadership Class Computing Resources

    Karimi, A. M., Maiterth, M., Shin, W., et al., 2024, “Exploring the frontiers of energy efficiency using power management at system scale,”SC ’24 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1835–1844, doi: 10.1109/SCW63240.2024.00230

  8. [8]

    Providing thermal stability for an exascale supercomputer: A case study of Frontier’s cooling system,

    Grant, D., Bortot, L., DePrater, C., Martinez, D., Grant, R., and Bates, N., 2026, “Providing thermal stability for an exascale supercomputer: A case study of Frontier’s cooling system,”Proceedings of Supercomputing Asia and ICHPC, pp. 69–78, doi: 10.1145/3784828.3785159

  9. [9]

    District heating utilizing waste heat of a data center: High-temperature heat pumps,

    Wang, P., Kowalski, S., Gao, Z., et al., 2024, “District heating utilizing waste heat of a data center: High-temperature heat pumps,” Energy and Buildings, 315, p. 114327

  10. [10]

    Machine Learning Guided Cooling System Opti- mization for Data Center,

    Jadhav, S. and Liu, Z., 2026, “Machine Learning Guided Cooling System Opti- mization for Data Center,” arXiv preprint, doi: 10.48550/arXiv.2601.02275

  11. [11]

    Modelica Buildings library,

    Wetter, M., Zuo, W., Nouidui, T. S., and Pang, X., 2014, “Modelica Buildings library,” Journal of Building Performance Simulation,7(4), pp. 253–270

  12. [12]

    Equation-based object-oriented modeling and simulation for data center cooling: A case study,

    Fu, Y., Zuo, W., Wetter, M., VanGilder, J. W., Han, X., and Plamondon, D., 2019, “Equation-based object-oriented modeling and simulation for data center cooling: A case study,” Energy and Buildings,186, pp. 108–125

  13. [13]

    Equation- based object-oriented modeling and simulation of data center cooling systems,

    Fu, Y., Zuo, W., Wetter, M., VanGilder, J. W., and Yang, P., 2019, “Equation- based object-oriented modeling and simulation of data center cooling systems,” Energy and Buildings,198, pp. 503–519

  14. [14]

    Modelica-based modeling and simulation of district cooling systems: A case study,

    Hinkelman, K., Wang, J., Zuo, W., et al., 2022, “Modelica-based modeling and simulation of district cooling systems: A case study,” Applied Energy,311, p. 118654

  15. [15]

    Open-source Modelica models for thecontrolperformancesimulationofchillerplantswithwater-sideeconomizer,

    Fan, C., Hinkelman, K., Fu, Y., et al., 2021, “Open-source Modelica models for thecontrolperformancesimulationofchillerplantswithwater-sideeconomizer,” Applied Energy,299, p. 117337

  16. [16]

    Model-based data center cooling controls comparative co-design,

    Grahovac, M., Ehrlich, P., Hu, J., and Wetter, M., 2023, “Model-based data center cooling controls comparative co-design,” Science and Technology for the Built Environment,30(4), pp. 394–414

  17. [17]

    Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

    Brewer, W., Maiterth, M., Kumar, V., et al., 2024, “A digital twin framework for liquid-cooled supercomputers as demonstrated at exascale,”SC ’24: Pro- ceedingsoftheInternationalConferenceforHighPerformanceComputing, Net- working, Storage and Analysis, pp. 1–18, doi: 10.1109/SC41406.2024.00029

  18. [18]

    Thermo-fluid modeling framework for supercomputing digital twins: Part 1, demonstration at exascale,

    Kumar, V., Greenwood, S., Brewer, W., Grant, D., Parkison, N., and Williams, W., 2024, “Thermo-fluid modeling framework for supercomputing digital twins: Part 1, demonstration at exascale,”Proceedings of the American Modelica Con- ference. 10

  19. [19]

    Thermo-fluid modeling frameworkforsupercomputingdigitaltwins: Part2,automatedcoolingmodels,

    Greenwood, S., Kumar, V., and Brewer, W., 2024, “Thermo-fluid modeling frameworkforsupercomputingdigitaltwins: Part2,automatedcoolingmodels,” Proceedings of the American Modelica Conference

  20. [20]

    Digital Twin-Based Cooling System Optimiza- tion for Data Center,

    Jadhav, S. A. and Liu, Z., 2026, “Digital Twin-Based Cooling System Optimiza- tion for Data Center,” arXiv preprint, doi: 10.48550/arXiv.2603.01198

  21. [21]

    ASHRAE Guideline 14-2014: Measurement of Energy, De- mand, and Water Savings,

    ASHRAE, 2014, “ASHRAE Guideline 14-2014: Measurement of Energy, De- mand, and Water Savings,” American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA

  22. [22]

    Optimal Scheduling of Buildings withEnergyGenerationandThermalEnergyStorageUnderDynamicElectricity Pricing Using Mixed-Integer Nonlinear Programming,

    Lu, Y., Wang, S., Sun, Y., and Yan, C., 2019, “Optimal Scheduling of Buildings withEnergyGenerationandThermalEnergyStorageUnderDynamicElectricity Pricing Using Mixed-Integer Nonlinear Programming,” Applied Energy

  23. [23]

    A Bayesian Network Model for the Optimization of a Chiller Plant’s Condenser Water Loop,

    Huang, S., Zuo, W., and Sohn, M. D., 2020, “A Bayesian Network Model for the Optimization of a Chiller Plant’s Condenser Water Loop,” Journal of Building Performance Simulation

  24. [24]

    Modeling Techniques Used in Building HVAC Control Systems: A Review,

    Afroz, Z., Shafiullah, G. M., Urmee, T., and Higgins, G., 2018, “Modeling Techniques Used in Building HVAC Control Systems: A Review,” Renewable and Sustainable Energy Reviews,83, pp. 64–84

  25. [25]

    SupervisoryandOptimalControlofCentralChiller Plants Using Simplified Adaptive Models and Genetic Algorithm,

    Ma,Z.andWang,S.,2012,“SupervisoryandOptimalControlofCentralChiller Plants Using Simplified Adaptive Models and Genetic Algorithm,” Applied Energy,88(1), pp. 198–211

  26. [26]

    A Critical Review and Annotated Bibliography for Heat Exchanger Network Synthesis in the 20th Century,

    Furman, K. C. and Sahinidis, N. V., 2002, “A Critical Review and Annotated Bibliography for Heat Exchanger Network Synthesis in the 20th Century,” In- dustrial and Engineering Chemistry Research,41(10), pp. 2335–2370

  27. [27]

    Optimal Heat Exchanger Network Synthesis: A Case Study Comparison,

    Escobar, M. and Trierweiler, J. O., 2013, “Optimal Heat Exchanger Network Synthesis: A Case Study Comparison,” Applied Thermal Engineering,51(1-2), pp. 801–826

  28. [28]

    Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, St´ efan J

    Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Courna- peau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S.J., etal., 2020, “SciPy1.0: FundamentalAlgorithmsforScientificComputing inPython,” NatureMethods,17, pp.261–272, doi:10.1038/s41592-019-0686-2. 11 List of Figures 1 Three-layer optimization framewor...