CUTh-Solver: GPU-Accelerated Sparse Matrix Solver for High-Resolution Thermal Simulation of 3D ICs

Chenghan Wang; Darong Huang; David Atienza; Kai Zhu; Luis Costero; Rongmei Chen; Shui Jiang; Siyuan Liang; Tsung-Wei Huang; Tsung-Yi Ho

arxiv: 2606.17850 · v1 · pith:ZOCP6W4Rnew · submitted 2026-06-16 · 💻 cs.AR

CUTh-Solver: GPU-Accelerated Sparse Matrix Solver for High-Resolution Thermal Simulation of 3D ICs

Chenghan Wang , Zhen Zhuang , Shui Jiang , Siyuan Liang , Xiaoman Yang , Kai Zhu , Darong Huang , Luis Costero

show 4 more authors

Rongmei Chen Tsung-Wei Huang David Atienza Tsung-Yi Ho

This is my paper

Pith reviewed 2026-06-26 22:19 UTC · model grok-4.3

classification 💻 cs.AR

keywords GPU accelerationsparse matrix solverthermal simulation3D ICPreconditioned Conjugate Gradientmixed precisionsymmetric positive definitediagonal storage

0 comments

The pith

A co-designed GPU solver for sparse SPD matrices from 3D IC thermal simulation delivers up to 25.8x speedup over COMSOL and 3x over standard NVIDIA libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution grids are needed to capture localized hotspots in 3D integrated circuit thermal analysis, but they produce very large sparse linear systems. General-purpose GPU solvers leave performance on the table because they ignore the regular sparsity patterns that arise from these structured grids. CUTh-Solver is a PCG framework that condenses diagonal storage, performs diagonal-wise SpMV for coalesced access, uses a high-parallelism preconditioner, and applies adaptive mixed-precision arithmetic to raise hardware utilization while preserving stability. The resulting solver runs both steady-state and transient problems and reports the measured speedups on representative 3D IC workloads.

Core claim

CUTh-Solver is a GPU-accelerated Preconditioned Conjugate Gradient solver for symmetric positive definite systems that arise in high-resolution steady-state and transient 3D IC thermal simulation. It condenses the DIA storage format to remove redundancy, employs diagonal-wise SpMV for coalesced memory access, adopts a high-parallelism preconditioning strategy to resolve the parallelism-quality conflict, and uses an adaptive fine-grained mixed-precision scheme that maps work to different floating-point units without compromising numerical stability.

What carries the argument

PCG solver equipped with condensed DIA storage, diagonal-wise SpMV, high-parallelism preconditioning, and adaptive mixed-precision arithmetic.

If this is right

Up to 25.8x speedup versus GPU-accelerated COMSOL Multiphysics 6.4 on the same thermal problems.
More than 3x speedup versus NVIDIA AmgX, cuSPARSE, and cuDSS on representative workloads.
Ablation experiments confirm that each of the four optimizations contributes measurably to the overall gain.
Both steady-state and transient thermal simulations are supported at the improved speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Domain-specific co-design of storage and kernels can outperform mature general-purpose libraries even when both run on the same GPU hardware.
The same regular-grid sparsity structure appears in other finite-difference or finite-element engineering problems, suggesting the optimizations may transfer beyond thermal analysis.
Higher throughput at fixed accuracy makes it practical to increase grid resolution further, which could improve detection of fine-scale thermal features.

Load-bearing premise

The coefficient matrices from high-resolution 3D IC thermal simulations possess regular sparsity patterns that specialized storage, access, and precision choices can exploit without loss of accuracy or stability.

What would settle it

A test matrix taken from a 3D IC thermal model on which CUTh-Solver either fails to converge or produces a solution whose residual or temperature field differs from a verified general-purpose solver beyond floating-point tolerance.

read the original abstract

Coarse-grained thermal simulation tends to underestimate localized thermal issues, potentially missing critical hotspots. Accurate analysis, therefore, demands fine-grained information, which dramatically increases grid resolution and thus computational workload. Fortunately, the coefficient matrices are often sparse with regular sparsity patterns, offering optimization opportunities. However, existing general-purpose matrix solvers on GPUs rarely exploit these domain-specific properties, thereby encountering bottlenecks in data storage, memory access, parallelism, computational efficiency, and hardware utilization. Therefore, we propose CUTh-Solver, a co-designed GPU-accelerated Preconditioned Conjugate Gradient (PCG)-based sparse solver framework for Symmetric Positive Definite (SPD) systems arising from high-resolution steady-state and transient 3D IC thermal simulation. For data storage, CUTh-Solver condenses the Diagonal (DIA) storage format to remove redundancy. To optimize the memory access, CUTh-Solver employs diagonal-wise SpMV to achieve coalesced memory access. We further observe a critical conflict between parallelism and preconditioning quality and thus adopt a high-parallelism preconditioning strategy. To improve computational efficiency and hardware utilization, we employ an adaptive fine-grained mixed-precision strategy that leverages diverse floating-point units to avoid resource contention, enhancing throughput without compromising numerical stability. Experimental results show that CUTh-Solver achieves up to 25.8x speedup over GPU-accelerated COMSOL Multiphysics 6.4 and over 3x speedup over NVIDIA's native general-purpose libraries (AmgX, cuSPARSE, cuDSS). Ablation studies validate the individual contribution of each optimization. The code is available at: https://github.com/Chenghan-Wang/CUTh-Solver

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUTh-Solver packages standard GPU sparse-matrix tricks for thermal SPD systems and reports large speedups, but the abstract gives almost no accuracy numbers to back the mixed-precision path.

read the letter

The paper's main contribution is a domain-specific GPU PCG solver that condenses the DIA format, switches to diagonal-wise SpMV for better coalescing, chooses a high-parallelism preconditioner, and layers on adaptive mixed precision. These are all known techniques, but the authors combine them for the regular sparsity patterns that come out of high-resolution 3D IC thermal models. They also ship the code on GitHub and run ablation studies, which is useful.

The speedups they list—up to 25.8x versus GPU COMSOL and more than 3x versus AmgX, cuSPARSE, and cuDSS—are the headline result. If the numbers hold and the solutions stay accurate, the work is a practical win for people who need faster steady-state and transient thermal analysis inside chip-design flows.

The soft spot is exactly what the stress-test note flags: the abstract asserts that the mixed-precision strategy and the preconditioner preserve numerical stability, yet it shows no residual norms, no iteration counts against a double-precision reference, and no error on hotspot temperatures. Without those checks it is impossible to judge whether the reported wall-clock gains come at an acceptable accuracy cost. The weakest assumption in the reader's note is therefore on target.

This is not a theoretical linear-algebra paper. It is aimed at engineers who already run thermal simulations on 3D ICs and want something faster than the general libraries. The experimental setup looks reproducible because the code is public, and the claims are concrete enough that a referee could check them.

I would send it to peer review. The optimizations are straightforward to evaluate, the performance gap is large enough to matter, and the missing accuracy data is the sort of thing reviewers routinely ask for and can be supplied in revision.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces CUTh-Solver, a co-designed GPU-accelerated PCG solver for SPD linear systems from high-resolution steady-state and transient 3D IC thermal simulations. It condenses the DIA format, uses diagonal-wise SpMV for coalesced access, adopts a high-parallelism preconditioner, and applies adaptive fine-grained mixed precision; the authors report up to 25.8× speedup versus GPU-accelerated COMSOL 6.4 and >3× versus AmgX/cuSPARSE/cuDSS, supported by ablation studies, and release the code at https://github.com/Chenghan-Wang/CUTh-Solver.

Significance. If the numerical accuracy and stability of the mixed-precision and preconditioning choices are verified for the target thermal matrices, the domain-specific optimizations could meaningfully accelerate fine-grained 3D IC thermal analysis where general-purpose GPU libraries are currently bottlenecks. The open-source release is a clear strength that supports reproducibility.

major comments (1)

[Experimental Results] Experimental Results section: the central speedup claims (25.8× over COMSOL, >3× over AmgX/cuSPARSE/cuDSS) rest on the assertion that the condensed DIA format, diagonal-wise SpMV, high-parallelism preconditioner, and adaptive mixed-precision preserve numerical stability and accuracy; however, no residual norms, iteration counts versus double-precision reference, convergence plots, or hotspot temperature error metrics are reported to quantify any degradation for the SPD systems arising from high-resolution 3D IC models.

minor comments (1)

[Abstract] The abstract states that the optimizations avoid compromising numerical stability but does not preview any quantitative accuracy verification; a brief mention of the error metrics used would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The concern regarding verification of numerical stability and accuracy for the mixed-precision and preconditioning strategies is valid and directly impacts the strength of our speedup claims. We will address this by adding the requested quantitative metrics in the revised version.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: the central speedup claims (25.8× over COMSOL, >3× over AmgX/cuSPARSE/cuDSS) rest on the assertion that the condensed DIA format, diagonal-wise SpMV, high-parallelism preconditioner, and adaptive mixed-precision preserve numerical stability and accuracy; however, no residual norms, iteration counts versus double-precision reference, convergence plots, or hotspot temperature error metrics are reported to quantify any degradation for the SPD systems arising from high-resolution 3D IC models.

Authors: We agree that explicit verification is necessary to substantiate the claim that the optimizations preserve accuracy. In the revised manuscript, we will include: (1) residual norm values at convergence for both our solver and a double-precision reference, (2) iteration counts comparing our adaptive mixed-precision PCG against full double-precision PCG on the same matrices, (3) convergence plots showing residual reduction over iterations, and (4) hotspot temperature error metrics (maximum and average absolute/relative errors) against a high-precision reference solution for the 3D IC test cases. These additions will quantify any potential degradation and confirm stability for the target SPD thermal matrices. revision: yes

Circularity Check

0 steps flagged

No circularity: experimental speedups rest on independent benchmarks

full rationale

The paper describes a PCG-based solver framework whose optimizations (condensed DIA storage, diagonal-wise SpMV, high-parallelism preconditioner, adaptive mixed-precision) are presented as engineering choices whose value is measured by wall-clock timings against COMSOL, AmgX, cuSPARSE and cuDSS. No equations, fitted parameters, or first-principles derivations are offered whose outputs reduce by construction to the inputs; the reported 25.8× and 3× speedups are empirical outcomes, not quantities obtained by renaming or self-referential fitting. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described chain. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard numerical linear algebra assumptions plus the domain observation that thermal matrices have exploitable regular sparsity; no new physical entities or ad-hoc constants are introduced.

axioms (2)

domain assumption The linear systems arising from 3D IC thermal simulation are Symmetric Positive Definite (SPD).
Stated directly in the abstract as the target class for the PCG solver.
domain assumption Matrices exhibit regular sparsity patterns that permit condensed DIA storage and diagonal-wise SpMV without changing the mathematical result.
Invoked to justify the storage and memory-access optimizations.

pith-pipeline@v0.9.1-grok · 5881 in / 1481 out tokens · 32960 ms · 2026-06-26T22:19:40.607989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references

[1]

A ppa study for heterogeneous 3-d ic options: Monolithic, hybrid bonding, and microbumping,

J. Kim, L. Zhu, H. M. Torun, M. Swaminathan, and S. K. Lim, “A ppa study for heterogeneous 3-d ic options: Monolithic, hybrid bonding, and microbumping,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 32, no. 3, pp. 401–412, 2023

2023
[2]

Thermal analysis of 3d stacking and beol technologies with functional partitioning of many-core risc-v soc,

M. Naeim, H. Oprins, S. Das, G. Van Der Plas, Y . Dai, P. Chen, C. Kao, D. Biswas, and D. Milojevic, “Thermal analysis of 3d stacking and beol technologies with functional partitioning of many-core risc-v soc,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2024, pp. 33–38

2024
[3]

Thermal performance analysis of mempool risc-v multicore soc,

S. Venkateswarlu, S. Mishra, H. Oprins, B. Vermeersch, M. Brunion, J.- H. Han, M. R. Stan, P. Weckx, and F. Catthoor, “Thermal performance analysis of mempool risc-v multicore soc,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 11, pp. 1668–1676, 2022

2022
[4]

Thermal analysis of advanced back- end-of-line structures and the impact of design parameters,

X. Chang, H. Oprins, M. Lofrano, B. Vermeersch, I. Ciofi, O. V . Pedreira, Z. Tokei, and I. De Wolf, “Thermal analysis of advanced back- end-of-line structures and the impact of design parameters,” in2022 21st IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm). IEEE, 2022, pp. 1–8

2022
[5]

Multiscale thermal impact of bspdn: Soc hotspot challenges and partial mitigation,

B. Vermeersch, S. Mishra, M. Brunion, O. Zografos, M. Lofrano, H. Oprins, J. Myers, Z. Tokei, and G. Hellings, “Multiscale thermal impact of bspdn: Soc hotspot challenges and partial mitigation,” in2024 IEEE International Electron Devices Meeting (IEDM), 2024, pp. 1–4

2024
[6]

Rapid estimation of anisotropic thermal conductivity in rdl for 2.5 d chiplet design,

Y . Li, J. Liu, D. Lu, W. Zhang, R. X.-K. Gao, E. Liu, M. D. Rotaru, D. Rahul, and N. Sridhar, “Rapid estimation of anisotropic thermal conductivity in rdl for 2.5 d chiplet design,” in2025 IEEE 75th Electronic Components and Technology Conference (ECTC). IEEE, 2025, pp. 1541–1546

2025
[7]

Fast and accurate machine learning prediction of back-end-of-line thermal resistances in backside power delivery and chiplet architectures,

P. R. Chowdhury, A. Jain, D. Chidambarrao, K. Acharya, and A. Ogino, “Fast and accurate machine learning prediction of back-end-of-line thermal resistances in backside power delivery and chiplet architectures,” in2025 IEEE 75th Electronic Components and Technology Conference (ECTC). IEEE, 2025, pp. 1577–1582

2025
[8]

A 20-year retrospective on power and thermal modeling and management,

D. Atienza, K. Zhu, D. Huang, and L. Costero, “A 20-year retrospective on power and thermal modeling and management,”IEEE Design & Test, 2025

2025
[9]

Pact: An extensible parallel thermal simulator for emerging integration and cooling technologies,

Z. Yuan, P. Shukla, S. Chetoui, S. Nemtzow, S. Reda, and A. K. Coskun, “Pact: An extensible parallel thermal simulator for emerging integration and cooling technologies,”IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol. 41, no. 4, pp. 1048–1061, 2021

2021
[10]

3d-ice 4.0: Accurate and efficient thermal modeling for 2.5 d/3d heterogeneous chiplet systems,

K. Zhu, D. Huang, L. Costero, and D. Atienza, “3d-ice 4.0: Accurate and efficient thermal modeling for 2.5 d/3d heterogeneous chiplet systems,” in2026 Design, Automation & Test in Europe Conference (DATE). IEEE, 2026, pp. 1–7

2026
[11]

Amgx: A library for gpu accelerated algebraic multigrid and preconditioned iterative methods,

M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh, V . Sellappan, and R. Strzodka, “Amgx: A library for gpu accelerated algebraic multigrid and preconditioned iterative methods,”SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S602–S626, 2015

2015
[12]

Dcsolver: Accelerating sparse iterative solvers via divide-and-conquer on gpus,

H. Qiu, C. Xu, J. Fang, J. Zhang, L. Deng, Z. Dai, Y . Ding, Y . Wang, Z. Han, Y . Cheet al., “Dcsolver: Accelerating sparse iterative solvers via divide-and-conquer on gpus,”ACM Transactions on Architecture and Code Optimization, vol. 22, no. 3, pp. 1–25, 2025

2025
[13]

A technical survey of sparse linear solvers in electronic design automation,

N. Rai, “A technical survey of sparse linear solvers in electronic design automation,”Journal of Circuits, Systems and Computers, 2026

2026
[14]

Recg: Reram-accelerated sparse conjugate gradient,

M. Fan, X. Chen, D. Yang, Z. Jin, and W. Liu, “Recg: Reram-accelerated sparse conjugate gradient,” inProceedings of the 61st ACM/IEEE Design Automation Conference (DAC), 2024, pp. 1–6

2024
[15]

From 2.5 d to 3d chiplet systems: Investigation of thermal implications with hotspot 7.0,

J.-H. Han, X. Guo, K. Skadron, and M. R. Stan, “From 2.5 d to 3d chiplet systems: Investigation of thermal implications with hotspot 7.0,” inIEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm). IEEE, 2022, pp. 1–6

2022
[16]

Mfit: Multi-fidelity thermal mod- eling for 2.5 d and 3d multi-chiplet architectures,

L. Pfromm, A. Kanani, H. Sharma, P. Solanki, E. Tervo, J. Park, J. Doppa, P. P. Pande, and U. Ogras, “Mfit: Multi-fidelity thermal mod- eling for 2.5 d and 3d multi-chiplet architectures,”ACM Transactions on Design Automation of Electronic Systems, 2024

2024
[17]

Randomized cholesky factorization with threshold- based multisampling for power grid simulation,

Z. Liu and W. Yu, “Randomized cholesky factorization with threshold- based multisampling for power grid simulation,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 9, pp. 2687–2691, 2024

2024
[18]

Multi- layer package power/ground planes synthesis with balanced dc ir drops: A game-theoretic optimization approach,

S. Liang, Z. Zhuang, K.-Y . Chao, B. Yu, and T.-Y . Ho, “Multi- layer package power/ground planes synthesis with balanced dc ir drops: A game-theoretic optimization approach,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2025

2025
[19]

Efficient large-scale power grid analysis based on preconditioned krylov-subspace iterative methods,

T.-H. Chen and C. C.-P. Chen, “Efficient large-scale power grid analysis based on preconditioned krylov-subspace iterative methods,” inProceed- ings of the 38th annual Design Automation Conference (DAC), 2001

2001
[20]

pgrass-solver: a parallel iterative solver for scalable power grid analysis based on graph spectral sparsification,

Z. Liu and W. Yu, “pgrass-solver: a parallel iterative solver for scalable power grid analysis based on graph spectral sparsification,” in2021 IEEE/ACM International Conference On Computer Aided Design (IC- CAD). IEEE, 2021, pp. 1–9

2021
[21]

[Online]

COMSOL Multiphysics. [Online]. Available: https://www.comsol.com/ release/6.4/gpu-acceleration
[22]

[Online]

SuperLU. [Online]. Available: https://portal.nersc.gov/project/sparse/ superlu/
[23]

Algorithm 907: Klu, a direct sparse solver for circuit simulation problems,

T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: Klu, a direct sparse solver for circuit simulation problems,”ACM Transactions on Mathematical Software (TOMS), vol. 37, no. 3, pp. 1–17, 2010

2010
[24]

Aztecoo user guide

M. A. Heroux, “Aztecoo user guide.” Sandia National Laboratories, Tech. Rep., 2004

2004
[25]

Ame- sos2 and belos: Direct and iterative solvers for large sparse linear systems,

E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Ame- sos2 and belos: Direct and iterative solvers for large sparse linear systems,”Scientific Programming, vol. 20, no. 3, pp. 241–255, 2012

2012
[26]

Thermalscope: Multi-scale thermal analysis for nanometer-scale integrated circuits,

N. Allec, Z. Hassan, L. Shang, R. P. Dick, and R. Yang, “Thermalscope: Multi-scale thermal analysis for nanometer-scale integrated circuits,” in 2008 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2008, pp. 603–610

2008
[27]

The mta: An advanced and versatile thermal simulator for integrated systems,

S. Ladenheim, Y .-C. Chen, M. Mihajlovi ´c, and V . F. Pavlidis, “The mta: An advanced and versatile thermal simulator for integrated systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3123–3136, 2018

2018
[28]

Porting hypre to heterogeneous computer architectures: Strategies and experi- ences,

R. D. Falgout, R. Li, B. Sj ¨ogreen, L. Wang, and U. M. Yang, “Porting hypre to heterogeneous computer architectures: Strategies and experi- ences,”Parallel Computing, vol. 108, p. 102840, 2021

2021
[29]

An efficient leakage-aware thermal simulation approach for 3d-ics using corrected linearized model and algebraic multigrid,

C. Yan, H. Zhu, D. Zhou, and X. Zeng, “An efficient leakage-aware thermal simulation approach for 3d-ics using corrected linearized model and algebraic multigrid,” inDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, 2017, pp. 1207–1212

2017
[30]

Thpa: Thermal simulation for advanced ics,

B.-W. Chen, Y .-H. Lin, C.-Y . Lin, and Y .-M. Lee, “Thpa: Thermal simulation for advanced ics,” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2026, pp. 1407–1413

2026
[31]

Ic thermal simulation and modeling via efficient multigrid-based approaches,

P. Li, L. T. Pileggi, M. Asheghi, and R. Chandra, “Ic thermal simulation and modeling via efficient multigrid-based approaches,”IEEE Transac- tions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 9, pp. 1763–1776, 2006

2006
[32]

Fast electrical-thermal co-simulation using multigrid method for 3d integration,

J. Xie and M. Swaminathan, “Fast electrical-thermal co-simulation using multigrid method for 3d integration,” in2012 IEEE 62nd Electronic Components and Technology Conference (ECTC). IEEE, 2012, pp. 651–657

2012
[33]

Fast thermal analysis on gpu for 3d ics with integrated microchannel cooling,

Z. Feng and P. Li, “Fast thermal analysis on gpu for 3d ics with integrated microchannel cooling,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 8, pp. 1526–1539, 2012

2012
[34]

Thermal simulator for advanced packaging and chiplet-based systems,

Y . Safari, A. Corbier, D. Al Saleh, F. R. Amik, and B. Vaisband, “Thermal simulator for advanced packaging and chiplet-based systems,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025

2025
[35]

Etla-3d: Equivalent thin layer aggregation based thermal fem for hybrid bonding f2f 3d ics,

C. Wang, Z. Zhuang, K. Zhu, D. Huang, L. Costero, R. Chen, D. Atienza, and T.-Y . Ho, “Etla-3d: Equivalent thin layer aggregation based thermal fem for hybrid bonding f2f 3d ics,” in2026 Design, Automation & Test in Europe Conference (DATE). IEEE, 2026

2026
[36]

Azul: An accelerator for sparse iterative solvers leveraging distributed on- chip memory,

A. Feldmann, C. Golden, Y . Yang, J. S. Emer, and D. Sanchez, “Azul: An accelerator for sparse iterative solvers leveraging distributed on- chip memory,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 643–656

2024
[37]

Saad,Iterative methods for sparse linear systems

Y . Saad,Iterative methods for sparse linear systems. SIAM, 2003. 13

2003
[38]

R. A. Horn and C. R. Johnson,Matrix analysis. Cambridge university press, 2012

2012
[39]

D. R. Kincaid and E. W. Cheney,Numerical analysis: mathematics of scientific computing. American Mathematical Soc., 2009, vol. 2

2009
[40]

G. H. Golub and C. F. Van Loan,Matrix computations. JHU press, 2013

2013
[41]

Data-driven mixed precision sparse matrix vector multiplication for gpus,

K. Ahmad, H. Sundar, and M. Hall, “Data-driven mixed precision sparse matrix vector multiplication for gpus,”ACM Transactions on Architecture and Code Optimization, vol. 16, no. 4, pp. 1–24, 2019

2019
[42]

Adaptive precision in block-jacobi preconditioning for iterative sparse linear system solvers,

H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana- Ort´ı, “Adaptive precision in block-jacobi preconditioning for iterative sparse linear system solvers,”Concurrency and Computation: Practice and Experience, vol. 31, no. 6, p. e4460, 2019

2019
[43]

Self-attention to operator learning-based 3d- ic thermal simulation,

Z. Huang, H. Wang, W. Yang, M. Tang, D. Xie, T.-J. Lin, Y . Zhang, W. W. Xing, and L. He, “Self-attention to operator learning-based 3d- ic thermal simulation,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

2025
[44]

T-fusion: Thermal modeling of 3d ics with multi-fidelity fusion,

B. Zhang, W. Xing, X. Zhao, and Y . Sun, “T-fusion: Thermal modeling of 3d ics with multi-fidelity fusion,” inProceedings of the 30th Asia and South Pacific Design Automation Conference (ASP-DAC), 2025, pp. 1406–1412

2025
[45]

Bueler,PETSc For partial differential equations: Numerical solutions in C and Python

E. Bueler,PETSc For partial differential equations: Numerical solutions in C and Python. SIAM, 2020

2020
[46]

[Online]

NVIDIA cuDSS. [Online]. Available: https://developer.nvidia.com/cudss
[47]

[Online]

NVIDIA cuDSS direct solvers github. [Online]. Available: https: //github.com/NVIDIA/CUDALibrarySamples/tree/main/cuDSS
[48]

[Online]

NVIDIA cuSPARSE. [Online]. Available: https://developer.nvidia.com/ cusparse
[49]

[On- line]

NVIDIA cuSPARSE ICC(0)-PCG solver github. [On- line]. Available: https://github.com/NVIDIA/CUDALibrarySamples/ tree/main/cuSPARSE/cg
[50]

[Online]

NVIDIA AmgX. [Online]. Available: https://developer.nvidia.com/amgx
[51]

[Online]

NVIDIA AmgX github. [Online]. Available: https://github.com/ NVIDIA/AMGX
[52]

Gpu-accelerated preconditioned iterative linear solvers,

R. Li and Y . Saad, “Gpu-accelerated preconditioned iterative linear solvers,”The Journal of Supercomputing, vol. 63, no. 2, pp. 443–466, 2013

2013
[53]

A customized precision format based on mantissa segmentation for accel- erating sparse linear algebra,

T. Gr ¨utzmacher, T. Cojean, G. Flegar, F. G ¨obel, and H. Anzt, “A customized precision format based on mantissa segmentation for accel- erating sparse linear algebra,”Concurrency and Computation: Practice and Experience, vol. 32, no. 15, p. e5418, 2020. 14

2020

[1] [1]

A ppa study for heterogeneous 3-d ic options: Monolithic, hybrid bonding, and microbumping,

J. Kim, L. Zhu, H. M. Torun, M. Swaminathan, and S. K. Lim, “A ppa study for heterogeneous 3-d ic options: Monolithic, hybrid bonding, and microbumping,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 32, no. 3, pp. 401–412, 2023

2023

[2] [2]

Thermal analysis of 3d stacking and beol technologies with functional partitioning of many-core risc-v soc,

M. Naeim, H. Oprins, S. Das, G. Van Der Plas, Y . Dai, P. Chen, C. Kao, D. Biswas, and D. Milojevic, “Thermal analysis of 3d stacking and beol technologies with functional partitioning of many-core risc-v soc,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2024, pp. 33–38

2024

[3] [3]

Thermal performance analysis of mempool risc-v multicore soc,

S. Venkateswarlu, S. Mishra, H. Oprins, B. Vermeersch, M. Brunion, J.- H. Han, M. R. Stan, P. Weckx, and F. Catthoor, “Thermal performance analysis of mempool risc-v multicore soc,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 11, pp. 1668–1676, 2022

2022

[4] [4]

Thermal analysis of advanced back- end-of-line structures and the impact of design parameters,

X. Chang, H. Oprins, M. Lofrano, B. Vermeersch, I. Ciofi, O. V . Pedreira, Z. Tokei, and I. De Wolf, “Thermal analysis of advanced back- end-of-line structures and the impact of design parameters,” in2022 21st IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm). IEEE, 2022, pp. 1–8

2022

[5] [5]

Multiscale thermal impact of bspdn: Soc hotspot challenges and partial mitigation,

B. Vermeersch, S. Mishra, M. Brunion, O. Zografos, M. Lofrano, H. Oprins, J. Myers, Z. Tokei, and G. Hellings, “Multiscale thermal impact of bspdn: Soc hotspot challenges and partial mitigation,” in2024 IEEE International Electron Devices Meeting (IEDM), 2024, pp. 1–4

2024

[6] [6]

Rapid estimation of anisotropic thermal conductivity in rdl for 2.5 d chiplet design,

Y . Li, J. Liu, D. Lu, W. Zhang, R. X.-K. Gao, E. Liu, M. D. Rotaru, D. Rahul, and N. Sridhar, “Rapid estimation of anisotropic thermal conductivity in rdl for 2.5 d chiplet design,” in2025 IEEE 75th Electronic Components and Technology Conference (ECTC). IEEE, 2025, pp. 1541–1546

2025

[7] [7]

Fast and accurate machine learning prediction of back-end-of-line thermal resistances in backside power delivery and chiplet architectures,

P. R. Chowdhury, A. Jain, D. Chidambarrao, K. Acharya, and A. Ogino, “Fast and accurate machine learning prediction of back-end-of-line thermal resistances in backside power delivery and chiplet architectures,” in2025 IEEE 75th Electronic Components and Technology Conference (ECTC). IEEE, 2025, pp. 1577–1582

2025

[8] [8]

A 20-year retrospective on power and thermal modeling and management,

D. Atienza, K. Zhu, D. Huang, and L. Costero, “A 20-year retrospective on power and thermal modeling and management,”IEEE Design & Test, 2025

2025

[9] [9]

Pact: An extensible parallel thermal simulator for emerging integration and cooling technologies,

Z. Yuan, P. Shukla, S. Chetoui, S. Nemtzow, S. Reda, and A. K. Coskun, “Pact: An extensible parallel thermal simulator for emerging integration and cooling technologies,”IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol. 41, no. 4, pp. 1048–1061, 2021

2021

[10] [10]

3d-ice 4.0: Accurate and efficient thermal modeling for 2.5 d/3d heterogeneous chiplet systems,

K. Zhu, D. Huang, L. Costero, and D. Atienza, “3d-ice 4.0: Accurate and efficient thermal modeling for 2.5 d/3d heterogeneous chiplet systems,” in2026 Design, Automation & Test in Europe Conference (DATE). IEEE, 2026, pp. 1–7

2026

[11] [11]

Amgx: A library for gpu accelerated algebraic multigrid and preconditioned iterative methods,

M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh, V . Sellappan, and R. Strzodka, “Amgx: A library for gpu accelerated algebraic multigrid and preconditioned iterative methods,”SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S602–S626, 2015

2015

[12] [12]

Dcsolver: Accelerating sparse iterative solvers via divide-and-conquer on gpus,

H. Qiu, C. Xu, J. Fang, J. Zhang, L. Deng, Z. Dai, Y . Ding, Y . Wang, Z. Han, Y . Cheet al., “Dcsolver: Accelerating sparse iterative solvers via divide-and-conquer on gpus,”ACM Transactions on Architecture and Code Optimization, vol. 22, no. 3, pp. 1–25, 2025

2025

[13] [13]

A technical survey of sparse linear solvers in electronic design automation,

N. Rai, “A technical survey of sparse linear solvers in electronic design automation,”Journal of Circuits, Systems and Computers, 2026

2026

[14] [14]

Recg: Reram-accelerated sparse conjugate gradient,

M. Fan, X. Chen, D. Yang, Z. Jin, and W. Liu, “Recg: Reram-accelerated sparse conjugate gradient,” inProceedings of the 61st ACM/IEEE Design Automation Conference (DAC), 2024, pp. 1–6

2024

[15] [15]

From 2.5 d to 3d chiplet systems: Investigation of thermal implications with hotspot 7.0,

J.-H. Han, X. Guo, K. Skadron, and M. R. Stan, “From 2.5 d to 3d chiplet systems: Investigation of thermal implications with hotspot 7.0,” inIEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (iTherm). IEEE, 2022, pp. 1–6

2022

[16] [16]

Mfit: Multi-fidelity thermal mod- eling for 2.5 d and 3d multi-chiplet architectures,

L. Pfromm, A. Kanani, H. Sharma, P. Solanki, E. Tervo, J. Park, J. Doppa, P. P. Pande, and U. Ogras, “Mfit: Multi-fidelity thermal mod- eling for 2.5 d and 3d multi-chiplet architectures,”ACM Transactions on Design Automation of Electronic Systems, 2024

2024

[17] [17]

Randomized cholesky factorization with threshold- based multisampling for power grid simulation,

Z. Liu and W. Yu, “Randomized cholesky factorization with threshold- based multisampling for power grid simulation,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 9, pp. 2687–2691, 2024

2024

[18] [18]

Multi- layer package power/ground planes synthesis with balanced dc ir drops: A game-theoretic optimization approach,

S. Liang, Z. Zhuang, K.-Y . Chao, B. Yu, and T.-Y . Ho, “Multi- layer package power/ground planes synthesis with balanced dc ir drops: A game-theoretic optimization approach,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2025

2025

[19] [19]

Efficient large-scale power grid analysis based on preconditioned krylov-subspace iterative methods,

T.-H. Chen and C. C.-P. Chen, “Efficient large-scale power grid analysis based on preconditioned krylov-subspace iterative methods,” inProceed- ings of the 38th annual Design Automation Conference (DAC), 2001

2001

[20] [20]

pgrass-solver: a parallel iterative solver for scalable power grid analysis based on graph spectral sparsification,

Z. Liu and W. Yu, “pgrass-solver: a parallel iterative solver for scalable power grid analysis based on graph spectral sparsification,” in2021 IEEE/ACM International Conference On Computer Aided Design (IC- CAD). IEEE, 2021, pp. 1–9

2021

[21] [21]

[Online]

COMSOL Multiphysics. [Online]. Available: https://www.comsol.com/ release/6.4/gpu-acceleration

[22] [22]

[Online]

SuperLU. [Online]. Available: https://portal.nersc.gov/project/sparse/ superlu/

[23] [23]

Algorithm 907: Klu, a direct sparse solver for circuit simulation problems,

T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: Klu, a direct sparse solver for circuit simulation problems,”ACM Transactions on Mathematical Software (TOMS), vol. 37, no. 3, pp. 1–17, 2010

2010

[24] [24]

Aztecoo user guide

M. A. Heroux, “Aztecoo user guide.” Sandia National Laboratories, Tech. Rep., 2004

2004

[25] [25]

Ame- sos2 and belos: Direct and iterative solvers for large sparse linear systems,

E. Bavier, M. Hoemmen, S. Rajamanickam, and H. Thornquist, “Ame- sos2 and belos: Direct and iterative solvers for large sparse linear systems,”Scientific Programming, vol. 20, no. 3, pp. 241–255, 2012

2012

[26] [26]

Thermalscope: Multi-scale thermal analysis for nanometer-scale integrated circuits,

N. Allec, Z. Hassan, L. Shang, R. P. Dick, and R. Yang, “Thermalscope: Multi-scale thermal analysis for nanometer-scale integrated circuits,” in 2008 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2008, pp. 603–610

2008

[27] [27]

The mta: An advanced and versatile thermal simulator for integrated systems,

S. Ladenheim, Y .-C. Chen, M. Mihajlovi ´c, and V . F. Pavlidis, “The mta: An advanced and versatile thermal simulator for integrated systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 12, pp. 3123–3136, 2018

2018

[28] [28]

Porting hypre to heterogeneous computer architectures: Strategies and experi- ences,

R. D. Falgout, R. Li, B. Sj ¨ogreen, L. Wang, and U. M. Yang, “Porting hypre to heterogeneous computer architectures: Strategies and experi- ences,”Parallel Computing, vol. 108, p. 102840, 2021

2021

[29] [29]

An efficient leakage-aware thermal simulation approach for 3d-ics using corrected linearized model and algebraic multigrid,

C. Yan, H. Zhu, D. Zhou, and X. Zeng, “An efficient leakage-aware thermal simulation approach for 3d-ics using corrected linearized model and algebraic multigrid,” inDesign, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, 2017, pp. 1207–1212

2017

[30] [30]

Thpa: Thermal simulation for advanced ics,

B.-W. Chen, Y .-H. Lin, C.-Y . Lin, and Y .-M. Lee, “Thpa: Thermal simulation for advanced ics,” in2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2026, pp. 1407–1413

2026

[31] [31]

Ic thermal simulation and modeling via efficient multigrid-based approaches,

P. Li, L. T. Pileggi, M. Asheghi, and R. Chandra, “Ic thermal simulation and modeling via efficient multigrid-based approaches,”IEEE Transac- tions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 9, pp. 1763–1776, 2006

2006

[32] [32]

Fast electrical-thermal co-simulation using multigrid method for 3d integration,

J. Xie and M. Swaminathan, “Fast electrical-thermal co-simulation using multigrid method for 3d integration,” in2012 IEEE 62nd Electronic Components and Technology Conference (ECTC). IEEE, 2012, pp. 651–657

2012

[33] [33]

Fast thermal analysis on gpu for 3d ics with integrated microchannel cooling,

Z. Feng and P. Li, “Fast thermal analysis on gpu for 3d ics with integrated microchannel cooling,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 8, pp. 1526–1539, 2012

2012

[34] [34]

Thermal simulator for advanced packaging and chiplet-based systems,

Y . Safari, A. Corbier, D. Al Saleh, F. R. Amik, and B. Vaisband, “Thermal simulator for advanced packaging and chiplet-based systems,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2025

2025

[35] [35]

Etla-3d: Equivalent thin layer aggregation based thermal fem for hybrid bonding f2f 3d ics,

C. Wang, Z. Zhuang, K. Zhu, D. Huang, L. Costero, R. Chen, D. Atienza, and T.-Y . Ho, “Etla-3d: Equivalent thin layer aggregation based thermal fem for hybrid bonding f2f 3d ics,” in2026 Design, Automation & Test in Europe Conference (DATE). IEEE, 2026

2026

[36] [36]

Azul: An accelerator for sparse iterative solvers leveraging distributed on- chip memory,

A. Feldmann, C. Golden, Y . Yang, J. S. Emer, and D. Sanchez, “Azul: An accelerator for sparse iterative solvers leveraging distributed on- chip memory,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 643–656

2024

[37] [37]

Saad,Iterative methods for sparse linear systems

Y . Saad,Iterative methods for sparse linear systems. SIAM, 2003. 13

2003

[38] [38]

R. A. Horn and C. R. Johnson,Matrix analysis. Cambridge university press, 2012

2012

[39] [39]

D. R. Kincaid and E. W. Cheney,Numerical analysis: mathematics of scientific computing. American Mathematical Soc., 2009, vol. 2

2009

[40] [40]

G. H. Golub and C. F. Van Loan,Matrix computations. JHU press, 2013

2013

[41] [41]

Data-driven mixed precision sparse matrix vector multiplication for gpus,

K. Ahmad, H. Sundar, and M. Hall, “Data-driven mixed precision sparse matrix vector multiplication for gpus,”ACM Transactions on Architecture and Code Optimization, vol. 16, no. 4, pp. 1–24, 2019

2019

[42] [42]

Adaptive precision in block-jacobi preconditioning for iterative sparse linear system solvers,

H. Anzt, J. Dongarra, G. Flegar, N. J. Higham, and E. S. Quintana- Ort´ı, “Adaptive precision in block-jacobi preconditioning for iterative sparse linear system solvers,”Concurrency and Computation: Practice and Experience, vol. 31, no. 6, p. e4460, 2019

2019

[43] [43]

Self-attention to operator learning-based 3d- ic thermal simulation,

Z. Huang, H. Wang, W. Yang, M. Tang, D. Xie, T.-J. Lin, Y . Zhang, W. W. Xing, and L. He, “Self-attention to operator learning-based 3d- ic thermal simulation,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

2025

[44] [44]

T-fusion: Thermal modeling of 3d ics with multi-fidelity fusion,

B. Zhang, W. Xing, X. Zhao, and Y . Sun, “T-fusion: Thermal modeling of 3d ics with multi-fidelity fusion,” inProceedings of the 30th Asia and South Pacific Design Automation Conference (ASP-DAC), 2025, pp. 1406–1412

2025

[45] [45]

Bueler,PETSc For partial differential equations: Numerical solutions in C and Python

E. Bueler,PETSc For partial differential equations: Numerical solutions in C and Python. SIAM, 2020

2020

[46] [46]

[Online]

NVIDIA cuDSS. [Online]. Available: https://developer.nvidia.com/cudss

[47] [47]

[Online]

NVIDIA cuDSS direct solvers github. [Online]. Available: https: //github.com/NVIDIA/CUDALibrarySamples/tree/main/cuDSS

[48] [48]

[Online]

NVIDIA cuSPARSE. [Online]. Available: https://developer.nvidia.com/ cusparse

[49] [49]

[On- line]

NVIDIA cuSPARSE ICC(0)-PCG solver github. [On- line]. Available: https://github.com/NVIDIA/CUDALibrarySamples/ tree/main/cuSPARSE/cg

[50] [50]

[Online]

NVIDIA AmgX. [Online]. Available: https://developer.nvidia.com/amgx

[51] [51]

[Online]

NVIDIA AmgX github. [Online]. Available: https://github.com/ NVIDIA/AMGX

[52] [52]

Gpu-accelerated preconditioned iterative linear solvers,

R. Li and Y . Saad, “Gpu-accelerated preconditioned iterative linear solvers,”The Journal of Supercomputing, vol. 63, no. 2, pp. 443–466, 2013

2013

[53] [53]

A customized precision format based on mantissa segmentation for accel- erating sparse linear algebra,

T. Gr ¨utzmacher, T. Cojean, G. Flegar, F. G ¨obel, and H. Anzt, “A customized precision format based on mantissa segmentation for accel- erating sparse linear algebra,”Concurrency and Computation: Practice and Experience, vol. 32, no. 15, p. e5418, 2020. 14

2020