GPU acceleration of plane-wave density functional theory calculations in Abinit

Ioanna-Maria Lygatsika; Lucas Baguet; Marc Sarraute; Marc Torrent; Pierre Kestener

arxiv: 2604.11139 · v2 · pith:WF7X5RZGnew · submitted 2026-04-13 · ❄️ cond-mat.mtrl-sci

GPU acceleration of plane-wave density functional theory calculations in Abinit

Ioanna-Maria Lygatsika , Marc Sarraute , Lucas Baguet , Pierre Kestener , Marc Torrent This is my paper

Pith reviewed 2026-05-10 15:00 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci

keywords GPU accelerationplane-wave DFTAbinitKohn-Sham equationsiterative diagonalizationLOBPCGChebyshev filteringheterogeneous computing

0 comments

The pith

Abinit achieves GPU speedups for plane-wave DFT by revising the Kohn-Sham iterative diagonalizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reports the porting of Abinit to multi-GPU architectures for plane-wave density functional theory calculations. This requires algorithmic changes to the procedure that solves the Kohn-Sham equations so that it favors operations efficient on GPUs, notably linear algebra and fast Fourier transforms applied to wave functions held in distributed memory. Performance data are presented that contrast pure CPU nodes against heterogeneous CPU-GPU nodes and that directly compare the Locally Optimal Block Preconditioned Conjugate Gradient method against Chebyshev polynomial filtering on their ability to exploit GPU hardware. Readers would care because the work demonstrates how a production electronic-structure code can be adapted to modern GPU-equipped machines while keeping the same physical model.

Core claim

The Abinit implementation on multi-GPU architectures relies on algorithmic revisions of the iterative diagonalization procedure in the resolution of the Kohn-Sham problem to identify GPU-efficient mathematical operations (linear algebra, FFTs) applied to wave functions distributed in memory, and supplies detailed performance results comparing CPU nodes versus heterogeneous CPU-GPU nodes together with a comparison of LOBPCG and Chebyshev polynomial filtering in terms of GPU efficiency.

What carries the argument

Revised iterative diagonalization procedure that selects and applies GPU-efficient linear algebra and FFT operations to distributed wave functions.

If this is right

Heterogeneous CPU-GPU nodes deliver higher throughput than CPU-only nodes for the same plane-wave DFT workloads.
Chebyshev polynomial filtering exploits GPU resources more effectively than LOBPCG in this setting.
Vendor library calls for linear algebra and FFTs become the dominant computational kernels after the revisions.
Large-scale electronic structure runs become feasible on GPU-equipped supercomputers without changing the underlying physics model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of favoring linear algebra and FFT operations could be applied to accelerate other plane-wave DFT packages.
Further gains may appear when the number of GPU nodes grows into the hundreds for systems containing thousands of atoms.
The reported speedups assume that data movement between CPU and GPU remains a minor fraction of total time.
Verification on a broader set of materials and properties would strengthen that accuracy is preserved.

Load-bearing premise

The algorithmic revisions to the iterative diagonalization preserve numerical accuracy and convergence properties of the original CPU implementation.

What would settle it

Identical input system yields total energies or forces that differ beyond floating-point tolerance when the same calculation is run on the original CPU version versus the GPU version.

Figures

Figures reproduced from arXiv: 2604.11139 by Ioanna-Maria Lygatsika, Lucas Baguet, Marc Sarraute, Marc Torrent, Pierre Kestener.

**Figure 2.** Figure 2: FIG. 2: Host-device memory transfers in main [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3: Row-to-column and column-to-row MPI transpositions of the wave function [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4: Rayleigh-Ritz (RR) procedure and the names of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6: Parallel subspace iteration algorithms for computing [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7: Hierarchy from the high-level wave function to the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8: Execution time and GPU speedup for Chebyshev filtering on a 255-atom Ti system (4096 electronic bands) over 10 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: FIG. 9: Energy consumption and savings for Chebyshev [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: FIG. 10: Roofline model for (a) the LOBPCG algorithm and (b) the Chebyshev filtering algorithm, using one NVIDIA A100 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: FIG. 11: Execution times for a single SCF iteration on one [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

read the original abstract

We report on the GPU port of the Abinit high-performance simulation code for plane-wave DFT calculations. Large-scale electronic structure calculations require computing the electronic wave function by solving the Kohn-Sham equations discretized over a large number of plane waves. Porting such calculations to GPU nodes relies not only on extensive usage of vendor libraries from a development perspective, but also on algorithmic revisions of the iterative diagonalization procedure in the resolution of the Kohn-Sham equations to identify GPU-efficient mathematical operations (linear algebra, FFTs) applied to the wave function distributed in memory. The present contribution discusses the Abinit implementation on multi-GPU architectures, providing detailed performance results for heterogeneous CPU-GPU nodes versus CPU nodes. Particular attention is given to comparing two diagonalization algorithms -- Locally Optimal Block Preconditioned Conjugate Gradient and Chebyshev polynomial filtering -- in terms of GPU efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abinit now has a working GPU port for its plane-wave DFT with targeted changes to the diagonalization routines, and the benchmarks show practical speedups on mixed nodes, though the paper underplays checks that the GPU results match the CPU ones numerically.

read the letter

The key takeaway is that Abinit has been ported to multi-GPU architectures with revisions to the LOBPCG and Chebyshev diagonalization routines to better suit GPU hardware, and the authors report performance gains when comparing CPU nodes to mixed CPU-GPU nodes. The paper does a good job laying out the specific changes needed for GPU efficiency in the plane-wave DFT workflow. They focus on identifying GPU-friendly operations in the iterative solvers and provide concrete timing data from their tests. This kind of implementation report helps users who have access to GPU clusters and want to know how much faster their calculations might run in Abinit. The main soft spot is the missing validation for numerical equivalence. As noted in the stress-test, there are no reported comparisons of energies, forces, or convergence behavior between the CPU and GPU implementations on the same inputs. The performance tables are there, but without those checks it's difficult to confirm that the speedups don't come with any trade-offs in accuracy or reliability. If the paper has those details in the full text, they aren't highlighted in the abstract or the main claims. This work is for practitioners in materials science who use Abinit for large-scale calculations and for developers interested in HPC adaptations of DFT codes. It is not a new physical method, so it won't change how people think about electronic structure, but it can make existing methods more practical on modern hardware. I would recommend sending it for peer review. The implementation choices and benchmarks are worth a careful look by experts in the field, even if some additional accuracy tests would improve it.

Referee Report

2 major / 2 minor

Summary. The manuscript reports the GPU porting of the Abinit plane-wave DFT code, emphasizing algorithmic revisions to the LOBPCG and Chebyshev polynomial filtering iterative diagonalization procedures to identify GPU-efficient operations (linear algebra and FFTs) on distributed wave functions. It provides performance benchmarks comparing CPU-only nodes to heterogeneous CPU-GPU nodes, with particular focus on the relative GPU efficiency of the two diagonalization algorithms.

Significance. If the central claims hold, this contribution would be significant for enabling scalable large-scale electronic structure calculations on modern GPU-accelerated HPC systems. The explicit comparison of LOBPCG versus Chebyshev filtering on GPUs offers practical guidance for algorithm selection in similar codes, and the reliance on vendor libraries plus distributed-memory considerations addresses key implementation challenges in the field.

major comments (2)

[§5] §5 (Performance results), Tables 1-3: The reported timings for LOBPCG and Chebyshev filtering on CPU versus CPU-GPU nodes do not include any side-by-side numerical equivalence metrics (e.g., total energy differences, eigenvalue residuals, or iteration counts to convergence) for identical inputs and systems. This is load-bearing for the claim that the algorithmic revisions preserve the original numerical accuracy and convergence properties while delivering the speedups.
[Implementation section] Implementation section (preceding the benchmarks): The description of the revisions to the iterative diagonalization does not specify how the GPU paths maintain mathematical equivalence to the CPU versions (e.g., identical preconditioning, filtering polynomials, or convergence criteria). Without this, it is unclear whether observed speedups reflect true acceleration or altered iteration behavior.

minor comments (2)

[Abstract] The abstract states that 'detailed performance results' are provided yet contains no quantitative values, error bars, or system sizes; moving a representative timing or speedup figure into the abstract would improve immediate readability.
[Figures] Figure captions for the performance plots should explicitly state the number of nodes, basis-set sizes, and whether single- or double-precision arithmetic was used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript describing the GPU porting of Abinit's plane-wave DFT solver. The feedback correctly identifies the need for explicit numerical validation and clearer description of algorithmic equivalence, both of which we will address in the revision. Our point-by-point responses follow.

read point-by-point responses

Referee: [§5] §5 (Performance results), Tables 1-3: The reported timings for LOBPCG and Chebyshev filtering on CPU versus CPU-GPU nodes do not include any side-by-side numerical equivalence metrics (e.g., total energy differences, eigenvalue residuals, or iteration counts to convergence) for identical inputs and systems. This is load-bearing for the claim that the algorithmic revisions preserve the original numerical accuracy and convergence properties while delivering the speedups.

Authors: We agree that side-by-side numerical equivalence metrics are necessary to substantiate that the GPU ports preserve accuracy and convergence behavior. In the revised manuscript we will add a dedicated subsection (or supplementary table) in §5 that reports, for each benchmark system and both diagonalization methods, the total energy, maximum eigenvalue residual, and number of iterations to convergence on CPU-only versus CPU-GPU nodes using identical input parameters. These quantities will be shown to agree to within double-precision round-off, confirming that the observed speedups arise from hardware acceleration rather than altered numerics. revision: yes
Referee: [Implementation section] Implementation section (preceding the benchmarks): The description of the revisions to the iterative diagonalization does not specify how the GPU paths maintain mathematical equivalence to the CPU versions (e.g., identical preconditioning, filtering polynomials, or convergence criteria). Without this, it is unclear whether observed speedups reflect true acceleration or altered iteration behavior.

Authors: We accept that the implementation section would benefit from greater explicitness on mathematical equivalence. We will revise the text to state that the GPU paths employ exactly the same preconditioning operators, the same Chebyshev polynomial degrees and filtering coefficients, and the identical convergence thresholds and stopping criteria as the CPU implementations. The algorithmic changes are limited to reordering and offloading linear-algebra and FFT kernels to vendor libraries while preserving the mathematical structure of both LOBPCG and Chebyshev filtering; iteration counts and residual histories therefore remain unchanged. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical implementation and benchmarking report

full rationale

The paper is a porting and performance benchmarking study of Abinit on GPUs. It describes algorithmic revisions to iterative diagonalization (LOBPCG and Chebyshev filtering) for GPU efficiency and reports timing comparisons between CPU and heterogeneous nodes. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or self-citation load-bearing arguments exist. Central claims rest on measured wall-clock times and GPU utilization, which are direct empirical observations rather than quantities constructed from the paper's own inputs. The absence of any claimed mathematical equivalence or predictive derivation precludes circularity by the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard Kohn-Sham DFT and established numerical linear algebra; no new free parameters, axioms beyond domain standards, or invented entities are introduced.

axioms (1)

domain assumption Standard Kohn-Sham equations discretized in a plane-wave basis
The paper assumes the validity of plane-wave DFT as implemented in Abinit without re-deriving it.

pith-pipeline@v0.9.0 · 5470 in / 1211 out tokens · 81784 ms · 2026-05-10T15:00:26.235557+00:00 · methodology

GPU acceleration of plane-wave density functional theory calculations in Abinit

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)