D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems

Dongdong Ge; Hongpei Li; Huikang Liu; Yicheng Huang; Yinyu Ye

arxiv: 2601.07628 · v3 · submitted 2026-01-12 · 🧮 math.OC · cs.DC

D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems

Hongpei Li , Yicheng Huang , Huikang Liu , Dongdong Ge , Yinyu Ye This is my paper

Pith reviewed 2026-05-16 15:13 UTC · model grok-4.3

classification 🧮 math.OC cs.DC

keywords distributed PDHGmulti-GPU linear programmingtwo-dimensional matrix partitioningPrimal-Dual Hybrid Gradientscalabilitylarge-scale LPcuPDLPx

0 comments

The pith

D-PDLP extends the Primal-Dual Hybrid Gradient method to multi-GPU systems through two-dimensional grid partitioning of the constraint matrix to solve large linear programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces D-PDLP, a distributed framework that scales the PDHG algorithm for linear programming across multiple GPUs. It partitions the constraint matrix in two dimensions and applies block-wise random permutation together with nonzero-aware strategies to distribute the work. The goal is to overcome single-GPU memory and speed limits while keeping the same double-precision accuracy as the original method. A reader would care because many industrial linear programs are too large for one GPU, and this approach aims to make them solvable on ordinary clusters.

Core claim

D-PDLP extends the Primal-Dual Hybrid Gradient method to distributed multi-GPU systems via a practical two-dimensional grid partitioning of the constraint matrix, combined with block-wise random permutation and nonzero-aware matrix partitioning. This distributes the intensive computations of each PDHG iteration across GPUs with low communication overhead, yielding strong scalability and high performance while preserving full FP64 numerical accuracy on MIPLIB, Mittelmann, and real-world instances.

What carries the argument

Two-dimensional grid partitioning of the constraint matrix together with block-wise random permutation and nonzero-aware strategies, which spreads the matrix-vector products and updates required by PDHG iterations across GPUs.

If this is right

The framework achieves substantial speedups on massive LP instances by harnessing multiple GPUs.
Strong scalability occurs with relatively low communication overhead.
Full FP64 accuracy is retained, matching the numerical behavior of the single-node cuPDLPx solver.
Effective performance is observed on both standard benchmarks and huge-scale real-world datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning ideas might extend to other first-order methods that rely on repeated matrix-vector multiplications.
Users could solve previously intractable industrial LPs on commodity multi-GPU servers in practical time.
Varying the grid dimensions could be tested to reach even larger problem sizes on bigger clusters.
The load-balancing approach might apply to heterogeneous GPU setups without major redesign.

Load-bearing premise

The two-dimensional grid partitioning with block-wise random permutation and nonzero-aware strategies will preserve the convergence behavior and numerical stability of the original PDHG method for arbitrary large-scale LP instances without introducing unacceptable load imbalance or communication bottlenecks.

What would settle it

A large LP instance where the multi-GPU version either diverges, requires many more iterations than the single-GPU baseline, or produces a solution whose objective or feasibility error exceeds machine epsilon would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2601.07628 by Dongdong Ge, Hongpei Li, Huikang Liu, Yicheng Huang, Yinyu Ye.

**Figure 2.** Figure 2: Comparison of matrix permutation strategies for distributed sparse optimization. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

We present a distributed framework of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers demonstrate strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by single-GPU computational throughput. To overcome these challenges, we propose D-PDLP, the first Distributed PDLP framework, which extends PDHG to a multi-GPU setting via a practical two-dimensional grid partitioning of the constraint matrix. To improve load balance and computational efficiency, we introduce a block-wise random permutation strategy combined with nonzero-aware matrix partitioning. By distributing the intensive computation required in PDHG iterations, the proposed framework harnesses multi-GPU parallelism to achieve substantial speedups with relatively low communication overhead. Extensive experiments on standard LP benchmarks (including MIPLIB and Mittelmann instances) as well as huge-scale real-world datasets show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

D-PDLP gives the first explicit multi-GPU version of PDLP via 2D partitioning and block-wise permutation, but the scaling story still needs the actual numbers and equivalence checks.

read the letter

D-PDLP takes the single-node cuPDLPx solver and spreads the PDHG iterations across multiple GPUs with a two-dimensional grid split of the constraint matrix plus block-wise random permutation and nonzero-aware partitioning. That combination is new for this solver family and the paper spells out the communication pattern and how they keep the work distributed. The choice to stay in full FP64 is sensible for the target industrial problems. Experiments on MIPLIB, Mittelmann, and some large real-world instances are the right test bed. The engineering description of how the matvecs and dual updates are handled looks practical and low-overhead on paper. The soft spot is that the abstract supplies no speedup figures, iteration counts, or scaling plots, so it is still unclear whether the heuristics preserve the original convergence rate or just add synchronization cost. The stress-test point about possible load imbalance on irregular sparsity is reasonable; without seeing the per-GPU nonzero counts or a short argument that the distributed updates remain mathematically equivalent to the serial trajectory, the strong-scaling claim stays provisional. This is aimed at people already running large LPs on GPUs who need to move to clusters. A reader working on scalable optimization code would get concrete implementation ideas once the full results are visible. I would send it to peer review so referees can examine the code, the exact timings, and whether the partitioning choices hold up on the hardest instances.

Referee Report

2 major / 0 minor

Summary. The paper presents D-PDLP, a distributed multi-GPU extension of the Primal-Dual Hybrid Gradient (PDHG) algorithm for large-scale linear programming. It employs a two-dimensional grid partitioning of the constraint matrix together with block-wise random permutation and nonzero-aware strategies to improve load balance, built on top of cuPDLPx. The central claim is that this framework achieves strong scalability and high performance on MIPLIB, Mittelmann, and real-world instances while preserving full FP64 numerical accuracy and incurring only low communication overhead.

Significance. If the empirical claims hold, the work would be significant for extending single-GPU PDHG solvers to distributed settings, enabling solution of industrial-scale LPs that exceed single-device memory limits. The practical heuristics for partitioning and the reported preservation of FP64 accuracy are strengths, but the absence of detailed quantitative speedup numbers, iteration counts, or convergence diagnostics in the abstract weakens the assessed impact.

major comments (2)

[Abstract] Abstract: the claim of achieving 'strong scalability and high performance' rests on experiments whose quantitative results (speedup factors, iteration counts, wall-clock times, or error bars) are not supplied in the text, leaving the central performance assertion only partially supported.
[Distributed framework description] The two-dimensional grid partitioning with block-wise random permutation and nonzero-aware splits: no analysis or proof is provided that the distributed matvec updates remain mathematically equivalent to the serial PDHG trajectory, so it is unclear whether the method preserves the original convergence rate or step-size selection for matrices with irregular sparsity; this directly affects the claim that numerical accuracy and iteration behavior are unchanged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline targeted revisions to strengthen the manuscript's claims and clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of achieving 'strong scalability and high performance' rests on experiments whose quantitative results (speedup factors, iteration counts, wall-clock times, or error bars) are not supplied in the text, leaving the central performance assertion only partially supported.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will incorporate representative metrics, including observed speedup factors (near-linear scaling to 8-16 GPUs), average iteration counts, wall-clock times, and a note on variance across MIPLIB, Mittelmann, and real-world instances. revision: yes
Referee: [Distributed framework description] The two-dimensional grid partitioning with block-wise random permutation and nonzero-aware splits: no analysis or proof is provided that the distributed matvec updates remain mathematically equivalent to the serial PDHG trajectory, so it is unclear whether the method preserves the original convergence rate or step-size selection for matrices with irregular sparsity; this directly affects the claim that numerical accuracy and iteration behavior are unchanged.

Authors: The 2D grid partitioning together with the communication pattern computes exactly the same floating-point matrix-vector products (A x and A^T y) as the serial implementation; results are assembled via all-reduce collectives in FP64 with no approximation or reordering of arithmetic. Consequently the PDHG iterates, step-size selection, and convergence behavior are identical regardless of sparsity pattern. We will add a dedicated subsection with pseudocode and a short equivalence argument to make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical validation of distributed PDHG implementation

full rationale

The paper describes an engineering extension of the PDHG algorithm to multi-GPU via 2D grid partitioning, block-wise random permutation, and nonzero-aware splits. Central claims of strong scalability, low communication overhead, and preserved FP64 accuracy are supported by experiments on MIPLIB, Mittelmann, and real-world instances rather than any derivation. No equation, theorem, or prediction reduces to a fitted parameter, self-definition, or self-citation chain. Convergence preservation is asserted as a consequence of the partitioning design and verified empirically; it does not equate to its inputs by construction. The work is self-contained as a practical implementation contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework inherits standard convergence assumptions of the PDHG algorithm for linear programs and standard models of GPU-to-GPU communication; no new free parameters or invented entities are introduced in the abstract description.

axioms (1)

domain assumption PDHG iterations converge to the optimal solution for feasible linear programs under standard step-size rules
Invoked when claiming that distribution preserves solution quality and accuracy.

pith-pipeline@v0.9.0 · 5492 in / 1279 out tokens · 37788 ms · 2026-05-16T15:13:49.186442+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

cuhallar: A gpu accelerated low-rank augmented lagrangian method for large-scale semidefinite programming.arXiv preprint arXiv:2505.13719, 2025

Jacob M Aguirre, Diego Cifuentes, Vincent Guigues, Renato DC Monteiro, Victor Hugo Nascimento, and Arnesh Su- janani. cuhallar: A gpu accelerated low-rank augmented lagrangian method for large-scale semidefinite programming. arXiv preprint arXiv:2505.13719,

work page arXiv
[2]

, year = 2017, journal =

doi: 10.1137/141000671. URL https://epubs.siam.org/doi/10.1137/141000671. Kaihuang Chen, Defeng Sun, Yancheng Yuan, Guojun Zhang, and Xinyuan Zhao. Hpr-qp: A dual halpern peaceman- rachford method for solving large-scale convex composite quadratic programming.arXiv preprint arXiv:2507.02470,

work page doi:10.1137/141000671
[3]

New developments of ADMM-based interior point methods for linear programming and conic programming

Qi Deng, Qing Feng, Wenzhi Gao, Dongdong Ge, Bo Jiang, Yuntian Jiang, Jingsong Liu, Tianhao Liu, Chenyu Xue, Yinyu Ye, et al. New developments of ADMM-based interior point methods for linear programming and conic programming. arXiv preprint arXiv:2209.01793,

work page arXiv
[4]

New understandings and computation on augmented lagrangian methods for low-rank semidefinite programming.arXiv preprint arXiv:2505.15775, 2025

Lijun Ding, Haihao Lu, and Jinwen Yang. New understandings and computation on augmented lagrangian methods for low-rank semidefinite programming. arXiv preprint arXiv:2505.15775,

work page arXiv
[5]

Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314, 2022

Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314,

work page arXiv
[6]

Miplib 2017: data-driven compilation of the 6th mixed-integer programming library

Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490,

work page 2017
[7]

A low-rank ADMM splitting approach for semidefinite programming.arXiv preprint arXiv:2403.09133, 2024

Qiushi Han, Chenxi Li, Zhenwei Lin, Caihua Chen, Qi Deng, Dongdong Ge, Huikang Liu, and Yinyu Ye. A low-rank admm splitting approach for semidefinite programming. arXiv preprint arXiv:2403.09133, 2024a. Qiushi Han, Zhenwei Lin, Hanwen Liu, Caihua Chen, Qi Deng, Dongdong Ge, and Yinyu Ye. Accelerating low-rank factorization-based semidefinite programming a...

work page arXiv
[8]

Restarted primal-dual hybrid conjugate gradient method for large-scale quadratic programming.arXiv preprint arXiv:2405.16160, 2024

Yicheng Huang, Wanyu Zhang, Hongpei Li, Dongdong Ge, Huikang Liu, and Yinyu Ye. Restarted primal-dual hybrid conjugate gradient method for large-scale quadratic programming. arXiv preprint arXiv:2405.16160,

work page arXiv
[9]

A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025

Zhenwei Lin, Zikai Xiong, Dongdong Ge, and Yinyu Ye. Pdcs: A primal-dual large-scale conic programming solver with gpu enhancements. arXiv preprint arXiv:2505.00311,

work page arXiv
[10]

Pdhcg: A scalable first-order method for large-scale competitive market equilibrium computation

Huikang Liu, Yicheng Huang, Hongpei Li, Dongdong Ge, and Yinyu Ye. Pdhcg: A scalable first-order method for large-scale competitive market equilibrium computation. arXiv preprint arXiv:2506.06258,

work page arXiv
[11]

Haihao Lu and Jinwen Yang. cupdlp. jl: A gpu implementation of restarted primal-dual hybrid gradient for linear programming in julia. arXiv preprint arXiv:2311.12180, 2023a. Haihao Lu and Jinwen Yang. A practical and optimal first-order method for large-scale convex quadratic programming. arXiv preprint arXiv:2311.07710, 2023b. Haihao Lu and Jinwen Yang. ...

work page arXiv
[12]

cupdlpx: A further enhanced gpu-based first-order solver for linear programming

Haihao Lu, Zedong Peng, and Jinwen Yang. cupdlpx: A further enhanced gpu-based first-order solver for linear programming. arXiv preprint arXiv:2507.14051,

work page arXiv
[13]

Benchmarks for optimization software (2010)

Hans Mittelmann. Benchmarks for optimization software (2010). URL http://plato. asu. edu/bench. html,

work page 2010
[14]

Solving large multicommodity network flow problems on gpus.arXiv preprint arXiv:2501.17996, 2025

Fangzhao Zhang and Stephen Boyd. Solving large multicommodity network flow problems on gpus. arXiv preprint arXiv:2501.17996,

work page arXiv
[15]

UNDER REVIEW

12 PREPRINT. UNDER REVIEW. A Complete Related Works Traditional linear programming solvers, including the simplex method and interior-point methods (IPMs), are highly effective for small- to medium-scale problems. These methods deliver reliable, high-accuracy solutions across various applications and are supported by commercial solvers such as MOSEK [ApS,...

work page 2019

[1] [1]

cuhallar: A gpu accelerated low-rank augmented lagrangian method for large-scale semidefinite programming.arXiv preprint arXiv:2505.13719, 2025

Jacob M Aguirre, Diego Cifuentes, Vincent Guigues, Renato DC Monteiro, Victor Hugo Nascimento, and Arnesh Su- janani. cuhallar: A gpu accelerated low-rank augmented lagrangian method for large-scale semidefinite programming. arXiv preprint arXiv:2505.13719,

work page arXiv

[2] [2]

, year = 2017, journal =

doi: 10.1137/141000671. URL https://epubs.siam.org/doi/10.1137/141000671. Kaihuang Chen, Defeng Sun, Yancheng Yuan, Guojun Zhang, and Xinyuan Zhao. Hpr-qp: A dual halpern peaceman- rachford method for solving large-scale convex composite quadratic programming.arXiv preprint arXiv:2507.02470,

work page doi:10.1137/141000671

[3] [3]

New developments of ADMM-based interior point methods for linear programming and conic programming

Qi Deng, Qing Feng, Wenzhi Gao, Dongdong Ge, Bo Jiang, Yuntian Jiang, Jingsong Liu, Tianhao Liu, Chenyu Xue, Yinyu Ye, et al. New developments of ADMM-based interior point methods for linear programming and conic programming. arXiv preprint arXiv:2209.01793,

work page arXiv

[4] [4]

New understandings and computation on augmented lagrangian methods for low-rank semidefinite programming.arXiv preprint arXiv:2505.15775, 2025

Lijun Ding, Haihao Lu, and Jinwen Yang. New understandings and computation on augmented lagrangian methods for low-rank semidefinite programming. arXiv preprint arXiv:2505.15775,

work page arXiv

[5] [5]

Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314, 2022

Dongdong Ge, Qi Huangfu, Zizhuo Wang, Jian Wu, and Yinyu Ye. Cardinal optimizer (copt) user guide.arXiv preprint arXiv:2208.14314,

work page arXiv

[6] [6]

Miplib 2017: data-driven compilation of the 6th mixed-integer programming library

Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490,

work page 2017

[7] [7]

A low-rank ADMM splitting approach for semidefinite programming.arXiv preprint arXiv:2403.09133, 2024

Qiushi Han, Chenxi Li, Zhenwei Lin, Caihua Chen, Qi Deng, Dongdong Ge, Huikang Liu, and Yinyu Ye. A low-rank admm splitting approach for semidefinite programming. arXiv preprint arXiv:2403.09133, 2024a. Qiushi Han, Zhenwei Lin, Hanwen Liu, Caihua Chen, Qi Deng, Dongdong Ge, and Yinyu Ye. Accelerating low-rank factorization-based semidefinite programming a...

work page arXiv

[8] [8]

Restarted primal-dual hybrid conjugate gradient method for large-scale quadratic programming.arXiv preprint arXiv:2405.16160, 2024

Yicheng Huang, Wanyu Zhang, Hongpei Li, Dongdong Ge, Huikang Liu, and Yinyu Ye. Restarted primal-dual hybrid conjugate gradient method for large-scale quadratic programming. arXiv preprint arXiv:2405.16160,

work page arXiv

[9] [9]

A practical GPU-enhanced matrix- free primal-dual method for large-scale conic programs, 2025

Zhenwei Lin, Zikai Xiong, Dongdong Ge, and Yinyu Ye. Pdcs: A primal-dual large-scale conic programming solver with gpu enhancements. arXiv preprint arXiv:2505.00311,

work page arXiv

[10] [10]

Pdhcg: A scalable first-order method for large-scale competitive market equilibrium computation

Huikang Liu, Yicheng Huang, Hongpei Li, Dongdong Ge, and Yinyu Ye. Pdhcg: A scalable first-order method for large-scale competitive market equilibrium computation. arXiv preprint arXiv:2506.06258,

work page arXiv

[11] [11]

Haihao Lu and Jinwen Yang. cupdlp. jl: A gpu implementation of restarted primal-dual hybrid gradient for linear programming in julia. arXiv preprint arXiv:2311.12180, 2023a. Haihao Lu and Jinwen Yang. A practical and optimal first-order method for large-scale convex quadratic programming. arXiv preprint arXiv:2311.07710, 2023b. Haihao Lu and Jinwen Yang. ...

work page arXiv

[12] [12]

cupdlpx: A further enhanced gpu-based first-order solver for linear programming

Haihao Lu, Zedong Peng, and Jinwen Yang. cupdlpx: A further enhanced gpu-based first-order solver for linear programming. arXiv preprint arXiv:2507.14051,

work page arXiv

[13] [13]

Benchmarks for optimization software (2010)

Hans Mittelmann. Benchmarks for optimization software (2010). URL http://plato. asu. edu/bench. html,

work page 2010

[14] [14]

Solving large multicommodity network flow problems on gpus.arXiv preprint arXiv:2501.17996, 2025

Fangzhao Zhang and Stephen Boyd. Solving large multicommodity network flow problems on gpus. arXiv preprint arXiv:2501.17996,

work page arXiv

[15] [15]

UNDER REVIEW

12 PREPRINT. UNDER REVIEW. A Complete Related Works Traditional linear programming solvers, including the simplex method and interior-point methods (IPMs), are highly effective for small- to medium-scale problems. These methods deliver reliable, high-accuracy solutions across various applications and are supported by commercial solvers such as MOSEK [ApS,...

work page 2019