A Hardware Accelerator for the Goemans-Williamson Algorithm
Pith reviewed 2026-05-18 10:29 UTC · model grok-4.3
The pith
Higher internal precision in conjugate gradient reduces time to solution for large Max-Cut SDP relaxations by a factor that grows with system size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When using indirect matrix inversion methods like Conjugate Gradient, which have lower complexity than direct methods and are therefore used in very large problems, increasing the internal working precision reduces the time to solution by a factor that increases with the system size on a hardware architecture that runs natively on extended precision.
What carries the argument
Conjugate gradient solver using extended floating point precision inside interior point methods for the Goemans-Williamson semidefinite relaxation of Max-Cut.
If this is right
- Time to solution decreases more significantly for larger systems.
- The method enables practical use of worst-case guaranteed Max-Cut approximations in performance-critical scenarios.
- Hardware designs can leverage native extended precision to translate iteration reductions into direct speedups.
Where Pith is reading between the lines
- The technique may apply to other large-scale convex optimization tasks that rely on iterative linear solvers.
- Future hardware could integrate this precision strategy to improve classical solvers used alongside quantum optimization methods.
Load-bearing premise
The hardware can execute extended-precision arithmetic natively with negligible overhead outside the linear solver itself.
What would settle it
Build a prototype hardware accelerator and compare the actual runtime for solving Max-Cut SDP instances of increasing size using standard versus extended precision in the conjugate gradient step.
Figures
read the original abstract
The combinatorial problem Max-Cut has become a benchmark in the evaluation of local search heuristics for both quantum and classical optimisers. In contrast to local search, which only provides average-case performance guarantees, the convex semidefinite relaxation of Max-Cut by Goemans and Williamson, provides worst-case guarantees and is therefore suited to both the construction of benchmarks and in applications to performance-critic scenarios. We show how extended floating point precision can be incorporated in algebraic subroutines in convex optimisation, namely in indirect matrix inversion methods like Conjugate Gradient, which are used in Interior Point Methods in the case of very large problem sizes. Also, an estimate is provided of the expected acceleration of the time to solution for a hardware architecture that runs natively on extended precision. Specifically, when using indirect matrix inversion methods like Conjugate Gradient, which have lower complexity than direct methods and are therefore used in very large problems, we see that increasing the internal working precision reduces the time to solution by a factor that increases with the system size.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a hardware accelerator for the Goemans-Williamson algorithm applied to Max-Cut. It focuses on incorporating extended floating-point precision into indirect solvers such as Conjugate Gradient within Interior Point Methods for the SDP relaxation, which are used at large problem sizes. The central claim is that raising internal working precision reduces iteration counts sufficiently that time-to-solution improves by a factor that grows with system size; an estimate of the resulting wall-clock acceleration is provided for a hardware architecture that executes extended-precision arithmetic natively.
Significance. If the unshown convergence model, iteration counts, and hardware-overhead analysis can be supplied and validated, the work would demonstrate a concrete precision-iteration tradeoff that improves scaling of iterative linear solvers in SDP-based combinatorial optimization. This could inform the design of specialized accelerators for large-scale convex relaxations, particularly where direct methods become prohibitive.
major comments (2)
- [Abstract] Abstract: the estimate of expected acceleration is stated without any convergence model, measured iteration counts for CG at different precisions, or error analysis showing how higher precision reduces iterations enough to produce a speedup factor that increases with system size. The central scaling claim therefore rests on unshown calculations.
- [Abstract] The manuscript supplies no cycle counts, area overhead figures, or micro-architectural description demonstrating that extended-precision operations can be realized with negligible overhead relative to the linear-algebra kernel. Without this, the translation from iteration reduction to wall-clock speedup remains an unverified assumption.
minor comments (1)
- [Abstract] The abstract refers to 'an estimate' of acceleration but does not indicate whether this estimate is derived analytically or from simulation; clarifying the derivation method would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The points raised identify areas where additional supporting details would improve clarity and rigor. We address each major comment below and describe the changes we will incorporate in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the estimate of expected acceleration is stated without any convergence model, measured iteration counts for CG at different precisions, or error analysis showing how higher precision reduces iterations enough to produce a speedup factor that increases with system size. The central scaling claim therefore rests on unshown calculations.
Authors: We agree that the abstract, being a high-level summary, does not present the supporting convergence model, iteration counts, or error analysis. The estimate in the manuscript is based on the established dependence of CG convergence on numerical stability and condition number, where extended precision reduces rounding-error accumulation that grows with problem size in SDP relaxations. In the revision we will add a concise convergence model and error analysis (drawing on standard CG theory) to the abstract or a new subsection, and we will include available iteration-count data from our simulations to substantiate that the speedup factor increases with system size. revision: yes
-
Referee: [Abstract] The manuscript supplies no cycle counts, area overhead figures, or micro-architectural description demonstrating that extended-precision operations can be realized with negligible overhead relative to the linear-algebra kernel. Without this, the translation from iteration reduction to wall-clock speedup remains an unverified assumption.
Authors: We acknowledge that the manuscript does not supply explicit cycle counts, area-overhead estimates, or a micro-architectural description. The current estimate assumes native extended-precision execution with low overhead, consistent with known VLSI trade-offs for floating-point units. In the revised manuscript we will add a dedicated paragraph or short appendix providing high-level cycle-count estimates, area-overhead figures relative to standard FP units, and a brief micro-architectural sketch to justify the negligible-overhead assumption and strengthen the link to wall-clock speedup. revision: yes
Circularity Check
No circularity: claim follows from standard CG complexity scaling without reduction to fitted inputs or self-citations
full rationale
The paper derives the claim that higher internal precision reduces CG iteration count (and thus time-to-solution by a factor growing with system size) from the known dependence of conjugate-gradient convergence on condition number and working precision in large-scale SDP relaxations inside Goemans-Williamson. No equation, prediction, or estimate is obtained by fitting a parameter to a subset of data and then renaming the fit as a prediction; no load-bearing step invokes a self-citation whose authors overlap with the present work; and the hardware-overhead assumption is stated separately rather than smuggled into the algebraic derivation. The result is therefore self-contained against external numerical-linear-algebra benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Higher internal floating-point precision reduces the iteration count of conjugate gradient when solving the Newton systems inside interior-point methods for the Max-Cut SDP.
Reference graph
Works this paper leans on
-
[1]
for extended precision computing. This RISC-V pro- cessor has been fitted with a variable and extended preci- sion floating point unit and a corresponding instruction set extension. It supports up to 512-bit floating point precision with performance depending on input and out- put precision. We have validated and measured this hard- FIG. 3. Ratio between ...
-
[2]
Combinatorial op- timization: theory and algorithms
KORTE, Bernhard; VYGEN, Jens. Combinatorial op- timization: theory and algorithms. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008
work page 2008
-
[3]
GENDREAU, Michel, et al. (ed.). Handbook of meta- heuristics. New York: Springer, 2010
work page 2010
-
[4]
Mathemati- cal foundation of quantum annealing
MORITA, Satoshi; NISHIMORI, Hidetoshi. Mathemati- cal foundation of quantum annealing. Journal of Mathe- matical Physics, 2008, vol. 49, no 12
work page 2008
-
[5]
Simulated quantum annealing can be exponentially faster than clas- sical simulated annealing
CROSSON, Elizabeth; HARROW, Aram W. Simulated quantum annealing can be exponentially faster than clas- sical simulated annealing. 2016 IEEE 57th Annual Sym- posium on Foundations of Computer Science (FOCS). IEEE, 2016. p. 714-723
work page 2016
-
[6]
Perspectives of quantum anneal- ing: Methods and implementations
HAUKE, Philipp, et al. Perspectives of quantum anneal- ing: Methods and implementations. Reports on Progress in Physics, 2020, vol. 83, no 5, p. 054401
work page 2020
-
[7]
A coherent Ising machine for 2000-node optimization problems
INAGAKI, Takahiro, et al. A coherent Ising machine for 2000-node optimization problems. Science, 2016, vol. 354, no 6312, p. 603-606
work page 2000
-
[8]
Coherent Ising ma- chines—optical neural networks operating at the quan- tum limit
YAMAMOTO, Yoshihisa, et al. Coherent Ising ma- chines—optical neural networks operating at the quan- tum limit. npj Quantum Information, 2017, vol. 3, no 1, p. 49
work page 2017
-
[9]
A time-division multiplex- ing Ising machine on FPGAs
YAMAMOTO, Kasho, et al. A time-division multiplex- ing Ising machine on FPGAs. Proceedings of the 8th In- ternational Symposium on Highly Efficient Accelerators and Reconfigurable Technologies. 2017. p. 1-6
work page 2017
-
[10]
High-speed sparse Ising model on FPGA
MINAMISAWA, Akira; IIMURA, Ryoma; KAWA- HARA, Takayuki. High-speed sparse Ising model on FPGA. 2019 IEEE 62nd International Midwest Sympo- sium on Circuits and Systems (MWSCAS). IEEE, 2019. p. 670-673
work page 2019
-
[11]
GPU-based Ising computing for solving max-cut combinatorial optimization problems
COOK, Chase, et al. GPU-based Ising computing for solving max-cut combinatorial optimization problems. Integration, 2019, vol. 69, p. 335-344
work page 2019
-
[12]
FPGA-based simulated bifurcation machine
TATSUMURA, Kosuke; DIXON, Alexander R.; GOTO, Hayato. FPGA-based simulated bifurcation machine. 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2019. p. 59-66
work page 2019
-
[13]
Scaling out Ising machines using a multi-chip ar- chitecture for simulated bifurcation
TATSUMURA, Kosuke; YAMASAKI, Masaya; GOTO, Hayato. Scaling out Ising machines using a multi-chip ar- chitecture for simulated bifurcation. Nature Electronics, 2021, vol. 4, no 3, p. 208-217
work page 2021
-
[14]
Application of digital annealer for faster combinatorial optimization
SAO, Masataka, et al. Application of digital annealer for faster combinatorial optimization. Fujitsu Scientific and Technical Journal, 2019, vol. 55, no 2, p. 45-51
work page 2019
-
[15]
Ising-FPGA: A spintronics-based reconfigurable Ising model solver
MONDAL, Ankit; SRIVASTAVA, Ankur. Ising-FPGA: A spintronics-based reconfigurable Ising model solver. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2020, vol. 26, no 1, p. 1-27
work page 2020
-
[16]
100,000-spin coherent Ising machine
HONJO, Toshimori, et al. 100,000-spin coherent Ising machine. Science advances, 2021, vol. 7, no 40, p. eabh0952
work page 2021
-
[17]
Ising machines as hardware solvers of combinatorial 7 optimization problems
MOHSENI, Naeimeh; MCMAHON, Peter L.; BYRNES, Tim. Ising machines as hardware solvers of combinatorial 7 optimization problems. Nature Reviews Physics, 2022, vol. 4, no 6, p. 363-379
work page 2022
-
[18]
All-to-all reconfigurability with sparse and higher-order Ising machines
NIKHAR, Srijan, et al. All-to-all reconfigurability with sparse and higher-order Ising machines. Nature Commu- nications, 2024, vol. 15, no 1, p. 8977
work page 2024
-
[19]
879- approximation algorithms for max cut and max 2sat
GOEMANS, Michel X.; WILLIAMSON, David P. . 879- approximation algorithms for max cut and max 2sat. Proceedings of the Twenty-sixth Annual ACM Sympo- sium on Theory of Computing. 1994. p. 422-431
work page 1994
-
[20]
GOEMANS, Michel X.; WILLIAMSON, David P. Im- proved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 1995, vol. 42, no 6, p. 1115-1145
work page 1995
-
[21]
Obstacles to variational quantum optimization from symmetry protection
BRAVYI, Sergey, et al. Obstacles to variational quantum optimization from symmetry protection. Physical review letters, 2020, vol. 125, no 26, p. 260505
work page 2020
-
[22]
On Khot’s unique games conjecture
TREVISAN, Luca. On Khot’s unique games conjecture. Bulletin of the American Mathematical Society, 2012, vol. 49, no 1, p. 91-111
work page 2012
-
[23]
Approximating the value of two power proof systems, with applications to max 2sat and max dicut
FEIGE, Uriel; GOEMANS, Michel. Approximating the value of two power proof systems, with applications to max 2sat and max dicut. Proceedings third Israel sym- posium on the theory of computing and systems. IEEE,
-
[24]
GOEMANS, Michel X.; WILLIAMSON, David. Approx- imation algorithms for MAX-3-CUT and other prob- lems via complex semidefinite programming.Proceedings of the thirty-third annual ACM symposium on Theory of computing. 2001. p. 443-452
work page 2001
-
[25]
ALIZADEH, Farid. Interior point methods in semidefi- nite programming with applications to combinatorial op- timization. SIAM journal on Optimization, 1995, vol. 5, no 1, p. 13-51
work page 1995
-
[26]
BOYD, Stephen P.; VANDENBERGHE, Lieven. Convex optimization. Cambridge university press, 2004
work page 2004
-
[27]
MPFR: A multiple-precision binary floating-point library with correct rounding
FOUSSE, Laurent, et al. MPFR: A multiple-precision binary floating-point library with correct rounding. ACM Transactions on Mathematical Software (TOMS), 2007, vol. 33, no 2, p. 13-es
work page 2007
-
[28]
A new stopping criterion for Krylov solvers applied in interior point meth- ods
ZANETTI, Filippo; GONDZIO, Jacek. A new stopping criterion for Krylov solvers applied in interior point meth- ods. SIAM Journal on Scientific Computing, 2023, vol. 45, no 2, p. A703-A728
work page 2023
-
[29]
Interior methods for nonlinear optimization
FORSGREN, Anders; GILL, Philip E.; WRIGHT, Mar- garet H. Interior methods for nonlinear optimization. SIAM review, 2002, vol. 44, no 4, p. 525-597
work page 2002
-
[30]
Iterative methods for solving lin- ear systems
GREENBAUM, Anne. Iterative methods for solving lin- ear systems. Society for Industrial and Applied Mathe- matics, 1997
work page 1997
-
[31]
E. Guthmuller et al., ”Xvpfloat: RISC-V ISA Extension for Variable Extended Precision Floating Point Compu- tation,” in IEEE Transactions on Computers, vol. 73, no. 7, pp. 1683-1697, July 2024
work page 2024
- [32]
-
[33]
E. Guthmuller, C. Fuguet, A. Bocco, J. Fereyre, A. Evans, Y. Durand, ”Variable and Extended Precision (VRP) Accelerator Implemented in a 22nm SoC,” 2025, Electronics Letters, 61 (1), art. no. e70255s
work page 2025
-
[34]
Roofline: An insightful visual performance model for multicore ar- chitectures,
S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore ar- chitectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009
work page 2009
-
[35]
K. Kim and M. -j. Park, ”Present and Future, Challenges of High Bandwith Memory (HBM),” 2024 IEEE Interna- tional Memory Workshop (IMW), Seoul, Korea, Republic of, 2024, pp. 1-4. 8 APPENDIX A. Primal-Dual Barrier Method for Semidefinite Programming By introducing the barrierµ, the original SDP cost function is modified with a log-determinant barrier func-...
work page 2024
-
[36]
Spatial locality: successive memory accesses are generally done to close locations, most often con- secutive memory addresses
-
[37]
Temporal locality: a memory location has a high chance to be accessed multiple times during the execution of the program. To exploit spatial locality, caches store blocks of data, calledcache linesorcache blocks, instead of a single vari- able. The VXP accelerator first level caches implements 64B cache lines, which is typical for general purpose pro- ces...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.