Lightweight Gaussian Process Inference in C++ on Metal and CUDA

Yu-Hsueh Fang

arxiv: 2605.17898 · v1 · pith:JKMLER6Unew · submitted 2026-05-18 · 💻 cs.LG

Lightweight Gaussian Process Inference in C++ on Metal and CUDA

Yu-Hsueh Fang This is my paper

Pith reviewed 2026-05-20 11:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords Gaussian processesinferenceC++MetalCUDAperformancesparse approximationmachine learning

0 comments

The pith

LightGP delivers faster Gaussian process inference in dependency-free C++ with Metal and CUDA backends.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LightGP, a C++17 library for Gaussian process regression that removes the overhead and dependencies of Python deep-learning frameworks like GPyTorch and GPflow. It implements four inference methods covering exact and approximate problems from small to half-million point scales, with tuned paths for CPU via Accelerate or OpenBLAS, Apple Metal, and NVIDIA CUDA. Benchmarks report clear speed gains, including 2.6-8.7 times faster exact GP on Apple M4 CPU and 2.3-6.7 times faster on RTX 3060 CUDA for moderate sizes, plus specialized optimizations like fused kernel-vector products and FFT-accelerated SKI. A sympathetic reader would care because this approach makes high-performance GP modeling practical on a wider range of hardware with a minimal software footprint.

Core claim

LightGP is a single static C++ library providing exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT inference. On an Apple M4 its CPU path is 2.6-8.7 times faster than GPyTorch CPU for exact GP and roughly 1.5 times faster for sparse GP at every tested scale; on an NVIDIA RTX 3060 its CUDA path is 2.3-6.7 times faster for exact GP up to N=2048. A fused matrix-free kernel-vector product on Metal reaches 32 times the speed of the explicit path at N=20,000 with only O(N) memory, while an FFT-accelerated SKI matrix-vector product via Accelerate runs in sub-millisecond time at N=200,000.

What carries the argument

The four inference paths (exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and FFT-accelerated structured kernel interpolation) implemented as a dependency-free C++17 static library with hardware-specific backends for CPU, Metal, and CUDA.

If this is right

Practitioners can run exact GP regression up to N around 2000 on consumer GPUs with substantially lower latency than current Python libraries.
Approximate methods such as sparse variational and SKI become practical at N=500,000 while staying within modest memory limits.
Python users obtain high-performance GP inference through a single pip install without installing or managing large deep-learning frameworks.
Edge and mobile devices using Apple Metal can execute both exact and approximate GP models more efficiently than on general CPU paths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A minimal-dependency library could be embedded directly into other C++ or Python applications to reduce overall system complexity for Bayesian modeling tasks.
The same lightweight pattern might be applied to other probabilistic models that currently rely on heavy framework dispatch.
Extending the backends to additional accelerators could make fast GP inference available on a broader set of embedded and cloud hardware.

Load-bearing premise

Benchmark speedups assume that problem setups, data, and hyperparameter choices are identical between LightGP and GPyTorch with no undisclosed differences in implementation details or hardware tuning.

What would settle it

Running identical GP regression tasks on the same hardware and data with matching hyperparameters and observing that LightGP is not consistently faster (or is slower) than GPyTorch at the reported scales would falsify the performance claims.

read the original abstract

Gaussian process (GP) inference in Python is dominated by libraries such as GPyTorch and GPflow, which are built on deep-learning frameworks and inherit their dispatch overhead and dependency footprint. We present LightGP, a dependency-free C++17 library for GP regression with Python bindings, supporting Apple Metal and NVIDIA CUDA backends alongside tuned CPU paths via Apple Accelerate and OpenBLAS. LightGP provides four inference paths -- exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT -- covering problems from $N{=}100$ to $N{=}500{,}000$. On an Apple M4, LightGP CPU is 2.6--8.7$\times$ faster than GPyTorch CPU for exact GP and ${\sim}1.5\times$ faster for sparse GP at every scale tested. On an NVIDIA RTX~3060, LightGP CUDA is 2.3--6.7$\times$ faster than GPyTorch CUDA for exact GP up to $N{=}2{,}048$, with GPyTorch closing the gap at $N{=}4{,}096$. A fused matrix-free kernel-vector product on Metal achieves 32$\times$ over the explicit path at $N{=}20{,}000$ with $O(N)$ memory, and an FFT-accelerated SKI matvec via Accelerate vDSP runs in sub-millisecond time at $N{=}200{,}000$. LightGP compiles as a single static library with zero external dependencies and is installable via \texttt{pip install lightgp

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LightGP is a lean C++ library for GP regression with reported speedups on Metal and CUDA, but the benchmark fairness needs explicit verification.

read the letter

LightGP is a new C++17 library for Gaussian process inference that runs on Metal, CUDA, and optimized CPU without the overhead of deep learning frameworks. It implements four main paths: exact Cholesky, conjugate gradients, sparse variational GP, and structured kernel interpolation with FFT. The library comes with Python bindings and compiles to a single static library with zero external dependencies. What it does well is provide concrete performance numbers across different scales and backends. On Apple M4, the CPU version is claimed to be 2.6 to 8.7 times faster than GPyTorch for exact GP, and about 1.5 times for sparse GP. On NVIDIA RTX 3060, CUDA version shows 2.3 to 6.7 times speedup for exact GP up to N=2048. Additional optimizations like fused matrix-free products on Metal and FFT-accelerated SKI are highlighted for larger problems. The soft spot is the reliance on benchmark comparisons. The speedups are only as good as the equivalence of the test setups. If GPyTorch was not run with the exact same kernel, data, precision, or convergence criteria, the factors could be misleading. The abstract does not spell out the matching protocol, so the paper needs to demonstrate that the comparisons are fair. This work is aimed at practitioners who want fast GP regression on specific hardware with minimal dependencies. Readers looking for implementation details or performance on Apple silicon or CUDA might get value from it. It deserves a serious referee to examine the code and benchmark setup.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LightGP, a dependency-free C++17 library for Gaussian process regression with Python bindings. It supports Apple Metal and NVIDIA CUDA backends along with tuned CPU paths via Apple Accelerate and OpenBLAS. Four inference methods are provided: exact Cholesky, matrix-free conjugate gradients, sparse variational free energy, and structured kernel interpolation with FFT, targeting problem sizes from N=100 to N=500,000. The central claims are concrete performance speedups over GPyTorch, including 2.6--8.7× on Apple M4 CPU for exact GP, ~1.5× for sparse GP, and 2.3--6.7× on NVIDIA RTX 3060 CUDA for exact GP up to N=2048.

Significance. If the reported speedups hold under matched experimental conditions, the work offers a practical contribution by demonstrating that a lightweight, single-static-library implementation can deliver substantial efficiency gains for GP inference across CPU, Metal, and CUDA without the dispatch overhead of deep-learning frameworks. The dependency-free design, support for multiple backends, and coverage of both exact and approximate methods at large scales are strengths that could benefit production and on-device applications.

major comments (2)

[Abstract] Abstract and benchmarking results: The headline speedup claims (2.6--8.7× on M4 CPU for exact GP and 2.3--6.7× on RTX 3060 CUDA) are load-bearing for the paper's contribution, yet the manuscript does not provide an explicit protocol confirming that LightGP and GPyTorch solve identical regression tasks. Details are needed on kernel choice, data sets, floating-point precision, hyperparameter optimization schedule, convergence criteria for CG or variational inference, number of inducing points, and whether GPyTorch defaults were used without undisclosed tuning differences.
[Benchmarking section] Benchmarking section: The claim of consistent speedups 'at every scale tested' requires tabulated experimental settings (e.g., CG iteration counts, inducing-point counts for sparse GP, data exclusion rules) to allow verification that comparisons control for all variables; without this, the factors cannot be confidently attributed to implementation or algorithmic advantages.

minor comments (1)

[Abstract] The installation command in the abstract should specify the exact pip package name and any build requirements for the Metal/CUDA backends to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify areas where additional detail will improve reproducibility and allow readers to better attribute the observed performance differences. We address each major comment below and will incorporate the requested information in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and benchmarking results: The headline speedup claims (2.6--8.7× on M4 CPU for exact GP and 2.3--6.7× on RTX 3060 CUDA) are load-bearing for the paper's contribution, yet the manuscript does not provide an explicit protocol confirming that LightGP and GPyTorch solve identical regression tasks. Details are needed on kernel choice, data sets, floating-point precision, hyperparameter optimization schedule, convergence criteria for CG or variational inference, number of inducing points, and whether GPyTorch defaults were used without undisclosed tuning differences.

Authors: We agree that an explicit experimental protocol is required to substantiate the speedup claims. In the revised manuscript we will add a dedicated 'Experimental Setup' subsection that states: the RBF kernel is used throughout with length-scale and signal variance optimized by L-BFGS on the marginal likelihood; experiments employ both synthetic data drawn from a known GP prior and UCI regression datasets with N ranging from 100 to 500 000; all computations use double-precision arithmetic; CG is run to a relative residual tolerance of 1e-6 or a hard limit of 1000 iterations; sparse variational inference uses 512 inducing points initialized by k-means; and GPyTorch runs employ the library defaults for the matching inference method and kernel without additional hyperparameter tuning or custom convergence settings. We will also supply a short reproducibility script that reproduces the exact benchmark loops. revision: yes
Referee: [Benchmarking section] Benchmarking section: The claim of consistent speedups 'at every scale tested' requires tabulated experimental settings (e.g., CG iteration counts, inducing-point counts for sparse GP, data exclusion rules) to allow verification that comparisons control for all variables; without this, the factors cannot be confidently attributed to implementation or algorithmic advantages.

Authors: We accept that tabulated settings are necessary for independent verification. The revised benchmarking section will contain a new table that, for every method, backend, and problem size, lists the precise CG iteration count (or average when adaptive), the number of inducing points, any data subsampling or exclusion criteria, and the hyperparameter-optimization schedule. The accompanying text will explicitly confirm that LightGP and GPyTorch were executed on identical data partitions, with identical kernel initializations and identical stopping criteria, so that observed differences can be attributed to implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation and benchmarking paper with no derivation chain or self-referential claims

full rationale

The paper describes a dependency-free C++ library for GP regression with CPU, Metal, and CUDA backends, plus empirical benchmarks against GPyTorch. No mathematical derivations, first-principles predictions, fitted parameters, or uniqueness theorems are claimed. Performance numbers are direct timing measurements on specific hardware; they do not reduce to any self-definition, fitted input renamed as prediction, or self-citation load-bearing step. The work is self-contained as an engineering artifact whose claims rest on reproducible code and explicit benchmark protocols rather than any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering implementation paper. No new mathematical axioms, free parameters, or invented entities are introduced; it relies on standard linear algebra routines and existing GP inference algorithms.

pith-pipeline@v0.9.0 · 5820 in / 960 out tokens · 33685 ms · 2026-05-20T11:59:18.679357+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Variational learning of inducing variables in sparse gaussian processes

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR, 2009

work page 2009
[2]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

work page 2018
[3]

Kernel interpolation for scalable structured gaussian processes (kiss-gp)

Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). InInternational conference on machine learning, pages 1775–1784. PMLR, 2015

work page 2015
[4]

Gpflow: A gaussian process library using tensorflow.Journal of Machine Learning Research, 18(40):1–6, 2017

Alexander G de G Matthews, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Bouk- ouvalas, Pablo Le´ on-Villagr´ a, Zoubin Ghahramani, and James Hensman. Gpflow: A gaussian process library using tensorflow.Journal of Machine Learning Research, 18(40):1–6, 2017

work page 2017
[5]

Constant-time predictive distributions for gaussian processes

Geoff Pleiss, Jacob Gardner, Kilian Weinberger, and Andrew Gordon Wilson. Constant-time predictive distributions for gaussian processes. InInternational Conference on Machine Learning, pages 4114–4123. PMLR, 2018

work page 2018
[6]

Gaussian processes for machine learning (gpml) toolbox.Journal of Machine Learning Research, 11:3011–3015, 2010

Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox.Journal of Machine Learning Research, 11:3011–3015, 2010

work page 2010
[7]

Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021

Benjamin Charlier, Jean Feydy, Joan Alexis Glaunes, Fran¸ cois-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021. 7

work page 2021

[1] [1]

Variational learning of inducing variables in sparse gaussian processes

Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial intelligence and statistics, pages 567–574. PMLR, 2009

work page 2009

[2] [2]

Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.Advances in neural information processing systems, 31, 2018

work page 2018

[3] [3]

Kernel interpolation for scalable structured gaussian processes (kiss-gp)

Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes (kiss-gp). InInternational conference on machine learning, pages 1775–1784. PMLR, 2015

work page 2015

[4] [4]

Gpflow: A gaussian process library using tensorflow.Journal of Machine Learning Research, 18(40):1–6, 2017

Alexander G de G Matthews, Mark Van Der Wilk, Tom Nickson, Keisuke Fujii, Alexis Bouk- ouvalas, Pablo Le´ on-Villagr´ a, Zoubin Ghahramani, and James Hensman. Gpflow: A gaussian process library using tensorflow.Journal of Machine Learning Research, 18(40):1–6, 2017

work page 2017

[5] [5]

Constant-time predictive distributions for gaussian processes

Geoff Pleiss, Jacob Gardner, Kilian Weinberger, and Andrew Gordon Wilson. Constant-time predictive distributions for gaussian processes. InInternational Conference on Machine Learning, pages 4114–4123. PMLR, 2018

work page 2018

[6] [6]

Gaussian processes for machine learning (gpml) toolbox.Journal of Machine Learning Research, 11:3011–3015, 2010

Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox.Journal of Machine Learning Research, 11:3011–3015, 2010

work page 2010

[7] [7]

Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021

Benjamin Charlier, Jean Feydy, Joan Alexis Glaunes, Fran¸ cois-David Collin, and Ghislain Durif. Kernel operations on the gpu, with autodiff, without memory overflows.Journal of Machine Learning Research, 22(74):1–6, 2021. 7

work page 2021