ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

Milena Veneva

arxiv: 2510.27351 · v1 · pith:FFTQPGDGnew · submitted 2025-10-31 · 💻 cs.DC

ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method

Milena Veneva This is my paper

Pith reviewed 2026-05-22 11:54 UTC · model grok-4.3

classification 💻 cs.DC

keywords machine learningkNNCUDAparallel partition algorithmtridiagonal systemsGPU implementationsub-system sizeheuristic

0 comments

The pith

A k-nearest neighbors model trained on empirical data predicts near-optimal sub-system sizes for the CUDA implementation of the parallel partition method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a machine learning heuristic that selects the best sub-system size for solving tridiagonal linear systems on GPUs with the parallel partition algorithm. Computational experiments first identify the fastest sub-system sizes across many problem sizes by direct measurement. A kNN classifier is then trained on these measurements to predict a good size for any new system size. The predictions match the measured optima closely enough that the GPU code runs efficiently without repeated tuning. The same trained-model approach is extended to choose both the sub-system size and the number of steps in a recursive version of the algorithm.

Core claim

A kNN model trained on empirically determined optimum sub-system sizes produces predictions that are acceptably close to the measured optima for the CUDA implementation of the parallel partition algorithm, and the same approach extends to predicting the number of recursive steps.

What carries the argument

kNN classification model trained to map system size to optimum sub-system size using the set of measured optima as training examples.

If this is right

The trained kNN model supplies a sub-system size for any new SLAE size without requiring fresh exhaustive timing experiments.
The same model supplies both the per-step sub-system size and the total number of recursive steps for the recursive parallel partition algorithm.
Statistical comparison of predicted versus measured optima shows the heuristic remains acceptably accurate across the tested range of system sizes.
Users facing varying linear system sizes can invoke the model at runtime to keep the GPU implementation close to peak performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same empirical-plus-kNN pattern could be reused to tune other performance-sensitive parameters in GPU linear-algebra kernels.
Collecting timing data from additional hardware generations would let one model serve across different GPU architectures.
Embedding the predictor inside a solver library would allow automatic adaptation of sub-system size as problem size changes during a computation.
The approach reduces the need for manual parameter search, lowering the barrier for non-expert users to obtain good GPU performance on tridiagonal problems.

Load-bearing premise

The empirically measured optimum sub-system sizes for the tested problem sizes form a representative training set that lets kNN generalize to new, unseen sizes.

What would settle it

For a new SLAE size outside the training data, execute the CUDA parallel partition code once with the kNN-predicted sub-system size and once with the true optimum found by exhaustive search; if the predicted choice produces substantially longer runtime, the claim fails.

Figures

Figures reproduced from arXiv: 2510.27351 by Milena Veneva.

**Figure 2.** Figure 2: Results from the kNN classification model for the optimum sub-system size. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Operations of the non-recursive partition method (top), and the recursive partition method with one recursive step [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between the times for the partition method with different number of recursions. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Results from the kNN classification model for the optimum number of recursive steps. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Results from the kNN classification model for the optimum sub-system size; FP32. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

This paper presents a machine learning (ML)-based heuristic for finding the optimum sub-system size for the CUDA implementation of the parallel partition algorithm. Computational experiments for different system of linear algebraic equation (SLAE) sizes are conducted, and the optimum sub-system size for each of them is found empirically. To estimate a model for the sub-system size, we perform the k-nearest neighbors (kNN) classification method. Statistical analysis of the results is done. By comparing the predicted values with the actual data, the algorithm is deemed to be acceptably good. Next, the heuristic is expanded to work for the recursive parallel partition algorithm as well. An algorithm for determining the optimum sub-system size for each recursive step is formulated. A kNN model for predicting the optimum number of recursive steps for a particular SLAE size is built.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures optimal sub-system sizes for a CUDA tridiagonal partition solver, fits a kNN model to predict them for new SLAE sizes, and sketches an extension to the recursive case, but leaves the generalization evidence thin.

read the letter

The core of this paper is straightforward engineering: the authors run experiments to find the best sub-system size for a range of SLAE sizes on their GPU implementation of the parallel partition method, then train a kNN classifier on those empirical points to predict the parameter for unseen sizes. They also outline how to apply the same approach to pick the number of recursive steps and build a second kNN model for that choice. The comparison of predicted versus measured values leads them to call the heuristic acceptably good, with some statistical analysis added in.

Referee Report

2 major / 2 minor

Summary. The paper proposes an ML-based heuristic using k-nearest neighbors (kNN) to predict the optimum sub-system size for the CUDA implementation of the parallel partition method applied to tridiagonal systems of linear algebraic equations (SLAEs). Computational experiments empirically determine these optima for different SLAE sizes; a kNN model is trained on the resulting data, evaluated via comparison of predicted and actual values (deemed acceptably good), and extended to the recursive variant of the algorithm with an additional kNN model to predict the optimum number of recursive steps.

Significance. If the kNN predictions generalize reliably beyond the training data and yield measurable performance gains on GPU hardware, the heuristic could reduce the need for exhaustive tuning in parallel tridiagonal solvers, which are common in scientific computing. The experimental-plus-kNN workflow is conceptually sound, but the absence of reported quantitative metrics or validation details currently limits the assessed practical significance.

major comments (2)

[Abstract] Abstract: the central claim that the kNN model produces predictions that are 'acceptably good' (and extends to recursive steps) lacks any quantitative error metrics (e.g., MAE, accuracy, or R²), cross-validation procedure, or description of how training and test splits were formed. This leaves the accuracy claim unsupported by visible evidence.
[Abstract] Abstract and experimental description: the kNN predictions appear to be generated from the same empirical optima used to define the targets, with no mention of a held-out test set or external validation on unseen SLAE sizes. This creates a circularity risk where reported agreement may reflect in-sample interpolation rather than independent generalization.

minor comments (2)

[Abstract] Abstract: specify the number, range, and spacing of SLAE sizes tested, the value of k in kNN, and any hardware-specific factors (e.g., GPU architecture) that may affect optimum sub-system size.
[Abstract] Abstract: clarify whether the statistical analysis includes measures of variance or sensitivity to input parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and agree that revisions are needed to strengthen the presentation of the kNN evaluation results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the kNN model produces predictions that are 'acceptably good' (and extends to recursive steps) lacks any quantitative error metrics (e.g., MAE, accuracy, or R²), cross-validation procedure, or description of how training and test splits were formed. This leaves the accuracy claim unsupported by visible evidence.

Authors: We agree that the manuscript does not currently report quantitative error metrics such as MAE or classification accuracy, nor does it describe the cross-validation procedure or train-test split methodology. In the revised version we will add these details, including MAE values for sub-system size predictions, accuracy for the recursive-step model, and a description of the k-fold cross-validation used to support the claim that the predictions are acceptably good. revision: yes
Referee: [Abstract] Abstract and experimental description: the kNN predictions appear to be generated from the same empirical optima used to define the targets, with no mention of a held-out test set or external validation on unseen SLAE sizes. This creates a circularity risk where reported agreement may reflect in-sample interpolation rather than independent generalization.

Authors: The referee correctly notes that the current evaluation compares predictions to the same empirical optima used for training, without an explicit held-out test set. This does limit claims of generalization. We will revise the experimental section to include results on a held-out set of SLAE sizes not present in the training data and will report the corresponding error metrics to demonstrate out-of-sample performance. revision: yes

Circularity Check

1 steps flagged

kNN 'predictions' of optimum sub-system sizes reduce to in-sample fit on the same empirical measurements used to define the targets

specific steps

fitted input called prediction [Abstract]
"Computational experiments for different system of linear algebraic equation (SLAE) sizes are conducted, and the optimum sub-system size for each of them is found empirically. To estimate a model for the sub-system size, we perform the k-nearest neighbors (kNN) classification method. ... By comparing the predicted values with the actual data, the algorithm is deemed to be acceptably good."

The targets (optimum sub-system sizes) are defined by the empirical measurements; the kNN model is fitted to those same measurements; the 'predictions' are then compared back to the identical actual data. The reported closeness is therefore a goodness-of-fit statistic on the training set rather than an out-of-sample prediction.

full rationale

The paper first measures optimum sub-system sizes empirically for various SLAE sizes, trains a kNN model on those measurements, and then reports that the model's outputs are 'acceptably good' by direct comparison to the identical measured values. This matches the fitted-input-called-prediction pattern: the reported agreement is a measure of how well the model reproduces its own training inputs rather than an independent prediction on unseen data. No held-out test set, cross-validation, or external validation is described in the provided text. The extension to recursive steps follows the same structure. The central claim therefore contains a partial circularity that weakens the generalization assertion, but the paper still performs real computational experiments and reports statistical analysis, so the circularity is not total.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a small number of experimental runs suffice to train a kNN model that generalizes, plus the standard assumption that Euclidean distance in the feature space (system size) is a meaningful similarity measure for optimum sub-system size.

free parameters (1)

k (number of neighbors)
The hyperparameter k in the kNN classifier must be chosen; its value is not derived from first principles and affects the smoothness of the predicted optimum sizes.

axioms (1)

domain assumption The relationship between SLAE size and optimal sub-system size is sufficiently smooth and local that nearest-neighbor lookup on observed points yields useful predictions.
Invoked when the authors decide to use kNN rather than a parametric model or analytical formula.

pith-pipeline@v0.9.0 · 5669 in / 1446 out tokens · 85616 ms · 2026-05-22T11:54:37.418418+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

M., and Berndt M., and Moulton J

Austin T. M., and Berndt M., and Moulton J. D.A Memory Efficient Parallel Tridiagonal Solver.Preprint LA-VR-03-4149, 13 p. (2004). 9

work page 2004
[2]

NVIDIA,NVIDIA CUDA C++ Programming Guide.https://docs.nvidia.com/cuda/cuda-c-programming-guide/ (2025)

work page 2025
[3]

Techpowerup,NVIDIA GeForce RTX 2080 Ti Specifications, https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3305 (2018)

work page 2080
[4]

NVIDIA,GeForce RTX 20 Series, https://www.nvidia.com/en-eu/geforce/20-series/ (2018)

work page 2018
[5]

ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method

Veneva, M. and Imamura, T.,ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method, Phys. Part. Nuclei,56(6), pp. 1643–1648, doi: 10.1134/S1063779625701023, arXiv: 2501.05938 [cs.DC] (2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1134/s1063779625701023 2025
[6]

NVIDIA,CUDA C++ Best Practices Guide, Release 12.9, https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ (2025)

work page 2025
[7]

NVIDIA,NVIDIA RTX A5000 Datasheet, https://nvdam.widen.net/s/wrqrqt75vh/nvidia-rtx-a5000-datasheet (2021)

work page 2021
[8]

NVIDIA,Introducing NVIDIA RTX A5000 Graphics Card, https://www.nvidia.com/en-us/design-visualization/rtx-a5000/ (2021)

work page 2021
[9]

[10]QUDA: A library for QCD on GPUs, https://lattice.github.io/quda/

Clark C.,GPU Computing with QUDA, NVIDIA (2013). [10]QUDA: A library for QCD on GPUs, https://lattice.github.io/quda/

work page 2013
[10]

[12]CUDA occupancy calculator, https://xmartlabs.github.io/cuda-calculator/

NVIDIA,Thrust: The C++ Parallel Algorithms Library, https://docs.nvidia.com/cuda/thrust/index.html. [12]CUDA occupancy calculator, https://xmartlabs.github.io/cuda-calculator/

work page
[11]

Jeshani T.,Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs, Master’s thesis (2023)

work page 2023
[12]

Sato K., Takizawa H., Komatsu K., and Kobayashi H..Automatic tuning of CUDA execution parameters for stencil processing, In: Software Automatic Tuning, From Concepts to State-of-the-Art Results (2010)

work page 2010
[13]

Nonparametric Discrimination: Consistency Properties, International Statistical Review/Revue Internationale de Statistique,57(3) pp

Fix H., and Joseph L.,Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties, International Statistical Review/Revue Internationale de Statistique,57(3) pp. 238–47, doi: 10.2307/1403797 (1989)

work page doi:10.2307/1403797 1989
[14]

Korstanje J.,The k-Nearest Neighbors (kNN) Algorithm in Python, https://realpython.com/knn-python/

work page
[15]

Banerjee P.,kNN Classifier Tutorial, https://www.kaggle.com/code/prashant111/knn-classifier-tutorial (2020)

work page 2020
[16]

2825–2830, (2011)

Pedregosa F., and Varoquaux G., and Gramfort A., and Michel V ., and Thirion B., and Grisel O., and Blondel M., and Prettenhofer P., and Weiss R., and Dubourg V ., and Vanderplas J., and Passos A., and Cournapeau D., and Brucher M., and Perrot M., and Duchesnay ´E.,Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, pp. 282...

work page 2011
[17]

Techpowerup,NVIDIA GeForce RTX 4080, https://www.techpowerup.com/gpu-specs/geforce-rtx-4080.c3888 (2022). 10

work page 2022

[1] [1]

M., and Berndt M., and Moulton J

Austin T. M., and Berndt M., and Moulton J. D.A Memory Efficient Parallel Tridiagonal Solver.Preprint LA-VR-03-4149, 13 p. (2004). 9

work page 2004

[2] [2]

NVIDIA,NVIDIA CUDA C++ Programming Guide.https://docs.nvidia.com/cuda/cuda-c-programming-guide/ (2025)

work page 2025

[3] [3]

Techpowerup,NVIDIA GeForce RTX 2080 Ti Specifications, https://www.techpowerup.com/gpu-specs/geforce-rtx-2080-ti.c3305 (2018)

work page 2080

[4] [4]

NVIDIA,GeForce RTX 20 Series, https://www.nvidia.com/en-eu/geforce/20-series/ (2018)

work page 2018

[5] [5]

ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method

Veneva, M. and Imamura, T.,ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method, Phys. Part. Nuclei,56(6), pp. 1643–1648, doi: 10.1134/S1063779625701023, arXiv: 2501.05938 [cs.DC] (2025)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1134/s1063779625701023 2025

[6] [6]

NVIDIA,CUDA C++ Best Practices Guide, Release 12.9, https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ (2025)

work page 2025

[7] [7]

NVIDIA,NVIDIA RTX A5000 Datasheet, https://nvdam.widen.net/s/wrqrqt75vh/nvidia-rtx-a5000-datasheet (2021)

work page 2021

[8] [8]

NVIDIA,Introducing NVIDIA RTX A5000 Graphics Card, https://www.nvidia.com/en-us/design-visualization/rtx-a5000/ (2021)

work page 2021

[9] [9]

[10]QUDA: A library for QCD on GPUs, https://lattice.github.io/quda/

Clark C.,GPU Computing with QUDA, NVIDIA (2013). [10]QUDA: A library for QCD on GPUs, https://lattice.github.io/quda/

work page 2013

[10] [10]

[12]CUDA occupancy calculator, https://xmartlabs.github.io/cuda-calculator/

NVIDIA,Thrust: The C++ Parallel Algorithms Library, https://docs.nvidia.com/cuda/thrust/index.html. [12]CUDA occupancy calculator, https://xmartlabs.github.io/cuda-calculator/

work page

[11] [11]

Jeshani T.,Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs, Master’s thesis (2023)

work page 2023

[12] [12]

Sato K., Takizawa H., Komatsu K., and Kobayashi H..Automatic tuning of CUDA execution parameters for stencil processing, In: Software Automatic Tuning, From Concepts to State-of-the-Art Results (2010)

work page 2010

[13] [13]

Nonparametric Discrimination: Consistency Properties, International Statistical Review/Revue Internationale de Statistique,57(3) pp

Fix H., and Joseph L.,Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties, International Statistical Review/Revue Internationale de Statistique,57(3) pp. 238–47, doi: 10.2307/1403797 (1989)

work page doi:10.2307/1403797 1989

[14] [14]

Korstanje J.,The k-Nearest Neighbors (kNN) Algorithm in Python, https://realpython.com/knn-python/

work page

[15] [15]

Banerjee P.,kNN Classifier Tutorial, https://www.kaggle.com/code/prashant111/knn-classifier-tutorial (2020)

work page 2020

[16] [16]

2825–2830, (2011)

Pedregosa F., and Varoquaux G., and Gramfort A., and Michel V ., and Thirion B., and Grisel O., and Blondel M., and Prettenhofer P., and Weiss R., and Dubourg V ., and Vanderplas J., and Passos A., and Cournapeau D., and Brucher M., and Perrot M., and Duchesnay ´E.,Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, pp. 282...

work page 2011

[17] [17]

Techpowerup,NVIDIA GeForce RTX 4080, https://www.techpowerup.com/gpu-specs/geforce-rtx-4080.c3888 (2022). 10

work page 2022