Packing Entries to Diagonals for Homomorphic Sparse-Matrix Vector Multiplication

Deniz Elbek; Erkay Sava\c{s}; Kamer Kaya; Kemal Mutluergil

arxiv: 2604.04683 · v1 · submitted 2026-04-06 · 💻 cs.CR · cs.DS

Packing Entries to Diagonals for Homomorphic Sparse-Matrix Vector Multiplication

Kemal Mutluergil , Deniz Elbek , Kamer Kaya , Erkay Sava\c{s} This is my paper

Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3

classification 💻 cs.CR cs.DS

keywords homomorphic encryptionsparse matrix-vector multiplicationdiagonal packingrow column permutationoptimization heuristicsSuiteSparse matricesencrypted linear algebra

0 comments

The pith

Permuting rows and columns packs nonzeros of sparse matrices into fewer cyclic diagonals for cheaper homomorphic matrix-vector multiplication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that reordering the rows and columns of a sparse matrix can concentrate its nonzero entries into fewer cyclic diagonals. This is important because the popular Halevi-Shoup homomorphic encryption method for performing matrix-vector multiplication has a cost that increases with the number of such diagonals. The authors develop heuristics that start with graph-based orderings and refine them using local search swaps, along with a strategy to remove dense rows or columns when needed. Tests on 175 real-world sparse matrices show an average 5.5-fold reduction in diagonal count, with some cases seeing much larger gains and corresponding drops in encryption costs.

Core claim

We formalise the two-dimensional diagonal packing problem and give practical heuristics that combine bandwidth reduction, anti-bandwidth maximisation, and spectral methods for initial orderings with 2OPT and 3OPT swaps for improvement. A dense row and column elimination approach is added with a cost model for the homomorphic scheme. On 175 SuiteSparse matrices the methods reduce the number of occupied cyclic diagonals by 5.5 times on average and by as much as 45.6 times in one case, with one instance showing a 23.7 times cost reduction when elimination is used.

What carries the argument

The two-dimensional diagonal packing problem (2DPP) addressed by graph-based permutation heuristics and iterative local optimization to minimize the count of occupied cyclic diagonals in the Halevi-Shoup homomorphic encryption scheme.

If this is right

Fewer occupied diagonals lower the computational cost of encrypted sparse matrix-vector multiplication.
The heuristics can be applied to improve performance on any sparse matrix without altering its numerical values.
Eliminating dense rows or columns provides extra savings when reordering alone falls short.
The cost model helps decide when elimination is worth applying alongside permutation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The packing approach could be adapted for other homomorphic encryption schemes that have similar diagonal-based costs.
Applying these orderings might benefit parallel or distributed versions of homomorphic linear algebra.
Domain-specific matrices, such as those from graph algorithms, may see even larger gains from tailored initial orderings.
Future work could integrate the optimization directly into HE libraries for automatic benefit.

Load-bearing premise

The encryption cost grows strictly with the number of occupied cyclic diagonals and the proposed heuristics reach near-optimal packings without adding other significant overheads in the encryption process.

What would settle it

Executing the homomorphic multiplication on a SuiteSparse matrix before and after reordering and observing that the runtime does not decrease proportionally to the reported diagonal reduction.

read the original abstract

Homomorphic encryption (HE) enables computation over encrypted data but incurs a substantial overhead. For sparse-matrix vector multiplication, the widely used Halevi and Shoup (2014) scheme has a cost linear in the number of occupied cyclic diagonals, which may be many due to the irregular nonzero pattern of the matrix. In this work, we study how to permute the rows and columns of a sparse matrix so that its nonzeros are packed into as few cyclic diagonals as possible. We formalise this as the two-dimensional diagonal packing problem (2DPP), introduce the two-dimensional circular bandsize metric, and give an integer programming formulation that yields optimal solutions for small instances. For large matrices, we propose practical ordering heuristics that combine graph-based initial orderings - based on bandwidth reduction, anti-bandwidth maximisation, and spectral analysis - and an iterative-improvement-based optimization phase employing 2OPT and 3OPT swaps. We also introduce a dense row/column elimination strategy and an HE-aware cost model that quantifies the benefits of isolating dense structures. Experiments on 175 sparse matrices from the SuiteSparse collection show that our ordering-optimisation variants can reduce the diagonal count by $5.5\times$ on average ($45.6\times$ for one instance). In addition, the dense row/column elimination approach can be useful for cases where the proposed permutation techniques are not sufficient; for instance, in one case, the additional elimination helped to reduce the encrypted multiplication cost by $23.7\times$ whereas without elimination, the improvement was only $1.9\times$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reordering sparse matrices to pack nonzeros into fewer cyclic diagonals gives measurable reductions in HE matrix-vector cost on SuiteSparse data, but the gains rest on an untested assumption that permutations add no extra rotations or alignment work.

read the letter

The paper shows a concrete way to permute rows and columns of sparse matrices so their nonzeros fit into fewer cyclic diagonals under the Halevi-Shoup scheme. This directly targets the linear cost in occupied diagonals and produces average 5.5x reductions across 175 SuiteSparse matrices, with one case reaching 45.6x and another 23.7x when dense-row elimination is added on top of the ordering.

Referee Report

3 major / 3 minor

Summary. The paper addresses efficient homomorphic sparse matrix-vector multiplication under the Halevi-Shoup (2014) scheme by permuting rows and columns of sparse matrices to minimize the number of occupied cyclic diagonals. It formalizes the two-dimensional diagonal packing problem (2DPP), introduces the two-dimensional circular bandsize metric, provides an integer programming formulation for optimal solutions on small instances, proposes practical heuristics combining graph-based initial orderings (bandwidth reduction, anti-bandwidth maximization, spectral) with 2OPT/3OPT iterative improvement, and adds a dense row/column elimination strategy along with an HE-aware cost model. Experiments on 175 SuiteSparse matrices report an average 5.5× reduction in diagonal count (up to 45.6×) and, in one case, a 23.7× encrypted multiplication cost reduction when elimination is applied.

Significance. If the HE-aware cost model accurately captures runtime (i.e., cost remains strictly proportional to occupied diagonals with no unmodeled rotation or alignment overheads) and the heuristics consistently approach optimality, the work could meaningfully accelerate privacy-preserving linear algebra on irregular sparse data. The public benchmark suite and concrete average/best-case numbers are strengths; however, the central empirical claims rest on the unverified linearity assumption and lack of full HE implementation or optimality-gap quantification.

major comments (3)

[Cost model section / abstract] Cost model (abstract and § on HE-aware cost model): the claim that multiplication cost is strictly linear in the number of occupied cyclic diagonals after permutation is load-bearing for all reported speedups (5.5× average, 23.7× in the elimination case), yet the model appears to omit potential extra homomorphic rotations needed to realign the permuted input vector, changes in multiplicative depth from altered diagonal offsets, or increased key-switching frequency. Without explicit accounting or empirical validation on an actual HE implementation, the translation from diagonal reduction to runtime savings is not guaranteed.
[Experiments / abstract] Experiments (abstract and results section): the reported 5.5× average diagonal reduction and 23.7× cost reduction are presented without details on exact baselines (e.g., which prior ordering or naive diagonal count), statistical significance tests across the 175 matrices, or variance, making it difficult to assess whether the heuristics reliably outperform standard bandwidth-reduction methods.
[IP formulation and heuristics sections] Heuristics evaluation (IP formulation and ordering-optimisation sections): an IP solver is introduced for small instances to obtain optimal solutions, but the manuscript does not report the optimality gap achieved by the 2OPT/3OPT heuristics on those instances; this gap is needed to substantiate the “near-optimal” claim that underpins the practical utility for large matrices.

minor comments (3)

[Formalization section] Clarify the precise definition and computation of the two-dimensional circular bandsize metric when introducing the 2DPP formalization.
[Heuristics section] Specify how the graph-based initial orderings (bandwidth reduction, anti-bandwidth, spectral) are adapted or combined for the cyclic-diagonal objective rather than the classical 1D bandwidth objective.
[Figures] Ensure all before/after diagonal-packing figures include explicit axis labels, matrix dimensions, and the exact number of occupied diagonals shown.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: Cost model (abstract and § on HE-aware cost model): the claim that multiplication cost is strictly linear in the number of occupied cyclic diagonals after permutation is load-bearing for all reported speedups (5.5× average, 23.7× in the elimination case), yet the model appears to omit potential extra homomorphic rotations needed to realign the permuted input vector, changes in multiplicative depth from altered diagonal offsets, or increased key-switching frequency. Without explicit accounting or empirical validation on an actual HE implementation, the translation from diagonal reduction to runtime savings is not guaranteed.

Authors: We thank the referee for highlighting this important consideration. The Halevi-Shoup scheme's matrix-vector multiplication cost is modeled as linear in the number of occupied diagonals because each diagonal requires a separate rotation and multiplication step. We acknowledge that row/column permutations can necessitate additional input-vector rotations for realignment and may influence multiplicative depth or key-switching costs depending on the specific diagonal offsets. In the revised manuscript, we will expand the HE-aware cost model section to explicitly discuss these potential overheads, clarify the assumptions of the model, and note the limitations in the absence of a full HE implementation. This will provide a more balanced view of when diagonal reduction translates to practical runtime savings. revision: yes
Referee: Experiments (abstract and results section): the reported 5.5× average diagonal reduction and 23.7× cost reduction are presented without details on exact baselines (e.g., which prior ordering or naive diagonal count), statistical significance tests across the 175 matrices, or variance, making it difficult to assess whether the heuristics reliably outperform standard bandwidth-reduction methods.

Authors: We agree that greater transparency on the experimental methodology is warranted. The reported 5.5× average reduction is computed relative to the number of occupied cyclic diagonals in the original (unpermuted) matrix ordering from the SuiteSparse collection. The 23.7× cost reduction arises from combining our permutation heuristics with the dense row/column elimination strategy. In the revised manuscript, we will explicitly state these baselines, report variance and ranges of the reduction factors across all 175 matrices, and add comparisons against standard bandwidth-reduction heuristics such as Reverse Cuthill-McKee. We will also include descriptive statistics to illustrate consistency of the improvements. revision: yes
Referee: Heuristics evaluation (IP formulation and ordering-optimisation sections): an IP solver is introduced for small instances to obtain optimal solutions, but the manuscript does not report the optimality gap achieved by the 2OPT/3OPT heuristics on those instances; this gap is needed to substantiate the “near-optimal” claim that underpins the practical utility for large matrices.

Authors: We appreciate this observation regarding the missing quantitative evaluation. Although the integer programming formulation was developed to obtain optimal solutions for small instances, the optimality gaps of the 2OPT/3OPT heuristics were not reported in the submitted version. In the revised manuscript, we will compute and present these gaps for all small instances solvable by the IP solver, thereby providing concrete evidence to support the near-optimality claim for the heuristics on larger matrices. revision: yes

standing simulated objections not resolved

Empirical runtime validation of the cost model via a complete homomorphic encryption implementation, which would require substantial additional engineering effort outside the current scope of the ordering and modeling contributions.

Circularity Check

0 steps flagged

No significant circularity; algorithmic method with external validation

full rationale

The paper formalizes the 2D diagonal packing problem, provides an integer programming formulation for small instances, and develops graph-based heuristics (2OPT/3OPT) plus dense-row elimination for larger matrices. These are evaluated empirically on the independent SuiteSparse collection of 175 matrices, with cost quantified via the external Halevi-Shoup (2014) scheme. No derivation step reduces to a fitted parameter renamed as prediction, a self-definitional equation, or a load-bearing self-citation chain; the central claims remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard graph theory and local-search optimization without fitted parameters or unproven domain axioms beyond the external linearity assumption from Halevi-Shoup; the new metric and problem are defined directly from the HE cost structure.

pith-pipeline@v0.9.0 · 5598 in / 1215 out tokens · 62644 ms · 2026-05-10T19:59:43.711104+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

In: Data Science and Security

Antony TB, Naduvath S (2022) On circulant completion of graphs. In: Data Science and Security. Springer, p 181–188 Antony TB, Naduvath S (2024) Further studies on circulant completion of graphs. Proyecciones Journal of Mathematics 43(3):761–773. URL https://www.revistaproyecciones.cl/index.php/proyecciones/article/download/ 6034/4504/38716, also cited asP...

work page doi:10.1109/tmag.2007 2022
[2]

Elsevier, p 117–129, https://doi. org/10.1016/S0167-5060(08)70455-2, URL https://www.sciencedirect.com/science/ article/pii/S0167506008704552 Esposito A, Catalano M, Malucelli F, et al (1998) A new matrix bandwidth reduc- tion algorithm. Operations Research Letters 23(3):99–107. https://doi.org/10.1016/ S0167-6377(98)00040-6, URL https://www.sciencedirect...

work page doi:10.1016/s0167-5060(08)70455-2 1998
[3]

Cryptology ePrint Archive, Paper 2018/244, URL https://eprint.iacr.org/2018/244 Heinrich K, Hell P (1987) On the problems of bandsize

Springer Berlin Heidelberg, Berlin, Heidelberg, pp 554–571 Halevi S, Shoup V (2018) Faster homomorphic linear transformations in HElib. Cryptology ePrint Archive, Paper 2018/244, URL https://eprint.iacr.org/2018/244 Heinrich K, Hell P (1987) On the problems of bandsize. Graphs and Combinatorics 3:279–284 Helsgaun K (2000) An effective implementation of th...

work page doi:10.1016/s0377-2217(99)00284-2 2018
[4]

Discrete Mathematics 309(11):3541–3552 Reid JK, Scott JA (2002) Implementing hager’s exchange methods for matrix profile reduction

Curran Associates, Inc., pp 19104–19116, URL https://proceedings.neurips.cc/paper files/ paper/2023/file/3cc685788a311fa35d8d41df93e288ca-Paper-Conference.pdf Raspaud A, Schr¨ oder H, S` ykora O, et al (2009) Antibandwidth and cyclic antiband- width of meshes and hypercubes. Discrete Mathematics 309(11):3541–3552 Reid JK, Scott JA (2002) Implementing hage...

work page doi:10.1145/592843 2023
[5]

International Journal for Numerical Methods in Engineering 36(19):3351–3379

https://doi.org/10.1002/nla.1859, URL https://onlinelibrary.wiley.com/doi/ 43 abs/10.1002/nla.1859, https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.1859 Souza LT, Murray DW (1993) An alternative pseudoperipheral node finder for resequencing schemes. International Journal for Numerical Methods in Engineering 36(19):3351–3379. https://doi.org/10.1002/nm...

work page doi:10.1002/nla.1859 1993

[1] [1]

In: Data Science and Security

Antony TB, Naduvath S (2022) On circulant completion of graphs. In: Data Science and Security. Springer, p 181–188 Antony TB, Naduvath S (2024) Further studies on circulant completion of graphs. Proyecciones Journal of Mathematics 43(3):761–773. URL https://www.revistaproyecciones.cl/index.php/proyecciones/article/download/ 6034/4504/38716, also cited asP...

work page doi:10.1109/tmag.2007 2022

[2] [2]

Elsevier, p 117–129, https://doi. org/10.1016/S0167-5060(08)70455-2, URL https://www.sciencedirect.com/science/ article/pii/S0167506008704552 Esposito A, Catalano M, Malucelli F, et al (1998) A new matrix bandwidth reduc- tion algorithm. Operations Research Letters 23(3):99–107. https://doi.org/10.1016/ S0167-6377(98)00040-6, URL https://www.sciencedirect...

work page doi:10.1016/s0167-5060(08)70455-2 1998

[3] [3]

Cryptology ePrint Archive, Paper 2018/244, URL https://eprint.iacr.org/2018/244 Heinrich K, Hell P (1987) On the problems of bandsize

Springer Berlin Heidelberg, Berlin, Heidelberg, pp 554–571 Halevi S, Shoup V (2018) Faster homomorphic linear transformations in HElib. Cryptology ePrint Archive, Paper 2018/244, URL https://eprint.iacr.org/2018/244 Heinrich K, Hell P (1987) On the problems of bandsize. Graphs and Combinatorics 3:279–284 Helsgaun K (2000) An effective implementation of th...

work page doi:10.1016/s0377-2217(99)00284-2 2018

[4] [4]

Discrete Mathematics 309(11):3541–3552 Reid JK, Scott JA (2002) Implementing hager’s exchange methods for matrix profile reduction

Curran Associates, Inc., pp 19104–19116, URL https://proceedings.neurips.cc/paper files/ paper/2023/file/3cc685788a311fa35d8d41df93e288ca-Paper-Conference.pdf Raspaud A, Schr¨ oder H, S` ykora O, et al (2009) Antibandwidth and cyclic antiband- width of meshes and hypercubes. Discrete Mathematics 309(11):3541–3552 Reid JK, Scott JA (2002) Implementing hage...

work page doi:10.1145/592843 2023

[5] [5]

International Journal for Numerical Methods in Engineering 36(19):3351–3379

https://doi.org/10.1002/nla.1859, URL https://onlinelibrary.wiley.com/doi/ 43 abs/10.1002/nla.1859, https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.1859 Souza LT, Murray DW (1993) An alternative pseudoperipheral node finder for resequencing schemes. International Journal for Numerical Methods in Engineering 36(19):3351–3379. https://doi.org/10.1002/nm...

work page doi:10.1002/nla.1859 1993