Packing Entries to Diagonals for Homomorphic Sparse-Matrix Vector Multiplication
Pith reviewed 2026-05-10 19:59 UTC · model grok-4.3
The pith
Permuting rows and columns packs nonzeros of sparse matrices into fewer cyclic diagonals for cheaper homomorphic matrix-vector multiplication.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formalise the two-dimensional diagonal packing problem and give practical heuristics that combine bandwidth reduction, anti-bandwidth maximisation, and spectral methods for initial orderings with 2OPT and 3OPT swaps for improvement. A dense row and column elimination approach is added with a cost model for the homomorphic scheme. On 175 SuiteSparse matrices the methods reduce the number of occupied cyclic diagonals by 5.5 times on average and by as much as 45.6 times in one case, with one instance showing a 23.7 times cost reduction when elimination is used.
What carries the argument
The two-dimensional diagonal packing problem (2DPP) addressed by graph-based permutation heuristics and iterative local optimization to minimize the count of occupied cyclic diagonals in the Halevi-Shoup homomorphic encryption scheme.
If this is right
- Fewer occupied diagonals lower the computational cost of encrypted sparse matrix-vector multiplication.
- The heuristics can be applied to improve performance on any sparse matrix without altering its numerical values.
- Eliminating dense rows or columns provides extra savings when reordering alone falls short.
- The cost model helps decide when elimination is worth applying alongside permutation.
Where Pith is reading between the lines
- The packing approach could be adapted for other homomorphic encryption schemes that have similar diagonal-based costs.
- Applying these orderings might benefit parallel or distributed versions of homomorphic linear algebra.
- Domain-specific matrices, such as those from graph algorithms, may see even larger gains from tailored initial orderings.
- Future work could integrate the optimization directly into HE libraries for automatic benefit.
Load-bearing premise
The encryption cost grows strictly with the number of occupied cyclic diagonals and the proposed heuristics reach near-optimal packings without adding other significant overheads in the encryption process.
What would settle it
Executing the homomorphic multiplication on a SuiteSparse matrix before and after reordering and observing that the runtime does not decrease proportionally to the reported diagonal reduction.
read the original abstract
Homomorphic encryption (HE) enables computation over encrypted data but incurs a substantial overhead. For sparse-matrix vector multiplication, the widely used Halevi and Shoup (2014) scheme has a cost linear in the number of occupied cyclic diagonals, which may be many due to the irregular nonzero pattern of the matrix. In this work, we study how to permute the rows and columns of a sparse matrix so that its nonzeros are packed into as few cyclic diagonals as possible. We formalise this as the two-dimensional diagonal packing problem (2DPP), introduce the two-dimensional circular bandsize metric, and give an integer programming formulation that yields optimal solutions for small instances. For large matrices, we propose practical ordering heuristics that combine graph-based initial orderings - based on bandwidth reduction, anti-bandwidth maximisation, and spectral analysis - and an iterative-improvement-based optimization phase employing 2OPT and 3OPT swaps. We also introduce a dense row/column elimination strategy and an HE-aware cost model that quantifies the benefits of isolating dense structures. Experiments on 175 sparse matrices from the SuiteSparse collection show that our ordering-optimisation variants can reduce the diagonal count by $5.5\times$ on average ($45.6\times$ for one instance). In addition, the dense row/column elimination approach can be useful for cases where the proposed permutation techniques are not sufficient; for instance, in one case, the additional elimination helped to reduce the encrypted multiplication cost by $23.7\times$ whereas without elimination, the improvement was only $1.9\times$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses efficient homomorphic sparse matrix-vector multiplication under the Halevi-Shoup (2014) scheme by permuting rows and columns of sparse matrices to minimize the number of occupied cyclic diagonals. It formalizes the two-dimensional diagonal packing problem (2DPP), introduces the two-dimensional circular bandsize metric, provides an integer programming formulation for optimal solutions on small instances, proposes practical heuristics combining graph-based initial orderings (bandwidth reduction, anti-bandwidth maximization, spectral) with 2OPT/3OPT iterative improvement, and adds a dense row/column elimination strategy along with an HE-aware cost model. Experiments on 175 SuiteSparse matrices report an average 5.5× reduction in diagonal count (up to 45.6×) and, in one case, a 23.7× encrypted multiplication cost reduction when elimination is applied.
Significance. If the HE-aware cost model accurately captures runtime (i.e., cost remains strictly proportional to occupied diagonals with no unmodeled rotation or alignment overheads) and the heuristics consistently approach optimality, the work could meaningfully accelerate privacy-preserving linear algebra on irregular sparse data. The public benchmark suite and concrete average/best-case numbers are strengths; however, the central empirical claims rest on the unverified linearity assumption and lack of full HE implementation or optimality-gap quantification.
major comments (3)
- [Cost model section / abstract] Cost model (abstract and § on HE-aware cost model): the claim that multiplication cost is strictly linear in the number of occupied cyclic diagonals after permutation is load-bearing for all reported speedups (5.5× average, 23.7× in the elimination case), yet the model appears to omit potential extra homomorphic rotations needed to realign the permuted input vector, changes in multiplicative depth from altered diagonal offsets, or increased key-switching frequency. Without explicit accounting or empirical validation on an actual HE implementation, the translation from diagonal reduction to runtime savings is not guaranteed.
- [Experiments / abstract] Experiments (abstract and results section): the reported 5.5× average diagonal reduction and 23.7× cost reduction are presented without details on exact baselines (e.g., which prior ordering or naive diagonal count), statistical significance tests across the 175 matrices, or variance, making it difficult to assess whether the heuristics reliably outperform standard bandwidth-reduction methods.
- [IP formulation and heuristics sections] Heuristics evaluation (IP formulation and ordering-optimisation sections): an IP solver is introduced for small instances to obtain optimal solutions, but the manuscript does not report the optimality gap achieved by the 2OPT/3OPT heuristics on those instances; this gap is needed to substantiate the “near-optimal” claim that underpins the practical utility for large matrices.
minor comments (3)
- [Formalization section] Clarify the precise definition and computation of the two-dimensional circular bandsize metric when introducing the 2DPP formalization.
- [Heuristics section] Specify how the graph-based initial orderings (bandwidth reduction, anti-bandwidth, spectral) are adapted or combined for the cyclic-diagonal objective rather than the classical 1D bandwidth objective.
- [Figures] Ensure all before/after diagonal-packing figures include explicit axis labels, matrix dimensions, and the exact number of occupied diagonals shown.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Cost model (abstract and § on HE-aware cost model): the claim that multiplication cost is strictly linear in the number of occupied cyclic diagonals after permutation is load-bearing for all reported speedups (5.5× average, 23.7× in the elimination case), yet the model appears to omit potential extra homomorphic rotations needed to realign the permuted input vector, changes in multiplicative depth from altered diagonal offsets, or increased key-switching frequency. Without explicit accounting or empirical validation on an actual HE implementation, the translation from diagonal reduction to runtime savings is not guaranteed.
Authors: We thank the referee for highlighting this important consideration. The Halevi-Shoup scheme's matrix-vector multiplication cost is modeled as linear in the number of occupied diagonals because each diagonal requires a separate rotation and multiplication step. We acknowledge that row/column permutations can necessitate additional input-vector rotations for realignment and may influence multiplicative depth or key-switching costs depending on the specific diagonal offsets. In the revised manuscript, we will expand the HE-aware cost model section to explicitly discuss these potential overheads, clarify the assumptions of the model, and note the limitations in the absence of a full HE implementation. This will provide a more balanced view of when diagonal reduction translates to practical runtime savings. revision: yes
-
Referee: Experiments (abstract and results section): the reported 5.5× average diagonal reduction and 23.7× cost reduction are presented without details on exact baselines (e.g., which prior ordering or naive diagonal count), statistical significance tests across the 175 matrices, or variance, making it difficult to assess whether the heuristics reliably outperform standard bandwidth-reduction methods.
Authors: We agree that greater transparency on the experimental methodology is warranted. The reported 5.5× average reduction is computed relative to the number of occupied cyclic diagonals in the original (unpermuted) matrix ordering from the SuiteSparse collection. The 23.7× cost reduction arises from combining our permutation heuristics with the dense row/column elimination strategy. In the revised manuscript, we will explicitly state these baselines, report variance and ranges of the reduction factors across all 175 matrices, and add comparisons against standard bandwidth-reduction heuristics such as Reverse Cuthill-McKee. We will also include descriptive statistics to illustrate consistency of the improvements. revision: yes
-
Referee: Heuristics evaluation (IP formulation and ordering-optimisation sections): an IP solver is introduced for small instances to obtain optimal solutions, but the manuscript does not report the optimality gap achieved by the 2OPT/3OPT heuristics on those instances; this gap is needed to substantiate the “near-optimal” claim that underpins the practical utility for large matrices.
Authors: We appreciate this observation regarding the missing quantitative evaluation. Although the integer programming formulation was developed to obtain optimal solutions for small instances, the optimality gaps of the 2OPT/3OPT heuristics were not reported in the submitted version. In the revised manuscript, we will compute and present these gaps for all small instances solvable by the IP solver, thereby providing concrete evidence to support the near-optimality claim for the heuristics on larger matrices. revision: yes
- Empirical runtime validation of the cost model via a complete homomorphic encryption implementation, which would require substantial additional engineering effort outside the current scope of the ordering and modeling contributions.
Circularity Check
No significant circularity; algorithmic method with external validation
full rationale
The paper formalizes the 2D diagonal packing problem, provides an integer programming formulation for small instances, and develops graph-based heuristics (2OPT/3OPT) plus dense-row elimination for larger matrices. These are evaluated empirically on the independent SuiteSparse collection of 175 matrices, with cost quantified via the external Halevi-Shoup (2014) scheme. No derivation step reduces to a fitted parameter renamed as prediction, a self-definitional equation, or a load-bearing self-citation chain; the central claims remain independent of the paper's own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Antony TB, Naduvath S (2022) On circulant completion of graphs. In: Data Science and Security. Springer, p 181–188 Antony TB, Naduvath S (2024) Further studies on circulant completion of graphs. Proyecciones Journal of Mathematics 43(3):761–773. URL https://www.revistaproyecciones.cl/index.php/proyecciones/article/download/ 6034/4504/38716, also cited asP...
-
[2]
Elsevier, p 117–129, https://doi. org/10.1016/S0167-5060(08)70455-2, URL https://www.sciencedirect.com/science/ article/pii/S0167506008704552 Esposito A, Catalano M, Malucelli F, et al (1998) A new matrix bandwidth reduc- tion algorithm. Operations Research Letters 23(3):99–107. https://doi.org/10.1016/ S0167-6377(98)00040-6, URL https://www.sciencedirect...
-
[3]
Springer Berlin Heidelberg, Berlin, Heidelberg, pp 554–571 Halevi S, Shoup V (2018) Faster homomorphic linear transformations in HElib. Cryptology ePrint Archive, Paper 2018/244, URL https://eprint.iacr.org/2018/244 Heinrich K, Hell P (1987) On the problems of bandsize. Graphs and Combinatorics 3:279–284 Helsgaun K (2000) An effective implementation of th...
-
[4]
Curran Associates, Inc., pp 19104–19116, URL https://proceedings.neurips.cc/paper files/ paper/2023/file/3cc685788a311fa35d8d41df93e288ca-Paper-Conference.pdf Raspaud A, Schr¨ oder H, S` ykora O, et al (2009) Antibandwidth and cyclic antiband- width of meshes and hypercubes. Discrete Mathematics 309(11):3541–3552 Reid JK, Scott JA (2002) Implementing hage...
-
[5]
International Journal for Numerical Methods in Engineering 36(19):3351–3379
https://doi.org/10.1002/nla.1859, URL https://onlinelibrary.wiley.com/doi/ 43 abs/10.1002/nla.1859, https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.1859 Souza LT, Murray DW (1993) An alternative pseudoperipheral node finder for resequencing schemes. International Journal for Numerical Methods in Engineering 36(19):3351–3379. https://doi.org/10.1002/nm...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.