Community-Aware Vertex Ordering for Reference-Based Graph Compression: A Cross-Encoder Empirical Study
Pith reviewed 2026-05-22 01:39 UTC · model grok-4.3
The pith
A two-stage Leiden+LLP vertex ordering reduces bits per edge by 0.3 to 5.4 in reference-based graph compression, with gains largely independent of encoder choice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Leiden+LLP two-stage ordering improves reference-based compression on poorly ordered graphs by 0.3 to 5.4 bits per edge on every dataset and encoder measured, and that this gain is largely insensitive to encoder parameterization. Four independently tuned encoders agree on the Leiden+LLP versus plain-LLP difference within roughly plus or minus 0.04 bits per edge on four of five weakly ordered datasets. On URL-ordered web graphs the benefit persists for adaptive encoders but can be mildly negative for encoders already tuned to URL residual structure. Three new encoders that select among up to 28 decompositions per vertex improve over BVGraph high-compression by 2-
What carries the argument
The two-stage Leiden+LLP vertex ordering: global LLP seeding followed by Leiden community detection and then per-cluster LLP on each induced subgraph, which improves neighbor-list locality for reference encoding.
Load-bearing premise
That the combination of global LLP seeding, Leiden community detection, and per-cluster LLP will reliably produce orderings with superior locality for reference encoding than plain LLP or URL order, and that this holds beyond the five tested weakly ordered datasets.
What would settle it
A new weakly ordered graph on which the Leiden+LLP ordering produces equal or higher bits per edge than plain LLP or URL ordering when measured with the same encoder.
Figures
read the original abstract
Reference-based graph compression encodes each vertex's neighbor list relative to a recent vertex, exploiting locality to compress large directed graphs. The dominant tool, WebGraph's BVGraph, fixes a single encoding pipeline and relies on a separately chosen vertex ordering -- typically URL-lexicographic or Layered Label Propagation (LLP). The interaction between ordering and encoder is rarely measured. We propose a two-stage Leiden+LLP vertex ordering -- global LLP to seed labels, Leiden community detection, then per-cluster LLP on each induced subgraph -- and study how it interacts with reference-based compression. On graphs with poor initial vertex order, reordering saves 0.3 to 5.4 bits per edge on every dataset and encoder we measured. The size of that gain is largely insensitive to the encoder: on four of five weakly ordered datasets, four independently parameterised encoders agree on the Leiden+LLP-vs-plain-LLP gain within roughly +/- 0.04 bpe. On URL-ordered web crawls, where the distributed ordering already encodes locality, adaptive encoders still benefit from reordering, but encoders tuned to URL-induced residual structure (BV-HC, CG at K>1) are mildly hurt by it. To quantify how much encoder choice matters once ordering is fixed, we contribute three reference-based encoders -- BG, CS, and CG -- that perform per-vertex cost-optimal selection from up to 28 candidate decompositions. Each is run under its own best-tested ordering. The best of the three improves over BVGraph high-compression by 2-9% on every dataset tested, with the encoder-level gain consistently smaller than the ordering-level gain on weakly ordered datasets. The encoder framework also yields a self-delimiting bitstream that supports low-overhead random access.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that a two-stage Leiden+LLP vertex ordering (global LLP seeding, Leiden community detection, then per-cluster LLP) improves reference-based compression of directed graphs. On graphs with poor initial order, reordering yields savings of 0.3–5.4 bits per edge across all tested datasets and encoders; these gains are largely insensitive to encoder choice, with four independently parameterized encoders agreeing on the Leiden+LLP vs. plain-LLP delta within roughly ±0.04 bpe on four of five weakly ordered datasets. The authors introduce three new reference-based encoders (BG, CS, CG) that perform per-vertex cost-optimal selection from up to 28 candidate decompositions; the best of these improves over BVGraph high-compression by 2–9 % on every dataset, with encoder-level gains consistently smaller than ordering-level gains on weakly ordered graphs. The framework also produces a self-delimiting bitstream supporting low-overhead random access.
Significance. If the reported empirical patterns hold, the work is significant because it supplies concrete evidence that vertex ordering exerts a larger and more stable effect on compressed size than incremental changes to the reference encoder itself. The cross-encoder comparison on multiple real-world graphs strengthens the robustness of the ordering claim, and the new encoders plus self-delimiting random-access support constitute practical engineering contributions to the reference-based compression literature.
major comments (1)
- [Abstract] Abstract: the central claim that ordering gains are 'largely insensitive to the encoder' and that four encoders agree within roughly ±0.04 bpe on the Leiden+LLP-vs-plain-LLP delta is load-bearing for the conclusion that ordering dominates encoder choice, yet the abstract supplies no variance estimates, error bars, statistical tests, or details on how the four encoders were selected and parameterized; this absence prevents assessment of whether the reported stability is statistically anchored or an artifact of the chosen subset.
minor comments (2)
- [Abstract] Abstract: the phrase 'on every dataset and encoder we measured' would be clearer if the number of datasets and the identity of the four encoders were stated explicitly rather than left implicit.
- [Abstract] Abstract: the parenthetical '(K for CG encoder)' in the free-parameters list is not defined in the provided text; a brief inline definition or forward reference would aid readability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for the positive evaluation of the work's significance. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that ordering gains are 'largely insensitive to the encoder' and that four encoders agree within roughly ±0.04 bpe on the Leiden+LLP-vs-plain-LLP delta is load-bearing for the conclusion that ordering dominates encoder choice, yet the abstract supplies no variance estimates, error bars, statistical tests, or details on how the four encoders were selected and parameterized; this absence prevents assessment of whether the reported stability is statistically anchored or an artifact of the chosen subset.
Authors: We agree that the abstract's brevity leaves the claim without supporting detail. The reported ±0.04 bpe figure is the maximum observed difference among the four measured deltas (Leiden+LLP minus plain-LLP) on the four datasets, not a statistical variance or error bar. The four encoders are BVGraph high-compression together with the three new encoders (BG, CS, CG) introduced in the paper; each was parameterized independently to minimize bits per edge for the ordering under test, as described in the experimental setup. No formal statistical tests were performed, as the study reports consistent empirical patterns across real-world graphs rather than inferential claims. We will revise the abstract to name the encoders and to state that the agreement is the observed range in our experiments. This change will be incorporated in the next version. revision: yes
Circularity Check
No circularity: empirical measurements on external datasets with no derivation chain or fitted predictions.
full rationale
The paper is an empirical study that measures compressed sizes (bits per edge) on external graph datasets using a proposed two-stage Leiden+LLP ordering and three new encoders (BG, CS, CG). All reported gains, insensitivity claims, and comparisons are direct experimental outcomes against baselines like BVGraph and URL order. No equations, first-principles derivations, parameter fits renamed as predictions, or self-citation chains appear in the provided text; results rest on observable performance differences rather than reducing to internal definitions or inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- K for CG encoder
- Number of candidate decompositions
axioms (1)
- domain assumption Leiden community detection identifies clusters whose induced subgraphs exhibit sufficient locality for reference-based neighbor encoding
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a two-stage Leiden+LLP vertex ordering—global LLP to seed labels, Leiden community detection, then per-cluster LLP on each induced subgraph—and study how it interacts with reference-based compression.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On graphs with poor initial vertex order, reordering saves 0.3 to 5.4 bits per edge... the size of that gain is largely insensitive to the encoder.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alberto Apostolico and Guido Drovandi. 2009. Graph Compression by BFS.Algorithms2, 3 (2009), 1031–1044. doi:10.3390/a2031031
-
[2]
Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast Unfolding of Communities in Large Networks. Journal of Statistical Mechanics: Theory and Experiment2008, 10 (2008), P10008. doi:10.1088/1742-5468/2008/10/P10008
-
[3]
Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. InProceedings of the 20th International Conference on World Wide Web (WWW). ACM, 587–596. doi:10.1145/ 1963405.1963488
-
[4]
Paolo Boldi, Massimo Santini, and Sebastiano Vigna. 2009. Permuting Web and Social Graphs.Internet Mathematics6, 3 (2009), 257–283. doi:10.1080/15427951.2009.10129184
-
[5]
Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. InProceedings of the 13th International Conference on World Wide Web (WWW). ACM, 595–601. doi:10.1145/988672.988752
-
[6]
Paolo Boldi and Sebastiano Vigna. 2005. Codes for the World Wide Web.Internet Mathematics2, 4 (2005), 405–427. doi:10.1080/15427951.2005.10129111
-
[7]
Brisaboa, Susana Ladra, and Gonzalo Navarro
Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2009. 𝑘 2-Trees for Compact Web Graph Representation.String Processing and Information Retrieval (SPIRE)(2009), 18–30. doi:10.1007/978-3-642-03784-9_3
-
[8]
Brisaboa, Susana Ladra, and Gonzalo Navarro
Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2014. Compact Representation of Web Graphs with Extended Functionality.Information Systems39 (2014), 152–174. doi:10.1016/j.is.2013.08.003 26 Jimmy Dubuisson
-
[9]
Zheng Chen, Feng Zhang, Jiawei Guan, Jidong Zhai, Xipeng Shen, Huanchen Zhang, Wentong Shu, and Xiaoyong Du. 2023. CompressGraph: Efficient Parallel Graph Analytics with Rule-Based Compression.Proceedings of the ACM on Management of Data (SIGMOD)1, 1 (2023). doi:10.1145/3588684
-
[10]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. 2009. On Compressing Social Networks. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 219–228. doi:10.1145/1557019.1557049
-
[11]
Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing Graphs and Indexes with Recursive Graph Bisection. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1535–1544. doi:10.1145/2939672.2939862
-
[12]
Aviezri S. Fraenkel and Shmuel T. Klein. 1996. Robust Universal Complete Codes for Transmission and Compression.Discrete Applied Mathematics 64, 1 (1996), 31–55. doi:10.1016/0166-218X(93)E0145-O
-
[13]
Laboratory for Web Algorithmics. 2024. LAW Datasets. https://law.di.unimi.it/datasets.php
work page 2024
-
[14]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data
work page 2014
-
[15]
Traag, Ludo Waltman, and Nees Jan van Eck
Vincent A. Traag, Ludo Waltman, and Nees Jan van Eck. 2019. From Louvain to Leiden: Guaranteeing Well-Connected Communities.Scientific Reports9, 1 (2019), 5233. doi:10.1038/s41598-019-41695-z
-
[16]
Sebastiano Vigna. 2013. Quasi-Succinct Indices. InProceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). ACM, 83–92. doi:10.1145/2433396.2433409
-
[17]
Zheng Xu et al. 2024. Improving Graph Compression for Efficient Resource-Constrained Graph Analytics.Proceedings of the VLDB Endowment17, 9 (2024), 2212–2225
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.