pith. sign in

arxiv: 2605.21510 · v1 · pith:ND3G2VP5new · submitted 2026-05-13 · 💻 cs.SI · cs.LG

Community-Aware Vertex Ordering for Reference-Based Graph Compression: A Cross-Encoder Empirical Study

Pith reviewed 2026-05-22 01:39 UTC · model grok-4.3

classification 💻 cs.SI cs.LG
keywords graph compressionvertex orderingreference encodingcommunity detectionLeiden algorithmLayered Label Propagationweb graphsdirected graphs
0
0 comments X

The pith

A two-stage Leiden+LLP vertex ordering reduces bits per edge by 0.3 to 5.4 in reference-based graph compression, with gains largely independent of encoder choice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reordering vertices via global Layered Label Propagation seeding, Leiden community detection, and per-cluster LLP produces orderings with better locality than plain LLP or URL order. On graphs that start with weak ordering, this yields consistent compression savings across multiple datasets and several different reference-based encoders. The study separates the contribution of ordering from encoder design by testing new per-vertex cost-optimal encoders and finds that ordering improvements are larger and more stable than encoder improvements on weakly ordered inputs. A reader would care because large directed graphs such as web crawls or social networks dominate storage and query costs, so even modest bits-per-edge reductions translate directly into smaller files and faster access.

Core claim

The central claim is that the Leiden+LLP two-stage ordering improves reference-based compression on poorly ordered graphs by 0.3 to 5.4 bits per edge on every dataset and encoder measured, and that this gain is largely insensitive to encoder parameterization. Four independently tuned encoders agree on the Leiden+LLP versus plain-LLP difference within roughly plus or minus 0.04 bits per edge on four of five weakly ordered datasets. On URL-ordered web graphs the benefit persists for adaptive encoders but can be mildly negative for encoders already tuned to URL residual structure. Three new encoders that select among up to 28 decompositions per vertex improve over BVGraph high-compression by 2-

What carries the argument

The two-stage Leiden+LLP vertex ordering: global LLP seeding followed by Leiden community detection and then per-cluster LLP on each induced subgraph, which improves neighbor-list locality for reference encoding.

Load-bearing premise

That the combination of global LLP seeding, Leiden community detection, and per-cluster LLP will reliably produce orderings with superior locality for reference encoding than plain LLP or URL order, and that this holds beyond the five tested weakly ordered datasets.

What would settle it

A new weakly ordered graph on which the Leiden+LLP ordering produces equal or higher bits per edge than plain LLP or URL ordering when measured with the same encoder.

Figures

Figures reproduced from arXiv: 2605.21510 by Jimmy Dubuisson.

Figure 1
Figure 1. Figure 1: Erdős–Rényi 𝐺 (10000, 𝑝): bpe vs. density. Dashed line: theoretical entropy ℎ(𝑝)/𝑝 [10]. (1) Leiden+llp ordering (right panel) reduces bpe by 3.5–5.4 bits, dwarfing all method differences. CG achieved the best bpe at every 𝜇 by 0.2–0.4 bpe over CS. (2) Without reordering (left panel), CS is the most robust encoder, while BV catches up at high 𝜇 where community structure is weak. These are single-run result… view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic web graphs (10000 vertices): bpe vs. density. Best-known parameters with Fibonacci encoding. BV. BG peaks at 𝑤=64 (2.330 bpe, 18.4 𝜇s/edge) and CS at 𝑤=256 (2.308 bpe, 53.0 𝜇s/edge). Notably, CG’s BPE curve is U-shaped: beyond 𝑤=64, its fixed-width reference header (1 + ⌈log2 𝑤⌉ bits per vertex) costs more than the marginal reference improvement. The fast cost model (cost_model=1) trades ∼0.2 bpe… view at source ↗
Figure 3
Figure 3. Figure 3: LFR benchmark (10000 vertices, avg degree 15): [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CNR-2000: bpe and encoding speed vs. window size 𝑤. BG/CS use Leiden+llp; CG uses LAW ordering. Left: bpe with BV (dashed) and BV-HC (dotted) thresholds. Right: encoding speed (𝜇s/edge, log scale). All single-threaded [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
read the original abstract

Reference-based graph compression encodes each vertex's neighbor list relative to a recent vertex, exploiting locality to compress large directed graphs. The dominant tool, WebGraph's BVGraph, fixes a single encoding pipeline and relies on a separately chosen vertex ordering -- typically URL-lexicographic or Layered Label Propagation (LLP). The interaction between ordering and encoder is rarely measured. We propose a two-stage Leiden+LLP vertex ordering -- global LLP to seed labels, Leiden community detection, then per-cluster LLP on each induced subgraph -- and study how it interacts with reference-based compression. On graphs with poor initial vertex order, reordering saves 0.3 to 5.4 bits per edge on every dataset and encoder we measured. The size of that gain is largely insensitive to the encoder: on four of five weakly ordered datasets, four independently parameterised encoders agree on the Leiden+LLP-vs-plain-LLP gain within roughly +/- 0.04 bpe. On URL-ordered web crawls, where the distributed ordering already encodes locality, adaptive encoders still benefit from reordering, but encoders tuned to URL-induced residual structure (BV-HC, CG at K>1) are mildly hurt by it. To quantify how much encoder choice matters once ordering is fixed, we contribute three reference-based encoders -- BG, CS, and CG -- that perform per-vertex cost-optimal selection from up to 28 candidate decompositions. Each is run under its own best-tested ordering. The best of the three improves over BVGraph high-compression by 2-9% on every dataset tested, with the encoder-level gain consistently smaller than the ordering-level gain on weakly ordered datasets. The encoder framework also yields a self-delimiting bitstream that supports low-overhead random access.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that a two-stage Leiden+LLP vertex ordering (global LLP seeding, Leiden community detection, then per-cluster LLP) improves reference-based compression of directed graphs. On graphs with poor initial order, reordering yields savings of 0.3–5.4 bits per edge across all tested datasets and encoders; these gains are largely insensitive to encoder choice, with four independently parameterized encoders agreeing on the Leiden+LLP vs. plain-LLP delta within roughly ±0.04 bpe on four of five weakly ordered datasets. The authors introduce three new reference-based encoders (BG, CS, CG) that perform per-vertex cost-optimal selection from up to 28 candidate decompositions; the best of these improves over BVGraph high-compression by 2–9 % on every dataset, with encoder-level gains consistently smaller than ordering-level gains on weakly ordered graphs. The framework also produces a self-delimiting bitstream supporting low-overhead random access.

Significance. If the reported empirical patterns hold, the work is significant because it supplies concrete evidence that vertex ordering exerts a larger and more stable effect on compressed size than incremental changes to the reference encoder itself. The cross-encoder comparison on multiple real-world graphs strengthens the robustness of the ordering claim, and the new encoders plus self-delimiting random-access support constitute practical engineering contributions to the reference-based compression literature.

major comments (1)
  1. [Abstract] Abstract: the central claim that ordering gains are 'largely insensitive to the encoder' and that four encoders agree within roughly ±0.04 bpe on the Leiden+LLP-vs-plain-LLP delta is load-bearing for the conclusion that ordering dominates encoder choice, yet the abstract supplies no variance estimates, error bars, statistical tests, or details on how the four encoders were selected and parameterized; this absence prevents assessment of whether the reported stability is statistically anchored or an artifact of the chosen subset.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'on every dataset and encoder we measured' would be clearer if the number of datasets and the identity of the four encoders were stated explicitly rather than left implicit.
  2. [Abstract] Abstract: the parenthetical '(K for CG encoder)' in the free-parameters list is not defined in the provided text; a brief inline definition or forward reference would aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for the positive evaluation of the work's significance. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that ordering gains are 'largely insensitive to the encoder' and that four encoders agree within roughly ±0.04 bpe on the Leiden+LLP-vs-plain-LLP delta is load-bearing for the conclusion that ordering dominates encoder choice, yet the abstract supplies no variance estimates, error bars, statistical tests, or details on how the four encoders were selected and parameterized; this absence prevents assessment of whether the reported stability is statistically anchored or an artifact of the chosen subset.

    Authors: We agree that the abstract's brevity leaves the claim without supporting detail. The reported ±0.04 bpe figure is the maximum observed difference among the four measured deltas (Leiden+LLP minus plain-LLP) on the four datasets, not a statistical variance or error bar. The four encoders are BVGraph high-compression together with the three new encoders (BG, CS, CG) introduced in the paper; each was parameterized independently to minimize bits per edge for the ordering under test, as described in the experimental setup. No formal statistical tests were performed, as the study reports consistent empirical patterns across real-world graphs rather than inferential claims. We will revise the abstract to name the encoders and to state that the agreement is the observed range in our experiments. This change will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on external datasets with no derivation chain or fitted predictions.

full rationale

The paper is an empirical study that measures compressed sizes (bits per edge) on external graph datasets using a proposed two-stage Leiden+LLP ordering and three new encoders (BG, CS, CG). All reported gains, insensitivity claims, and comparisons are direct experimental outcomes against baselines like BVGraph and URL order. No equations, first-principles derivations, parameter fits renamed as predictions, or self-citation chains appear in the provided text; results rest on observable performance differences rather than reducing to internal definitions or inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on the empirical effectiveness of standard community-detection and label-propagation algorithms plus the assumption that the tested graphs are representative of real-world directed networks; no new mathematical axioms or invented physical entities are introduced.

free parameters (2)
  • K for CG encoder
    Parameter K>1 used when tuning CG to URL-induced residual structure.
  • Number of candidate decompositions
    Up to 28 candidate decompositions from which per-vertex selection is made in BG, CS, and CG.
axioms (1)
  • domain assumption Leiden community detection identifies clusters whose induced subgraphs exhibit sufficient locality for reference-based neighbor encoding
    Invoked when the paper describes the second stage of the proposed ordering pipeline.

pith-pipeline@v0.9.0 · 5857 in / 1384 out tokens · 128308 ms · 2026-05-22T01:39:30.184105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Alberto Apostolico and Guido Drovandi. 2009. Graph Compression by BFS.Algorithms2, 3 (2009), 1031–1044. doi:10.3390/a2031031

  2. [2]

    Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre

    Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast Unfolding of Communities in Large Networks. Journal of Statistical Mechanics: Theory and Experiment2008, 10 (2008), P10008. doi:10.1088/1742-5468/2008/10/P10008

  3. [3]

    Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. InProceedings of the 20th International Conference on World Wide Web (WWW). ACM, 587–596. doi:10.1145/ 1963405.1963488

  4. [4]

    Paolo Boldi, Massimo Santini, and Sebastiano Vigna. 2009. Permuting Web and Social Graphs.Internet Mathematics6, 3 (2009), 257–283. doi:10.1080/15427951.2009.10129184

  5. [5]

    Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph Framework I: Compression Techniques. InProceedings of the 13th International Conference on World Wide Web (WWW). ACM, 595–601. doi:10.1145/988672.988752

  6. [6]

    Paolo Boldi and Sebastiano Vigna. 2005. Codes for the World Wide Web.Internet Mathematics2, 4 (2005), 405–427. doi:10.1080/15427951.2005.10129111

  7. [7]

    Brisaboa, Susana Ladra, and Gonzalo Navarro

    Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2009. 𝑘 2-Trees for Compact Web Graph Representation.String Processing and Information Retrieval (SPIRE)(2009), 18–30. doi:10.1007/978-3-642-03784-9_3

  8. [8]

    Brisaboa, Susana Ladra, and Gonzalo Navarro

    Nieves R. Brisaboa, Susana Ladra, and Gonzalo Navarro. 2014. Compact Representation of Web Graphs with Extended Functionality.Information Systems39 (2014), 152–174. doi:10.1016/j.is.2013.08.003 26 Jimmy Dubuisson

  9. [9]

    Zheng Chen, Feng Zhang, Jiawei Guan, Jidong Zhai, Xipeng Shen, Huanchen Zhang, Wentong Shu, and Xiaoyong Du. 2023. CompressGraph: Efficient Parallel Graph Analytics with Rule-Based Compression.Proceedings of the ACM on Management of Data (SIGMOD)1, 1 (2023). doi:10.1145/3588684

  10. [10]

    Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. 2009. On Compressing Social Networks. InProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 219–228. doi:10.1145/1557019.1557049

  11. [11]

    Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing Graphs and Indexes with Recursive Graph Bisection. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1535–1544. doi:10.1145/2939672.2939862

  12. [12]

    Fraenkel and Shmuel T

    Aviezri S. Fraenkel and Shmuel T. Klein. 1996. Robust Universal Complete Codes for Transmission and Compression.Discrete Applied Mathematics 64, 1 (1996), 31–55. doi:10.1016/0166-218X(93)E0145-O

  13. [13]

    Laboratory for Web Algorithmics. 2024. LAW Datasets. https://law.di.unimi.it/datasets.php

  14. [14]

    Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data

  15. [15]

    Traag, Ludo Waltman, and Nees Jan van Eck

    Vincent A. Traag, Ludo Waltman, and Nees Jan van Eck. 2019. From Louvain to Leiden: Guaranteeing Well-Connected Communities.Scientific Reports9, 1 (2019), 5233. doi:10.1038/s41598-019-41695-z

  16. [16]

    Sebastiano Vigna. 2013. Quasi-Succinct Indices. InProceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM). ACM, 83–92. doi:10.1145/2433396.2433409

  17. [17]

    Zheng Xu et al. 2024. Improving Graph Compression for Efficient Resource-Constrained Graph Analytics.Proceedings of the VLDB Endowment17, 9 (2024), 2212–2225