pith. sign in

arxiv: 1907.01032 · v2 · pith:LELJ3DQYnew · submitted 2019-07-01 · 💻 cs.IR · cs.DS

On Slicing Sorted Integer Sequences

Pith reviewed 2026-05-25 11:17 UTC · model grok-4.3

classification 💻 cs.IR cs.DS
keywords sorted integer sequencesinverted index compressionlist intersectionuniverse partitioningrecursive slicingspace time tradeoffpartitioning paradigms
0
0 comments X

The pith

Recursive slicing of the number universe for sorted integer sequences yields faster list intersections and unions than cardinality partitioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines two partitioning methods for compressing sorted integer sequences used in retrieval systems: dividing by the number of elements in each part or by splitting the overall range of numbers. It finds that recursive splitting of the range, though resulting in slightly larger compressed sizes than some existing methods, leads to substantially faster intersections and unions. This is relevant because these operations are key to query processing in systems like web search engines. The work provides the first direct experimental comparison of the two paradigms in the context of inverted index compression.

Core claim

A solution that recursively slices the universe of representation of a sequence achieves compact storage and fast query execution. Albeit larger than some state-of-the-art representations, this slicing approach substantially improves the performance of list intersections and unions while operating in compressed space, offering an excellent space/time trade-off.

What carries the argument

Recursive universe slicing, which partitions the sequence by repeatedly dividing the range of possible values to balance storage and access speed.

If this is right

  • List intersections and unions execute faster in compressed space.
  • Space/time trade-off improves for inverted index compression problems.
  • Partitioning by universe receives more attention as a viable alternative to cardinality partitioning.
  • Query resolution in large-scale retrieval systems can benefit from this method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating this into existing search engines could reduce query latency without increasing memory much.
  • The recursive slicing might apply to other data structures involving sorted integers, such as in databases.
  • Further optimizations could combine this with other compression techniques for even better results.

Load-bearing premise

That implementing and testing the recursive slicing will show measurable speed gains over current methods for list operations.

What would settle it

Running the slicing method on standard inverted index benchmarks and finding no speedup or even slowdown in intersection and union times compared to cardinality-based approaches.

Figures

Figures reproduced from arXiv: 1907.01032 by Giulio Ermanno Pibiri.

Figure 1
Figure 1. Figure 1: Example of a sequence of 32 values drawn from a universe of size 56 as partitioned by cardinality (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two different intersection algorithms, respectively suitable for sequences that are partitioned by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The upper part (a) of the picture shows the recursive slicing of the universe [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: C++ macros to support the vectorized implementation of the intersection between small sorted arrays. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The C++ skeleton of the vectorized implementation of the intersection between small sorted arrays. The [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Plots (a), (c) and (e) show the percentage of integers covered by the full chunks (FC), dense chunks [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Space/time trade-off curve for the different representations summarized in Table [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Representing sorted integer sequences in small space is a central problem for large-scale retrieval systems such as Web search engines. Efficient query resolution, e.g., intersection or random access, is achieved by carefully partitioning the sequences. In this work we describe and compare two different partitioning paradigms: partitioning by cardinality and partitioning by universe. Although the ideas behind such paradigms have been known in the coding and algorithmic community since many years, inverted index compression has extensively adopted the former paradigm, whereas the latter has received only little attention. As a result, an experimental comparison between these two is missing for the setting of inverted index compression. We also propose and implement a solution that recursively slices the universe of representation of a sequence to achieve compact storage and attain to fast query execution. Albeit larger than some state-of-the-art representations, this slicing approach substantially improves the performance of list intersections and unions while operating in compressed space, thus offering an excellent space/time trade-off for the problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript compares cardinality-based and universe-based partitioning paradigms for representing sorted integer sequences in inverted indexes. It proposes and implements a recursive universe-slicing technique, presents an experimental comparison (previously missing from the literature), and claims that the slicing method, while larger than some state-of-the-art representations, delivers substantially faster intersections and unions in compressed space and therefore an excellent space/time trade-off.

Significance. If the reported speedups hold under standard inverted-index workloads and baselines, the work supplies the first direct experimental comparison of the two partitioning paradigms and introduces a practical alternative that could improve query latency in large-scale retrieval systems without sacrificing compression.

major comments (1)
  1. The central performance claim rests entirely on the new experiments for recursive slicing; the abstract itself notes that 'an experimental comparison between these two is missing' and that the paradigm 'has received only little attention'. Without the full experimental section (datasets, baselines, implementation details, and statistical significance), it is impossible to assess whether the reported gains are robust to baseline choice or implementation overhead.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the major comment point by point below.

read point-by-point responses
  1. Referee: The central performance claim rests entirely on the new experiments for recursive slicing; the abstract itself notes that 'an experimental comparison between these two is missing' and that the paradigm 'has received only little attention'. Without the full experimental section (datasets, baselines, implementation details, and statistical significance), it is impossible to assess whether the reported gains are robust to baseline choice or implementation overhead.

    Authors: The manuscript contains a full experimental section that details the datasets (standard TREC collections used for inverted-index evaluation), the baselines (representative methods from both cardinality-based and universe-based partitioning, including prior state-of-the-art codecs), the implementation of the recursive universe-slicing technique, and the measured intersection/union times in compressed space. This section supplies the direct experimental comparison that the abstract notes was previously absent from the literature. The reported speedups are therefore grounded in these concrete results rather than in the absence of experiments. revision: no

Circularity Check

0 steps flagged

No circularity: algorithmic proposal without derived quantities or self-referential steps

full rationale

The paper introduces a recursive universe-slicing technique for sorted integer sequences and contrasts cardinality-based vs. universe-based partitioning. No equations, fitted parameters, predictions, or first-principles derivations appear that could reduce to their own inputs by construction. The contribution consists of a new algorithmic method whose performance claims rest on implementation and experimental comparison rather than any self-definitional, fitted-input, or self-citation load-bearing chain. The abstract and description treat the approach as an independent engineering proposal whose value is assessed externally via runtime and space measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an algorithmic proposal relying only on standard assumptions from coding theory and data structures; no free parameters, domain-specific axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5682 in / 910 out tokens · 26933 ms · 2026-05-25T11:17:54.679283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Vo Ngoc Anh and Alistair Moffat. 2005. Inverted Index Compression Using Word-Aligned Binary Codes. Information Retrieval Journal 8, 1 (2005), 151–166

  2. [2]

    Vo Ngoc Anh and Alistair Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2 (2010), 131–147. On Slicing Sorted Integer Sequences 19

  3. [3]

    Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y

    Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Y. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 12th ACM International Conference on Information and Knowledge Management. 426–434

  4. [4]

    Barla Cambazoglu and Ricardo Baeza-Yates

    B. Barla Cambazoglu and Ricardo Baeza-Yates. 2015. Scalability Challenges in Web Search Engines . Morgan & Claypool Publishers

  5. [5]

    Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. 2016. Better bitmap performance with Roaring bitmaps. Software: practice and experience 46, 5 (2016), 709–719

  6. [6]

    Cormen, Charles E

    Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). MIT Press

  7. [7]

    Jeffrey Dean. 2009. Challenges in building large-scale information retrieval systems: invited talk. In Proceedings of the 2nd International Conference on Web Search and Data Mining

  8. [8]

    Peter Elias. 1974. Efficient Storage and Retrieval by Content and Address of Static Files. J. ACM 21, 2 (1974), 246–260

  9. [9]

    Peter Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21, 2 (1975), 194–203

  10. [10]

    Robert Mario Fano. 1971. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971)

  11. [11]

    Solomon Golomb. 1966. Run-length encodings. IEEE Transactions on Information Theory 12, 3 (1966), 399–401

  12. [12]

    Daniel Lemire and Leonid Boytsov. 2015. Decoding billions of integers per second through vectorization. 45, 1 (2015), 1–29

  13. [13]

    Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O’Hara, François Saint-Jacques, and Gregory Ssi-Yan-Kai

  14. [14]

    Software: Practice and Experience 48, 4 (2018), 867–895

    Roaring bitmaps: Implementation of an optimized software library. Software: Practice and Experience 48, 4 (2018), 867–895

  15. [15]

    Daniel Lemire, Nathan Kurz, and Christoph Rupp. 2018. Stream-VByte: faster byte-oriented integer compression. Inform. Process. Lett. 130 (2018), 1–6

  16. [16]

    Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently faster and smaller compressed bitmaps with roaring. Software: Practice and Experience 46, 11 (2016), 1547–1569

  17. [17]

    Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax WAND with Variable-sized Blocks. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval. 625–634

  18. [18]

    2008.Introduction to Information Retrieval

    Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008.Introduction to Information Retrieval. Cambridge University Press

  19. [19]

    Alistair Moffat. 2008. Compressing integer sequences and sets. In Encyclopedia of algorithms. Springer, 1–99

  20. [20]

    Alistair Moffat and Matthias Petri. 2017. ANS-Based Index Compression. In Proceedings of the ACM on Conference on Information and Knowledge Management . 677–686

  21. [21]

    Alistair Moffat and Matthias Petri. 2018. Index Compression Using Byte-Aligned ANS Coding and Two-Dimensional Contexts. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining . 405–413

  22. [22]

    Alistair Moffat and Lang Stuiver. 1996. Exploiting Clustering in Inverted File Compression. In Data Compression Conference. 82–91

  23. [23]

    Alistair Moffat and Lang Stuiver. 2000. Binary Interpolative Coding for Effective Index Compression. Information Retrieval Journal 3, 1 (2000), 25–47

  24. [24]

    Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano Indexes. InProceedings of the 37th International Conference on Research and Development in Information Retrieval . 273–282

  25. [25]

    Prashant Pandey, Michael A Bender, and Rob Johnson. 2017. A fast x86 implementation of select. arXiv preprint arXiv:1706.00990 (2017)

  26. [26]

    Giulio Ermanno Pibiri, Matthias Petri, and Alistair Moffat. 2019. Fast Dictionary-Based Compression for Inverted Indexes. In International ACM Conference on Web Search and Data Mining . 9

  27. [27]

    Giulio Ermanno Pibiri and Rossano Venturini. 2017. Clustered Elias-Fano indexes. ACM Transactions on Information Systems 36, 1, Article 2 (2017), 33 pages

  28. [28]

    Giulio Ermanno Pibiri and Rossano Venturini. 2017. Dynamic Elias-Fano Representation. In Proceedings of the 28-th Annual Symposium on Combinatorial Pattern Matching . 30:1–30:14

  29. [29]

    Giulio Ermanno Pibiri and Rossano Venturini. 2018. Inverted Index Compression.Encyclopedia of Big Data Technologies (2018), 1–8. 20 Giulio Ermanno Pibiri

  30. [30]

    Giulio Ermanno Pibiri and Rossano Venturini. 2019. On Optimally Partitioning Variable-Byte Codes. IEEE Transactions on Knowledge and Data Engineering (2019), 1–12

  31. [31]

    Jeff Plaisance, Nathan Kurz, and Daniel Lemire. 2015. Vectorized VByte Decoding. In International Symposium on Web Algorithms

  32. [32]

    Stephen Robertson and Sparck Jones. 1976. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 3 (1976), 129–146

  33. [33]

    Benjamin Schlegel, Thomas Willhalm, and Wolfgang Lehner. 2011. Fast Sorted-Set Intersection using SIMD Instructions.. In ADMS@ VLDB. 1–8

  34. [34]

    Falk Scholer, Hugh E Williams, John Yiannis, and Justin Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 222–229

  35. [35]

    Fabrizio Silvestri. 2007. Sorting Out the Document Identifier Assignment Problem. In Proceedings of the 29th European Conference on IR Research . 101–112

  36. [36]

    Alexander Stepanov, Anil Gangolli, Daniel Rose, Ryan Ernst, and Paramjit Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th International Conference on Information and Knowledge Management . 317–326

  37. [37]

    Larry H Thiel and HS Heaps. 1972. Program design for retrospective searches on large data bases. Information Storage and Retrieval 8, 1 (1972), 1–20

  38. [38]

    Andrew Trotman. 2014. Compression, SIMD, and postings lists. In Proceedings of the 2014 Australasian Document Computing Symposium. ACM, 50

  39. [39]

    Peter van Emde Boas. 1975. Preserving Order in a Forest in less than Logarithmic Time. In Proceedings of the 16-th Annual Symposium on Foundations of Computer Science . 75–84

  40. [40]

    Peter van Emde Boas. 1977. Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space. Inform. Process. Lett. 6, 3 (1977), 80–82

  41. [41]

    Sebastiano Vigna. 2013. Quasi-succinct indices. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining. 83–92

  42. [42]

    Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2017. An experimental study of bitmap compression vs. inverted list compression. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, 993–1008

  43. [43]

    Ian Witten, Alistair Moffat, and Timothy Bell. 1999. Managing gigabytes: compressing and indexing documents and images (2nd ed.). Morgan Kaufmann

  44. [44]

    Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International Conference on World Wide Web . 401–410

  45. [45]

    Zhang, X

    J. Zhang, X. Long, and T. Suel. 2008. Performance of compressed inverted list caching in search engines. InInternational World Wide Web Conference (WWW). 387–396

  46. [46]

    Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. Comput. Surveys 38, 2 (2006), 1–56

  47. [47]

    Marcin Zukowski, Sándor Héman, Niels Nes, and Peter Boncz. 2006. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering . 59–70