pith. sign in

arxiv: 2605.20683 · v2 · pith:7H7XAVGBnew · submitted 2026-05-20 · 💻 cs.IR

Layer-wise Token Compression for Efficient Document Reranking

Pith reviewed 2026-05-22 09:15 UTC · model grok-4.3

classification 💻 cs.IR
keywords token compressioncross-encoder rerankingtransformer efficiencyinformation retrievaladaptive poolingdocument rankingMS MARCO
0
0 comments X

The pith

Applying adaptive token pooling at intermediate transformer layers speeds up cross-encoder rerankers without loss of ranking quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that token compression, when applied right at the embedding layer, disrupts a cross-encoder's ability to judge query-document relevance, yet the same compression performed at middle layers leaves relevance modeling intact. By reducing the token count for all subsequent layers, the approach cuts the quadratic cost of attention and feed-forward steps. Experiments on MS MARCO passage and document ranking show that these compressed models match the effectiveness of their full counterparts while raising queries per second by up to 25 percent on passages and 116 percent on documents. The same layer-wise pattern transfers to listwise LLM rerankers and produces even larger speed gains on long contexts. Models trained with the compression also outperform uncompressed versions when ranking long documents, indicating that the technique may encourage length-invariant representations.

Core claim

The central claim is that adaptive token pooling inserted at intermediate transformer layers, rather than at the initial embedding layer, reduces the effective sequence length for later computations while preserving the cross-encoder's capacity to model query-document interactions, thereby delivering higher inference throughput on both passage and document ranking tasks without degrading standard effectiveness metrics.

What carries the argument

Layer-wise Token Compression (LTC), which inserts adaptive token pooling operations at selected intermediate layers to shrink the token sequence before the remaining transformer blocks.

If this is right

  • Ranking effectiveness on MS MARCO passage and document tasks stays comparable to the uncompressed models.
  • Inference throughput rises by up to 25 percent for passage reranking and up to 116 percent for document reranking.
  • The identical compression pattern applies directly to listwise LLM rerankers and yields larger relative speed gains on long inputs.
  • Models trained with compression outperform their uncompressed counterparts when used for long-document ranking tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Early-layer compression appears to destroy interaction patterns that middle layers would otherwise build, which explains why only later placement succeeds.
  • The length-invariance benefit observed on long documents suggests compression acts as implicit regularization against over-reliance on document length cues.
  • The same selective reduction could be explored in other transformer pipelines where early layers capture coarse features and later layers refine them.
  • Learned or query-dependent policies for choosing compression layers might further improve the speed-quality trade-off.

Load-bearing premise

The approach assumes that query-document matching signals are already sufficiently formed by the middle layers so that later token reduction does not erase critical interaction information.

What would settle it

An experiment that applies middle-layer compression on the MS MARCO document ranking task and measures a statistically significant drop in NDCG@10 relative to the uncompressed baseline at matched computational cost would refute the claim that ranking quality is preserved.

Figures

Figures reproduced from arXiv: 2605.20683 by Ivano Lauriola, Shengyao Zhuang, Zhichao Xu.

Figure 1
Figure 1. Figure 1: Effect of LTC to pointwise passage ranking. Left: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of zero-shot compression for pointwise [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of LTC to listwise document ranking. Left: [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Transformer-based document cross-encoder rerankers are a central component of modern information retrieval systems. Despite their success, these models suffer from high computational costs due to processing long query-document sequences at inference time. A known approach to improve efficiency is token compression, which consists of aggregating groups of tokens together in the initial embedding layer, reducing the effective number of tokens, and making the computation faster. While token compression has proven to be successful for bi-encoder retrievers, we empirically observed that this approach may be ineffective for cross-encoder rerankers. In this paper, we propose Layer-wise Token Compression (LTC), which applies adaptive token pooling at intermediate transformer layers. Through extensive ablation studies on MS MARCO passage and document ranking tasks, we demonstrate that compression at middle layers preserves ranking quality while increasing inference QPS by up to 25% for passage ranking and up to 116% for document ranking. We also extend LTC to listwise LLM rerankers and show that the same approach can be easily applied to long-context listwise reranking, where the QPS improvements are even greater. More surprisingly, when applying rerankers trained on short passages to long-document ranking tasks, models trained with compression outperform their uncompressed counterparts, suggesting that compression may act as a beneficial regularizer that encourages length-invariant representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes Layer-wise Token Compression (LTC) for transformer-based cross-encoder rerankers. Rather than applying token compression at the initial embedding layer (observed to degrade cross-encoder performance), LTC performs adaptive token pooling at selected intermediate layers. Ablation studies on MS MARCO passage and document ranking tasks show that middle-layer compression preserves ranking quality while yielding inference QPS gains of up to 25% (passages) and 116% (documents). The approach is extended to listwise LLM rerankers with larger gains, and models trained with compression outperform uncompressed baselines when transferred from short-passage to long-document ranking, suggesting a regularizing effect toward length-invariant representations.

Significance. If the empirical findings prove robust, the work offers practical value for efficient neural reranking in production IR pipelines, especially for long-context and LLM-based listwise settings. The key insight that layer position critically affects compression viability for cross-encoders (unlike bi-encoders) and the incidental regularizer benefit are useful observations. The paper supplies ablation results across tasks and an extension to LLMs, which bolsters applicability. However, the absence of variance estimates and statistical tests in the reported efficiency and effectiveness numbers weakens the strength of the central efficiency-quality trade-off claim.

major comments (3)
  1. [§4.2 and Table 2] §4.2 and Table 2: The reported QPS gains (25% for passage ranking, 116% for document ranking) are presented as point estimates without error bars, standard deviations across runs, or statistical significance tests. Because the central claim is that quality is preserved while efficiency improves, the lack of these controls makes it impossible to determine whether the gains exceed experimental noise.
  2. [§3.1 and §3.2] §3.1 and §3.2: The precise definition of the adaptive pooling operation, the criterion used to choose which intermediate layers receive compression, and the exact compression ratios tested are not fully specified. These details are load-bearing for reproducing the reported result that middle-layer compression succeeds while initial-layer compression fails.
  3. [§5.3] §5.3: The interpretation that LTC functions as a regularizer producing length-invariant representations rests solely on improved transfer performance from short to long documents. No supporting measurements (e.g., length-score correlation or representation similarity across lengths) are provided, leaving the mechanistic claim under-supported relative to its prominence in the abstract.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'up to 25%' and 'up to 116%' should be accompanied by the specific compression ratios and layer indices that achieve these maxima so readers can assess the operating range.
  2. [Related Work] Related Work: A short paragraph contrasting why initial-layer pooling harms cross-encoder query-document interaction modeling (while succeeding for bi-encoders) would sharpen the motivation for LTC.
  3. [§3.2] Notation: The manuscript uses 'adaptive token pooling' without an explicit equation; adding a concise formal definition (e.g., in §3.2) would improve clarity for readers unfamiliar with the pooling variant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback and the recommendation for minor revision. We appreciate the points raised regarding the robustness of our efficiency results, the need for greater methodological detail, and the support for our interpretation of the regularizer effect. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§4.2 and Table 2] §4.2 and Table 2: The reported QPS gains (25% for passage ranking, 116% for document ranking) are presented as point estimates without error bars, standard deviations across runs, or statistical significance tests. Because the central claim is that quality is preserved while efficiency improves, the lack of these controls makes it impossible to determine whether the gains exceed experimental noise.

    Authors: We agree that variance estimates and statistical tests would strengthen the central efficiency-quality trade-off claim. In the revised manuscript, we will rerun the key experiments across multiple random seeds (at least 3 runs per configuration), report standard deviations for both NDCG/MRR and QPS values, and include paired statistical significance tests (e.g., t-tests) comparing compressed and baseline models. This will allow readers to assess whether the reported gains exceed experimental variability. revision: yes

  2. Referee: [§3.1 and §3.2] §3.1 and §3.2: The precise definition of the adaptive pooling operation, the criterion used to choose which intermediate layers receive compression, and the exact compression ratios tested are not fully specified. These details are load-bearing for reproducing the reported result that middle-layer compression succeeds while initial-layer compression fails.

    Authors: We thank the referee for highlighting this reproducibility gap. In the revised version, we will expand Sections 3.1 and 3.2 with: (1) the full mathematical definition of the adaptive pooling, including the token importance scoring function and aggregation rule; (2) the explicit layer-selection criterion (derived from preliminary ablations showing early-layer degradation); and (3) the precise compression ratios (e.g., pooling factors of 2× or 4×) applied at each chosen layer. We will also include pseudocode for the LTC procedure to ensure full reproducibility. revision: yes

  3. Referee: [§5.3] §5.3: The interpretation that LTC functions as a regularizer producing length-invariant representations rests solely on improved transfer performance from short to long documents. No supporting measurements (e.g., length-score correlation or representation similarity across lengths) are provided, leaving the mechanistic claim under-supported relative to its prominence in the abstract.

    Authors: We acknowledge that the regularizer interpretation currently relies primarily on the transfer results. While these results provide suggestive evidence, we agree additional mechanistic support would be valuable. In the revision, we will either tone down the language in the abstract and §5.3 to emphasize the observational nature of the finding, or add a brief supporting analysis (e.g., length-relevance score correlations or representation similarity metrics across document lengths) if space allows. We view this as an interesting direction for future work rather than a fully substantiated mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances an empirical method for layer-wise token compression in cross-encoder rerankers, validated through ablation studies on MS MARCO passage and document ranking tasks. Central claims rest on observed performance preservation at intermediate layers (with QPS gains) and extension to LLM rerankers, without any mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the result to the paper's own inputs by construction. The findings are externally falsifiable via standard IR benchmarks and do not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, the paper introduces LTC as a new technique without listing explicit free parameters or background axioms beyond standard transformer assumptions; the core addition is the invented method itself.

invented entities (1)
  • Layer-wise Token Compression (LTC) no independent evidence
    purpose: Apply adaptive token pooling at intermediate transformer layers to reduce sequence length for cross-encoder efficiency
    Presented as the central novel contribution that differs from initial-layer compression

pith-pipeline@v0.9.0 · 5767 in / 1302 out tokens · 28520 ms · 2026-05-22T09:15:21.506576+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 10 internal anchors

  1. [1]

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Compu...

  2. [2]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL] https://arxiv.org/abs/1611.09268

  3. [3]

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Fe- ichtenhofer, and Judy Hoffman. 2023. Token Merging: Your ViT But Faster. InThe Eleventh International Conference on Learning Representations. https: //openreview.net/forum?id=JroZRaRw7Eu

  4. [4]

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Kem- ing Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. 2025. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv:2406.02069 [cs.CL] https://arxiv.org/abs/2406.02069

  5. [5]

    Haodong Chen, Shengyao Zhuang, Zheng Yao, Guido Zuccon, and Teerapong Leelanupab. 2026. Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking. arXiv:2602.22591 [cs.IR] https://arxiv.org/ abs/2602.22591

  6. [6]

    Shijie Chen, Bernal Jimenez Gutierrez, and Yu Su. 2025. Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers. InThe Thirteenth Inter- national Conference on Learning Representations. https://openreview.net/forum? id=yzloNYH3QN

  7. [7]

    Zijian Chen, Ronak Pradeep, and Jimmy Lin. 2025. Accelerating Listwise Rerank- ing: Reproducing and Enhancing FIRST. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 3165–3172. doi:10.1145/3726302.3730287

  8. [8]

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapt- ing Language Models to Compress Contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3829–3846. doi:10.18653/v1/2023.emnlp-main.232

  9. [9]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. arXiv:2102.07662 [cs.IR] https://arxiv.org/ abs/2102.07662

  10. [10]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR] https://arxiv.org/abs/2003.07820

  11. [11]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

  12. [12]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. InThe Eleventh International Conference on Learning Representations. https://openreview.net/ forum?id=tcbBPnfwxS

  13. [13]

    Revanth Gangi Reddy, JaeHyeok Doo, Yifei Xu, Md Arafat Sultan, Deevya Swain, Avirup Sil, and Heng Ji. 2024. FIRST: Faster Improved Listwise Reranking with Single Token Decoding. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational L...

  14. [14]

    Luyu Gao, Zhuyun Dai, and Jamie Callan. 2021. Rethink Training of BERT Rerankers in Multi-stage Retrieval Pipeline. InAdvances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 – April 1, 2021, Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg, 280–286. doi:10.1007/978-3-030-72240-1_26

  15. [15]

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Tevatron: An Efficient and Flexible Toolkit for Neural Retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3120–3124. doi:10.1145/3539618.3591805

  16. [16]

    Raje, Venkatesan T

    Saurabh Goyal, Anamitra Roy Choudhury, Saurabh M. Raje, Venkatesan T. Chakaravarthy, Yogish Sabharwal, and Ashish Verma. 2020. PoWER-BERT: accel- erating BERT inference via progressive word-vector elimination. InProceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 346, 10 pages

  17. [17]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML] https://arxiv.org/abs/1503.02531

  18. [18]

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems37 (2024), 1270–1303

  19. [19]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9

  20. [20]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. arXiv:2310.068...

  21. [21]

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLM- Lingua: Compressing Prompts for Accelerated Inference of Large Language Mod- els. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1...

  22. [22]

    Yann LeCun, John Denker, and Sara Solla. 1989. Optimal brain damage.Advances in neural information processing systems2 (1989)

  23. [23]

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing Context to Enhance Inference Efficiency of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6342–6353. doi:10.18653/v1...

  24. [24]

    Zongqian Li, Yinhong Liu, Yixuan Su, and Nigel Collier. 2024. Prompt Com- pression for Large Language Models: A Survey. arXiv:2410.12388 [cs.CL] https://arxiv.org/abs/2410.12388

  25. [25]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIG...

  26. [26]

    Zheng Liu, Chaofan Li, Shitao Xiao, Chaozhuo Li, Chen Jason Zhang, Hao Liao, Defu Lian, and Yingxia Shao. 2025. Fitting Into Any Shape: A Flexible LLM-Based Re-Ranker With Configurable Depth and Width. InProceedings of the ACM on Web Conference 2025(Sydney NSW, Australia)(WWW ’25). Association for Computing Machinery, New York, NY, USA, 3942–3951. doi:10....

  27. [27]

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asymmet- ric 2bit Quantization for KV Cache. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, ...

  28. [28]

    Xueguang Ma, Luyu Gao, Shengyao Zhuang, Jiaqi Samantha Zhan, Jamie Callan, and Jimmy Lin. 2025. Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machiner...

  29. [29]

    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023. Zero-Shot List- wise Document Reranking with a Large Language Model. arXiv:2305.02156 [cs.IR] https://arxiv.org/abs/2305.02156

  30. [30]

    Paul Michel, Omer Levy, and Graham Neubig. 2019. Are Sixteen Heads Really Better than One?. InAdvances in Neural Information Processing Systems, H. Wal- lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/ 2019/file/2c601ad9d2ff9bc8b282670cdd5...

  31. [31]

    Jesse Mu, Xiang Li, and Noah Goodman. 2023. Learning to compress prompts with gist tokens.Advances in Neural Information Processing Systems36 (2023), 19327–19352

  32. [32]

    Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. CoRRabs/1901.04085 (2019). arXiv:1901.04085 http://arxiv.org/abs/1901.04085

  33. [33]

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. arXiv:2309.15088 [cs.IR] https://arxiv.org/abs/2309.15088

  34. [34]

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! arXiv:2312.02724 [cs.IR] https://arxiv.org/abs/2312.02724

  35. [35]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL] https://arxiv.org/abs/1910.01108

  36. [36]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? In- vestigating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for ...

  37. [37]

    Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. DeepSeek-OCR: Contexts Optical Compression.arXiv preprint arXiv:2510.18234(2025)

  38. [38]

    Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Ben- jamin Van Durme. 2025. Rank1: Test-Time Compute for Reranking in Information Retrieval. arXiv:2502.18418 [cs.IR] https://arxiv.org/abs/2502.18418

  39. [39]

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2024. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. InThe Twelfth International Conference on Learning Representations. https://openreview. net/forum?id=09iOdaeOzp

  40. [40]

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient Streaming Language Models with Attention Sinks. InThe Twelfth Inter- national Conference on Learning Representations. https://openreview.net/forum? id=NG7sS51zVF

  41. [41]

    Zhichao Xu. 2026. RankMamba: Benchmarking Mamba’s Document Ranking Performance in the Era of Transformers. arXiv:2403.18276 [cs.IR] https://arxiv. org/abs/2403.18276

  42. [42]

    Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, and Vivek Srikumar. 2024. Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Compu- tational Linguistics, Miami, Florida, USA, 15359–1539...

  43. [43]

    Zhichao Xu, Zhiqi Huang, Shengyao Zhuang, and Vivek Srikumar. 2025. Distilla- tion versus Contrastive Learning: How to Train Your Rerankers. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Kentaro Inui, Sakriani Sakt...

  44. [44]

    Zhichao Xu, Fengran Mo, Zhiqi Huang, Crystina Zhang, Puxuan Yu, Bei Wang, Jimmy Lin, and Vivek Srikumar. 2025. A Survey of Model Architectures in Information Retrieval. arXiv:2502.14822 [cs.IR] https://arxiv.org/abs/2502.14822

  45. [45]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  46. [46]

    Eugene Yang, Andrew Yates, Kathryn Ricci, Orion Weller, Vivek Chari, Ben- jamin Van Durme, and Dawn Lawrie. 2025. Rank-K: Test-Time Reasoning for Listwise Reranking. arXiv:2505.14432 [cs.IR] https://arxiv.org/abs/2505.14432

  47. [47]

    Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2666–2668. doi:10.1145/3404835.3462812

  48. [48]

    Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. In2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS). IEEE, 36–39. doi:10. 1109/emc2-nips53020.2019.00016

  49. [49]

    Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. 2025. Jasper-Token- Compression-600M Technical Report. arXiv:2511.14405 [cs.IR] https://arxiv.org/ abs/2511.14405

  50. [50]

    Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon

  51. [51]

    arXiv:2503.06034 [cs.IR] https://arxiv.org/abs/2503

    Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning. arXiv:2503.06034 [cs.IR] https://arxiv.org/abs/2503. 06034

  52. [52]

    Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, and Guido Zuccon. 2024. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC, USA)(SIGIR ’24). Association for Computing Machinery, Ne...

  53. [53]

    Shengyao Zhuang and Guido Zuccon. 2021. Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. arXiv:2108.08513 [cs.IR] https://arxiv.org/abs/2108.08513