Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation
Pith reviewed 2026-06-26 21:18 UTC · model grok-4.3
The pith
Aggregating independently encoded chunks into one document vector reduces evidence dilution and improves retrieval on long documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Document-side early compression weakens short but decisive spans when a long document is encoded into a single vector; DICE counters this by encoding chunks independently and aggregating them back into one vector, producing a representation that stays closer to the strongest chunk-level evidence and yields higher retrieval accuracy while preserving the conventional ranking interface.
What carries the argument
DICE, the training-free aggregation of independently encoded chunk vectors into a single document vector.
If this is right
- Retrieval metrics rise on long-document benchmarks, with the largest absolute gains on passages exceeding 4k tokens.
- The Evidence Dilution Index drops relative to the single-vector baseline in 92.8 percent of the 12,779 filtered samples.
- The same gains appear across four different frozen backbone models without any retraining.
- The standard one-query-one-document interface remains unchanged, so existing rankers require no modification.
Where Pith is reading between the lines
- The main bottleneck addressed is document encoding rather than query encoding or the ranking stage itself.
- Fixed aggregation rules might generalize to other single-vector tasks such as long-text classification or clustering.
- The method could be layered on top of models that already support longer contexts to reduce residual dilution.
Load-bearing premise
Independent chunk encodings can be aggregated in a fixed way that reliably preserves the strongest internal evidence without model-specific tuning or post-hoc selection.
What would settle it
A controlled test on documents whose gold label is determined by one short decisive chunk, showing that the aggregated vector ranks the gold document lower than the single-vector baseline on a majority of such cases.
Figures
read the original abstract
Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper diagnoses long-document dense retrieval failures as 'document-side early compression,' where short decisive spans are diluted in a single document vector. It introduces the Evidence Dilution Index (EDI) to quantify the gap between document-level and strongest chunk-level evidence, then proposes DICE: a training-free method that splits documents into chunks, encodes them independently with a frozen encoder, aggregates the chunk vectors into one document vector, and retains the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval on four backbones, with large gains on >4k-token slices (Dream Passkey >4k: 30.0→90.0; Needle >4k: 23.3→74.0) and lower EDI than the baseline in 92.8% of 12,779 filtered samples.
Significance. If the aggregation operator is a fixed, training-free function without implicit query- or gold-dependent selection, the work identifies a practical, interface-preserving lever for long-document retrieval that requires no retraining. The concrete numeric gains on long slices and the high EDI win rate provide falsifiable evidence that document encoding choices matter; the introduction of EDI as a diagnostic is a useful addition to the retrieval literature.
major comments (3)
- [Method / DICE] Method section (DICE description): no equation or pseudocode defines the aggregation operator that combines independently encoded chunk vectors into a single document vector. Without an explicit definition (e.g., mean, per-dimension max, or any form of selection/weighting), it is impossible to verify that the 92.8% EDI reduction and the >4k-token jumps arise from the claimed early-compression diagnosis rather than from the particular operator chosen.
- [Experiments / LongEmbed] Experimental results (LongEmbed tables): chunking parameters (length, stride, overlap) and the exact aggregation rule are not stated as fixed across the four backbones or as part of the training-free claim. Any dependence on these choices would affect the reported EDI win rate and the Passkey/Needle >4k improvements, yet no ablation isolates their contribution.
- [EDI definition] EDI definition and evaluation: the paper reports that DICE yields lower EDI than the single-vector baseline in 92.8% of cases, but does not provide the precise formula for EDI (how the 'strongest chunk-level evidence' is identified and compared) or confirm that this comparison is performed identically for both systems. This definition is load-bearing for the central diagnostic claim.
minor comments (2)
- [Abstract] Abstract and introduction should include a one-line mathematical sketch of the aggregation step so readers can immediately assess whether it is a standard pooling operator.
- [Experiments] The filtering criteria that reduce the test set to 12,779 samples are not described; this affects reproducibility of the 92.8% statistic.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Method / DICE] Method section (DICE description): no equation or pseudocode defines the aggregation operator that combines independently encoded chunk vectors into a single document vector. Without an explicit definition (e.g., mean, per-dimension max, or any form of selection/weighting), it is impossible to verify that the 92.8% EDI reduction and the >4k-token jumps arise from the claimed early-compression diagnosis rather than from the particular operator chosen.
Authors: We agree that an explicit mathematical definition is required. The aggregation operator in DICE is mean pooling over the independently encoded chunk vectors. We will add both the equation and pseudocode to the Method section in the revision. revision: yes
-
Referee: [Experiments / LongEmbed] Experimental results (LongEmbed tables): chunking parameters (length, stride, overlap) and the exact aggregation rule are not stated as fixed across the four backbones or as part of the training-free claim. Any dependence on these choices would affect the reported EDI win rate and the Passkey/Needle >4k improvements, yet no ablation isolates their contribution.
Authors: We acknowledge the need for explicit statement of these hyperparameters and an ablation. All experiments used a fixed chunk length of 512 tokens with stride 256 (50% overlap) and mean aggregation, applied identically across backbones. We will document these choices in the Experiments section and add an ablation table on chunk size and aggregation variants. revision: yes
-
Referee: [EDI definition] EDI definition and evaluation: the paper reports that DICE yields lower EDI than the single-vector baseline in 92.8% of cases, but does not provide the precise formula for EDI (how the 'strongest chunk-level evidence' is identified and compared) or confirm that this comparison is performed identically for both systems. This definition is load-bearing for the central diagnostic claim.
Authors: We agree the precise EDI formula must be stated. EDI is computed as sim(q, d) - max_i sim(q, c_i) where c_i are the chunks of the gold document d; the max is taken over the identical chunk set for both the baseline and DICE. We will insert the formal definition and confirmation of identical evaluation procedure in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical evaluation on external benchmarks
full rationale
The paper defines EDI as a diagnostic metric and proposes the training-free DICE aggregation method, then reports direct empirical comparisons (EDI lower in 92.8% of 12,779 samples; retrieval gains on LongEmbed slices) against single-vector baselines across four backbones. No equations, fitted parameters, or derivations are presented that reduce to self-definition or self-citation; all load-bearing claims rest on held-out benchmark measurements rather than internal construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dense retrieval systems encode queries and documents into fixed-length vectors whose similarity determines ranking quality.
invented entities (1)
-
Evidence Dilution Index (EDI)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. https://api.semanticscholar.org/CorpusID:269009682 Llm2vec: Large language models are secretly powerful text encoders . ArXiv, abs/2404.05961
arXiv 2024
-
[2]
Sinchana Ramakanth Bhat, Max Rudat, Jannis Spiekermann, and Nicolas Flores-Herr. 2025. https://api.semanticscholar.org/CorpusID:278960026 Rethinking chunk size for long-document retrieval: A multi-dataset analysis . ArXiv, abs/2505.21700
arXiv 2025
-
[3]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://api.semanticscholar.org/CorpusID:267413218 M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Annual Meeting of the Association for Computational Linguistics
2024
-
[4]
Junying Chen, Qingcai Chen, Dongfang Li, and Yutao Huang. 2022. https://api.semanticscholar.org/CorpusID:253734670 Sedr: Segment representation learning for long documents dense retrieval . ArXiv, abs/2211.10841
arXiv 2022
-
[5]
Sedigheh Eslami, M. V. Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, and Denis A. Bykov. 2026. https://api.semanticscholar.org/CorpusID:285470145 Diffusion-pretrained dense and contextual embeddings . ArXiv, abs/2602.11151
arXiv 2026
-
[6]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...
Pith/arXiv arXiv 2024
-
[7]
Michael Gunther, Isabelle Mohr, Bo Wang, and Han Xiao. 2024. https://api.semanticscholar.org/CorpusID:272524899 Late chunking: Contextual chunk embeddings using long-context embedding models . ArXiv, abs/2409.04701
arXiv 2024
-
[8]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. https://api.semanticscholar.org/CorpusID:269032933 Ruler: What's the real context size of your long-context language models? ArXiv, abs/2404.06654
Pith/arXiv arXiv 2024
-
[9]
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. https://api.semanticscholar.org/CorpusID:249097975 Unsupervised dense information retrieval with contrastive learning . Trans. Mach. Learn. Res., 2022
2021
-
[10]
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. 2023. https://api.sem...
Pith/arXiv arXiv 2023
-
[11]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...
-
[12]
Omar Khattab and Matei Zaharia. 2020. https://doi.org/10.1145/3397271.3401075 Colbert: Efficient and effective passage search via contextualized late interaction over bert . In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '20, page 39–48, New York, NY, USA. Association for Computing...
-
[13]
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/c4bf73386022473a652a18941e9ea6f8-Paper-Conference.pdf Nv-embed: Improved techniques for training llms as generalist embedding models . In International Conference on Learning Representations...
2025
-
[14]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf Retrieval-augmented generation for knowledge-intens...
2020
-
[15]
Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2024. https://doi.org/10.18653/v1/2024.acl-long.859 L oo GLE : Can long-context language models understand long contexts? In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16304--16333, Bangkok, Thailand. Association for Computat...
-
[16]
Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. https://api.semanticscholar.org/CorpusID:260682258 Towards general text embeddings with multi-stage contrastive learning . ArXiv, abs/2308.03281
Pith/arXiv arXiv 2023
-
[17]
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://doi.org/10.1162/tacl_a_00638 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12:157--173
-
[18]
Niklas Muennighoff, Hongjin SU, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/70cf215430492f7d34830a24e744b3f1-Paper-Conference.pdf Generative representational instruction tuning . In International Conference on Learning Representations, volume 2025, pages 45544--45613
2025
-
[19]
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics
-
[20]
Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using S iamese BERT -networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982--3992, Hong Kong, Chi...
-
[21]
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. https://doi.org/10.18653/v1/2022.naacl-main.272 C ol BERT v2: Effective and efficient retrieval via lightweight late interaction . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...
-
[22]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. https://api.semanticscholar.org/CorpusID:254366618 Text embeddings by weakly-supervised contrastive pre-training . ArXiv, abs/2212.03533
Pith/arXiv arXiv 2022
-
[23]
Lawrie, and Luca Soldaini
Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn J. Lawrie, and Luca Soldaini. 2024. https://api.semanticscholar.org/CorpusID:268667440 Followir: Evaluating and teaching information retrieval models to follow instructions . In North American Chapter of the Association for Computational Linguistics
2024
-
[24]
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian yun Nie. 2023. https://api.semanticscholar.org/CorpusID:271114619 C-pack: Packed resources for general chinese embeddings . Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
2023
-
[25]
Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, and 25 others. 2024. https://api.semanticscholar.org/CorpusID:274859421 Qwen2.5 technical report . ArXiv, abs/2412.15115
Pith/arXiv arXiv 2024
-
[26]
Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. 2025. https://api.semanticscholar.org/CorpusID:280700361 Dream 7b: Diffusion large language models . ArXiv, abs/2508.15487
Pith/arXiv arXiv 2025
-
[27]
Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, and Chen Zhao. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.213 Diffusion vs. autoregressive language models: A text embedding perspective . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4273--4303, Suzhou, China. Association for Comput...
-
[28]
Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.47 L ong E mbed: Extending embedding models for long context retrieval . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 802--816, Miami, Florida, USA. Association for Computati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.