arxiv: 2604.27037 · v1 · submitted 2026-04-29 · 💻 cs.IR · cs.CL

Recognition: unknown

Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval

Arne Eichholtz , Yongkang Li , Jutte Vijverberg , Tobias Groot , Mohammad Aliannejadi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:12 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords Hypencoderreproducibilitybi-encoderneural retrievalfirst-stage retrievalquery latencyadversarial robustnesshypernetwork

0 comments

The pith

Reproducing the Hypencoder confirms its non-linear q-net scorer beats standard bi-encoders on retrieval benchmarks while an efficient search algorithm cuts latency with little accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reproduces the Hypencoder retrieval framework, which replaces fixed inner-product scoring with a query-specific neural network generated by a hypernetwork. It verifies that this design yields better results than a comparable bi-encoder on in-domain and out-of-domain tasks. The work also validates an efficient search procedure that lowers query latency substantially while preserving most of the performance. Results on harder benchmarks are mixed, partly because of checkpoint and fine-tuning differences. Additional tests examine alternative encoders, direct latency comparisons with Faiss, and resistance to adversarial attacks.

Core claim

The Hypencoder, which uses a hypernetwork to generate weights for a query-specific neural scoring network, reproduces to outperform a similarly trained bi-encoder baseline on in-domain and out-of-domain benchmarks. Its proposed efficient search algorithm reduces query latency with only minimal performance degradation. On hard tasks the advantage holds for DL-Hard and FollowIR but not TREC TOT, where checkpoint incompatibility and fine-tuning sensitivity prevent full verification. Performance gains when swapping pre-trained encoders depend on the encoder and fine-tuning choices; standard Faiss-based bi-encoder retrieval remains faster in both exhaustive and approximate settings; and the non-l

What carries the argument

The q-net, a query-specific neural network for relevance scoring whose weights are produced by a hypernetwork from contextualized query embeddings, enabling expressive non-linear scoring while keeping query and document encodings independent.

If this is right

Hypencoder performance gains when integrating alternative pre-trained encoders depend on the specific encoder and the fine-tuning strategy used.
Standard bi-encoder retrieval with Faiss indexing remains faster than the Hypencoder under both exhaustive and efficient search conditions.
The q-net's non-linear scoring does not produce a consistent robustness disadvantage relative to inner-product scoring under adversarial evaluation.
Partial support on hard tasks indicates that checkpoint compatibility and fine-tuning sensitivity affect whether the Hypencoder advantage appears on every difficult benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed sensitivity to checkpoints suggests that future neural retrieval papers should release exact training scripts and final model weights to enable tighter reproductions.
If further latency optimizations close the gap with Faiss-based bi-encoders, the q-net approach could become practical for production first-stage retrieval where accuracy matters more than raw speed.
The lack of a consistent adversarial robustness penalty opens the possibility that non-linear scoring can be added to other retrieval architectures without introducing new attack surfaces.

Load-bearing premise

The reproduction setup, including model checkpoints, training data order, and fine-tuning hyperparameters, matches the original Hypencoder implementation closely enough for direct performance comparison.

What would settle it

Re-training the Hypencoder from the same starting checkpoints and evaluating it on the same in-domain and out-of-domain benchmarks where it fails to exceed the bi-encoder baseline by a clear margin.

Figures

Figures reproduced from arXiv: 2604.27037 by Arne Eichholtz, Jutte Vijverberg, Mohammad Aliannejadi, Tobias Groot, Yongkang Li.

**Figure 1.** Figure 1: Comparison of retrieval and reranking paradigms. While standard bi-encoders are limited by simple vector similarity view at source ↗

**Figure 2.** Figure 2: Average per-query latency (ms) vs. corpus size (log view at source ↗

**Figure 3.** Figure 3: Neighbor graph construction time (seconds) vs. cor view at source ↗

**Figure 4.** Figure 4: Relative performance drop (%) under adversarial view at source ↗

read the original abstract

The Hypencoder, proposed by Killingback et al., is a retrieval framework that replaces the fixed inner-product scoring function used in standard bi-encoders with a query-specific neural network (the $q$-net), whose weights are generated by a hypernetwork from the contextualized query embeddings. This design enables more expressive relevance estimation while preserving independent query and document encoding. In this work, we conduct a reproducibility study of the Hypencoder and extend the original analysis in three directions. Our reproduction confirms that the Hypencoder outperforms a similarly trained bi-encoder baseline on in-domain and out-of-domain benchmarks, and that the proposed efficient search algorithm substantially reduces query latency with minimal performance loss. On hard retrieval tasks, we find partial support: the Hypencoder outperforms the baseline on DL-Hard and FollowIR, but not on TREC TOT, where checkpoint incompatibility and fine-tuning sensitivity complicate full verification. Beyond reproduction, we investigate three extensions: (i)~integrating alternative pre-trained encoders into the Hypencoder framework, where we find that performance gains depend on the encoder and fine-tuning strategy; (ii)~comparing query latency against a Faiss-based bi-encoder pipeline, revealing that standard bi-encoder retrieval remains faster under both exhaustive and efficient search settings; and (iii)~evaluating adversarial robustness, where we find that the $q$-net's non-linear scoring does not provide a consistent robustness disadvantage over inner-product scoring. Our code is publicly available at https://github.com/arneeichholtz/Hypencoder-reprod.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A careful reproducibility study that largely validates the Hypencoder on in-domain tasks and latency but leaves the out-of-domain gains less secure due to checkpoint mismatches.

read the letter

This paper reproduces the Hypencoder and adds targeted checks on encoders, latency, and robustness. It confirms that the q-net version beats a similarly trained bi-encoder on in-domain and some out-of-domain benchmarks, and that their efficient search cuts latency with little loss. The code is public, which is the main practical value here. They also test alternative pre-trained encoders and run an adversarial robustness check, both of which are straightforward but useful extensions. The work is transparent about where things did not line up perfectly. On TREC TOT the checkpoints were incompatible, so they could not run a matched comparison, and they note fine-tuning sensitivity as a factor. That directly weakens the claim that the non-linear scoring itself drives the gains on hard tasks, because training differences could explain part of the gap. The in-domain results look more solid because the setups matched better there. The latency comparison against a Faiss baseline is clear and shows standard bi-encoders still win on speed. Overall this is incremental work that sits inside the existing Hypencoder framework rather than introducing new methods. It is useful for IR groups that already care about hypernetwork-based scoring and want to know the practical trade-offs before trying it. A serious editor should send it to peer review because the public artifacts and honest reporting on friction points make it a reliable reference point, even if the out-of-domain story needs more matched runs to be fully convincing.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a reproducibility study of the Hypencoder framework, which replaces inner-product scoring in bi-encoders with a query-specific neural network (q-net) whose weights are generated by a hypernetwork. The authors confirm that the reproduced Hypencoder outperforms a similarly trained bi-encoder baseline on in-domain and out-of-domain benchmarks (with partial support on hard tasks), that the proposed efficient search algorithm reduces query latency with minimal performance loss, and extend the work by testing alternative encoders, comparing latency to Faiss-based pipelines (where standard bi-encoders remain faster), and evaluating adversarial robustness (no consistent disadvantage found). Public code is released at the provided GitHub link.

Significance. This work is significant for providing independent verification and extensions to the original Hypencoder claims. The public code, benchmark results, and direct empirical comparisons (with no circular derivations) are strengths that support community reuse. If the out-of-domain generalization holds under matched conditions, the findings indicate that non-linear q-net scoring can yield measurable gains over standard bi-encoders in first-stage retrieval.

major comments (2)

[Hard retrieval tasks / out-of-domain benchmarks] Hard tasks results (DL-Hard, FollowIR, TREC TOT): checkpoint incompatibility on TREC TOT prevents matched comparison, which is load-bearing for the out-of-domain generalization claim in the abstract and results section. The paper should detail the exact mismatches in training data order, initialization, or hyperparameters and, if possible, provide an aligned run or sensitivity analysis to strengthen attribution of gains to the q-net architecture rather than setup differences.
[Latency analysis / extension (ii)] Latency comparison to Faiss-based bi-encoder: the finding that standard bi-encoder retrieval remains faster under both exhaustive and efficient search settings qualifies the efficiency claims for the proposed Hypencoder search algorithm. This should be more explicitly framed in the discussion of latency reductions to avoid overstating practical advantages.

minor comments (2)

Tables reporting benchmark results should explicitly distinguish reproduced numbers from original paper values and note any fine-tuning differences.
Clarify the exact pre-trained encoder variants and fine-tuning strategies tested in extension (i) to make the dependence on encoder choice easier to interpret.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the positive assessment and constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Hard retrieval tasks / out-of-domain benchmarks] Hard tasks results (DL-Hard, FollowIR, TREC TOT): checkpoint incompatibility on TREC TOT prevents matched comparison, which is load-bearing for the out-of-domain generalization claim in the abstract and results section. The paper should detail the exact mismatches in training data order, initialization, or hyperparameters and, if possible, provide an aligned run or sensitivity analysis to strengthen attribution of gains to the q-net architecture rather than setup differences.

Authors: We thank the referee for this observation. The manuscript already qualifies the TREC TOT results due to checkpoint incompatibility in the abstract and results section. In the revision, we have added a new paragraph in Section 4.3 explicitly detailing the mismatches in training data order, initialization seeds, and hyperparameter settings between our reproduction and the original Hypencoder checkpoints. We have also included a sensitivity analysis on the compatible DL-Hard and FollowIR runs to isolate the contribution of the q-net. However, the fundamental incompatibility of the TREC TOT checkpoints prevents an aligned run, so we have further emphasized the partial nature of the hard-task support and clarified that the primary out-of-domain claims rest on the matched benchmarks. revision: partial
Referee: [Latency analysis / extension (ii)] Latency comparison to Faiss-based bi-encoder: the finding that standard bi-encoder retrieval remains faster under both exhaustive and efficient search settings qualifies the efficiency claims for the proposed Hypencoder search algorithm. This should be more explicitly framed in the discussion of latency reductions to avoid overstating practical advantages.

Authors: We agree that the comparison should be framed more explicitly. In the revised discussion (Section 5.2), we now state upfront that although the proposed efficient search algorithm reduces Hypencoder query latency with only minimal performance loss, standard bi-encoder retrieval with Faiss remains faster under both exhaustive and approximate settings. This qualification is presented as a direct limitation on the practical efficiency gains of the Hypencoder approach. revision: yes

standing simulated objections not resolved

Providing a fully aligned run on TREC TOT due to checkpoint incompatibility

Circularity Check

0 steps flagged

No circularity: empirical reproducibility study with no derivations or fitted predictions

full rationale

The paper conducts a reproducibility study of the existing Hypencoder model, performing direct empirical comparisons against baselines on public benchmarks (in-domain and out-of-domain). It reports performance metrics, latency measurements, and robustness evaluations without any mathematical derivations, first-principles predictions, or parameter-fitting steps that could reduce to self-definition or self-citation. Claims rest on experimental results and code release; the noted checkpoint incompatibility on TREC TOT is a transparency issue about reproduction fidelity, not a circular reduction in any derivation chain. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical reproducibility study. It relies on standard IR evaluation practices such as benchmark dataset validity and nDCG/MRR metrics but introduces no free parameters, new axioms, or invented entities.

pith-pipeline@v0.9.0 · 5593 in / 1239 out tokens · 82424 ms · 2026-05-07T11:12:27.513515+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Jaime Arguello, Samarth Bhargav, Fernando Diaz, Evangelos Kanoulas, and Bhaskar Mitra. 2023. Overview of the TREC 2023 Tip-of-the-Tongue Track. InThe Thirty-Second Text REtrieval Conference Proceedings (TREC 2023), Gaithers- burg, MD, USA, November. 14–17

2023
[2]

Samarth Bhargav, Georgios Sidiropoulos, and Evangelos Kanoulas. 2022. ’It’s on the tip of my tongue’: A new Dataset for Known-Item Retrieval. InWSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, K. Selcuk Candan, Huan Liu, Leman Akoglu, Xin Luna Dong, and Jiliang Tang...

work page arXiv 2022
[3]

Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, et al. 2020. Overview of Touché 2020: argument retrieval. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 384–395

2020
[4]

Vera Boteva, Demian Gholipour Ghalandari, Artem Sokolov, and Stefan Riezler
[5]

InAdvances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016

A Full-Text Learning to Rank Dataset for Medical Information Retrieval. InAdvances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings (Lecture Notes in Computer Science, Vol. 9626). Springer, 716–722. doi:10.1007/978-3-319-30671-1_58 Hypencoder Revisited: Reproducibility and Analysis...

work page doi:10.1007/978-3-319-30671-1_58 2016
[6]

Ritchie, and Nick Weston

Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. 2018. SMASH: One-Shot Model Architecture Search through HyperNetworks. In6th Inter- national Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rydeCEhs-

2018
[7]

Vinod Kumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and David A. Clifton. 2024. A brief review of hypernetworks in deep learning.Artif. Intell. Rev. 57, 9 (2024), 250. doi:10.1007/S10462-024-10862-8

work page doi:10.1007/s10462-024-10862-8 2024
[8]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld
[9]

SPECTER: Document-level Representation Learning using Citation- informed Transformers. InProceedings of the 58th Annual Meeting of the As- sociation for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 2270–2282. doi:10.18653/V...

work page doi:10.18653/v1/2020.acl-main.207 2020
[10]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 deep learning track. arXiv:2102.07662 [cs.IR] https://arxiv.org/ abs/2102.07662

work page arXiv 2021
[11]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track.CoRR abs/2003.07820 (2020). arXiv:2003.07820 https://arxiv.org/abs/2003.07820

work page arXiv 2020
[12]

Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold

Thomas Diggelmann, Jordan L. Boyd-Graber, Jannis Bulian, Massimiliano Cia- ramita, and Markus Leippold. 2020. CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims.CoRRabs/2012.00614 (2020). arXiv:2012.00614 https://arxiv.org/abs/2012.00614

work page arXiv 2020
[13]

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]

work page internal anchor Pith review arXiv 2024
[14]

Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones...

2021
[15]

doi:10.1145/3404835.3463098

work page doi:10.1145/3404835.3463098
[16]

Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W Bruce Croft, and Xueqi Cheng. 2020. A deep look into neural ranking models for information retrieval.Information Processing & Management57, 6 (2020), 102067

2020
[17]

Dai, and Quoc V

David Ha, Andrew M. Dai, and Quoc V. Le. 2017. HyperNetworks. InInterna- tional Conference on Learning Representations. https://openreview.net/forum?id= rkpACe1lx

2017
[18]

Tim Hagen, Harrisen Scells, and Martin Potthast. 2024. Revisiting Query Variation Robustness of Transformer Models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 4283–4296. doi:10.18653/v1/2024.findings-emnlp.248

work page doi:10.18653/v1/2024.findings-emnlp.248 2024
[19]

Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-Entity v2: A Test Collection for Entity Search. InProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017. ACM, 1265–1268. ...

work page doi:10.1145/3077136.3080751 2017
[20]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently teaching an effective dense retriever with balanced topic aware sampling. InProceedings of the 44th international ACM SIGIR confer- ence on research and development in information retrieval. 113–122

2021
[21]

Shankar Iyer, Nikhil Dandekar, and Kornél Csernai. 2017. First Quora Dataset Re- lease: Question Pairs. https://quoradata.quora.com/First-Quora-Dataset-Release- Question-Pairs

2017
[22]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bo- janowski, Armand Joulin, and Edouard Grave. 2022. Unsupervised Dense Infor- mation Retrieval with Contrastive Learning.Trans. Mach. Learn. Res.2022 (2022). https://openreview.net/forum?id=jKN1pXi7b0

2022
[23]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data7, 3 (2019), 535–547

2019
[24]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open- Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Ya...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[25]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vaness...

work page doi:10.1145/3397271.3401075 2020
[26]

Julian Killingback, Hansi Zeng, and Hamed Zamani. 2025. Hypencoder: Hyper- networks for Information Retrieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025, Nicola Ferro, Maria Maistro, Gabriella Pasi, Omar Alonso, Andrew Trotman, and Suzan Ver...

work page doi:10.1145/3726302.3729983 2025
[27]

https://aclanthology.org/ Q19-1026/

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Resear...

work page doi:10.1162/tacl_a_00276 2019
[28]

Yongkang Li. 2026. Understanding and Enhancing Robustness in Dense Informa- tion Retrieval. InAdvances in Information Retrieval - 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 - April 2, 2026, Proceedings, Part III (Lecture Notes in Computer Science). Springer, 599–607. doi:10.1007/978-3-032-21324-2_51

work page doi:10.1007/978-3-032-21324-2_51 2026
[29]

Yongkang Li, Panagiotis Eustratiadis, and Evangelos Kanoulas. 2025. Reproduc- ing HotFlip for Corpus Poisoning Attacks in Dense Retrieval. InAdvances in Information Retrieval - 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6-10, 2025, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 15575). Springer, 95–111...

work page doi:10.1007/978-3-031-88717-8_8 2025
[30]

Yongkang Li, Panagiotis Eustratiadis, Simon Lupart, and Evangelos Kanoulas
[31]

Unsupervised Corpus Poisoning Attacks in Continuous Space for Dense Re- trieval. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy, July 13-18, 2025, Nicola Ferro, Maria Maistro, Gabriella Pasi, Omar Alonso, Andrew Trotman, and Suzan Verberne (Eds.). ACM, 2452–2462. ...

work page doi:10.1145/3726302.3730110 2025
[32]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2010.06467 [cs.IR] https://arxiv.org/ abs/2010.06467

work page arXiv 2021
[33]

Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, and Xilun Chen. 2023. How to Train Your Dragon: Diverse Augmentation Towards Generalizable Dense Retrieval. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.)....

work page doi:10.18653/v1/2023.findings-emnlp.423 2023
[34]

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-Batch Nega- tives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Re- trieval. InProceedings of the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online, August 6, 2021, Anna Rogers, Iacer Calixto, Ivan Vulic, Naomi Saphra, Nora Kassner, Oana-Maria Ca...

work page doi:10.18653/v1/2021.repl4nlp-1.17 2021
[35]

Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine- Tuning LLaMA for Multi-Stage Text Retrieval. InProceedings of the 47th Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and ...

work page doi:10.1145/3626772.3657951 2024
[36]

Iain Mackie, Jeffrey Dalton, and Andrew Yates. 2021. How Deep is your Learn- ing: the DL-HARD Annotated Deep Learning Dataset. InSIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Informa- tion Retrieval, Virtual Event, Canada, July 11-15, 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and T...

work page doi:10.1145/3404835.3463262 2021
[37]

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Www’18 open challenge: financial opinion mining and question answering. InCompanion proceedings of the the web conference 2018. 1941–1942

2018
[38]

Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M

John X. Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M. Rush
[39]

Text Embeddings Reveal (Almost) As Much As Text. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, 12448–12460. doi:10.18653/V1/ 2023.EMNLP-MAIN.765

work page doi:10.18653/v1/ 2023
[40]

Aviv Navon, Aviv Shamsian, Ethan Fetaya, and Gal Chechik. 2021. Learning the Pareto Front with Hypernetworks. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=NjF772F4ZZR

2021
[41]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016)

2016
[42]

Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR] https://arxiv.org/abs/1901.04085

work page internal anchor Pith review arXiv 2020
[43]

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Doc- ument Ranking with a Pretrained Sequence-to-Sequence Model. InFindings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16- 20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Lingu...

work page doi:10.18653/v1/2020.findings-emnlp.63 2020
[44]

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-Stage Document Ranking with BERT.CoRRabs/1910.14424 (2019). arXiv:1910.14424 http://arxiv.org/abs/1910.14424

work page arXiv 2019
[45]

Gustavo Penha, Arthur Câmara, and Claudia Hauff. 2022. Evaluating the Ro- bustness of Retrieval Pipelines with Query Variation Generators. InAdvances in Information Retrieval - 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10-14, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13185), Matthias Hagen, Suzan...

work page doi:10.1007/978-3-030-99736-6_27 2022
[46]

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, Lucy Lu Wang, and William R Hersh. 2021. Searching for scientific evidence in a pandemic: An overview of TREC-COVID.Journal of Biomedical Informatics121 (2021), 103865

2021
[47]

Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. Okapi at TREC-3. InProceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994 (NIST Special Publication, Vol. 500-225), Donna K. Harman (Ed.). National Institute of Standards and Technology (NIST), 109–12...

1994
[48]

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, W A, United States...

work page doi:10.18653/v1/2022.naacl-main.272 2022
[49]

Text Embeddings Reveal (Almost) As Much As Text

Dominykas Seputis, Yongkang Li, Karsten Langerak, and Serghei Mihailov. 2025. Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text". InProceedings of the Nineteenth ACM Conference on Recommender Systems, RecSys 2025, Prague, Czech Republic, September 22-26, 2025, Mária Bieliková, Pavel Kord...

work page doi:10.1145/3705328.3748155 2025
[50]

Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. 2021. Personalized Federated Learning using Hypernetworks. InProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Pro- ceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 9489–9502. http://proceeding...

2021
[51]

Jinyan Su, Preslav Nakov, and Claire Cardie. 2025. Corpus Poisoning via Approx- imate Greedy Gradient Descent. InFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Asso- ciation for Computational Linguistics, 427...

2025
[52]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino...

work page doi:10.18653/v1/2023.emnlp-main.923 2023
[53]

Panuthep Tasawong, Wuttikorn Ponwitayarat, Peerat Limkonchotiwat, Can Udomcharoenchaikit, Ekapol Chuangsuwanich, and Sarana Nutanong. 2023. Typo-Robust Representation Learning for Dense Retrieval. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki ...

2023
[54]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.arXiv preprint arXiv:2104.08663(2021)

work page internal anchor Pith review arXiv 2021
[55]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
[56]

FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Mari- lyn A. Walker, Heng Ji, and Amanda Stent (Eds.). A...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[57]

Johannes von Oswald, Christian Henning, João Sacramento, and Benjamin F. Grewe. 2020. Continual learning with hypernetworks. In8th International Con- ference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

2020
[58]

https://openreview.net/forum?id=SJgwNerKvB

OpenReview.net. https://openreview.net/forum?id=SJgwNerKvB
[59]

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. Retrieval of the Best Counterargument without Prior Topic Knowledge. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computationa...

work page doi:10.18653/v1/p18-1023 2018
[60]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). As...

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[61]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. InProceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, Wei-Ying Ma, Jian-Yun Nie, Ricardo Baeza-Yates, Tat-Seng Chua, and W. Bruce Croft (Eds.). AC...

work page doi:10.1145/2009916 2011
[62]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2023. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Ann...

work page doi:10.18653/v1/2023.acl-long.125 2023
[63]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds...

work page doi:10.18653/v1/2024.acl-long.642 2024
[64]

Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Ben- jamin Van Durme, Dawn Lawrie, and Luca Soldaini. 2024. FollowIR: Eval- uating and Teaching Information Retrieval Models to Follow Instructions. arXiv:2403.15246 [cs.IR] https://arxiv.org/abs/2403.15246

work page arXiv 2024
[65]

F ollow IR : Evaluating and teaching information retrieval models to follow instructions

Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Ben- jamin Van Durme, Dawn J. Lawrie, and Luca Soldaini. 2025. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Langu...

work page doi:10.18653/v1/2025.naacl-long.597 2025
[66]

Lawrie, Ashwin Paranjape, Yuhao Zhang, and Jack Hessel

Orion Weller, Benjamin Van Durme, Dawn J. Lawrie, Ashwin Paranjape, Yuhao Zhang, and Jack Hessel. 2025. Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=odvSjn416y

2025
[67]

Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. arXiv:2205.12035 [cs.CL] https://arxiv.org/abs/2205.12035

work page arXiv 2022
[68]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/foru...

2021
[69]

, booktitle =

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, D...

work page doi:10.18653/v1/d18-1259 2018
[70]

Chris Zhang, Mengye Ren, and Raquel Urtasun. 2019. Graph HyperNetworks for Neural Architecture Search. In7th International Conference on Learning Rep- resentations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=rkgW0oA9FX

2019
[71]

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. 2023. Poi- soning Retrieval Corpora by Injecting Adversarial Passages. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistic...

work page doi:10.18653/v1/2023.emnlp-main.849 2023
[72]

Shengyao Zhuang and Guido Zuccon. 2021. Dealing with Typos for BERT- based Passage Retrieval and Ranking. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.)...

work page doi:10.18653/v1/2021.emnlp-main.225 2021
[73]

Shengyao Zhuang and Guido Zuccon. 2022. CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos(SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 1444–1454. doi:10.1145/3477495.3531951

work page doi:10.1145/3477495.3531951 2022