pith. sign in

arxiv: 2407.04573 · v4 · pith:RYVYPHH3new · submitted 2024-07-05 · 💻 cs.IR · cs.CL

Vector Retrieval with Similarity and Diversity: How Hard Is It?

Pith reviewed 2026-05-25 08:40 UTC · model grok-4.3

classification 💻 cs.IR cs.CL
keywords vector retrievalsimilarity and diversityNP-completeheuristic algorithmmaximal marginal relevancedeterminantal point processesdense vector search
0
0 comments X

The pith

Maximizing the dot product between a query vector and the sum of selected vectors defines an NP-complete retrieval problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines Vector Retrieval with Similarity and Diversity as the task of selecting a fixed-size set of vectors that maximizes the dot product of the query with their vector sum. This single objective is shown to encode both individual similarity and collective diversity. The authors prove that solving this optimization exactly is NP-complete. They then introduce a parameter-free heuristic to find good solutions in practice. Experiments across datasets indicate the heuristic improves on Maximal Marginal Relevance and determinantal point processes using both geometric scores and LLM-based judgments.

Core claim

The paper establishes that the VRSD problem—maximizing the dot-product similarity between a query vector and the sum of a selected set of candidate vectors—is NP-complete. It introduces a parameter-free heuristic to solve it and shows through evaluations that the heuristic outperforms baselines such as MMR and k-DPP on multiple datasets using both objective metrics and subjective LLM assessments.

What carries the argument

The VRSD optimization problem, which encodes both similarity and diversity by maximizing the query's dot product with the sum of selected vectors.

If this is right

  • Exact solutions for balanced similarity-diversity retrieval cannot be computed efficiently in the worst case.
  • Practical systems must rely on heuristics or approximations for this objective.
  • A parameter-free method removes the need for manual tuning that affects methods like MMR.
  • The heuristic delivers measurable gains over MMR and k-DPP on standard retrieval benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sum-based objective may relate to existing set-selection problems studied in submodular optimization.
  • Approximation algorithms with provable ratios could be developed specifically for this formulation.
  • The approach might transfer to other embedding spaces where collective coverage matters, such as in recommendation or summarization tasks.

Load-bearing premise

That the dot product between the query and the sum of selected vectors sufficiently captures both individual similarity to the query and diversity among the selected items.

What would settle it

Discovery of a polynomial-time exact algorithm for the VRSD problem on general instances, or repeated cases where the proposed heuristic underperforms a well-tuned MMR baseline on held-out data.

Figures

Figures reproduced from arXiv: 2407.04573 by Dong Deng, Hang Gao, Yongfeng Zhang.

Figure 1
Figure 1. Figure 1: An analysis of the Maximal Marginal Relevance. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Dense vector retrieval is an important building block of modern machine learning systems, underlying applications ranging from semantic search to retrieval-augmented generation and knowledge-intensive reasoning. Beyond retrieving items that are individually similar to a query, many applications require a set of results that is also diverse, complementary, and collectively informative. Balancing similarity and diversity is therefore central to effective retrieval, but remains challenging to optimize in a stable and theoretically grounded way. Maximal Marginal Relevance (MMR) is a widely adopted heuristic for this problem, yet its reliance on a manually tuned parameter leads to optimization fluctuations and unpredictable retrieval results. More broadly, existing methods provide limited theoretical insight into how similarity and diversity interact in dense vector spaces, leaving the joint optimization problem insufficiently understood. To address these challenges, this paper introduces a novel approach that characterizes both constraints simultaneously by maximizing the similarity between the query vector and the sum of the selected candidate vectors. We formally define this optimization problem, Vector Retrieval with Similarity and Diversity (VRSD), and prove that it is NP-complete, establishing a rigorous theoretical bound on the inherent difficulty of this dual-objective retrieval. Subsequently, we present a parameter-free heuristic algorithm to solve VRSD. Extensive evaluations on multiple datasets, incorporating both objective geometric metrics and LLM-simulated subjective assessments, demonstrate that our VRSD heuristic consistently outperforms established baselines, including MMR and Determinantal Point Processes (k-DPP).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper defines Vector Retrieval with Similarity and Diversity (VRSD) as the problem of selecting a cardinality-k set S of vectors to maximize the dot product between a query q and the sum of the vectors in S. It claims to prove that VRSD is NP-complete and proposes a parameter-free heuristic for solving it. The heuristic is evaluated on multiple datasets against baselines including MMR and k-DPP, using both geometric metrics and LLM-simulated subjective assessments, with claims of consistent outperformance.

Significance. If the NP-completeness result held, the work would supply a formal complexity characterization of a dual-objective retrieval task and a tuning-free algorithmic alternative to MMR, both of which would be useful contributions to the information-retrieval literature. The parameter-free property of the heuristic and the use of LLM-based evaluation would be additional strengths.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (VRSD definition): the objective is stated as maximizing dot(q, sum_{v in S} v) subject to |S|=k. By linearity of the inner product this equals sum_{v in S} dot(q, v), whose optimum is obtained by sorting all candidates by their individual dot products with q and taking the top k. The problem is therefore solvable in O(n log n) time and lies in P, rendering any NP-completeness proof impossible for the stated formulation. The same linearity also shows that the objective contains no cross terms or penalties that could enforce diversity.
minor comments (1)
  1. [Experiments] The experimental section would benefit from explicit pseudocode or a reference implementation link for the proposed heuristic to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying this critical issue in our formulation. We acknowledge the validity of the observation and will revise the manuscript to correct the definition of VRSD, the complexity analysis, and related claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (VRSD definition): the objective is stated as maximizing dot(q, sum_{v in S} v) subject to |S|=k. By linearity of the inner product this equals sum_{v in S} dot(q, v), whose optimum is obtained by sorting all candidates by their individual dot products with q and taking the top k. The problem is therefore solvable in O(n log n) time and lies in P, rendering any NP-completeness proof impossible for the stated formulation. The same linearity also shows that the objective contains no cross terms or penalties that could enforce diversity.

    Authors: We agree with the referee's analysis. By linearity of the inner product, the stated objective reduces exactly to selecting the k vectors with the largest individual dot products with q. This is solvable in O(n log n) time via sorting and contains no mechanism for diversity. Consequently, the NP-completeness claim cannot hold for the formulation as written in the abstract and §3. In the revised manuscript we will (1) replace the current VRSD objective with one that explicitly incorporates a diversity term (e.g., subtracting a pairwise similarity penalty among selected vectors), (2) re-establish the complexity result under the corrected definition, (3) update the abstract, §3, the heuristic description, and all experimental claims, and (4) re-run the evaluations to ensure the new formulation is meaningfully distinct from simple top-k retrieval. We view this as a necessary correction rather than a defense of the original text. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper defines VRSD via a new objective (maximizing dot-product similarity between query and vector sum of selected set) and claims an independent NP-completeness proof for that formulation. No steps reduce by construction to fitted parameters, self-citations, or prior ansatzes from the authors; the objective and complexity claim are presented as self-contained definitions without load-bearing reductions to inputs. This is the normal case of an independent theoretical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced beyond standard computational-complexity assumptions for the NP-completeness proof.

pith-pipeline@v0.9.0 · 5773 in / 1064 out tokens · 33423 ms · 2026-05-25T08:40:36.478450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Think you have solved direct-answer question answering? try arc-da, the direct-answer ai2 reasoning challenge

    Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. Think you have solved direct-answer question answering? try arc-da, the direct-answer ai2 reasoning challenge. arXiv preprint arXiv:2102.03315,

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Reading wikipedia to answer open- domain questions

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open- domain questions. In 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, pp. 1870–1879. Association for Computational Linguistics (ACL),

  4. [4]

    Uprise: Universal prompt retrieval for improving zero-shot evaluation

    Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Weiwei Deng, and Qi Zhang. Uprise: Universal prompt retrieval for improving zero-shot evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12318–12337,

  5. [5]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    10 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,

  6. [6]

    Leveraging passage retrieval with generative models for open domain question answering

    Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics, pp. 874–880. Association for Computational Linguistics,

  7. [7]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics,

  8. [8]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114,

  9. [9]

    Man Luo, Xin Xu, Zhuyun Dai, Panupong Pasupat, Mehran Kazemi, Chitta Baral, Vaiva Imbra- saite, and Vincent Y Zhao. Dr. icl: Demonstration-retrieved in-context learning. arXiv preprint arXiv:2305.14128,

  10. [10]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

  11. [11]

    Controllable semantic parsing via retrieval augmentation

    Panupong Pasupat, Yuan Zhang, and Kelvin Guu. Controllable semantic parsing via retrieval augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7683–7698,

  12. [12]

    Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering

    Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo...

  13. [13]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992,

  14. [14]

    Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking

    Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2825–2835,

  15. [15]

    Learning to retrieve prompts for in-context learning

    Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2655–2671,

  16. [16]

    Constrained language models yield few-shot semantic parsers

    Richard Shin, Christopher Lin, Sam Thomson, Charles Chen Jr, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7699–7715,

  17. [17]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533,

  18. [18]

    Learning to retrieve in-context examples for large language models

    Liang Wang, Nan Yang, and Furu Wei. Learning to retrieve in-context examples for large language models. arXiv preprint arXiv:2307.07164,

  19. [19]

    Complementary explanations for effective in-context learning

    Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Ves Stoyanov, Greg Durrett, and Ramakanth Pasunuru. Complementary explanations for effective in-context learning. Findings of the Association for Computational Linguistics: ACL 2023,