One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Pith reviewed 2026-05-22 06:26 UTC · model grok-4.3
The pith
Embedding model rankings can be made to favor any participant simply by changing the evaluation prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Instruction embedding models exhibit large performance differences across plausible task instructions for the same dataset. When the authors evaluate six models over fifteen prompts per dataset, the default single prompt used in existing benchmarks can either understate or overstate results relative to the full distribution, and any model can be moved into first place on the leaderboard by selecting favorable prompts.
What carries the argument
The spread of retrieval or similarity scores that each model produces when the identical task is rephrased in fifteen different ways.
If this is right
- Single-prompt scores reported in existing benchmarks do not reflect the range of outcomes users will actually observe.
- Any of the six models can be placed at the top of the ranking by choosing the best prompt for each task.
- Future benchmarks must either average across many prompts or publish sensitivity numbers next to point estimates.
Where Pith is reading between the lines
- Training procedures for these models could be adjusted to reduce dependence on exact wording rather than chasing peak scores on one prompt.
- The same sensitivity pattern is likely to appear in other instruction-following systems beyond embedding models.
- Application developers may need to test several prompt variants before trusting a published leaderboard score for their use case.
Load-bearing premise
The fifteen prompts written for each dataset stand in for the full range of instructions that actual users would write.
What would settle it
Re-running the full set of experiments on a much larger collection of prompts drawn from real user logs and finding that model orderings remain stable across that wider sample.
Figures
read the original abstract
Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that instruction-tuned embedding models are highly sensitive to prompt phrasing, making single-prompt evaluations unreliable. Through experiments on 6 models across 11 datasets using 15 task-specific prompts each (990 total evaluations), it shows that default prompts can understate or overstate performance and that leaderboard rankings are not robust—any model can be made to rank first with favorable prompt choices. The authors argue that benchmarks should incorporate prompt robustness via multi-prompt evaluation or sensitivity reporting.
Significance. If the empirical observations hold, this work is significant for highlighting a systematic weakness in current evaluation practices for instruction-based embedding models. The scale of the study (990 runs) provides concrete evidence of rank instability, which could prompt the community to adopt more robust benchmarking standards. The purely empirical nature with no circular derivations strengthens the reporting of observed score distributions and reversals.
major comments (1)
- The central claim that any model can be promoted to first place (and thus that single-prompt evaluation undermines leaderboards) depends on the 15 prompts per dataset capturing enough variation to reflect real evaluation fragility. The manuscript provides no external validation of the prompt set against actual user instructions, crowdsourced phrasings, or linguistic diversity metrics, leaving open whether the rank reversals are an artifact of the prompt-generation procedure rather than a general property of these models.
minor comments (2)
- Add explicit details on the statistical testing used to support claims about score misrepresentation and rank changes, as the current description leaves verification of variance and significance unclear.
- Clarify the exact method for generating the 15 task-specific prompts (templates, manual variation, etc.) to allow reproducibility and assessment of coverage.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the significance of our empirical findings. We address the major comment below.
read point-by-point responses
-
Referee: The central claim that any model can be promoted to first place (and thus that single-prompt evaluation undermines leaderboards) depends on the 15 prompts per dataset capturing enough variation to reflect real evaluation fragility. The manuscript provides no external validation of the prompt set against actual user instructions, crowdsourced phrasings, or linguistic diversity metrics, leaving open whether the rank reversals are an artifact of the prompt-generation procedure rather than a general property of these models.
Authors: We agree that external validation of the prompt set (e.g., via crowdsourcing or linguistic diversity metrics) would strengthen the generalizability of our results. Our 15 prompts were manually authored to span common variations in length, specificity, formality, and structure drawn from model documentation and typical user queries for embedding tasks. The consistent observation of rank instability across all six models and eleven datasets indicates that sensitivity is not limited to an idiosyncratic prompt set. We will revise the manuscript to (1) provide a detailed appendix describing the prompt-generation process and (2) add an explicit limitations paragraph acknowledging the lack of crowdsourced validation while recommending such validation for future benchmarks. revision: partial
Circularity Check
No circularity: pure empirical reporting of prompt sensitivity scores
full rationale
The paper performs a direct empirical evaluation of six embedding models on eleven datasets using fifteen task-specific prompts each, computing and comparing performance scores and rankings across these fixed inputs. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the reported chain; the central claim that any model can reach first place follows from exhaustive enumeration of the chosen prompt set rather than from any reduction to prior outputs or self-citations. The prompt construction itself is presented as an experimental design choice without being derived from the resulting scores, keeping the analysis self-contained against the observed data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The fifteen prompts per dataset are representative of plausible user instructions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that reported scores misrepresent the distribution of scores over plausible prompts... by choosing prompts favorably, any model in our study can be promoted to first place.
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the leaderboard ranking is not robust to prompt selection
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Brittlebench: Quantifying LLM robustness via prompt sensitivity , author=. 2026 , eprint=
work page 2026
-
[2]
Paraphrase Types Elicit Prompt Engineering Capabilities
Wahle, Jan Philip and Ruas, Terry and Xu, Yang and Gipp, Bela. Paraphrase Types Elicit Prompt Engineering Capabilities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.617
-
[3]
The language of prompting: What linguistic properties make a prompt successful? , author=. 2023 , eprint=
work page 2023
-
[4]
Frank, Manuel and Afli, Haithem. PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.130
- [5]
- [6]
-
[7]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=
-
[8]
On Benchmark Hacking in ML Contests: Modeling, Insights and Design
On Benchmark Hacking in ML Contests: Modeling, Insights and Design , author=. arXiv preprint arXiv:2604.22230 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Simmons, Joseph P. and Nelson, Leif D. and Simonsohn, Uri , title =. Psychological Science , volume =
- [10]
-
[11]
Multilingual E5 Text Embeddings: A Technical Report
Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The best writing on mathematics (Pitici M, ed) , volume=
The statistical crisis in science , author=. The best writing on mathematics (Pitici M, ed) , volume=
-
[13]
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s
Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006
-
[14]
jina-embeddings-v5-text: Task-Targeted Embedding Distillation
jina-embeddings-v5-text: Task-Targeted Embedding Distillation , author=. arXiv preprint arXiv:2602.15547 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
P rompt BERT : Improving BERT Sentence Embeddings with Prompts
Jiang, Ting and Jiao, Jian and Huang, Shaohan and Zhang, Zihan and Wang, Deqing and Zhuang, Fuzhen and Wei, Furu and Huang, Haizhen and Deng, Denvy and Zhang, Qi. P rompt BERT : Improving BERT Sentence Embeddings with Prompts. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.603
-
[16]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Nandan Thakur and Nils Reimers and Andreas R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=
-
[18]
Kamalloo, Ehsan and Thakur, Nandan and Lassance, Carlos and Ma, Xueguang and Yang, Jheng-Hong and Lin, Jimmy , title =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2024 , isbn =. doi:10.1145/3626772.3657862 , abstract =
-
[19]
One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=
work page 2023
-
[20]
C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=
work page 2023
-
[21]
Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and zhenyu liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang , booktitle=. Ka. 2026 , url=
work page 2026
-
[22]
KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model , author=. 2025 , eprint=
work page 2025
-
[23]
MTEB: Massive Text Embedding Benchmark
Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils , title =. arXiv preprint arXiv:2210.07316 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Advances in Neural Information Processing Systems , volume=
A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
Advances in neural information processing systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
-
[26]
arXiv preprint arXiv:2502.13595 , year=
MMTEB: Massive Multilingual Text Embedding Benchmark , author=. arXiv preprint arXiv:2502.13595 , year=
-
[27]
State of What Art? A Call for Multi-Prompt LLM Evaluation
Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681
-
[28]
Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches , author=. 2018 , eprint=
work page 2018
-
[29]
Lost in the Middle: How Language Models Use Long Contexts
Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , doi =. Transactions of the Association for Computational Linguistics , month =. https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf , issn =
work page internal anchor Pith review doi:10.1162/tacl
-
[30]
arXiv , author =:2104.07081 , journal =
TWEAC: Transformer with Extendable QA Agent Classifiers , url =. arXiv , author =:2104.07081 , journal =
-
[31]
HuggingFace's Transformers: State-of-the-art Natural Language Processing , author=. 2020 , eprint=
work page 2020
-
[32]
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =
Chen, Xi and Zeynali, Ali and Camargo, Chico and Fl. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =
work page 2022
-
[33]
Generating a Word-Emotion Lexicon from
Bandhakavi, Anil and Wiratunga, Nirmalie and P, Deepak and Massie, Stewart , booktitle =. Generating a Word-Emotion Lexicon from. doi:10.3115/v1/S14-1002 , editor =
-
[34]
Proceedings of the 9th International Workshop on Semantic Evaluation (
Bi. Proceedings of the 9th International Workshop on Semantic Evaluation (. doi:10.18653/v1/S15-2010 , editor =
-
[35]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =
O. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =
work page 2021
-
[36]
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =
-
[37]
Barbieri, Francesco and Espinosa Anke, Luis and Camacho-Collados, Jose , booktitle =
-
[38]
FEVER: a large-scale dataset for Fact Extraction and VERification
Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , editor =
work page internal anchor Pith review doi:10.18653/v1/n18-1074
-
[39]
Nandan Thakur and Luiz Bonifacio and Maik. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.