pith. sign in

arxiv: 2605.22544 · v1 · pith:Q24F4WTDnew · submitted 2026-05-21 · 💻 cs.CL · cs.IR

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Pith reviewed 2026-05-22 06:26 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords prompt sensitivityembedding modelsinstruction tuningbenchmark evaluationmodel rankingsperformance variationtask instructionsrobustness testing
0
0 comments X

The pith

Embedding model rankings can be made to favor any participant simply by changing the evaluation prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that instruction-tuned embedding models change their measured performance when the same task is described with different wording, so a single fixed prompt cannot give a trustworthy picture of how well any model works. The authors test this by running six models on eleven datasets with fifteen distinct prompts each, producing 990 separate evaluations that expose wide score spreads. If the finding holds, then current leaderboards systematically misrepresent real capability and let rankings be rearranged at will through prompt selection. Practitioners who rely on these models for search or retrieval therefore face unpredictable results once users start writing their own instructions.

Core claim

Instruction embedding models exhibit large performance differences across plausible task instructions for the same dataset. When the authors evaluate six models over fifteen prompts per dataset, the default single prompt used in existing benchmarks can either understate or overstate results relative to the full distribution, and any model can be moved into first place on the leaderboard by selecting favorable prompts.

What carries the argument

The spread of retrieval or similarity scores that each model produces when the identical task is rephrased in fifteen different ways.

If this is right

  • Single-prompt scores reported in existing benchmarks do not reflect the range of outcomes users will actually observe.
  • Any of the six models can be placed at the top of the ranking by choosing the best prompt for each task.
  • Future benchmarks must either average across many prompts or publish sensitivity numbers next to point estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training procedures for these models could be adjusted to reduce dependence on exact wording rather than chasing peak scores on one prompt.
  • The same sensitivity pattern is likely to appear in other instruction-following systems beyond embedding models.
  • Application developers may need to test several prompt variants before trusting a published leaderboard score for their use case.

Load-bearing premise

The fifteen prompts written for each dataset stand in for the full range of instructions that actual users would write.

What would settle it

Re-running the full set of experiments on a much larger collection of prompts drawn from real user logs and finding that model orderings remain stable across that wider sample.

Figures

Figures reproduced from arXiv: 2605.22544 by Kenneth Enevoldsen, Yevhen Kostiuk.

Figure 1
Figure 1. Figure 1: Overview of our approach, showing the currently unreported distribution of performances over reasonable prompts against the point estimate that is the reported results. Below that we show three possible option, either the reported result is representative of the distribution, it is lower than the (prompt deflation) or higher than (prompt inflation). offering higher flexibility for various domains and objec… view at source ↗
Figure 2
Figure 2. Figure 2: Example of prompt deflation and inflation on a sample of evaluated tasks and models. For all the scores and distributions see Appendix A. the prompt within the models implementation – to allow customization to a specific prompt scheme – making it possible for the model developers to selectively choose the best performing prompts for a given task. Current research on prompt sensitivity is fo￾cused mostly on… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt sensitivity: Coefficient of variance distributions across the models and prompts. We see that certain models have a high degree of variance across prompts. (Kwon et al., 2023) structured outputs with the default parameters. The model was instructed to generate prompts based on the combination of language, task description, and task name. The prompts were manually evaluated to ensure that they were c… view at source ↗
Figure 4
Figure 4. Figure 4: Leaderboard sensitivity under adversarial prompt selection. We show for each model two scenarios. Left: The model uses its best prompt while others use the default. Right: The model uses its best prompt while others use their worst. Ranks are computed via Borda scores across all tasks. In both cases we compare against the default configuration (MTEB Reported). Under default prompts, KaLM and Qwen3 oc￾cupy … view at source ↗
Figure 5
Figure 5. Figure 5: Probability of prompt inflation, i.e. reported score higher than the median, across tasks. The uncer￾tainty is a 95% bootstrapped confidence interval. 4 Conclusion In this paper, we conducted an empirical study of prompt sensitivity across 165 task-specific prompts. We find that reported benchmark scores frequently misrepresent expected performance: the MTEB de￾fault prompt can systematically overstate or … view at source ↗
Figure 6
Figure 6. Figure 6: Detailed scores per model, metric, and task. For some tasks MTEB does not currently report a score hence [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of human vs machine generated prompt score distribution for Classification and Retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of human vs machine generated prompt score distribution for Clustering and STS tasks. Red [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that instruction-tuned embedding models are highly sensitive to prompt phrasing, making single-prompt evaluations unreliable. Through experiments on 6 models across 11 datasets using 15 task-specific prompts each (990 total evaluations), it shows that default prompts can understate or overstate performance and that leaderboard rankings are not robust—any model can be made to rank first with favorable prompt choices. The authors argue that benchmarks should incorporate prompt robustness via multi-prompt evaluation or sensitivity reporting.

Significance. If the empirical observations hold, this work is significant for highlighting a systematic weakness in current evaluation practices for instruction-based embedding models. The scale of the study (990 runs) provides concrete evidence of rank instability, which could prompt the community to adopt more robust benchmarking standards. The purely empirical nature with no circular derivations strengthens the reporting of observed score distributions and reversals.

major comments (1)
  1. The central claim that any model can be promoted to first place (and thus that single-prompt evaluation undermines leaderboards) depends on the 15 prompts per dataset capturing enough variation to reflect real evaluation fragility. The manuscript provides no external validation of the prompt set against actual user instructions, crowdsourced phrasings, or linguistic diversity metrics, leaving open whether the rank reversals are an artifact of the prompt-generation procedure rather than a general property of these models.
minor comments (2)
  1. Add explicit details on the statistical testing used to support claims about score misrepresentation and rank changes, as the current description leaves verification of variance and significance unclear.
  2. Clarify the exact method for generating the 15 task-specific prompts (templates, manual variation, etc.) to allow reproducibility and assessment of coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the significance of our empirical findings. We address the major comment below.

read point-by-point responses
  1. Referee: The central claim that any model can be promoted to first place (and thus that single-prompt evaluation undermines leaderboards) depends on the 15 prompts per dataset capturing enough variation to reflect real evaluation fragility. The manuscript provides no external validation of the prompt set against actual user instructions, crowdsourced phrasings, or linguistic diversity metrics, leaving open whether the rank reversals are an artifact of the prompt-generation procedure rather than a general property of these models.

    Authors: We agree that external validation of the prompt set (e.g., via crowdsourcing or linguistic diversity metrics) would strengthen the generalizability of our results. Our 15 prompts were manually authored to span common variations in length, specificity, formality, and structure drawn from model documentation and typical user queries for embedding tasks. The consistent observation of rank instability across all six models and eleven datasets indicates that sensitivity is not limited to an idiosyncratic prompt set. We will revise the manuscript to (1) provide a detailed appendix describing the prompt-generation process and (2) add an explicit limitations paragraph acknowledging the lack of crowdsourced validation while recommending such validation for future benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical reporting of prompt sensitivity scores

full rationale

The paper performs a direct empirical evaluation of six embedding models on eleven datasets using fifteen task-specific prompts each, computing and comparing performance scores and rankings across these fixed inputs. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the reported chain; the central claim that any model can reach first place follows from exhaustive enumeration of the chosen prompt set rather than from any reduction to prior outputs or self-citations. The prompt construction itself is presented as an experimental design choice without being derived from the resulting scores, keeping the analysis self-contained against the observed data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical distribution of scores across the chosen prompts rather than on new theoretical constructs, fitted parameters, or invented entities.

axioms (1)
  • domain assumption The fifteen prompts per dataset are representative of plausible user instructions.
    This assumption is required to generalize the observed sensitivity beyond the specific prompts tested.

pith-pipeline@v0.9.0 · 5680 in / 1080 out tokens · 62481 ms · 2026-05-22T06:26:26.805403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

  1. [1]

    2026 , eprint=

    Brittlebench: Quantifying LLM robustness via prompt sensitivity , author=. 2026 , eprint=

  2. [2]

    Paraphrase Types Elicit Prompt Engineering Capabilities

    Wahle, Jan Philip and Ruas, Terry and Xu, Yang and Gipp, Bela. Paraphrase Types Elicit Prompt Engineering Capabilities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.617

  3. [3]

    2023 , eprint=

    The language of prompting: What linguistic properties make a prompt successful? , author=. 2023 , eprint=

  4. [4]

    PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s

    Frank, Manuel and Afli, Haithem. PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.130

  5. [5]

    2025 , eprint=

    OpenAI GPT-5 System Card , author=. 2025 , eprint=

  6. [6]

    2026 , howpublished =

    Anthropic , title =. 2026 , howpublished =

  7. [7]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  8. [8]

    On Benchmark Hacking in ML Contests: Modeling, Insights and Design

    On Benchmark Hacking in ML Contests: Modeling, Insights and Design , author=. arXiv preprint arXiv:2604.22230 , year=

  9. [9]

    and Nelson, Leif D

    Simmons, Joseph P. and Nelson, Leif D. and Simonsohn, Uri , title =. Psychological Science , volume =

  10. [10]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  11. [11]

    Multilingual E5 Text Embeddings: A Technical Report

    Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=

  12. [12]

    The best writing on mathematics (Pitici M, ed) , volume=

    The statistical crisis in science , author=. The best writing on mathematics (Pitici M, ed) , volume=

  13. [13]

    Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

    Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006

  14. [14]

    jina-embeddings-v5-text: Task-Targeted Embedding Distillation

    jina-embeddings-v5-text: Task-Targeted Embedding Distillation , author=. arXiv preprint arXiv:2602.15547 , year=

  15. [15]

    P rompt BERT : Improving BERT Sentence Embeddings with Prompts

    Jiang, Ting and Jiao, Jian and Huang, Shaohan and Zhang, Zihan and Wang, Deqing and Zhuang, Fuzhen and Wei, Furu and Huang, Haizhen and Deng, Denvy and Zhang, Qi. P rompt BERT : Improving BERT Sentence Embeddings with Prompts. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.603

  16. [16]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

  17. [17]

    Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

    Nandan Thakur and Nils Reimers and Andreas R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

  18. [18]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

    Kamalloo, Ehsan and Thakur, Nandan and Lassance, Carlos and Ma, Xueguang and Yang, Jheng-Hong and Lin, Jimmy , title =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2024 , isbn =. doi:10.1145/3626772.3657862 , abstract =

  19. [19]

    2023 , eprint=

    One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=

  20. [20]

    2023 , eprint=

    C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

  21. [21]

    Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and zhenyu liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang , booktitle=. Ka. 2026 , url=

  22. [22]

    2025 , eprint=

    KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model , author=. 2025 , eprint=

  23. [23]

    MTEB: Massive Text Embedding Benchmark

    Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils , title =. arXiv preprint arXiv:2210.07316 , year =

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Advances in neural information processing systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

  26. [26]

    arXiv preprint arXiv:2502.13595 , year=

    MMTEB: Massive Multilingual Text Embedding Benchmark , author=. arXiv preprint arXiv:2502.13595 , year=

  27. [27]

    State of What Art? A Call for Multi-Prompt LLM Evaluation

    Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681

  28. [28]

    2018 , eprint=

    Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches , author=. 2018 , eprint=

  29. [29]

    Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times? , journal =

    Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , doi =. Transactions of the Association for Computational Linguistics , month =. https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf , issn =

  30. [30]

    arXiv , author =:2104.07081 , journal =

    TWEAC: Transformer with Extendable QA Agent Classifiers , url =. arXiv , author =:2104.07081 , journal =

  31. [31]

    2020 , eprint=

    HuggingFace's Transformers: State-of-the-art Natural Language Processing , author=. 2020 , eprint=

  32. [32]

    Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =

    Chen, Xi and Zeynali, Ali and Camargo, Chico and Fl. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =

  33. [33]

    Generating a Word-Emotion Lexicon from

    Bandhakavi, Anil and Wiratunga, Nirmalie and P, Deepak and Massie, Stewart , booktitle =. Generating a Word-Emotion Lexicon from. doi:10.3115/v1/S14-1002 , editor =

  34. [34]

    Proceedings of the 9th International Workshop on Semantic Evaluation (

    Bi. Proceedings of the 9th International Workshop on Semantic Evaluation (. doi:10.18653/v1/S15-2010 , editor =

  35. [35]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

    O. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

  36. [36]

    and Daly, Raymond E

    Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =

  37. [37]

    Barbieri, Francesco and Espinosa Anke, Luis and Camacho-Collados, Jose , booktitle =

  38. [38]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , editor =

  39. [39]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =

    Nandan Thakur and Luiz Bonifacio and Maik. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =