One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Kenneth Enevoldsen; Yevhen Kostiuk

arxiv: 2605.22544 · v1 · pith:Q24F4WTDnew · submitted 2026-05-21 · 💻 cs.CL · cs.IR

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Yevhen Kostiuk , Kenneth Enevoldsen This is my paper

Pith reviewed 2026-05-22 06:26 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords prompt sensitivityembedding modelsinstruction tuningbenchmark evaluationmodel rankingsperformance variationtask instructionsrobustness testing

0 comments

The pith

Embedding model rankings can be made to favor any participant simply by changing the evaluation prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that instruction-tuned embedding models change their measured performance when the same task is described with different wording, so a single fixed prompt cannot give a trustworthy picture of how well any model works. The authors test this by running six models on eleven datasets with fifteen distinct prompts each, producing 990 separate evaluations that expose wide score spreads. If the finding holds, then current leaderboards systematically misrepresent real capability and let rankings be rearranged at will through prompt selection. Practitioners who rely on these models for search or retrieval therefore face unpredictable results once users start writing their own instructions.

Core claim

Instruction embedding models exhibit large performance differences across plausible task instructions for the same dataset. When the authors evaluate six models over fifteen prompts per dataset, the default single prompt used in existing benchmarks can either understate or overstate results relative to the full distribution, and any model can be moved into first place on the leaderboard by selecting favorable prompts.

What carries the argument

The spread of retrieval or similarity scores that each model produces when the identical task is rephrased in fifteen different ways.

If this is right

Single-prompt scores reported in existing benchmarks do not reflect the range of outcomes users will actually observe.
Any of the six models can be placed at the top of the ranking by choosing the best prompt for each task.
Future benchmarks must either average across many prompts or publish sensitivity numbers next to point estimates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures for these models could be adjusted to reduce dependence on exact wording rather than chasing peak scores on one prompt.
The same sensitivity pattern is likely to appear in other instruction-following systems beyond embedding models.
Application developers may need to test several prompt variants before trusting a published leaderboard score for their use case.

Load-bearing premise

The fifteen prompts written for each dataset stand in for the full range of instructions that actual users would write.

What would settle it

Re-running the full set of experiments on a much larger collection of prompts drawn from real user logs and finding that model orderings remain stable across that wider sample.

Figures

Figures reproduced from arXiv: 2605.22544 by Kenneth Enevoldsen, Yevhen Kostiuk.

**Figure 1.** Figure 1: Overview of our approach, showing the currently unreported distribution of performances over reasonable prompts against the point estimate that is the reported results. Below that we show three possible option, either the reported result is representative of the distribution, it is lower than the (prompt deflation) or higher than (prompt inflation). offering higher flexibility for various domains and objec… view at source ↗

**Figure 2.** Figure 2: Example of prompt deflation and inflation on a sample of evaluated tasks and models. For all the scores and distributions see Appendix A. the prompt within the models implementation – to allow customization to a specific prompt scheme – making it possible for the model developers to selectively choose the best performing prompts for a given task. Current research on prompt sensitivity is focused mostly on… view at source ↗

**Figure 3.** Figure 3: Prompt sensitivity: Coefficient of variance distributions across the models and prompts. We see that certain models have a high degree of variance across prompts. (Kwon et al., 2023) structured outputs with the default parameters. The model was instructed to generate prompts based on the combination of language, task description, and task name. The prompts were manually evaluated to ensure that they were c… view at source ↗

**Figure 4.** Figure 4: Leaderboard sensitivity under adversarial prompt selection. We show for each model two scenarios. Left: The model uses its best prompt while others use the default. Right: The model uses its best prompt while others use their worst. Ranks are computed via Borda scores across all tasks. In both cases we compare against the default configuration (MTEB Reported). Under default prompts, KaLM and Qwen3 occupy … view at source ↗

**Figure 5.** Figure 5: Probability of prompt inflation, i.e. reported score higher than the median, across tasks. The uncertainty is a 95% bootstrapped confidence interval. 4 Conclusion In this paper, we conducted an empirical study of prompt sensitivity across 165 task-specific prompts. We find that reported benchmark scores frequently misrepresent expected performance: the MTEB default prompt can systematically overstate or … view at source ↗

**Figure 6.** Figure 6: Detailed scores per model, metric, and task. For some tasks MTEB does not currently report a score hence [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of human vs machine generated prompt score distribution for Classification and Retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of human vs machine generated prompt score distribution for Clustering and STS tasks. Red [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt choice can flip any embedding model to the top of the leaderboard, so single-prompt scores are not reliable.

read the letter

The main thing to know is that this paper documents a clear practical problem: instruction-tuned embedding models show large score swings across different but plausible prompts, and the authors demonstrate that you can select prompts to push any of the six models they tested into first place on the rankings. They ran 990 evaluations across 11 datasets and 15 task-specific prompts per dataset, which is enough scale to make the variability visible rather than anecdotal. The default prompt sometimes overstates and sometimes understates performance, so the usual single-prompt leaderboard numbers do not represent the distribution of results users would actually see. That part of the work is straightforward and useful. The empirical design is simple and the numbers are reported directly, which makes the central observation easy to check. The finding that rankings are not robust is new in this specific setting for embedding models, even if prompt sensitivity has been noted elsewhere for LLMs. The soft spot is the prompt set itself. The fifteen prompts per dataset were constructed by the authors, and the paper does not compare them to real user logs, crowdsourced instructions, or any measure of how typical they are. If the chosen variations happen to hit the kinds of syntactic or semantic differences that matter most, the rank reversals could look larger than they would be in ordinary use. That does not invalidate the result, but it limits how far the claim generalizes without further checks. The work is aimed at people who build or select embedding models for retrieval and downstream NLP tasks, and at anyone maintaining or using benchmarks like MTEB. Readers who care about evaluation reliability will find it worth their time. It is solid enough on the empirical side to deserve a serious referee, even though the prompt representativeness question will need attention in revision.

Referee Report

1 major / 2 minor

Summary. The paper claims that instruction-tuned embedding models are highly sensitive to prompt phrasing, making single-prompt evaluations unreliable. Through experiments on 6 models across 11 datasets using 15 task-specific prompts each (990 total evaluations), it shows that default prompts can understate or overstate performance and that leaderboard rankings are not robust—any model can be made to rank first with favorable prompt choices. The authors argue that benchmarks should incorporate prompt robustness via multi-prompt evaluation or sensitivity reporting.

Significance. If the empirical observations hold, this work is significant for highlighting a systematic weakness in current evaluation practices for instruction-based embedding models. The scale of the study (990 runs) provides concrete evidence of rank instability, which could prompt the community to adopt more robust benchmarking standards. The purely empirical nature with no circular derivations strengthens the reporting of observed score distributions and reversals.

major comments (1)

The central claim that any model can be promoted to first place (and thus that single-prompt evaluation undermines leaderboards) depends on the 15 prompts per dataset capturing enough variation to reflect real evaluation fragility. The manuscript provides no external validation of the prompt set against actual user instructions, crowdsourced phrasings, or linguistic diversity metrics, leaving open whether the rank reversals are an artifact of the prompt-generation procedure rather than a general property of these models.

minor comments (2)

Add explicit details on the statistical testing used to support claims about score misrepresentation and rank changes, as the current description leaves verification of variance and significance unclear.
Clarify the exact method for generating the 15 task-specific prompts (templates, manual variation, etc.) to allow reproducibility and assessment of coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the significance of our empirical findings. We address the major comment below.

read point-by-point responses

Referee: The central claim that any model can be promoted to first place (and thus that single-prompt evaluation undermines leaderboards) depends on the 15 prompts per dataset capturing enough variation to reflect real evaluation fragility. The manuscript provides no external validation of the prompt set against actual user instructions, crowdsourced phrasings, or linguistic diversity metrics, leaving open whether the rank reversals are an artifact of the prompt-generation procedure rather than a general property of these models.

Authors: We agree that external validation of the prompt set (e.g., via crowdsourcing or linguistic diversity metrics) would strengthen the generalizability of our results. Our 15 prompts were manually authored to span common variations in length, specificity, formality, and structure drawn from model documentation and typical user queries for embedding tasks. The consistent observation of rank instability across all six models and eleven datasets indicates that sensitivity is not limited to an idiosyncratic prompt set. We will revise the manuscript to (1) provide a detailed appendix describing the prompt-generation process and (2) add an explicit limitations paragraph acknowledging the lack of crowdsourced validation while recommending such validation for future benchmarks. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical reporting of prompt sensitivity scores

full rationale

The paper performs a direct empirical evaluation of six embedding models on eleven datasets using fifteen task-specific prompts each, computing and comparing performance scores and rankings across these fixed inputs. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the reported chain; the central claim that any model can reach first place follows from exhaustive enumeration of the chosen prompt set rather than from any reduction to prior outputs or self-citations. The prompt construction itself is presented as an experimental design choice without being derived from the resulting scores, keeping the analysis self-contained against the observed data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical distribution of scores across the chosen prompts rather than on new theoretical constructs, fitted parameters, or invented entities.

axioms (1)

domain assumption The fifteen prompts per dataset are representative of plausible user instructions.
This assumption is required to generalize the observed sensitivity beyond the specific prompts tested.

pith-pipeline@v0.9.0 · 5680 in / 1080 out tokens · 62481 ms · 2026-05-22T06:26:26.805403+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that reported scores misrepresent the distribution of scores over plausible prompts... by choosing prompts favorably, any model in our study can be promoted to first place.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the leaderboard ranking is not robust to prompt selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

2026 , eprint=

Brittlebench: Quantifying LLM robustness via prompt sensitivity , author=. 2026 , eprint=

work page 2026
[2]

Paraphrase Types Elicit Prompt Engineering Capabilities

Wahle, Jan Philip and Ruas, Terry and Xu, Yang and Gipp, Bela. Paraphrase Types Elicit Prompt Engineering Capabilities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.617

work page doi:10.18653/v1/2024.emnlp-main.617 2024
[3]

2023 , eprint=

The language of prompting: What linguistic properties make a prompt successful? , author=. 2023 , eprint=

work page 2023
[4]

PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s

Frank, Manuel and Afli, Haithem. PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.130

work page doi:10.18653/v1/2026.eacl-long.130 2026
[5]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025
[6]

2026 , howpublished =

Anthropic , title =. 2026 , howpublished =

work page 2026
[7]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page
[8]

On Benchmark Hacking in ML Contests: Modeling, Insights and Design

On Benchmark Hacking in ML Contests: Modeling, Insights and Design , author=. arXiv preprint arXiv:2604.22230 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

and Nelson, Leif D

Simmons, Joseph P. and Nelson, Leif D. and Simonsohn, Uri , title =. Psychological Science , volume =

work page
[10]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[11]

Multilingual E5 Text Embeddings: A Technical Report

Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

The best writing on mathematics (Pitici M, ed) , volume=

The statistical crisis in science , author=. The best writing on mathematics (Pitici M, ed) , volume=

work page
[13]

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006

work page doi:10.18653/v1/2025.emnlp-main.1006 2025
[14]

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

jina-embeddings-v5-text: Task-Targeted Embedding Distillation , author=. arXiv preprint arXiv:2602.15547 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

P rompt BERT : Improving BERT Sentence Embeddings with Prompts

Jiang, Ting and Jiao, Jian and Huang, Shaohan and Zhang, Zihan and Wang, Deqing and Zhuang, Fuzhen and Wei, Furu and Huang, Haizhen and Deng, Denvy and Zhang, Qi. P rompt BERT : Improving BERT Sentence Embeddings with Prompts. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.603

work page doi:10.18653/v1/2022.emnlp-main.603 2022
[16]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Nandan Thakur and Nils Reimers and Andreas R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

work page
[18]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Kamalloo, Ehsan and Thakur, Nandan and Lassance, Carlos and Ma, Xueguang and Yang, Jheng-Hong and Lin, Jimmy , title =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2024 , isbn =. doi:10.1145/3626772.3657862 , abstract =

work page doi:10.1145/3626772.3657862 2024
[19]

2023 , eprint=

One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=

work page 2023
[20]

2023 , eprint=

C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

work page 2023
[21]

Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and zhenyu liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang , booktitle=. Ka. 2026 , url=

work page 2026
[22]

2025 , eprint=

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model , author=. 2025 , eprint=

work page 2025
[23]

MTEB: Massive Text Embedding Benchmark

Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils , title =. arXiv preprint arXiv:2210.07316 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Advances in Neural Information Processing Systems , volume=

A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[25]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page
[26]

arXiv preprint arXiv:2502.13595 , year=

MMTEB: Massive Multilingual Text Embedding Benchmark , author=. arXiv preprint arXiv:2502.13595 , year=

work page arXiv
[27]

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681

work page doi:10.1162/tacl_a_00681 2024
[28]

2018 , eprint=

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches , author=. 2018 , eprint=

work page 2018
[29]

Lost in the Middle: How Language Models Use Long Contexts

Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , doi =. Transactions of the Association for Computational Linguistics , month =. https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf , issn =

work page internal anchor Pith review doi:10.1162/tacl
[30]

arXiv , author =:2104.07081 , journal =

TWEAC: Transformer with Extendable QA Agent Classifiers , url =. arXiv , author =:2104.07081 , journal =

work page arXiv
[31]

2020 , eprint=

HuggingFace's Transformers: State-of-the-art Natural Language Processing , author=. 2020 , eprint=

work page 2020
[32]

Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =

Chen, Xi and Zeynali, Ali and Camargo, Chico and Fl. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =

work page 2022
[33]

Generating a Word-Emotion Lexicon from

Bandhakavi, Anil and Wiratunga, Nirmalie and P, Deepak and Massie, Stewart , booktitle =. Generating a Word-Emotion Lexicon from. doi:10.3115/v1/S14-1002 , editor =

work page doi:10.3115/v1/s14-1002
[34]

Proceedings of the 9th International Workshop on Semantic Evaluation (

Bi. Proceedings of the 9th International Workshop on Semantic Evaluation (. doi:10.18653/v1/S15-2010 , editor =

work page doi:10.18653/v1/s15-2010 2010
[35]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

O. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

work page 2021
[36]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =

work page
[37]

Barbieri, Francesco and Espinosa Anke, Luis and Camacho-Collados, Jose , booktitle =

work page
[38]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , editor =

work page internal anchor Pith review doi:10.18653/v1/n18-1074
[39]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =

Nandan Thakur and Luiz Bonifacio and Maik. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =

work page

[1] [1]

2026 , eprint=

Brittlebench: Quantifying LLM robustness via prompt sensitivity , author=. 2026 , eprint=

work page 2026

[2] [2]

Paraphrase Types Elicit Prompt Engineering Capabilities

Wahle, Jan Philip and Ruas, Terry and Xu, Yang and Gipp, Bela. Paraphrase Types Elicit Prompt Engineering Capabilities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.617

work page doi:10.18653/v1/2024.emnlp-main.617 2024

[3] [3]

2023 , eprint=

The language of prompting: What linguistic properties make a prompt successful? , author=. 2023 , eprint=

work page 2023

[4] [4]

PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s

Frank, Manuel and Afli, Haithem. PTEB : Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLM s. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.130

work page doi:10.18653/v1/2026.eacl-long.130 2026

[5] [5]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025

[6] [6]

2026 , howpublished =

Anthropic , title =. 2026 , howpublished =

work page 2026

[7] [7]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

work page

[8] [8]

On Benchmark Hacking in ML Contests: Modeling, Insights and Design

On Benchmark Hacking in ML Contests: Modeling, Insights and Design , author=. arXiv preprint arXiv:2604.22230 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

and Nelson, Leif D

Simmons, Joseph P. and Nelson, Leif D. and Simonsohn, Uri , title =. Psychological Science , volume =

work page

[10] [10]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[11] [11]

Multilingual E5 Text Embeddings: A Technical Report

Multilingual E5 Text Embeddings: A Technical Report , author=. arXiv preprint arXiv:2402.05672 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

The best writing on mathematics (Pitici M, ed) , volume=

The statistical crisis in science , author=. The best writing on mathematics (Pitici M, ed) , volume=

work page

[13] [13]

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

Hua, Andong and Tang, Kenan and Gu, Chenhe and Gu, Jindong and Wong, Eric and Qin, Yao. Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1006

work page doi:10.18653/v1/2025.emnlp-main.1006 2025

[14] [14]

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

jina-embeddings-v5-text: Task-Targeted Embedding Distillation , author=. arXiv preprint arXiv:2602.15547 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

P rompt BERT : Improving BERT Sentence Embeddings with Prompts

Jiang, Ting and Jiao, Jian and Huang, Shaohan and Zhang, Zihan and Wang, Deqing and Zhuang, Fuzhen and Wei, Furu and Huang, Haizhen and Deng, Denvy and Zhang, Qi. P rompt BERT : Improving BERT Sentence Embeddings with Prompts. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.603

work page doi:10.18653/v1/2022.emnlp-main.603 2022

[16] [16]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

Nandan Thakur and Nils Reimers and Andreas R. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , year=

work page

[18] [18]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Kamalloo, Ehsan and Thakur, Nandan and Lassance, Carlos and Ma, Xueguang and Yang, Jheng-Hong and Lin, Jimmy , title =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2024 , isbn =. doi:10.1145/3626772.3657862 , abstract =

work page doi:10.1145/3626772.3657862 2024

[19] [19]

2023 , eprint=

One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. 2023 , eprint=

work page 2023

[20] [20]

2023 , eprint=

C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

work page 2023

[21] [21]

Xinping Zhao and Xinshuo Hu and Zifei Shan and Shouzheng Huang and Yao Zhou and Xin Zhang and Zetian Sun and zhenyu liu and Dongfang Li and Xinyuan Wei and Youcheng Pan and Yang Xiang and Meishan Zhang and Haofen Wang and Jun Yu and Baotian Hu and Min Zhang , booktitle=. Ka. 2026 , url=

work page 2026

[22] [22]

2025 , eprint=

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model , author=. 2025 , eprint=

work page 2025

[23] [23]

MTEB: Massive Text Embedding Benchmark

Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils , title =. arXiv preprint arXiv:2210.07316 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Advances in Neural Information Processing Systems , volume=

A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

work page

[25] [25]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

work page

[26] [26]

arXiv preprint arXiv:2502.13595 , year=

MMTEB: Massive Multilingual Text Embedding Benchmark , author=. arXiv preprint arXiv:2502.13595 , year=

work page arXiv

[27] [27]

State of What Art? A Call for Multi-Prompt LLM Evaluation

Mizrahi, Moran and Kaplan, Guy and Malkin, Dan and Dror, Rotem and Shahaf, Dafna and Stanovsky, Gabriel. State of What Art? A Call for Multi-Prompt LLM Evaluation. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00681

work page doi:10.1162/tacl_a_00681 2024

[28] [28]

2018 , eprint=

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches , author=. 2018 , eprint=

work page 2018

[29] [29]

Lost in the Middle: How Language Models Use Long Contexts

Zhang, Xinyu and Thakur, Nandan and Ogundepo, Odunayo and Kamalloo, Ehsan and Alfonso-Hermelo, David and Li, Xiaoguang and Liu, Qun and Rezagholizadeh, Mehdi and Lin, Jimmy , doi =. Transactions of the Association for Computational Linguistics , month =. https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00595/2157340/tacl\_a\_00595.pdf , issn =

work page internal anchor Pith review doi:10.1162/tacl

[30] [30]

arXiv , author =:2104.07081 , journal =

TWEAC: Transformer with Extendable QA Agent Classifiers , url =. arXiv , author =:2104.07081 , journal =

work page arXiv

[31] [31]

2020 , eprint=

HuggingFace's Transformers: State-of-the-art Natural Language Processing , author=. 2020 , eprint=

work page 2020

[32] [32]

Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =

Chen, Xi and Zeynali, Ali and Camargo, Chico and Fl. Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) , doi =

work page 2022

[33] [33]

Generating a Word-Emotion Lexicon from

Bandhakavi, Anil and Wiratunga, Nirmalie and P, Deepak and Massie, Stewart , booktitle =. Generating a Word-Emotion Lexicon from. doi:10.3115/v1/S14-1002 , editor =

work page doi:10.3115/v1/s14-1002

[34] [34]

Proceedings of the 9th International Workshop on Semantic Evaluation (

Bi. Proceedings of the 9th International Workshop on Semantic Evaluation (. doi:10.18653/v1/S15-2010 , editor =

work page doi:10.18653/v1/s15-2010 2010

[35] [35]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

O. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , doi =

work page 2021

[36] [36]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher , booktitle =. Learning Word Vectors for Sentiment Analysis , url =

work page

[37] [37]

Barbieri, Francesco and Espinosa Anke, Luis and Camacho-Collados, Jose , booktitle =

work page

[38] [38]

FEVER: a large-scale dataset for Fact Extraction and VERification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit , booktitle =. doi:10.18653/v1/N18-1074 , editor =

work page internal anchor Pith review doi:10.18653/v1/n18-1074

[39] [39]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =

Nandan Thakur and Luiz Bonifacio and Maik. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , title =

work page