Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

Hua, Andong, Tang, Kenan, Gu, Chenhe, Gu, Jindong, Wong, Eric, Qin, Yao · 2025 · DOI 10.18653/v1/2025.emnlp-main.1006

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

representative citing papers

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics

cs.CL · 2026-05-13 · accept · novelty 7.0

LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

cs.CL · 2026-05-21 · accept · novelty 6.0

Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

cs.CL · 2026-06-05 · unverdicted · novelty 4.0

On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.

citing papers explorer

Showing 4 of 4 citing papers.

LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics cs.CL · 2026-05-13 · accept · none · ref 77
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 11
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation cs.CL · 2026-05-21 · accept · none · ref 13
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification cs.CL · 2026-06-05 · unverdicted · none · ref 13
On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s

fields

years

verdicts

representative citing papers

citing papers explorer