LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLM s
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4years
2026 4representative citing papers
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.
citing papers explorer
-
LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
LLMs can provide cost-effective annotation of credibility in Danish asylum texts but produce inconsistent errors that vary by model and prompt, requiring checks beyond single-model accuracy.
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
-
Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification
On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.