What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

Federico Errica, Giuseppe Siracusano, Davide Sanvito, Roberto Bifulco · 2024 · arXiv 2406.12334

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

cs.LG · 2025-12-14 · unverdicted · novelty 7.0

GRPO fine-tuning with entropy-based stability rewards reduces output variability in LLMs for investment and job recommendations compared to baseline models.

PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

cs.CL · 2025-12-11 · conditional · novelty 7.0

PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

citing papers explorer

Showing 4 of 4 citing papers.

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization cs.LG · 2025-12-14 · unverdicted · none · ref 9
GRPO fine-tuning with entropy-based stability rewards reduces output variability in LLMs for investment and job recommendations compared to baseline models.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data cs.CL · 2025-12-11 · conditional · none · ref 27
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization cs.SE · 2026-05-08 · unverdicted · none · ref 14
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
Position: AI Evaluations Should be Grounded on a Theory of Capability cs.AI · 2025-09-23 · conditional · none · ref 18
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

What did i do wrong? quantifying llms' sensitivity and consistency to prompt engineering

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer