Standard errors in LLM pipelines underestimate uncertainty by 40-60 percent, with naive 95 percent CI coverage dropping as sample size grows; TEE correction maintains coverage and halves estimation error on MMLU while raising human agreement on Chatbot Arena.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Standard errors in LLM pipelines underestimate uncertainty by 40-60 percent, with naive 95 percent CI coverage dropping as sample size grows; TEE correction maintains coverage and halves estimation error on MMLU while raising human agreement on Chatbot Arena.