ICL with LLMs reduces absolute imputation error for survey data versus MICE PMM across MCAR/MAR/MNAR mechanisms and yields narrower intervals with near-nominal coverage.
Demystifying prediction powered inference.arXiv preprint arXiv:2601.20819
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
method 1polarities
use method 1representative citing papers
Multi-task PPI framework uses cross-task recalibration to improve inference power across related tasks, with a proof that gains require nonlinear proxy-ground-truth structure, shown on synthetic data and a 2024 election LM audit case study.
Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.
GLIDE is a Python library that packages multiple PPI estimators and samplers for reliable GenAI evaluation and reports annotation savings in an agentic case study.
citing papers explorer
-
In-Context Learning for the Imputation of Public Opinion Data with Large Language Models
ICL with LLMs reduces absolute imputation error for survey data versus MICE PMM across MCAR/MAR/MNAR mechanisms and yields narrower intervals with near-nominal coverage.
-
Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research
Multi-task PPI framework uses cross-task recalibration to improve inference power across related tasks, with a proof that gains require nonlinear proxy-ground-truth structure, shown on synthetic data and a 2024 election LM audit case study.
-
Calibeating Prediction-Powered Inference
Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.
-
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
GLIDE is a Python library that packages multiple PPI estimators and samplers for reliable GenAI evaluation and reports annotation savings in an agentic case study.