Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
Demystifying Prompts in Language Models via Perplexity Estimation
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
Uncertainty-based gating for few-shot reranking reduces compute by 15-80% and improves performance by up to 2% across 8 LLMs, 7 NLU datasets, and 9 MT settings.
citing papers explorer
-
Forecasting Downstream Performance of LLMs With Proxy Metrics
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
-
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
-
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
-
When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking
Uncertainty-based gating for few-shot reranking reduces compute by 15-80% and improves performance by up to 2% across 8 LLMs, 7 NLU datasets, and 9 MT settings.