OLMES : A Standard for Language Model Evaluations

Gu, Yuling, Tafjord, Oyvind, Kuehl, Bailey, Haddad, Dany, Dodge, Jesse, Hajishirzi, Hannaneh , editor = · 2025 · DOI 10.18653/v1/2025.findings-naacl.282

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Forecasting Downstream Performance of LLMs With Proxy Metrics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

cs.LG · 2026-06-05 · unverdicted · novelty 5.0

GRASP is a scalable method for subset-level data attribution in pretraining that models interactions via a geometry-aware quadratic penalty and claims to double rank correlation while cutting costs.

citing papers explorer

Showing 2 of 2 citing papers.

Forecasting Downstream Performance of LLMs With Proxy Metrics cs.CL · 2026-05-18 · unverdicted · none · ref 92
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution cs.LG · 2026-06-05 · unverdicted · none · ref 25
GRASP is a scalable method for subset-level data attribution in pretraining that models interactions via a geometry-aware quadratic penalty and claims to double rank correlation while cutting costs.

OLMES : A Standard for Language Model Evaluations

fields

years

verdicts

representative citing papers

citing papers explorer