Holistic evaluation of language models.Transactions on Machine Learning Research

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al · 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.

Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

PSA-Eval reframes evaluation of trilingual public-space agents around traceable failures and regression testing, revealing cross-language score drift in a pilot despite high average performance.

citing papers explorer

Showing 2 of 2 citing papers.

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation cs.CL · 2026-05-04 · unverdicted · none · ref 17
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
Failure-Centered Runtime Evaluation for Deployed Trilingual Public-Space Agents cs.AI · 2026-04-27 · unverdicted · none · ref 5
PSA-Eval reframes evaluation of trilingual public-space agents around traceable failures and regression testing, revealing cross-language score drift in a pilot despite high average performance.

Holistic evaluation of language models.Transactions on Machine Learning Research

fields

years

verdicts

representative citing papers

citing papers explorer