Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):850–864, Oct

Maria Eriksson, Erasmo Purificato, Arman Noroozian, João Vinagre, Guillaume Chaslot, Emilia Gomez, David Fernandez-Llorca · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

An Interpretable and Scalable Framework for Evaluating Large Language Models

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

citing papers explorer

Showing 1 of 1 citing paper.

An Interpretable and Scalable Framework for Evaluating Large Language Models stat.ML · 2026-05-07 · unverdicted · none · ref 23
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

Can we trust AI benchmarks? an interdisciplinary review of current issues in AI evaluation.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 8(1):850–864, Oct

fields

years

verdicts

representative citing papers

citing papers explorer