InProceedings of the International Con- ference on Machine Learning

tinyBenchmarks: Evaluating LLMs with fewer examples

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation cs.CL · 2026-04-30 · unverdicted · none · ref 1
Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.

InProceedings of the International Con- ference on Machine Learning

fields

years

verdicts

representative citing papers

citing papers explorer