Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.
InProceedings of the International Con- ference on Machine Learning
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.