Lost in benchmarks? rethinking large language model benchmarking with item response theoryinProceedings of the AAAI Conference on Artificial Intelligence40 (2026), 35085–35093

Zhou, H · 2026

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

AI scientists produce results without reasoning scientifically

cs.AI · 2026-04-20 · conditional · novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

citing papers explorer

Showing 1 of 1 citing paper.

AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 49
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

Lost in benchmarks? rethinking large language model benchmarking with item response theoryinProceedings of the AAAI Conference on Artificial Intelligence40 (2026), 35085–35093

fields

years

verdicts

representative citing papers

citing papers explorer