All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items

Humanity’s Last Exam (HLE) (Physics, Chemistry, Biology subsets) [Phan et al · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 26
Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.

All Physics samples are chosen randomly (ensuring they are text-only) from the original HLE dataset, leading to 168 items

fields

years

verdicts

representative citing papers

citing papers explorer