Evaluation is using direct string match and provides the accuracy score

GRE Reading Comprehension: 32 items, ground truth available in the form of multiple choice options, which are provided in the prompt too

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond Confidence: Rethinking Self-Assessments for Performance Prediction in LLMs cs.CL · 2026-05-08 · unverdicted · none · ref 22
Effort and ability appraisals match or beat confidence in predicting LLM failures, with effort giving less overoptimistic and more stable signals across model sizes and task types.

Evaluation is using direct string match and provides the accuracy score

fields

years

verdicts

representative citing papers

citing papers explorer