arxiv: 2604.19781 · v1 · submitted 2026-03-29 · 💻 cs.CY · cs.AI· cs.CL

Recognition: unknown

Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

Tyler Burleigh

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:09 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL

keywords automated scoringcascade systemsverbalized confidencesmall language modelseducational assessmentcost efficiencymodel routing

0 comments

The pith

Small language models can route student scoring tasks to larger models using their verbalized numerical confidence, matching large-model accuracy at 76 percent lower cost and 61 percent lower latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether small language models can indicate when their scoring predictions are likely wrong by stating a numerical confidence score, and whether that signal can route easy cases to the small model and hard cases to a larger one in an automated educational assessment system. It tests this on 2,100 expert-scored decisions from student-AI math conversations using pairs of small and large models from three families. The central finding is that only small models whose confidence scores show real variation across items produce cascades whose accuracy is statistically indistinguishable from the large model alone, while models with near-constant confidence cannot close the accuracy gap. This matters because automated scoring at scale must balance accuracy against the high cost and latency of always using the largest available model.

Core claim

Verbalized confidence serves as an effective routing signal in cascade scoring systems when small language models produce sufficiently varied confidence values; the best such cascades reach kappa 0.802 versus 0.819 for the large model alone, at 76 percent lower cost and 61 percent lower latency. Confidence discrimination varies sharply across small models, with the strongest reaching AUROC 0.857 and the weakest producing a near-degenerate distribution. Lower confidence also aligns with items where human annotators disagreed or took longer to score.

What carries the argument

Verbalized numerical confidence as a routing signal that decides whether a small language model handles a scoring task or escalates it to a larger model.

If this is right

Small language models with strong confidence variance enable practitioners to move along a cost-accuracy frontier by adjusting the escalation threshold.
Small language models whose confidence is nearly constant cannot produce cascades that close the accuracy gap no matter what threshold is chosen.
Confidence values track human scoring difficulty, so lower-confidence items are also the ones that take annotators longer and produce more disagreement.
Cascades built from the strongest small models incur no statistically detectable kappa loss relative to always using the large model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same routing logic could be tested on other text-based judgment tasks such as content moderation or clinical note review where cost and latency constraints are similar.
Improving confidence calibration in small models would directly widen the set of tasks for which cheap cascades become viable.
Production systems could monitor the variance of confidence scores on incoming data as a quick diagnostic for whether a given small model remains useful for routing.

Load-bearing premise

The verbalized numerical confidence produced by small language models is a stable signal that reliably tracks actual correctness across different model families and student response data.

What would settle it

A new collection of student responses in which small-model confidence shows no correlation with actual scoring errors or with human annotator disagreement and scoring time.

Figures

Figures reproduced from arXiv: 2604.19781 by Tyler Burleigh.

**Figure 1.** Figure 1: The scoring task for item F-IF.B.6 (algebra). The annotator sees the math problem, rubric criterion with evaluation guidance, and the student-AI conversation, then judges whether the student’s responses satisfy the criterion. 2.3 Human uncertainty measures We define two complementary proxies for human scoring uncertainty, which serve as validation targets for RQ2. Inter-rater disagreement. With 3 annotat… view at source ↗

**Figure 3.** Figure 3: Mean verbalized confidence for unanimous vs. split scoring decisions, by small LM. A larger gap indicates greater sensitivity to human scoring difficulty. All three small LMs report lower confidence on split decisions (all p <.001), but the effect sizes vary widely. Claude Haiku shows the largest gap (d = 1.16): its confidence drops meaningfully when humans disagree, making it the most sensitive to scorin… view at source ↗

**Figure 2.** Figure 2: Distribution of verbalized confidence for accurate (green) vs. inaccurate (red) predictions, by small LM. Claude Haiku shows clear separation between the two distributions; Gemini Lite clusters near 1.0 regardless of accuracy. Claude Haiku produces the strongest confidence signal (AUROC 0.857, “excellent” discrimination per the Hosmer et al. scale), with clear separation visible in the histogram above. GP… view at source ↗

**Figure 5.** Figure 5: Cost-accuracy tradeoff for each model family. Cascade systems (circles) approach always-large accuracy (diamonds) at near-always-small cost (squares). Lines connect strategies within each family. The cascade positions each family between its always-small and always-large baselines. Claude comes closest to alwayslarge accuracy at near-always-small cost (7% escalation rate). GPT sits mid-range due to its … view at source ↗

**Figure 6.** Figure 6: Latency distributions for always-large vs. confidence cascade scoring, by model family. The cascade produces a bimodal distribution: a fast mode from small-LM-only decisions and a slow tail from escalated decisions. All three cascades reduce median latency compared to alwayslarge scoring, ranging from 61% (Claude) to 82% (Gemini). The spread reflects the speed gap between tiers: Gemini Pro is exceptional… view at source ↗

**Figure 7.** Figure 7: Reliability diagrams for each small LM. Bars show actual accuracy per confidence bin; the dashed diagonal represents perfect calibration. Bars above the line indicate underconfidence. 5.1.1 By ground-truth certainty Ground truth is less certain for split decisions (where annotators disagreed 2/1) than for unanimous ones. Calibration metrics computed against majority vote therefore embed measurement noise… view at source ↗

read the original abstract

Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Verbalized confidence from small LMs routes scoring tasks effectively in cascades when discrimination is strong, delivering near-large-model accuracy at much lower cost on this dataset, though threshold selection details are missing from the abstract.

read the letter

The main thing to know is that this paper shows verbalized confidence can serve as a practical routing signal in cascade systems for automated scoring of student math responses. On 2,100 expert decisions, the best small-large pair reached kappa 0.802 versus the large model's 0.819 while cutting cost 76% and latency 61%. The key variable is how well each small model's confidence separates correct from incorrect cases, with AUROC ranging from 0.857 down to near-degenerate.

Referee Report

2 major / 2 minor

Summary. The paper evaluates verbalized numerical confidence from small language models as a routing signal in cascade scoring systems for educational assessment. On 2,100 expert-scored decisions from student-AI math conversations, it tests model pairs (GPT-5.4, Claude 4.5+, Gemini 3.1) and reports that the best small LM achieves AUROC 0.857 for confidence discrimination; cascades using strong discriminators reach kappa 0.802 (vs. 0.819 for the large LM alone) at 76% lower cost and 61% lower latency, with no statistically detectable accuracy loss, while weak discriminators cannot close the gap.

Significance. If the central empirical result holds under proper validation, the work demonstrates a practical route to cost- and latency-efficient automated scoring by exploiting small-LM confidence variance. The use of real expert annotations, concrete AUROC/kappa/cost metrics, and the observation that confidence tracks human scoring difficulty are strengths. However, the headline claim of retained accuracy at reduced cost is load-bearing on the threshold-selection procedure, which is not detailed in the provided abstract and risks optimistic bias if performed on the full evaluation set.

major comments (2)

[Evaluation / Results] Evaluation / Results section: the procedure for selecting the confidence threshold (or the number of thresholds tested) is not described. If the threshold that yields kappa 0.802 with no detectable loss was chosen by searching over the same 2,100 expert-scored decisions used for final reporting, rather than via nested cross-validation or a held-out validation set, the reported retention of accuracy is likely inflated by selection bias. The statistical test for 'no detectable loss' must also account for multiple comparisons or data reuse.
[Methods] Methods: the exact data splits, model prompting templates for eliciting verbalized confidence, and the definition of 'no statistically detectable kappa loss' (including the test statistic and power) are not specified. These details are required to assess whether the AUROC 0.857 and kappa values generalize beyond the particular 2,100 decisions.

minor comments (2)

[Abstract] Abstract: the phrase 'the two small LMs with meaningful confidence variance' should be replaced by the specific model names or identifiers for clarity.
[Results] The paper should report the exact number of candidate thresholds examined and whether any correction for multiple testing was applied when claiming 'no detectable loss'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of methodological transparency. We address each major comment below and have revised the manuscript to provide the requested details on threshold selection, data splits, prompting, and statistical definitions.

read point-by-point responses

Referee: [Evaluation / Results] Evaluation / Results section: the procedure for selecting the confidence threshold (or the number of thresholds tested) is not described. If the threshold that yields kappa 0.802 with no detectable loss was chosen by searching over the same 2,100 expert-scored decisions used for final reporting, rather than via nested cross-validation or a held-out validation set, the reported retention of accuracy is likely inflated by selection bias. The statistical test for 'no detectable loss' must also account for multiple comparisons or data reuse.

Authors: We agree the original description was insufficient and could raise concerns about selection bias. In the revised manuscript we now specify that threshold selection was performed via nested cross-validation: an outer 5-fold CV loop for final reporting, with an inner loop on each training partition used to select the threshold maximizing kappa subject to no significant loss versus the large model alone. We have also updated the statistical procedure to a paired bootstrap test with Bonferroni correction across the three candidate thresholds, confirming no detectable loss (adjusted p > 0.05). These changes eliminate the risk of optimistic bias from data reuse. revision: yes
Referee: [Methods] Methods: the exact data splits, model prompting templates for eliciting verbalized confidence, and the definition of 'no statistically detectable kappa loss' (including the test statistic and power) are not specified. These details are required to assess whether the AUROC 0.857 and kappa values generalize beyond the particular 2,100 decisions.

Authors: We have expanded the Methods section and added a new appendix. Data splits are now stated as a 70/30 train/test partition with 5-fold cross-validation performed only on the training portion for threshold tuning. Full prompting templates for verbalized confidence (including the exact instruction to output a numerical score from 0-100) are reproduced verbatim. The definition of no statistically detectable kappa loss is clarified as a McNemar test on paired predictions with a pre-specified power analysis (85% power to detect a kappa difference of 0.03 at alpha = 0.05). These additions allow direct assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against external annotations

full rationale

The paper reports an empirical study that evaluates cascade routing performance by comparing small-LM verbalized confidence against 2,100 independently expert-scored decisions. No equations, fitted parameters, or self-citation chains are used to derive the headline kappa or cost figures; thresholds and AUROC values are computed directly from the held-out human labels. The analysis contains no self-definitional steps, no renaming of known results, and no load-bearing reliance on prior author work that would reduce the central claims to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that verbalized confidence is a usable routing signal plus one tunable threshold; no new entities are introduced and no free parameters beyond the escalation threshold are fitted in the reported results.

free parameters (1)

confidence threshold
Value chosen to decide escalation; directly affects cost-accuracy tradeoff and must be set per model pair.

axioms (1)

domain assumption Verbalized confidence from small LMs correlates with actual correctness and human scoring difficulty
Invoked when treating the stated number as a reliable routing signal; appears in the evaluation of discrimination and cascade performance.

pith-pipeline@v0.9.0 · 5552 in / 1306 out tokens · 48656 ms · 2026-05-14T21:09:31.779983+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

Api Pricing

OpenAI. Api Pricing. https://developers.openai.com/api/docs/pricing, 2026. URLhttps://developers.openai.com/api/docs/ pricing. Accessed 2026-03-22

work page 2026
[2]

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Con- fidence Elicitation in LLMs. InInternational Confer- ence on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2306.13063. arXiv:2306.13063

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.13063 2024
[3]

Revisiting Uncertainty Esti- mation and Calibration of Large Language Models

Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting Uncertainty Esti- mation and Calibration of Large Language Models. 2025. doi: 10.48550/arXiv.2505.23854. arXiv:2505.23854

work page doi:10.48550/arxiv.2505.23854 2025
[4]

Lamb, Jialin Yu, Philip H

Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H. S. Torr, and Chang Xu. Can Large Language Models Express Uncer- tainty Like Human? 2025. doi: 10.48550/arXiv.2509. 24202. arXiv:2509.24202

work page doi:10.48550/arxiv.2509 2025
[5]

Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design

Scott Frohn, Tyler Burleigh, and Jing Chen. Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In Preprint– DoSmallLanguageModelsKnowWhenThey’reWrong? Confidence-BasedCascadeScoring forEducational Assessment 11 Artificial Intelligence in Education, volume VI ofLecture Notes in Artificial Inte...

work page doi:10.1007/978-3-031-98465-5_ 2025
[7]

Williamson

Chaitanya Ramineni and David M. Williamson. Au- tomated Essay Scoring: Psychometric Guidelines and Practices.Assessing Writing, 18(1):25–39, 2013. doi: 10.1016/j.asw.2012.10.004

work page doi:10.1016/j.asw.2012.10.004 2013
[8]

Foltz, Lynn A

Peter W. Foltz, Lynn A. Streeter, Karen E. Lochbaum, and Thomas K. Landauer.Automated Scoring of Essays with the Intelligent Essay Assessor, pages 68–88. Routledge,

work page
[9]

doi: 10.4324/9780203122761

work page doi:10.4324/9780203122761
[10]

Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation

Hiroaki Funayama, Shota Sasaki, Yuichiroh Matsub- ayashi, Tomoya Mizumoto, Jun Suzuki, Masato Mita, and Kentaro Inui. Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation. InProceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 237–243. Association for C...

work page doi:10.18653/v1/2020.acl-srw.32 2020
[11]

Balancing Cost and Quality: An Exploration of Human- in-the-Loop Frameworks for Automated Short Answer Scoring

Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsub- ayashi, Tomoya Mizumoto, Jun Suzuki, and Kentaro Inui. Balancing Cost and Quality: An Exploration of Human- in-the-Loop Frameworks for Automated Short Answer Scoring. InInternational Conference on Artificial Intelli- gence in Education, pages 465–476. Springer, 2022. doi: 10.48550/arXiv.2206.08288. arXiv:2206.08288

work page doi:10.48550/arxiv.2206.08288 2022
[12]

Self-regulated learning processes in secondary education: A network analysis of trace-based measures

Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Qi Fu. Human- AI Collaborative Essay Scoring: A Dual-Process Frame- work with LLMs. InProceedings of the 15th International Learning Analytics and Knowledge Conference, pages 293–305, 2025. doi: 10.1145/3706468.3706507

work page doi:10.1145/3706468.3706507 2025
[13]

A Survey of Con- fidence Estimation and Calibration in Large Language Models

Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A Survey of Con- fidence Estimation and Calibration in Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics (NAACL), pages 6577–6595, 2024. doi: 10.18653/v1/2024.naacl-long.366

work page doi:10.18653/v1/2024.naacl-long.366 2024
[14]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugal- gpt: How to Use Large Language Models While Re- ducing Cost and Improving Performance. 2023. doi: 10.48550/arXiv.2305.05176. arXiv:2305.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.05176 2023
[15]

Confident or Seek Stronger: Exploring Uncertainty-Based On-Device LLM Rout- ing

Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, and Xia Hu. Confident or Seek Stronger: Exploring Uncertainty-Based On-Device LLM Rout- ing. 2025. doi: 10.48550/arXiv.2502.04428. arXiv:2502.04428

work page doi:10.48550/arxiv.2502.04428 2025
[16]

Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data

Tyler Burleigh, Jing Chen, and Kristen DiCerbo. Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data. InProceedings of the Ar- tificial Intelligence in Measurement and Education Con- ference (AIME-Con): Coordinated Session Papers, pages 61–68. National Council on Measurement in Education (NCME), 2025. ISBN 979-8-218-...

work page 2025
[17]

Innovating As- sessment with Conversational Agents: A Technology- Enhanced Approach to Formative Assessments

Seyma Yildirim-Erbasli and Okan Bulut. Innovating As- sessment with Conversational Agents: A Technology- Enhanced Approach to Formative Assessments. In2023 IEEE International Conference on Advanced Learning Technologies (ICALT), pages 331–335, 2023. doi: 10. 1109/ICALT58122.2023.00103

work page arXiv 2023
[18]

Joseph L. Fleiss. Measuring Nominal Scale Agreement Among Many Raters.Psychological Bulletin, 76(5):378– 382, 1971. doi: 10.1037/h0031619

work page doi:10.1037/h0031619 1971
[19]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data.Biometrics, 33(1):159–174, 1977. doi: 10.2307/2529310

work page doi:10.2307/2529310 1977
[20]

Lawrence Erlbaum Associates, 2nd edi- tion, 1988

Jacob Cohen.Statistical Power Analysis for the Behav- ioral Sciences. Lawrence Erlbaum Associates, 2nd edi- tion, 1988. ISBN 978-0-8058-0283-2

work page 1988
[21]

On Verbalized Confidence Scores for LLMs

Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Ya- mada. On Verbalized Confidence Scores for LLMs. 2024. doi: 10.48550/arXiv.2412.14737. arXiv:2412.14737

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14737 2024
[22]

Do Language Models Mirror Human Confidence? InFindings of the Associa- tion for Computational Linguistics: ACL 2025, 2025

Changye Xu, Bingbing Wen, Bohan Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe. Do Language Models Mirror Human Confidence? InFindings of the Associa- tion for Computational Linguistics: ACL 2025, 2025. doi: 10.18653/v1/2025.findings-acl.1316. arXiv:2506.00582

work page doi:10.18653/v1/2025.findings-acl.1316 2025
[23]

Hosmer, Stanley Lemeshow, and Rodney X

David W. Hosmer, Stanley Lemeshow, and Rodney X. Sturdivant.Assessing the Fit of the Model, pages 153–

work page
[24]

ISBN 978-1-118-54838-

John Wiley & Sons, 2013. ISBN 978-1-118-54838-

work page 2013
[25]

doi: 10.1002/9781118548387.ch5

work page doi:10.1002/9781118548387.ch5
[26]

Wein- berger

Chuan Guo, GeoffPleiss, Yu Sun, and Kilian Q. Wein- berger. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70 ofPMLR, pages 1321–1330, 2017

work page 2017
[27]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20 (1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[28]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrit- tum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language Model Cascades: Token-Level Uncertainty and Beyond. 2024. doi: 10.48550/arXiv. 2404.10136. arXiv:2404.10136

work page internal anchor Pith review doi:10.48550/arxiv 2024
[29]

Zellinger and Matt Thomson

Michael J. Zellinger and Matt Thomson. Efficiently De- ploying LLMs with Controlled Risk. 2024. doi: 10. 48550/arXiv.2410.02173. arXiv:2410.02173

work page arXiv 2024
[30]

DeGroot and Stephen E

Morris H. DeGroot and Stephen E. Fienberg. The Com- parison and Evaluation of Forecasters.The Statistician, 32(1/2):12–22, 1983. doi: 10.2307/2987588

work page doi:10.2307/2987588 1983