pith. machine review for the scientific record. sign in

arxiv: 2604.19781 · v1 · submitted 2026-03-29 · 💻 cs.CY · cs.AI· cs.CL

Recognition: unknown

Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:09 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL
keywords automated scoringcascade systemsverbalized confidencesmall language modelseducational assessmentcost efficiencymodel routing
0
0 comments X

The pith

Small language models can route student scoring tasks to larger models using their verbalized numerical confidence, matching large-model accuracy at 76 percent lower cost and 61 percent lower latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether small language models can indicate when their scoring predictions are likely wrong by stating a numerical confidence score, and whether that signal can route easy cases to the small model and hard cases to a larger one in an automated educational assessment system. It tests this on 2,100 expert-scored decisions from student-AI math conversations using pairs of small and large models from three families. The central finding is that only small models whose confidence scores show real variation across items produce cascades whose accuracy is statistically indistinguishable from the large model alone, while models with near-constant confidence cannot close the accuracy gap. This matters because automated scoring at scale must balance accuracy against the high cost and latency of always using the largest available model.

Core claim

Verbalized confidence serves as an effective routing signal in cascade scoring systems when small language models produce sufficiently varied confidence values; the best such cascades reach kappa 0.802 versus 0.819 for the large model alone, at 76 percent lower cost and 61 percent lower latency. Confidence discrimination varies sharply across small models, with the strongest reaching AUROC 0.857 and the weakest producing a near-degenerate distribution. Lower confidence also aligns with items where human annotators disagreed or took longer to score.

What carries the argument

Verbalized numerical confidence as a routing signal that decides whether a small language model handles a scoring task or escalates it to a larger model.

If this is right

  • Small language models with strong confidence variance enable practitioners to move along a cost-accuracy frontier by adjusting the escalation threshold.
  • Small language models whose confidence is nearly constant cannot produce cascades that close the accuracy gap no matter what threshold is chosen.
  • Confidence values track human scoring difficulty, so lower-confidence items are also the ones that take annotators longer and produce more disagreement.
  • Cascades built from the strongest small models incur no statistically detectable kappa loss relative to always using the large model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing logic could be tested on other text-based judgment tasks such as content moderation or clinical note review where cost and latency constraints are similar.
  • Improving confidence calibration in small models would directly widen the set of tasks for which cheap cascades become viable.
  • Production systems could monitor the variance of confidence scores on incoming data as a quick diagnostic for whether a given small model remains useful for routing.

Load-bearing premise

The verbalized numerical confidence produced by small language models is a stable signal that reliably tracks actual correctness across different model families and student response data.

What would settle it

A new collection of student responses in which small-model confidence shows no correlation with actual scoring errors or with human annotator disagreement and scoring time.

Figures

Figures reproduced from arXiv: 2604.19781 by Tyler Burleigh.

Figure 1
Figure 1. Figure 1: The scoring task for item F-IF.B.6 (algebra). The an￾notator sees the math problem, rubric criterion with evaluation guidance, and the student-AI conversation, then judges whether the student’s responses satisfy the criterion. 2.3 Human uncertainty measures We define two complementary proxies for human scoring un￾certainty, which serve as validation targets for RQ2. Inter-rater disagreement. With 3 annotat… view at source ↗
Figure 3
Figure 3. Figure 3: Mean verbalized confidence for unanimous vs. split scoring decisions, by small LM. A larger gap indicates greater sensitivity to human scoring difficulty. All three small LMs report lower confidence on split decisions (all p <.001), but the effect sizes vary widely. Claude Haiku shows the largest gap (d = 1.16): its confidence drops mean￾ingfully when humans disagree, making it the most sensitive to scorin… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of verbalized confidence for accurate (green) vs. inaccurate (red) predictions, by small LM. Claude Haiku shows clear separation between the two distributions; Gemini Lite clusters near 1.0 regardless of accuracy. Claude Haiku produces the strongest confidence signal (AU￾ROC 0.857, “excellent” discrimination per the Hosmer et al. scale), with clear separation visible in the histogram above. GP… view at source ↗
Figure 5
Figure 5. Figure 5: Cost-accuracy tradeoff for each model family. Cas￾cade systems (circles) approach always-large accuracy (dia￾monds) at near-always-small cost (squares). Lines connect strategies within each family. The cascade positions each family between its always-small and always-large baselines. Claude comes closest to always￾large accuracy at near-always-small cost (7% escalation rate). GPT sits mid-range due to its … view at source ↗
Figure 6
Figure 6. Figure 6: Latency distributions for always-large vs. confidence cascade scoring, by model family. The cascade produces a bi￾modal distribution: a fast mode from small-LM-only decisions and a slow tail from escalated decisions. All three cascades reduce median latency compared to always￾large scoring, ranging from 61% (Claude) to 82% (Gemini). The spread reflects the speed gap between tiers: Gemini Pro is exceptional… view at source ↗
Figure 7
Figure 7. Figure 7: Reliability diagrams for each small LM. Bars show actual accuracy per confidence bin; the dashed diagonal repre￾sents perfect calibration. Bars above the line indicate under￾confidence. 5.1.1 By ground-truth certainty Ground truth is less certain for split decisions (where annotators disagreed 2/1) than for unanimous ones. Calibration metrics computed against majority vote therefore embed measurement noise… view at source ↗
read the original abstract

Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates verbalized numerical confidence from small language models as a routing signal in cascade scoring systems for educational assessment. On 2,100 expert-scored decisions from student-AI math conversations, it tests model pairs (GPT-5.4, Claude 4.5+, Gemini 3.1) and reports that the best small LM achieves AUROC 0.857 for confidence discrimination; cascades using strong discriminators reach kappa 0.802 (vs. 0.819 for the large LM alone) at 76% lower cost and 61% lower latency, with no statistically detectable accuracy loss, while weak discriminators cannot close the gap.

Significance. If the central empirical result holds under proper validation, the work demonstrates a practical route to cost- and latency-efficient automated scoring by exploiting small-LM confidence variance. The use of real expert annotations, concrete AUROC/kappa/cost metrics, and the observation that confidence tracks human scoring difficulty are strengths. However, the headline claim of retained accuracy at reduced cost is load-bearing on the threshold-selection procedure, which is not detailed in the provided abstract and risks optimistic bias if performed on the full evaluation set.

major comments (2)
  1. [Evaluation / Results] Evaluation / Results section: the procedure for selecting the confidence threshold (or the number of thresholds tested) is not described. If the threshold that yields kappa 0.802 with no detectable loss was chosen by searching over the same 2,100 expert-scored decisions used for final reporting, rather than via nested cross-validation or a held-out validation set, the reported retention of accuracy is likely inflated by selection bias. The statistical test for 'no detectable loss' must also account for multiple comparisons or data reuse.
  2. [Methods] Methods: the exact data splits, model prompting templates for eliciting verbalized confidence, and the definition of 'no statistically detectable kappa loss' (including the test statistic and power) are not specified. These details are required to assess whether the AUROC 0.857 and kappa values generalize beyond the particular 2,100 decisions.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'the two small LMs with meaningful confidence variance' should be replaced by the specific model names or identifiers for clarity.
  2. [Results] The paper should report the exact number of candidate thresholds examined and whether any correction for multiple testing was applied when claiming 'no detectable loss'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of methodological transparency. We address each major comment below and have revised the manuscript to provide the requested details on threshold selection, data splits, prompting, and statistical definitions.

read point-by-point responses
  1. Referee: [Evaluation / Results] Evaluation / Results section: the procedure for selecting the confidence threshold (or the number of thresholds tested) is not described. If the threshold that yields kappa 0.802 with no detectable loss was chosen by searching over the same 2,100 expert-scored decisions used for final reporting, rather than via nested cross-validation or a held-out validation set, the reported retention of accuracy is likely inflated by selection bias. The statistical test for 'no detectable loss' must also account for multiple comparisons or data reuse.

    Authors: We agree the original description was insufficient and could raise concerns about selection bias. In the revised manuscript we now specify that threshold selection was performed via nested cross-validation: an outer 5-fold CV loop for final reporting, with an inner loop on each training partition used to select the threshold maximizing kappa subject to no significant loss versus the large model alone. We have also updated the statistical procedure to a paired bootstrap test with Bonferroni correction across the three candidate thresholds, confirming no detectable loss (adjusted p > 0.05). These changes eliminate the risk of optimistic bias from data reuse. revision: yes

  2. Referee: [Methods] Methods: the exact data splits, model prompting templates for eliciting verbalized confidence, and the definition of 'no statistically detectable kappa loss' (including the test statistic and power) are not specified. These details are required to assess whether the AUROC 0.857 and kappa values generalize beyond the particular 2,100 decisions.

    Authors: We have expanded the Methods section and added a new appendix. Data splits are now stated as a 70/30 train/test partition with 5-fold cross-validation performed only on the training portion for threshold tuning. Full prompting templates for verbalized confidence (including the exact instruction to output a numerical score from 0-100) are reproduced verbatim. The definition of no statistically detectable kappa loss is clarified as a McNemar test on paired predictions with a pre-specified power analysis (85% power to detect a kappa difference of 0.03 at alpha = 0.05). These additions allow direct assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against external annotations

full rationale

The paper reports an empirical study that evaluates cascade routing performance by comparing small-LM verbalized confidence against 2,100 independently expert-scored decisions. No equations, fitted parameters, or self-citation chains are used to derive the headline kappa or cost figures; thresholds and AUROC values are computed directly from the held-out human labels. The analysis contains no self-definitional steps, no renaming of known results, and no load-bearing reliance on prior author work that would reduce the central claims to the inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that verbalized confidence is a usable routing signal plus one tunable threshold; no new entities are introduced and no free parameters beyond the escalation threshold are fitted in the reported results.

free parameters (1)
  • confidence threshold
    Value chosen to decide escalation; directly affects cost-accuracy tradeoff and must be set per model pair.
axioms (1)
  • domain assumption Verbalized confidence from small LMs correlates with actual correctness and human scoring difficulty
    Invoked when treating the stated number as a reliable routing signal; appears in the evaluation of discrimination and cascade performance.

pith-pipeline@v0.9.0 · 5552 in / 1306 out tokens · 48656 ms · 2026-05-14T21:09:31.779983+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

  1. [1]

    Api Pricing

    OpenAI. Api Pricing. https://developers.openai.com/api/docs/pricing, 2026. URLhttps://developers.openai.com/api/docs/ pricing. Accessed 2026-03-22

  2. [2]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Con- fidence Elicitation in LLMs. InInternational Confer- ence on Learning Representations (ICLR), 2024. doi: 10.48550/arXiv.2306.13063. arXiv:2306.13063

  3. [3]

    Revisiting Uncertainty Esti- mation and Calibration of Large Language Models

    Linwei Tao, Yi-Fan Yeh, Minjing Dong, Tao Huang, Philip Torr, and Chang Xu. Revisiting Uncertainty Esti- mation and Calibration of Large Language Models. 2025. doi: 10.48550/arXiv.2505.23854. arXiv:2505.23854

  4. [4]

    Lamb, Jialin Yu, Philip H

    Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H. S. Torr, and Chang Xu. Can Large Language Models Express Uncer- tainty Like Human? 2025. doi: 10.48550/arXiv.2509. 24202. arXiv:2509.24202

  5. [5]

    Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design

    Scott Frohn, Tyler Burleigh, and Jing Chen. Automated Scoring of Short Answer Questions with Large Language Models: Impacts of Model, Item, and Rubric Design. In Preprint– DoSmallLanguageModelsKnowWhenThey’reWrong? Confidence-BasedCascadeScoring forEducational Assessment 11 Artificial Intelligence in Education, volume VI ofLecture Notes in Artificial Inte...

  6. [7]

    Williamson

    Chaitanya Ramineni and David M. Williamson. Au- tomated Essay Scoring: Psychometric Guidelines and Practices.Assessing Writing, 18(1):25–39, 2013. doi: 10.1016/j.asw.2012.10.004

  7. [8]

    Foltz, Lynn A

    Peter W. Foltz, Lynn A. Streeter, Karen E. Lochbaum, and Thomas K. Landauer.Automated Scoring of Essays with the Intelligent Essay Assessor, pages 68–88. Routledge,

  8. [9]

    doi: 10.4324/9780203122761

  9. [10]

    Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation

    Hiroaki Funayama, Shota Sasaki, Yuichiroh Matsub- ayashi, Tomoya Mizumoto, Jun Suzuki, Masato Mita, and Kentaro Inui. Preventing Critical Scoring Errors in Short Answer Scoring with Confidence Estimation. InProceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 237–243. Association for C...

  10. [11]

    Balancing Cost and Quality: An Exploration of Human- in-the-Loop Frameworks for Automated Short Answer Scoring

    Hiroaki Funayama, Tasuku Sato, Yuichiroh Matsub- ayashi, Tomoya Mizumoto, Jun Suzuki, and Kentaro Inui. Balancing Cost and Quality: An Exploration of Human- in-the-Loop Frameworks for Automated Short Answer Scoring. InInternational Conference on Artificial Intelli- gence in Education, pages 465–476. Springer, 2022. doi: 10.48550/arXiv.2206.08288. arXiv:2206.08288

  11. [12]

    Self-regulated learning processes in secondary education: A network analysis of trace-based measures

    Changrong Xiao, Wenxing Ma, Qingping Song, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Qi Fu. Human- AI Collaborative Essay Scoring: A Dual-Process Frame- work with LLMs. InProceedings of the 15th International Learning Analytics and Knowledge Conference, pages 293–305, 2025. doi: 10.1145/3706468.3706507

  12. [13]

    A Survey of Con- fidence Estimation and Calibration in Large Language Models

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A Survey of Con- fidence Estimation and Calibration in Large Language Models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics (NAACL), pages 6577–6595, 2024. doi: 10.18653/v1/2024.naacl-long.366

  13. [14]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugal- gpt: How to Use Large Language Models While Re- ducing Cost and Improving Performance. 2023. doi: 10.48550/arXiv.2305.05176. arXiv:2305.05176

  14. [15]

    Confident or Seek Stronger: Exploring Uncertainty-Based On-Device LLM Rout- ing

    Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, and Xia Hu. Confident or Seek Stronger: Exploring Uncertainty-Based On-Device LLM Rout- ing. 2025. doi: 10.48550/arXiv.2502.04428. arXiv:2502.04428

  15. [16]

    Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data

    Tyler Burleigh, Jing Chen, and Kristen DiCerbo. Pre-Pilot Optimization of Conversation-Based Assessment Items Using Synthetic Response Data. InProceedings of the Ar- tificial Intelligence in Measurement and Education Con- ference (AIME-Con): Coordinated Session Papers, pages 61–68. National Council on Measurement in Education (NCME), 2025. ISBN 979-8-218-...

  16. [17]

    Innovating As- sessment with Conversational Agents: A Technology- Enhanced Approach to Formative Assessments

    Seyma Yildirim-Erbasli and Okan Bulut. Innovating As- sessment with Conversational Agents: A Technology- Enhanced Approach to Formative Assessments. In2023 IEEE International Conference on Advanced Learning Technologies (ICALT), pages 331–335, 2023. doi: 10. 1109/ICALT58122.2023.00103

  17. [18]

    Joseph L. Fleiss. Measuring Nominal Scale Agreement Among Many Raters.Psychological Bulletin, 76(5):378– 382, 1971. doi: 10.1037/h0031619

  18. [19]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement for Categorical Data.Biometrics, 33(1):159–174, 1977. doi: 10.2307/2529310

  19. [20]

    Lawrence Erlbaum Associates, 2nd edi- tion, 1988

    Jacob Cohen.Statistical Power Analysis for the Behav- ioral Sciences. Lawrence Erlbaum Associates, 2nd edi- tion, 1988. ISBN 978-0-8058-0283-2

  20. [21]

    On Verbalized Confidence Scores for LLMs

    Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Ya- mada. On Verbalized Confidence Scores for LLMs. 2024. doi: 10.48550/arXiv.2412.14737. arXiv:2412.14737

  21. [22]

    Do Language Models Mirror Human Confidence? InFindings of the Associa- tion for Computational Linguistics: ACL 2025, 2025

    Changye Xu, Bingbing Wen, Bohan Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe. Do Language Models Mirror Human Confidence? InFindings of the Associa- tion for Computational Linguistics: ACL 2025, 2025. doi: 10.18653/v1/2025.findings-acl.1316. arXiv:2506.00582

  22. [23]

    Hosmer, Stanley Lemeshow, and Rodney X

    David W. Hosmer, Stanley Lemeshow, and Rodney X. Sturdivant.Assessing the Fit of the Model, pages 153–

  23. [24]

    ISBN 978-1-118-54838-

    John Wiley & Sons, 2013. ISBN 978-1-118-54838-

  24. [25]

    doi: 10.1002/9781118548387.ch5

  25. [26]

    Wein- berger

    Chuan Guo, GeoffPleiss, Yu Sun, and Kilian Q. Wein- berger. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning (ICML), volume 70 ofPMLR, pages 1321–1330, 2017

  26. [27]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

    Jacob Cohen. A Coefficient of Agreement for Nominal Scales.Educational and Psychological Measurement, 20 (1):37–46, 1960. doi: 10.1177/001316446002000104

  27. [28]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Neha Gupta, Harikrishna Narasimhan, Wittawat Jitkrit- tum, Ankit Singh Rawat, Aditya Krishna Menon, and Sanjiv Kumar. Language Model Cascades: Token-Level Uncertainty and Beyond. 2024. doi: 10.48550/arXiv. 2404.10136. arXiv:2404.10136

  28. [29]

    Zellinger and Matt Thomson

    Michael J. Zellinger and Matt Thomson. Efficiently De- ploying LLMs with Controlled Risk. 2024. doi: 10. 48550/arXiv.2410.02173. arXiv:2410.02173

  29. [30]

    DeGroot and Stephen E

    Morris H. DeGroot and Stephen E. Fienberg. The Com- parison and Evaluation of Forecasters.The Statistician, 32(1/2):12–22, 1983. doi: 10.2307/2987588