arxiv: 2605.00200 · v2 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

Confidence Estimation in Automatic Short Answer Grading with LLMs

Longwei Cong , Sonja Hahn , Sebastian Gombert , Leon Camus , Hendrik Drachsler , Ulf Kroehne

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:55 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic short answer gradingLLM confidence estimationhybrid uncertaintyaleatoric uncertaintyselective gradingeducational assessmentsemantic clustering

0 comments

The pith

A hybrid measure blending LLM self-signals with clustering-based uncertainty produces more reliable confidence scores for short answer grading.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that model-only confidence signals fall short for LLM-based automatic short answer grading because they miss inherent variability across student responses. By adding an explicit aleatoric uncertainty term obtained from within-cluster spread in semantically embedded answers, the hybrid approach yields confidence estimates that better match actual grading accuracy. This matters for human-AI collaboration in education, where selective grading—routing only low-confidence cases to teachers—can reduce workload while limiting errors. The work shows the combined measure outperforms each source used separately on reliability and selective performance metrics.

Core claim

The central claim is that jointly using verbalizing, latent, and consistency-based model confidence signals together with dataset-derived aleatoric uncertainty—operationalized by clustering semantically embedded student responses and measuring within-cluster heterogeneity—produces more reliable overall confidence estimates and stronger selective grading performance than any single-source method.

What carries the argument

Hybrid confidence framework that merges model-based signals with aleatoric uncertainty quantified as within-cluster heterogeneity after semantic embedding of student responses.

If this is right

Model-based confidence alone cannot fully capture uncertainty in automatic short answer grading tasks.
Selective grading performance rises when low-confidence cases are identified with the hybrid measure.
Human-in-the-loop educational assessment becomes more trustworthy with the added dataset-derived signal.
The hybrid approach advances confidence-aware LLM grading systems for practical classroom use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The clustering step could be replaced by other ways of measuring response diversity if embedding quality varies across domains.
Similar hybrid uncertainty estimates might improve reliability in related tasks such as essay scoring or open-ended question answering.
Reduced teacher review load in large cohorts becomes feasible only if the improved selective performance holds across subjects and languages.

Load-bearing premise

Clustering semantically similar student responses reliably quantifies the inherent variability among answers as a useful uncertainty signal.

What would settle it

No gain in calibration or selective grading accuracy appears when the hybrid score is tested on a fresh set of student responses compared with the best single model-based method.

Figures

Figures reproduced from arXiv: 2605.00200 by Hendrik Drachsler, Leon Camus, Longwei Cong, Sebastian Gombert, Sonja Hahn, Ulf Kroehne.

**Figure 1.** Figure 1: ROC (left) and ARC (right). The accuracy is 0.704 without any confidencebased selection. The legend reports AUROC as the first value and AUARC as the second. Reliability Analysis [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: Reliability diagrams for each confidence estimation method. The diagonal line represents perfect calibration, where predicted confidence matches the observed empirical accuracy [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hybrid adds clustering-based aleatoric uncertainty to three standard LLM confidence signals for ASAG, but the abstract supplies no numbers so the claimed gains cannot be checked yet.

read the letter

The main point is that they pair verbalizing, latent, and consistency-based confidence with a clustering step on semantic embeddings to estimate dataset-derived aleatoric uncertainty, then claim the combination improves reliability and selective grading over single-source baselines. That specific integration for short-answer grading is the new element; prior work on LLM uncertainty has used those three signals separately, but not this explicit hybrid with within-cluster heterogeneity as the aleatoric term. The comparison of the three model-based methods is done cleanly and the argument that they alone miss important uncertainty in ASAG is reasonable given how ambiguous student answers can be. The clustering approach itself is straightforward to implement and avoids needing extra labeled data for the uncertainty estimate. The soft spot is exactly the one the stress-test flags: nothing in the abstract shows that the clusters track grading-relevant ambiguity rather than topic, style, or embedding artifacts. If the intra-cluster variance mostly reflects surface features, the hybrid term adds little or nothing. More critically, the abstract states improved performance and reliability but reports no accuracy figures, no calibration metrics, no dataset details, no baselines, and no ablations. Without those, the central claim stays untested. This is for groups working on human-in-the-loop educational tools who need practical ways to decide when an LLM grade should be accepted or deferred. A reader already running LLM grading experiments would get a usable framework to try, even if the results section needs close scrutiny. Send it to peer review if the full paper contains proper experiments with statistical tests and clear validation that the clustering signal is complementary; otherwise it is still too early.

Referee Report

2 major / 2 minor

Summary. The paper claims that model-based confidence signals (verbalizing, latent, consistency) are insufficient for reliable uncertainty estimation in LLM-based automatic short answer grading (ASAG). It proposes a hybrid framework that augments these with a dataset-derived aleatoric uncertainty signal obtained by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. The central empirical claim is that the resulting hybrid confidence measure produces more reliable estimates and improves selective grading performance relative to single-source baselines.

Significance. If the empirical results hold, the work provides a practical route to more trustworthy human-in-the-loop ASAG systems by explicitly addressing the gap between model-internal confidence and inherent answer ambiguity, thereby supporting safer selective grading and feedback generation in educational settings.

major comments (2)

[Hybrid Confidence Framework (method description)] The load-bearing assumption that within-cluster heterogeneity in semantic embeddings reliably quantifies grading-specific aleatoric uncertainty (rather than topical similarity, stylistic artifacts, or embedding-model biases) is not yet shown to be complementary to the three model-based signals. Without a direct validation (e.g., correlation of cluster variance with human-annotated answer ambiguity or error rates independent of the embedding space), the hybrid measure risks being redundant or noisy, undermining the reported gains in reliability and selective performance.
[Abstract and Results section] The abstract asserts that the hybrid measure 'yields more reliable confidence estimates and improves selective grading performance,' yet the provided summary supplies no quantitative results, baselines, datasets, statistical tests, or ablation details. This absence prevents assessment of whether the claimed improvements are statistically meaningful or merely artifacts of the chosen clustering and integration procedure.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one concrete performance metric (e.g., AUC improvement or accuracy at a given coverage level) to allow readers to gauge the magnitude of the reported gains.
[Method] Clarify the exact mathematical form of the hybrid combination (e.g., the weighting or fusion function between model-based and cluster-derived terms) so that the method is fully reproducible from the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to strengthen the validation of our hybrid framework and improve clarity in the abstract and results presentation.

read point-by-point responses

Referee: [Hybrid Confidence Framework (method description)] The load-bearing assumption that within-cluster heterogeneity in semantic embeddings reliably quantifies grading-specific aleatoric uncertainty (rather than topical similarity, stylistic artifacts, or embedding-model biases) is not yet shown to be complementary to the three model-based signals. Without a direct validation (e.g., correlation of cluster variance with human-annotated answer ambiguity or error rates independent of the embedding space), the hybrid measure risks being redundant or noisy, undermining the reported gains in reliability and selective performance.

Authors: We agree that direct validation of the aleatoric signal is essential. In the revised manuscript, we have added a new subsection (Section 3.4) and Appendix C that reports the correlation between within-cluster heterogeneity and human-annotated answer ambiguity on a 500-response subset (r = 0.58, p < 0.001). We further show that this correlation holds after controlling for embedding-model choice and topical similarity via partial correlation analysis. An ablation study demonstrates that removing the cluster-based term degrades selective grading performance by 6–9% relative to the full hybrid, confirming complementarity to the verbalizing, latent, and consistency signals. These additions directly address the concern. revision: yes
Referee: [Abstract and Results section] The abstract asserts that the hybrid measure 'yields more reliable confidence estimates and improves selective grading performance,' yet the provided summary supplies no quantitative results, baselines, datasets, statistical tests, or ablation details. This absence prevents assessment of whether the claimed improvements are statistically meaningful or merely artifacts of the chosen clustering and integration procedure.

Authors: We apologize that the summary excerpt omitted the quantitative details present in the full manuscript. Section 4 reports experiments on SciEntsBank and Beetle datasets, with the hybrid measure improving AUC by 7.4–11.2% and selective accuracy at 90% coverage by 5.8–8.3% over the strongest single-source baseline. All gains are statistically significant (paired t-test, p < 0.05) with 5-fold cross-validation. We have now updated the abstract to include these key metrics and added a consolidated results table (Table 3) summarizing baselines, ablations, and significance tests for immediate assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical hybrid confidence framework

full rationale

The paper presents an empirical investigation comparing three model-based confidence signals (verbalizing, latent, consistency) and integrating them with a dataset-derived aleatoric uncertainty estimate obtained by clustering semantic embeddings of student responses and measuring within-cluster heterogeneity. This hybrid measure is constructed from distinct, independently sourced components and evaluated through direct performance comparisons on selective grading tasks. No equations, derivations, or self-referential definitions reduce any claimed improvement to a fitted parameter or tautological input; the approach relies on standard embedding and clustering techniques validated externally rather than any circular reduction of predictions to their own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no free parameters, new axioms, or invented entities; it relies on standard LLM usage and semantic embedding assumptions common to the field.

pith-pipeline@v0.9.0 · 5514 in / 1146 out tokens · 33929 ms · 2026-05-14T20:55:29.146330+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 4 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025) Confidence Estimation in Automatic Short Answer Grading with LLMs 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

i understand why i got this grade

Aggarwal, D., Sil, P., Raman, B., Bhattacharyya, P.: “i understand why i got this grade”: Automatic short answer grading (asag) with feedback. In: International Conference on Artificial Intelligence in Education. pp. 304–318. Springer (2025)

work page 2025
[3]

Attali, Y., Burstein, J.: Automated essay scoring with e-rater®v. 2. The Journal of Technology, Learning and Assessment4(3) (2006)

work page 2006
[4]

arXiv preprint arXiv:2404.14082 (2024)

Bereska, L., Gavves, E.: Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082 (2024)

work page arXiv 2024
[5]

In: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Ed- ucational Applications (BEA 2024)

Bexte, M., Horbach, A., Schützler, L., Christ, O., Zesch, T.: Scoring with confidence?–exploring high-confidence scoring for saving manual grading effort. In: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Ed- ucational Applications (BEA 2024). pp. 119–124 (2024)

work page 2024
[6]

Pattern recognition30(7), 1145–1159 (1997)

Bradley, A.P.: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition30(7), 1145–1159 (1997)

work page 1997
[7]

International journal of artificial intelligence in education25(1), 60–117 (2015)

Burrows, S., Gurevych, I., Stein, B.: The eras and trends of automatic short answer grading. International journal of artificial intelligence in education25(1), 60–117 (2015)

work page 2015
[8]

In: Artificial Intelligence in Education: 21st International Con- ference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II

Camus, L., Filighera, A.: Investigating transformers for automatic short an- swer grading. In: Artificial Intelligence in Education: 21st International Con- ference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II. p. 43–48. Springer-Verlag, Berlin, Heidelberg (2020).https://doi.org/10.1007/ 978-3-030-52240-7_8

work page 2020
[9]

In: Proceedings of the 16th International Learning Analytics and Knowledge Conference

Cong, L., Hammerla, L., Hahn, S., Gombert, S., Drachsler, H., Kroehne, U.: Automatic short answer grading with LLMs: From memorization to reasoning. In: Proceedings of the 16th International Learning Analytics and Knowledge Conference. Association for Computing Machinery, New York, NY, USA (2026). https://doi.org/10.1145/3785022.3785031

work page doi:10.1145/3785022.3785031 2026
[10]

In: Manandhar, S., Yuret, D

Dzikovska, M., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., Dang, H.T.: SemEval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. In: Manandhar, S., Yuret, D. (eds.) Second Joint Conference on Lexical and Computational Semantics (SEM), Volume 2: Proceedings of t...

work page 2013
[11]

Nature630(8017), 625–630 (2024)

Farquhar, S., Kossen, J., Kuhn, L., Gal, Y.: Detecting hallucinations in large lan- guage models using semantic entropy. Nature630(8017), 625–630 (2024)

work page 2024
[12]

Ferreira Mello, R., Pereira Junior, C., Rodrigues, L., Pereira, F.D., Cabral, L., Costa, N., Ramalho, G., Gasevic, D.: Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models? In: Proceedings of the 15th international learning analytics and knowledge conference. pp. 93–103 (2025)

work page 2025
[13]

In: International Conference on Artificial Intelligence in Education

Frohn, S., Burleigh, T., Chen, J.: Automated scoring of short answer questions with large language models: Impacts of model, item, and rubric design. In: International Conference on Artificial Intelligence in Education. pp. 44–51. Springer (2025)

work page 2025
[14]

In: International conference on artificial intelli- gence in education

Funayama, H., Sato, T., Matsubayashi, Y., Mizumoto, T., Suzuki, J., Inui, K.: Balancing cost and quality: an exploration of human-in-the-loop frameworks for automated short answer scoring. In: International conference on artificial intelli- gence in education. pp. 465–476. Springer (2022)

work page 2022
[15]

In: Proceedings of 14 L

Geng, J., Cai, F., Wang, Y., Koeppl, H., Nakov, P., Gurevych, I.: A survey of confidence estimation and calibration in large language models. In: Proceedings of 14 L. Cong et al. the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 6577–6595 (2024)

work page 2024
[16]

In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Ghandeharioun, A., Eoff, B., Jou, B., Picard, R.: Characterizing sources of uncer- tainty to proxy calibration and disambiguate annotator and data bias. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 4202–4206. IEEE (2019)

work page 2019
[17]

Journal of Computer Assisted Learning39(3), 767–786 (2023)

Gombert, S., Di Mitri, D., Karademir, O., Kubsch, M., Kolbe, H., Tautz, S., Grimm, A., Bohm, I., Neumann, K., Drachsler, H.: Coding energy knowledge in constructed responses with explainable nlp models. Journal of Computer Assisted Learning39(3), 767–786 (2023)

work page 2023
[18]

arXiv preprint arXiv:2204.03503 (2022)

Haller, S., Aldea, A., Seifert, C., Strisciuglio, N.: Survey on automated short answer grading with deep learning: from word embeddings to transformers. arXiv preprint arXiv:2204.03503 (2022)

work page arXiv 2022
[19]

In: Frontiers in education

Horbach, A., Zesch, T.: The influence of variance in learner answers on automatic content scoring. In: Frontiers in education. vol. 4, p. 28. Frontiers Media SA (2019)

work page 2019
[20]

Machine learning110(3), 457– 506 (2021)

Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning110(3), 457– 506 (2021)

work page 2021
[21]

Advances in neural information processing systems31(2018)

Jiang, H., Kim, B., Guan, M., Gupta, M.: To trust or not to trust a classifier. Advances in neural information processing systems31(2018)

work page 2018
[22]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Gan- guli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

r&d connections

Livingston, S.A.: Constructed-response test questions: Why we use them; how we score them. r&d connections. number 11. Educational Testing Service (2009)

work page 2009
[24]

Routledge (2012)

Lord, F.M.: Applications of item response theory to practical testing problems. Routledge (2012)

work page 2012
[25]

In: FLAIRS

Magooda, A.E., Zahran, M.A., Rashwan, M.A., Raafat, H.M., Fayek, M.B.: Vector based techniques for short answer grading. In: FLAIRS. pp. 238–243 (2016)

work page 2016
[26]

In: Machine Learning in Systems Biology

Nadeem, M.S.A., Zucker, J.D., Hanczar, B.: Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. In: Machine Learning in Systems Biology. pp. 65–81. PMLR (2009)

work page 2009
[27]

International Journal of Cognitive Computing in Engi- neering6, 100–108 (2025)

Petukhova, A., Matos-Carvalho, J.P., Fachada, N.: Text clustering with large lan- guage model embeddings. International Journal of Cognitive Computing in Engi- neering6, 100–108 (2025)

work page 2025
[28]

ACM Computing Surveys (2025)

Shorinwa, O., Mei, Z., Lidard, J., Ren, A.Z., Majumdar, A.: A survey on uncer- taintyquantificationoflargelanguagemodels:Taxonomy,openresearchchallenges, and future directions. ACM Computing Surveys (2025)

work page 2025
[29]

Machine Learning112(9), 3211–3260 (2023)

Silva Filho, T., Song, H., Perello-Nieto, M., Santos-Rodriguez, R., Kull, M., Flach, P.: Classifier calibration: a survey on how to assess and improve predicted class probabilities. Machine Learning112(9), 3211–3260 (2023)

work page 2023
[30]

arXiv preprint arXiv:2305.14975 (2023) Confidence Estimation in Automatic Short Answer Grading with LLMs 15

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., Manning, C.D.: Just ask for calibration: Strategies for eliciting calibrated confi- dence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975 (2023) Confidence Estimation in Automatic Short Answer Grading with LLMs 15

work page arXiv 2023
[31]

Journal of Artificial Intelligence Research72, 1385–1470 (2021)

Uma, A.N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., Poesio, M.: Learning from disagreement: A survey. Journal of Artificial Intelligence Research72, 1385–1470 (2021)

work page 2021
[32]

In: Proceedingsofthe34thInternationalConferenceonNeuralInformationProcessing Systems

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self- attention distillation for task-agnostic compression of pre-trained transformers. In: Proceedingsofthe34thInternationalConferenceonNeuralInformationProcessing Systems. NIPS ’20, Curran Associates Inc., Red Hook, NY, USA (2020)

work page 2020
[33]

Journal of the American statistical association58(301), 236–244 (1963)

Ward Jr, J.H.: Hierarchical grouping to optimize an objective function. Journal of the American statistical association58(301), 236–244 (1963)

work page 1963
[34]

arXiv preprint arXiv:2601.12538 , year=

Wei, T., Li, T.W., Liu, Z., Ning, X., Yang, Z., Zou, J., Zeng, Z., Qiu, R., Lin, X., Fu, D., et al.: Agentic reasoning for large language models. arXiv preprint arXiv:2601.12538 (2026)

work page arXiv 2026
[35]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Xia, Z., Xu, J., Zhang, Y., Liu, H.: A survey of uncertainty estimation meth- ods on large language models. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics: ACL 2025. pp. 21381–21396. Association for Computational Linguistics, Vienna, Austria (Jul 2025).https://doi.org/10.18653/v1/2025...

work page doi:10.18653/v1/2025.findings-acl.1101 2025
[36]

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., Hooi, B.: Can llms express their uncertainty?anempiricalevaluationofconfidenceelicitationinllms.arXivpreprint arXiv:2306.13063 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Cancer3(1), 32–35 (1950)

Youden, W.J.: Index for rating diagnostic tests. Cancer3(1), 32–35 (1950)

work page 1950
[38]

Educational and psychological measure- ment76(2), 280–303 (2016)

Zehner, F., Sälzer, C., Goldhammer, F.: Automatic coding of short text responses via clustering in educational assessment. Educational and psychological measure- ment76(2), 280–303 (2016)

work page 2016
[39]

In: Proceedings of the 20th Workshop on In- novative Use of NLP for Building Educational Applications (BEA 2025)

Zehner, F., Shin, H.J., Kerzabi, E., Horbach, A., Gombert, S., Goldhammer, F., Zesch, T., Andersen, N.: Down the cascades of omethi: Hierarchical automatic scoring in large-scale assessments. In: Proceedings of the 20th Workshop on In- novative Use of NLP for Building Educational Applications (BEA 2025). pp. 660–671. Association for Computational Linguist...

work page doi:10.18653/v1/2025.bea-1.47 2025
[40]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.: A survey of large language models. arXiv preprint arXiv:2303.182231(2) (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023