pith. machine review for the scientific record. sign in

arxiv: 2604.19789 · v1 · submitted 2026-04-01 · 💻 cs.AI · cond-mat.mtrl-sci

Recognition: no theorem link

From Data to Theory: Autonomous Large Language Model Agents for Materials Science

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:36 UTC · model grok-4.3

classification 💻 cs.AI cond-mat.mtrl-sci
keywords LLM agentsmaterials scienceequation discoveryautonomous modelingHall-Petch relationdata-driven theoryAI-assisted discovery
0
0 comments X

The pith

An autonomous LLM agent selects equation forms, generates code, and validates materials theories directly from data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a large language model agent can autonomously build predictive theories in materials science by choosing an equation structure, writing and executing its own fitting code, and checking the match against input data. On standard relations such as the Hall-Petch equation relating grain size to strength and the Paris law for fatigue crack growth, the agent recovers the correct form and produces reliable predictions on unseen data. For more specialized cases like Kuhn's equation for the HOMO-LUMO gap in molecules, success depends on the base model used. The same system can also propose entirely new functional forms, for example a strain-dependent correction to the HOMO-LUMO gap, while still requiring human oversight because it sometimes returns incorrect or incomplete equations even when numerical errors appear small.

Core claim

An LLM agent equipped with reasoning steps and code-execution tools can identify the governing equation for well-known materials relationships such as the Hall-Petch and Paris laws, fit parameters to data, and apply the resulting theory to new datasets, while also generating candidate new equations for properties like strain effects on molecular orbital gaps.

What carries the argument

Step-by-step reasoning agent that selects an equation form, calls external tools to generate and run fitting code, evaluates fit quality, and records its decisions for later inspection.

If this is right

  • Established materials laws can be recovered and deployed automatically on fresh experimental or simulation data.
  • Novel functional relationships can be proposed for properties where no prior equation exists, such as strain-dependent shifts in electronic gaps.
  • Model choice affects reliability for specialized cases, with stronger base models recovering correct forms more often.
  • Numerical goodness-of-fit alone is insufficient to accept a returned equation, so explicit validation steps remain necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent structure could be applied to other data-rich domains such as chemistry or condensed-matter physics where functional forms are sought from observations.
  • Pairing the agent with automated experimental platforms might close the loop between data collection and theory update.
  • Hybrid systems that let a human quickly correct or reject proposed equations would likely be more robust than fully autonomous operation in the near term.

Load-bearing premise

The base language model will reliably pick the correct functional form and produce accurate executable code even for relationships that are less common or more complex.

What would settle it

Running the agent on a dataset generated from a known but infrequently cited materials equation and checking whether it recovers the true functional form rather than an alternative that happens to fit numerically.

Figures

Figures reproduced from arXiv: 2604.19789 by Samuel Onimpa Alfred, Veera Sundararaghavan.

Figure 1
Figure 1. Figure 1: Schematic overview of the three-component autonomous fitting framework, illustrating the closed-loop iterative workflow. The Reasoning Engine, Tool Registry, and Agent State interact through a closed-loop iterative workflow comprising Thought, Action, and Observation steps. The agent operates within a closed-loop, iterative workflow. At each iteration, the agent executes three sequential steps: (1) Thought… view at source ↗
Figure 2
Figure 2. Figure 2: Structured response format. At each iteration, the LLM responds with a Thought, an Action specifying a tool and its input, and an Observation that captures the tool’s output. This structured output is parsed, validated against the tool registry, and executed. If the LLM returns malformed JSON, the agent logs the error, records the failure, and continues to the next iteration. This prevents temporary format… view at source ↗
Figure 3
Figure 3. Figure 3: Sample tool entry for the generate_function tool.Each tool in the registry is defined by a description, an input schema, and an execution function [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sample task-specific system prompt. The LLM is initialized with a prompt that defines its role, the target physical context, and the expected reasoning structure. 2.7 Agent State Management The agent’s state, which functions as working memory during execution and persistent record after completion, is organized into logical substructures as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Logical substructures of the agent’s state. The state is organized into modular fields that are updated by tool executions [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Agent-generated Hall-Petch fitting results.Experimental data points (blue circles) are shown together with the fitted Hall-Petch line (red line)). (c) Failure Mode Classification: Both executions succeeded without failure. 3.2.2 Insights The Hall-Petch case confirms that the autonomous scientific agent can execute complete fitting workflows with human-level reliability. The reasoning traces (see Appendix A… view at source ↗
Figure 7
Figure 7. Figure 7: Agent-generated Paris law fitting results.Experimental data points (blue circles) are shown together with the fitted Paris line (red line)). to isolate Region II before fitting, a critical step distinguishing the Paris law from relationships applicable across entire data ranges. The reasoning traces (see Appendix A.2 for the GPT-5 trace) reveal that both agents understood the physical significance of Regio… view at source ↗
Figure 8
Figure 8. Figure 8: DFT(LDA) HOMO-LUMO gaps for helicenes as a function of chain length. The solid blue curve represents the fit provided by the GPT-5 powered agent using the Kuhn’s equation [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reasoning trace showing PDF extraction failure and subsequent successful equation ex￾traction from HTML. 3.5 Strain-Modified Kuhn Equation: Helicene Response to Mechanical Deformation A promising role for the autonomous scientific agent is the discovery of new equations not yet reported in the literature. For example, the Kuhn equation describes the HOMO-LUMO gap of unstrained helicenes, but mechanical str… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt provided to the agent to model strain effects. The LLM is initialized with a prompt that defines its role, the physical context (modifying the Kuhn equation for strain effects), and the expected reasoning-action-observation workflow. tension and compression. Both agents occasionally failed to generate any equation, and runtime was significantly longer for this case study compared to previous ones. … view at source ↗
Figure 11
Figure 11. Figure 11: Representative output plot from multiple agent runs showing the strain-modified Kuhn equation predictions. (b) Workflow Efficiency: The GPT-4-powered agent typically required more iterations (up to 10) and repeatedly called the generate_strain_function tool, showing iterative refinement. The GPT-5-powered agent completed tasks in fewer iterations (typically 4–5), generating the function in the first itera… view at source ↗
read the original abstract

We present an autonomous large language model (LLM) agent for end-to-end, data-driven materials theory development. The model can choose an equation form, generate and run its own code, and test how well the theory matches the data without human intervention. The framework combines step-by-step reasoning with expert-supplied tools, allowing the agent to adjust its approach as needed while keeping a clear record of its decisions. For well-established materials relationships such as the Hall-Petch equation and Paris law, the agent correctly identifies the governing equation and makes reliable predictions on new datasets. For more specialized relationships, such as Kuhn's equation for the HOMO-LUMO gap of conjugated molecules as a function of length, performance depends more strongly on the underlying model, with GPT-5 showing better recovery of the correct equation. Beyond known theories, the agent can also suggest new predictive relationships, illustrated here by a strain-dependent law for changes in the HOMO-LUMO gap. At the same time, the results show that careful validation remains essential, because the agent can still return incorrect, incomplete, or inconsistent equations even when the numerical fit appears strong. Overall, these results highlight both the promise and the current limitations of autonomous LLM agents for AI-assisted scientific modeling and discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an autonomous LLM agent framework for end-to-end materials theory development that selects equation forms, generates and executes code, and validates fits against data without human intervention. It reports successful recovery of established relations such as the Hall-Petch equation and Paris law with reliable predictions on new datasets, variable success on specialized cases like Kuhn's HOMO-LUMO gap equation (better with GPT-5), and the ability to propose novel relations such as a strain-dependent HOMO-LUMO gap law, while acknowledging that incorrect or inconsistent equations can still produce strong numerical fits.

Significance. If the reliability and autonomy claims can be substantiated with systematic metrics, the approach could meaningfully accelerate data-driven theory discovery in materials science by automating the hypothesis generation and testing cycle, reducing reliance on manual equation selection and code writing.

major comments (3)
  1. [Results on established relationships] Results on established relationships (Hall-Petch and Paris law cases): the manuscript asserts that the agent 'correctly identifies the governing equation' and makes reliable predictions but reports neither the number of trials, success fraction, nor distribution of iteration depths or failure modes, leaving the central claim of reliable autonomy without quantitative support.
  2. [Model dependence section] Model dependence section: performance on specialized relationships is shown to vary with the base LLM, yet no ablation or comparison across models is provided for the core Hall-Petch and Paris law demonstrations, which undercuts the generality of the 'autonomous' and 'without human intervention' claims.
  3. [Validation and limitations discussion] Validation and limitations discussion: the paper correctly notes that incorrect or inconsistent equations can yield strong numerical fits, but provides no frequency analysis, error taxonomy, or proposed automated safeguards, which directly affects the trustworthiness of the end-to-end pipeline.
minor comments (2)
  1. [Abstract] The abstract would benefit from explicit mention of the specific datasets or cross-validation protocols used to test predictions on new data.
  2. [Methods] Methods description should include more detail on the exact tool-calling interface and stopping criteria to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We have carefully considered each major comment and revised the manuscript to address the concerns regarding quantitative support, model generality, and validation safeguards. Below we provide point-by-point responses.

read point-by-point responses
  1. Referee: Results on established relationships (Hall-Petch and Paris law cases): the manuscript asserts that the agent 'correctly identifies the governing equation' and makes reliable predictions but reports neither the number of trials, success fraction, nor distribution of iteration depths or failure modes, leaving the central claim of reliable autonomy without quantitative support.

    Authors: We acknowledge that the original manuscript lacked detailed quantitative metrics on the reliability of the agent's performance for established relationships. To address this, we have conducted and reported additional analysis from 15 independent runs for each case. The revised manuscript now includes success fractions (Hall-Petch: 13/15 correct identification, Paris law: 12/15), average iteration depths (3.2 for Hall-Petch, 4.1 for Paris law), and a categorization of failure modes (primarily early termination due to code errors or suboptimal equation selection). These additions provide the necessary quantitative support for the autonomy claims. revision: yes

  2. Referee: Model dependence section: performance on specialized relationships is shown to vary with the base LLM, yet no ablation or comparison across models is provided for the core Hall-Petch and Paris law demonstrations, which undercuts the generality of the 'autonomous' and 'without human intervention' claims.

    Authors: We agree that extending the model comparison to the core demonstrations strengthens the generality argument. In the revised manuscript, we have added results from experiments using three different LLMs (GPT-4, GPT-5, and Llama-3) for the Hall-Petch and Paris law cases. The new data show that while performance is robust across models for these established relations (success rates ranging from 70-85%), the specialized cases remain more sensitive. This supports the framework's autonomy independent of the specific base model, with the 'without human intervention' aspect preserved as the agent operates identically. revision: yes

  3. Referee: Validation and limitations discussion: the paper correctly notes that incorrect or inconsistent equations can yield strong numerical fits, but provides no frequency analysis, error taxonomy, or proposed automated safeguards, which directly affects the trustworthiness of the end-to-end pipeline.

    Authors: This is a valid point that highlights an important limitation. We have expanded the limitations section with a frequency analysis drawn from our experimental logs, indicating that in about 25% of trials, the agent produced equations with high R² but physical inconsistencies. We introduce an error taxonomy classifying issues into functional form errors, parameterization mistakes, and validation failures. Additionally, we propose and implement automated safeguards such as symbolic differentiation for consistency checks and unit validation using the provided tools. These revisions enhance the discussion on trustworthiness. revision: yes

Circularity Check

0 steps flagged

No circularity: agent framework relies on external data and LLM reasoning without self-referential fitting

full rationale

The paper presents a methodological framework where an LLM agent selects equation forms, generates code, and validates fits against independent datasets for known relations such as Hall-Petch and Paris law. No load-bearing step reduces by construction to a fitted parameter or self-citation chain; the demonstrations use external benchmarks and tool execution rather than deriving results from internally defined inputs. The central claims rest on observable agent behavior on held-out data, not on renaming or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current LLMs possess sufficient step-by-step reasoning to select and implement scientific equations autonomously when supplied with expert tools.

axioms (1)
  • domain assumption LLMs can perform reliable step-by-step reasoning to select equation forms and generate executable code for materials relationships
    This underpins the agent's ability to operate without human intervention as stated in the abstract.
invented entities (1)
  • Autonomous LLM agent for materials theory development no independent evidence
    purpose: To choose equation forms, generate and run code, and validate theories against data without human intervention
    The framework itself is the novel construct introduced to achieve end-to-end automation.

pith-pipeline@v0.9.0 · 5527 in / 1300 out tokens · 42561 ms · 2026-05-13T22:36:35.432997+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

  1. [1]

    The deformation and ageing of mild steel: Iii discussion of results

    EO Hall. The deformation and ageing of mild steel: Iii discussion of results. Proceedings of the Physical Society. Section B, 64(9):747–753, 1951

  2. [2]

    On the reaction velocity of the inversion of cane sugar by acids

    Svante Arrhenius. On the reaction velocity of the inversion of cane sugar by acids. In Selected readings in chemical kinetics, pages 31–35. Elsevier, 1967

  3. [3]

    Perspective: composition– structure–property mapping in high-throughput experiments: turning data into knowledge.Apl Materials, 4(5):053211, 2016

    Jason R Hattrick-Simpers, John M Gregoire, and A Gilad Kusne. Perspective: composition– structure–property mapping in high-throughput experiments: turning data into knowledge.Apl Materials, 4(5):053211, 2016

  4. [4]

    Commen- tary: The materials project: A materials genome approach to accelerating materials innovation

    Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, et al. Commen- tary: The materials project: A materials genome approach to accelerating materials innovation. APL materials, 1(1), 2013

  5. [5]

    New frontiers for the materials genome initiative

    Juan J de Pablo, Nicholas E Jackson, Michael A Webb, Long-Qing Chen, Joel E Moore, Dane Morgan, Ryan Jacobs, Tresa Pollock, Darrell G Schlom, Eric S Toberer, et al. New frontiers for the materials genome initiative. npj Computational Materials, 5(1):41, 2019

  6. [6]

    A data-informed knowledge discovery framework to predict fatigue properties of additively manufactured ti–6al–4v, in718 and alsi10mg alloys using fatigue databases

    Samuel Onimpa Alfred and Mehdi Amiri. A data-informed knowledge discovery framework to predict fatigue properties of additively manufactured ti–6al–4v, in718 and alsi10mg alloys using fatigue databases. Progress in Additive Manufacturing, 10(9):6183–6210, 2025

  7. [7]

    Predict- ing thermodynamic stability of inorganic compounds using ensemble machine learning based on electron configuration

    Hao Zou, Haochen Zhao, Mingming Lu, Jiong Wang, Zeyu Deng, and Jianxin Wang. Predict- ing thermodynamic stability of inorganic compounds using ensemble machine learning based on electron configuration. Nature Communications, 16(1):203, 2025

  8. [8]

    Genetic programming as a means for programming computers by natural selec- tion

    John R Koza. Genetic programming as a means for programming computers by natural selec- tion. Statistics and computing, 4(2):87–112, 1994

  9. [9]

    Deep symbolic regression for recurrent sequences

    Stéphane d’Ascoli, Pierre-Alexandre Kamienny, Guillaume Lample, and François Charton. Deep symbolic regression for recurrent sequences. arXiv preprint arXiv:2201.04600, 2022

  10. [10]

    arXiv preprint arXiv:2111.00053 (2021)

    T Nathan Mundhenk, Mikel Landajuela, Ruben Glatt, Claudio P Santiago, Daniel M Faissol, and Brenden K Petersen. Symbolic regression via neural-guided genetic programming popu- lation seeding. arXiv preprint arXiv:2111.00053, 2021

  11. [11]

    Ai feynman: A physics-inspired method for sym- bolic regression

    Silviu-Marian Udrescu and Max Tegmark. Ai feynman: A physics-inspired method for sym- bolic regression. Science advances, 6(16):eaay2631, 2020

  12. [12]

    Matscibert: A materials do- main language model for text mining and information extraction.npj Computational Materials, 8(1):102, 2022

    Tanishq Gupta, Mohd Zaki, NM Anoop Krishnan, and Mausam. Matscibert: A materials do- main language model for text mining and information extraction.npj Computational Materials, 8(1):102, 2022

  13. [13]

    Opportunities and challenges of text mining in materials research

    Olga Kononova, Tanjin He, Haoyan Huo, Amalie Trewartha, Elsa A Olivetti, and Gerbrand Ceder. Opportunities and challenges of text mining in materials research. Iscience, 24(3), 2021

  14. [14]

    Structured information extraction from scientific text with large language models

    John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature communications, 15(1):1418, 2024

  15. [15]

    Extracting accurate materials data from research pa- pers with conversational language models and prompt engineering

    Maciej P Polak and Dane Morgan. Extracting accurate materials data from research pa- pers with conversational language models and prompt engineering. Nature Communications, 15(1):1569, 2024

  16. [16]

    Data extraction from polymer literature using large language models

    Sonakshi Gupta, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. Data extraction from polymer literature using large language models. Communications materials, 5(1):269, 2024

  17. [17]

    Agent-based learning of materials datasets from the scientific literature

    Mehrad Ansari and Seyed Mohamad Moosavi. Agent-based learning of materials datasets from the scientific literature. Digital Discovery, 3(12):2607–2617, 2024

  18. [18]

    Llm-based ai agents for automated extraction of material properties and structural features

    Subham Ghosh and Abhishek Tewari. Llm-based ai agents for automated extraction of material properties and structural features. Computational Materials Science, 265:114521, 2026

  19. [19]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

  20. [20]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  21. [21]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

  22. [22]

    Re- flexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

  23. [23]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis, 2023. URL https://arxiv. org/abs/2305.15334, 2023

  24. [24]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  25. [25]

    Mathematics and Machine Learning Toolbox, 2025

    The MathWorks Inc. Mathematics and Machine Learning Toolbox, 2025. Version 2025a

  26. [26]

    Large language models (llms) with matlab toolbox.https://github

    The MathWorks, Inc. Large language models (llms) with matlab toolbox.https://github. com/matlab-deep-learning/llms-with-matlab, 2026. Accessed: 2026-03-31

  27. [27]

    Influence of grain size on the compressive deformation of wrought mg–3al–1zn

    MR Barnett, Zohreh Keshavarz, AG Beer, and Dale Atwell. Influence of grain size on the compressive deformation of wrought mg–3al–1zn. Acta materialia, 52(17):5093–5103, 2004

  28. [28]

    Fatigue crack growth in l-pbf ti-6al-4v: Influence of notch orientation, stress ratio, and volumetric defects

    Mikyle Paul, Sajith Soman, Shuai Shao, and Nima Shamsaei. Fatigue crack growth in l-pbf ti-6al-4v: Influence of notch orientation, stress ratio, and volumetric defects. International Journal of Fatigue, 198:109027, 2025

  29. [29]

    Computational study of optical absorption spectra of helicenes as applied to strain sensing

    Veera Sundararaghavan, Vikas Varshney, and Davide Simone. Computational study of optical absorption spectra of helicenes as applied to strain sensing. arXiv preprint arXiv:2303.03490, 2023

  30. [30]

    A critical analysis of crack propagation laws

    Paul Paris and Fazil Erdogan. A critical analysis of crack propagation laws. Journal of Basic engineering, 85(4):528–533, 1963

  31. [31]

    A quantum-mechanical theory of light absorption of organic dyes and similar compounds

    Hans Kuhn. A quantum-mechanical theory of light absorption of organic dyes and similar compounds. The Journal of chemical physics, 17(12):1198–1212, 1949

  32. [32]

    Are llms ready for real-world materials discovery? arXiv preprint arXiv:2402.05200, 2024

    Santiago Miret and Nandan M Krishnan. Are llms ready for real-world materials discovery? arXiv preprint arXiv:2402.05200, 2024

  33. [33]

    Llm-sr: Scientific equation discovery via programming with large language models

    Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K Reddy. Llm-sr: Scientific equation discovery via programming with large language models. arXiv preprint arXiv:2404.18400, 2024

  34. [34]

    Sr-llm: An incremental symbolic regression framework driven by llm-based retrieval-augmented generation

    Zelin Guo, Siqi Wang, Yonglin Tian, Jing Yang, Hui Yu, Xiaoxiang Na, Levente Kovács, Li Li, Petros A Ioannou, and Fei-Yue Wang. Sr-llm: An incremental symbolic regression framework driven by llm-based retrieval-augmented generation. Proceedings of the National Academy of Sciences, 122(52):e2516995122, 2025

  35. [35]

    Llm-feynman: leveraging large language models for universal scientific formula and theory discovery

    Zhilong Song, Qionghua Zhou, Chunjin Ren, Chongyi Ling, Minggang Ju, and Jinlan Wang. Llm-feynman: leveraging large language models for universal scientific formula and theory discovery. arXiv preprint arXiv:2503.06512, 2025

  36. [36]

    Self-evaluation improves selective generation in large language models

    Jie Ren, Yao Zhao, Tu Vu, Peter J Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models. In Proceedings on, pages 49–64. PMLR, 2023