pith. sign in

arxiv: 2405.02079 · v3 · submitted 2024-05-03 · 💻 cs.CL · cs.AI

Argumentative Large Language Models for Explainable and Contestable Claim Verification

Pith reviewed 2026-05-24 01:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords ArgLLMsargumentation frameworksclaim verificationexplainable AIcontestable AIlarge language modelsformal reasoning
0
0 comments X

The pith

ArgLLMs augment LLMs by constructing argumentation frameworks to support explainable and contestable claim verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces argumentative large language models (ArgLLMs) that build argumentation frameworks from their encoded knowledge to handle claim verification. These frameworks then underpin formal reasoning steps that produce the final decision. Standard LLMs can draw on broad knowledge but typically cannot supply outputs that users can faithfully trace or correct when wrong. A sympathetic reader would care because the method seeks to add transparency and correctability to LLM-based decisions without discarding their zero-shot capabilities.

Core claim

ArgLLMs construct argumentation frameworks, which then serve as the basis for formal reasoning in support of decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by ArgLLMs may be explained and contested. The approach is tested experimentally on claim verification and assessed against newly defined properties of contestability.

What carries the argument

Argumentation frameworks (networks of arguments and attacks between them) constructed by the LLM to enable formal reasoning and make decisions explainable and contestable.

If this is right

  • Claim verification decisions become traceable through the explicit arguments and attacks in the framework.
  • Users can contest a decision by pointing to specific flaws or missing attacks in the constructed framework.
  • ArgLLMs are evaluated experimentally against state-of-the-art claim-verification techniques.
  • Novel properties are defined to measure contestability and applied to assess the ArgLLM outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction of argumentation frameworks could be applied to other LLM decision tasks that require user oversight.
  • Requiring structured argumentation might constrain hallucinated claims by forcing consistency across supporting and attacking arguments.
  • Human-in-the-loop experiments could test whether the added contestability changes user trust or final accuracy.

Load-bearing premise

That LLMs can reliably construct argumentation frameworks that faithfully capture relevant knowledge and enable effective contestation without introducing hallucinations or structural biases.

What would settle it

Cases where the frameworks produced by ArgLLMs lead to claim-verification decisions that cannot be meaningfully explained or contested, or where the formal reasoning yields results inconsistent with the framework.

Figures

Figures reproduced from arXiv: 2405.02079 by Adam Dejl, Antonio Rago, Deniz Gorur, Francesca Toni, Gabriel Freedman, Xiang Yin.

Figure 1
Figure 1. Figure 1: Comparison of our approach (ArgLLM, here in combination with Mixtral) with existing alternatives. The example claim is adapted from TruthfulQA. 1 arXiv:2405.02079v3 [cs.CL] 18 Apr 2025 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for ArgLLMs (in comparison with baselines, see §4 and §5 for the details). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt used for Γ. {“supporting”/“attacking”} and {“support”/“attack”} are determined by θ. ArgLLMs and the baselines. We evaluate all prompts, and the combinations thereof, in a pilot experiment on two validation sets of 200 samples each, taken from TruthfulClaim and StrategyClaim, using Mistral and Mixtral models (selected as two substantially different open-source models). In this evaluation, we separat… view at source ↗
Figure 4
Figure 4. Figure 4: In order to assess accuracy for claim verification we use a similar threshold as for the Est. Confidence baseline — if the the input claim’s final strength is greater than 0.5 it is classified as true, and otherwise as false. The four variations we consider arise from the hyperparame￾ters θ and the choice of intrinsic strength for the input claim by E. For θ, we consider two options — Depth=1 and Depth=2: … view at source ↗
Figure 5
Figure 5. Figure 5: An example of contestation, in the ArgLLM with Mixtral, for a claim taken from StrategyClaim. Before contestation, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used for direct questioning on confidence [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompts used for chain-of-thought baseline. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used for argument strength attribution for [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: ChatGPT Argument Generator prompt 11.2 Role-player prompts Role-player prompts followed a prompting strategy where the LLMs were expected to act like debater for the Argument [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 16
Figure 16. Figure 16: OPRO Argument Miner prompt [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: Role-Player: Analyst Uncertainty Estimator [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 18
Figure 18. Figure 18: An illustration of a user adding an additional sup [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
read the original abstract

The profusion of knowledge encoded in large language models (LLMs) and their ability to apply this knowledge zero-shot in a range of settings makes them promising candidates for use in decision-making. However, they are currently limited by their inability to provide outputs which can be faithfully explained and effectively contested to correct mistakes. In this paper, we attempt to reconcile these strengths and weaknesses by introducing \emph{argumentative LLMs (ArgLLMs)}, a method for augmenting LLMs with argumentative reasoning. Concretely, ArgLLMs construct argumentation frameworks, which then serve as the basis for formal reasoning in support of decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by ArgLLMs may be explained and contested. We evaluate ArgLLMs' performance experimentally in comparison with state-of-the-art techniques, in the context of the decision-making task of claim verification. We also define novel properties to characterise contestability and assess ArgLLMs formally in terms of these properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces argumentative LLMs (ArgLLMs) that augment standard LLMs by having them construct argumentation frameworks, which then support formal reasoning for decision-making tasks such as claim verification. The approach aims to improve explainability and contestability of LLM outputs. The manuscript includes an experimental evaluation comparing ArgLLMs to state-of-the-art methods on claim verification and defines novel formal properties to characterize contestability, with an assessment of ArgLLMs against those properties.

Significance. If the empirical results and formal assessment hold, the work offers a concrete mechanism for injecting interpretable argumentative structure into LLM pipelines, directly addressing the explainability and contestability limitations highlighted in the abstract. The combination of framework construction with formal reasoning properties provides a pathway for contestable decisions that is more structured than post-hoc explanation techniques.

major comments (2)
  1. [§4] §4 (experimental evaluation on claim verification): the reported comparisons with SOTA techniques should include explicit metrics for framework faithfulness (e.g., agreement with human-annotated argument structures) in addition to task accuracy, because the central claim that the frameworks enable reliable formal reasoning rests on this property being preserved by the LLM construction step.
  2. [§5] §5 (formal properties for contestability): the assessment that ArgLLMs satisfy the defined properties appears to assume that the constructed frameworks are always well-formed and complete; if the LLM occasionally produces inconsistent or incomplete frameworks, the formal guarantees would not transfer, and this edge case should be quantified in the evaluation.
minor comments (3)
  1. [Abstract] The abstract and introduction use 'ArgLLMs' without an initial definition or expansion on first use; add a parenthetical expansion at first mention.
  2. [Figure 1] Figure 1 (argumentation framework example) would benefit from a caption that explicitly links the nodes/attacks to the claim-verification task rather than a generic illustration.
  3. [Related Work] Related-work section omits recent papers on LLM-augmented argumentation (e.g., works on debate or structured reasoning with LLMs from 2023-2024); add 2-3 citations for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§4] §4 (experimental evaluation on claim verification): the reported comparisons with SOTA techniques should include explicit metrics for framework faithfulness (e.g., agreement with human-annotated argument structures) in addition to task accuracy, because the central claim that the frameworks enable reliable formal reasoning rests on this property being preserved by the LLM construction step.

    Authors: We thank the referee for this observation. Our evaluation in §4 prioritizes end-to-end task accuracy on claim verification to enable direct comparison with SOTA methods. We agree that explicit faithfulness metrics would strengthen claims about the reliability of the constructed frameworks for formal reasoning. However, the current experiments do not include human-annotated argument structures, and collecting such annotations would constitute substantial additional work. We will revise the manuscript to discuss this as a limitation and propose it as future work, while noting that poor framework quality would be expected to degrade task performance. revision: partial

  2. Referee: [§5] §5 (formal properties for contestability): the assessment that ArgLLMs satisfy the defined properties appears to assume that the constructed frameworks are always well-formed and complete; if the LLM occasionally produces inconsistent or incomplete frameworks, the formal guarantees would not transfer, and this edge case should be quantified in the evaluation.

    Authors: We agree that this is an important consideration. The formal assessment in §5 applies to well-formed frameworks, and the manuscript does not currently quantify cases where the LLM produces inconsistent or incomplete outputs. We will add an empirical analysis in the revised evaluation section reporting the frequency of such edge cases across the datasets. This will clarify the practical scope of the contestability guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces ArgLLMs as an augmentation of LLMs with argumentation frameworks for claim verification, evaluates them experimentally against baselines, and defines novel contestability properties for formal assessment. No derivation reduces a claimed prediction or result to a fitted parameter, self-defined quantity, or self-citation chain; the central claims rest on empirical performance and externally grounded argumentation concepts rather than internal redefinitions or forced outputs. The approach is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that LLMs can generate useful argumentation frameworks from their encoded knowledge, with ArgLLMs as the primary new construct; no free parameters or additional invented entities are specified in the abstract.

axioms (1)
  • domain assumption LLMs encode knowledge that can be applied zero-shot in a range of settings
    Explicitly stated in the abstract as the basis for using LLMs in decision-making.
invented entities (1)
  • ArgLLMs no independent evidence
    purpose: Augment LLMs with argumentative reasoning to enable explainable and contestable decisions
    New method introduced by the paper to address limitations of standard LLMs.

pith-pipeline@v0.9.0 · 5713 in / 1100 out tokens · 24644 ms · 2026-05-24T01:14:35.872000+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI at the Front Lines of Platform Governance: Using LLMs to Support Illegal Content Reporting under the Digital Services Act

    cs.HC 2026-05 unverdicted novelty 5.0

    EvalAI providing pro/con arguments improves provision-level accuracy and reduces misclassification distance in DSA illegal content reporting under AI error conditions versus conventional XAI.

  2. Contestable AI needs Computational Argumentation

    cs.AI 2024-05 unverdicted novelty 3.0

    Contestable AI needs dynamic processes enabled by computational argumentation instead of static models.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 2 Pith papers · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. CoRR , abs/2303.08774, 2023

  2. [2]

    Axiomatic foundations of acceptability semantics

    Leila Amgoud and Jonathan Ben - Naim. Axiomatic foundations of acceptability semantics. In KR , pages 2--11, 2016

  3. [3]

    Using arguments for making and explaining decisions

    Leila Amgoud and Henri Prade. Using arguments for making and explaining decisions. Artif. Intell. , 173(3-4):413--436, 2009

  4. [4]

    Acceptability semantics for weighted argumentation frameworks

    Leila Amgoud, Jonathan Ben - Naim, Dragan Doder, and Srdjan Vesic. Acceptability semantics for weighted argumentation frameworks. In IJCAI , pages 56--62, 2017

  5. [5]

    Measuring the intensity of attacks in argumentation graphs with shapley value

    Leila Amgoud, Jonathan Ben - Naim, and Srdjan Vesic. Measuring the intensity of attacks in argumentation graphs with shapley value. In IJCAI , pages 63--69, 2017

  6. [6]

    Towards artificial argumentation

    Katie Atkinson, Pietro Baroni, Massimiliano Giacomin, Anthony Hunter, Henry Prakken, Chris Reed, Guillermo Ricardo Simari, Matthias Thimm, and Serena Villata. Towards artificial argumentation. AI Mag. , 38(3):25--36, 2017

  7. [7]

    How many properties do we need for gradual argumentation? In AAAI , pages 1736--1743, 2018

    Pietro Baroni, Antonio Rago, and Francesca Toni. How many properties do we need for gradual argumentation? In AAAI , pages 1736--1743, 2018

  8. [8]

    From fine-grained properties to broad principles for gradual argumentation: A principled spectrum

    Pietro Baroni, Antonio Rago, and Francesca Toni. From fine-grained properties to broad principles for gradual argumentation: A principled spectrum. Int. J. Approx. Reason. , 105:252--286, 2019

  9. [9]

    a is b" fail to learn

    Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a". CoRR , abs/2309.12288, 2023

  10. [10]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In AAAI , pages 17682--17690, 2024

  11. [11]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS , 2020

  12. [12]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S \' e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4 . CoRR , abs/2303.12712, 2023

  13. [13]

    On the acceptability of arguments in bipolar argumentation frameworks

    Claudette Cayrol and Marie - Christine Lagasquie - Schiex. On the acceptability of arguments in bipolar argumentation frameworks. In ECSQARU , pages 378--389, 2005

  14. [14]

    Exploring the potential of large language models in computational argumentation

    Guizhen Chen, Liying Cheng, Luu Anh Tuan, and Lidong Bing. Exploring the potential of large language models in computational argumentation. CoRR , abs/2311.09022, 2023

  15. [15]

    Argumentative XAI: A survey

    Kristijonas Cyras, Antonio Rago, Emanuele Albini, Pietro Baroni, and Francesca Toni. Argumentative XAI: A survey. In IJCAI , pages 4392--4399, 2021

  16. [16]

    QLoRA : Efficient finetuning of quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In NeurIPS , 2023

  17. [17]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. CoRR , abs/2407.21783, 2024

  18. [18]

    Evaluating superhuman models with consistency checks

    Lukas Fluri, Daniel Paleka, and Florian Tram \` e r. Evaluating superhuman models with consistency checks. CoRR , abs/2306.09983, 2023

  19. [19]

    Rodr \' guez, Diego Letzen, Maria Vanina Martinez, and Laura Alonso Alemany

    Dami \' a n Ariel Furman, Pablo Torres, Jos \' e A. Rodr \' guez, Diego Letzen, Maria Vanina Martinez, and Laura Alonso Alemany. High-quality argumentative information in low resources approaches improve counter-narrative generation. In EMNLP , pages 2942--2956, 2023

  20. [20]

    Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. CoRR , abs/2101.02235, 2021

  21. [21]

    Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM

    Ivan Habernal and Iryna Gurevych. Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM . In ACL , 2016

  22. [22]

    arXiv preprint arXiv:2402.18563 , year=

    Danny Halawi, Fred Zhang, Chen Yueh - Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. CoRR , abs/2402.18563, 2024

  23. [23]

    Hammond and David B

    Kristian J. Hammond and David B. Leake. Large language models need symbolic AI . In NeSy , pages 204--209, 2023

  24. [24]

    Beyond explainability: justifiability and contestability of algorithmic decision systems

    Cl \' e ment Henin and Daniel Le M \' e tayer. Beyond explainability: justifiability and contestability of algorithmic decision systems. AI Soc. , 37(4):1397--1410, 2022

  25. [25]

    Leveraging large language models to generate answer set programs

    Adam Ishay, Zhun Yang, and Joohyung Lee. Leveraging large language models to generate answer set programs. In KR , pages 374--383, 2023

  26. [26]

    Mistral 7B

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. CoRR , abs/2310.06825, 2023

  27. [27]

    Mixtral of Experts

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. CoRR , abs/2401.04088, 2024

  28. [28]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei - Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR , abs/2009.13081, 2020

  29. [29]

    Contribution functions for quantitative bipolar argumentation graphs: A principle-based analysis

    Timotheus Kampik, Nico Potyka, Xiang Yin, Kristijonas Cyras, and Francesca Toni. Contribution functions for quantitative bipolar argumentation graphs: A principle-based analysis. Int. J. Approx. Reason. , 173:109255, 2024

  30. [30]

    Towards a framework for evaluating explanations in automated fact verification

    Neema Kotonya and Francesca Toni. Towards a framework for evaluating explanations in automated fact verification. In LREC/COLING , pages 16364--16377, 2024

  31. [31]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR , 2023

  32. [32]

    Tetreault

    Anne Lauscher, Lily Ng, Courtney Napoles, and Joel R. Tetreault. Rhetoric, logic, and dialectic: Advancing theory-based argument quality assessment in natural language processing. In COLING , pages 4563--4574, 2020

  33. [33]

    Contestable AI Needs Computational Argumentation

    Francesco Leofante, Hamed Ayoobi, Adam Dejl, Gabriel Freedman, Deniz Gorur, Junqi Jiang, Guilherme Paulino-Passos, Antonio Rago, Anna Rapberger, Fabrizio Russo, Xiang Yin, Dekai Zhang, and Francesca Toni. Contestable AI Needs Computational Argumentation . In Proceedings of the 21st International Conference on Principles of Knowledge Representation and Rea...

  34. [34]

    u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

    Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS , 2020

  35. [35]

    Vera Liao and Jennifer Wortman Vaughan

    Q. Vera Liao and Jennifer Wortman Vaughan. AI Transparency in the Age of LLMs : A Human - Centered Research Roadmap . Harvard Data Science Review , (Special Issue 5), may 31 2024

  36. [36]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. CoRR , abs/2109.07958, 2021

  37. [37]

    Conceptualising contestability: Perspectives on contesting algorithmic decisions

    Henrietta Lyons, Eduardo Velloso, and Tim Miller. Conceptualising contestability: Perspectives on contesting algorithmic decisions. Proc. ACM Hum. Comput. Interact. , 5( CSCW1 ):106:1--106:25, 2021

  38. [38]

    Why do humans reason? arguments for an argumentative theory

    Hugo Mercier and Dan Sperber. Why do humans reason? arguments for an argumentative theory. Behavioral and Brain Sciences , 34(2):57–74, 2011

  39. [39]

    The Enigma of Reason

    Hugo Mercier and Dan Sperber. The Enigma of Reason . Penguin, 2018

  40. [40]

    Gemma: Open Models Based on Gemini Research and Technology

    Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, et al. Gemma: Open models based on gemini research and technology. CoRR , abs/2403.08295, 2024

  41. [41]

    Explainable AI is dead, long live explainable ai!: Hypothesis-driven decision support using evaluative AI

    Tim Miller. Explainable AI is dead, long live explainable ai!: Hypothesis-driven decision support using evaluative AI . In FAccT , pages 333--342, 2023

  42. [42]

    Gpt-4o mini: advancing cost-efficient intelligence, 2024

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

  43. [43]

    Autoplan: Automatic planning of interactive decision-making tasks with large language models

    Siqi Ouyang and Lei Li. Autoplan: Automatic planning of interactive decision-making tasks with large language models. In EMNLP , pages 3114--3128, 2023

  44. [44]

    Continuous dynamical systems for weighted bipolar argumentation

    Nico Potyka. Continuous dynamical systems for weighted bipolar argumentation. In KR , pages 148--157, 2018

  45. [45]

    Discontinuity-free decision support with quantitative argumentation debates

    Antonio Rago, Francesca Toni, Marco Aurisicchio, and Pietro Baroni. Discontinuity-free decision support with quantitative argumentation debates. In KR , pages 63--73, 2016

  46. [46]

    Interactive explanations by conflict resolution via argumentative exchanges

    Antonio Rago, Hengzhi Li, and Francesca Toni. Interactive explanations by conflict resolution via argumentative exchanges. In KR , pages 582--592, 2023

  47. [47]

    Gemma 2: Improving Open Language Models at a Practical Size

    Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. Gemma 2: Improving open language models at a practical size. CoRR , abs/2408.00118, 2024

  48. [48]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

    Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. , 1(5):206--215, 2019

  49. [49]

    Teler: A general taxonomy of LLM prompts for benchmarking complex tasks

    Shubhra Kanti Karmaker Santu and Dongji Feng. Teler: A general taxonomy of LLM prompts for benchmarking complex tasks. In EMNLP , pages 14197--14203, 2023

  50. [50]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS , 2023

  51. [51]

    Talking about large language models

    Murray Shanahan. Talking about large language models. Commun. ACM , 67(2):68--79, 2024

  52. [52]

    Finding convincing arguments using scalable bayesian preference learning

    Edwin Simpson and Iryna Gurevych. Finding convincing arguments using scalable bayesian preference learning. Trans. Assoc. Comput. Linguistics , 6:357--371, 2018

  53. [53]

    Entailer: Answering questions with faithful and truthful chains of reasoning

    Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. Entailer: Answering questions with faithful and truthful chains of reasoning. In EMNLP , pages 2078--2093, 2022

  54. [54]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In NeurIPS , 2023

  55. [55]

    Explainable claim verification via knowledge-grounded reasoning with large language models

    Haoran Wang and Kai Shu. Explainable claim verification via knowledge-grounded reasoning with large language models. In EMNLP , pages 6288--6304, 2023

  56. [56]

    Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

    Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Lunting Fan, Lingfei Wu, and Qingsong Wen. Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. CoRR , abs/2310.16340, 2023

  57. [57]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS , 2022

  58. [58]

    Evaluating mathematical reasoning beyond accuracy

    Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy. CoRR , abs/2404.05692, 2024

  59. [59]

    Le, Denny Zhou, and Xinyun Chen

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In ICLR , 2024

  60. [60]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS , 2023

  61. [61]

    Argument attribution explanations in quantitative bipolar argumentation frameworks

    Xiang Yin, Nico Potyka, and Francesca Toni. Argument attribution explanations in quantitative bipolar argumentation frameworks. In ECAI , pages 2898--2905, 2023

  62. [62]

    CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks

    Xiang Yin, Nico Potyka, and Francesca Toni. CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks . In KR , pages 697--707, 8 2024

  63. [63]

    Explaining arguments' strength: Unveiling the role of attacks and supports

    Xiang Yin, Nico Potyka, and Francesca Toni. Explaining arguments' strength: Unveiling the role of attacks and supports. In IJCAI , pages 3622--3630, 2024

  64. [64]

    Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method

    Xuan Zhang and Wei Gao. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. In IJCNLP , pages 996--1011, 2023

  65. [65]

    Hugh Zhang and David C. Parkes. Chain-of-thought reasoning is a policy improvement operator. CoRR , abs/2309.08589, 2023

  66. [66]

    Integrating automated knowledge extraction with large language models for explainable medical decision-making

    Haodi Zhang, Jiahong Li, Yichi Wang, and Yuanfeng Song. Integrating automated knowledge extraction with large language models for explainable medical decision-making. In BIBM , pages 1710--1717, 2023

  67. [67]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...