Argumentative Large Language Models for Explainable and Contestable Claim Verification

Adam Dejl; Antonio Rago; Deniz Gorur; Francesca Toni; Gabriel Freedman; Xiang Yin

arxiv: 2405.02079 · v3 · submitted 2024-05-03 · 💻 cs.CL · cs.AI

Argumentative Large Language Models for Explainable and Contestable Claim Verification

Gabriel Freedman , Adam Dejl , Deniz Gorur , Xiang Yin , Antonio Rago , Francesca Toni This is my paper

Pith reviewed 2026-05-24 01:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ArgLLMsargumentation frameworksclaim verificationexplainable AIcontestable AIlarge language modelsformal reasoning

0 comments

The pith

ArgLLMs augment LLMs by constructing argumentation frameworks to support explainable and contestable claim verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces argumentative large language models (ArgLLMs) that build argumentation frameworks from their encoded knowledge to handle claim verification. These frameworks then underpin formal reasoning steps that produce the final decision. Standard LLMs can draw on broad knowledge but typically cannot supply outputs that users can faithfully trace or correct when wrong. A sympathetic reader would care because the method seeks to add transparency and correctability to LLM-based decisions without discarding their zero-shot capabilities.

Core claim

ArgLLMs construct argumentation frameworks, which then serve as the basis for formal reasoning in support of decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by ArgLLMs may be explained and contested. The approach is tested experimentally on claim verification and assessed against newly defined properties of contestability.

What carries the argument

Argumentation frameworks (networks of arguments and attacks between them) constructed by the LLM to enable formal reasoning and make decisions explainable and contestable.

If this is right

Claim verification decisions become traceable through the explicit arguments and attacks in the framework.
Users can contest a decision by pointing to specific flaws or missing attacks in the constructed framework.
ArgLLMs are evaluated experimentally against state-of-the-art claim-verification techniques.
Novel properties are defined to measure contestability and applied to assess the ArgLLM outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same construction of argumentation frameworks could be applied to other LLM decision tasks that require user oversight.
Requiring structured argumentation might constrain hallucinated claims by forcing consistency across supporting and attacking arguments.
Human-in-the-loop experiments could test whether the added contestability changes user trust or final accuracy.

Load-bearing premise

That LLMs can reliably construct argumentation frameworks that faithfully capture relevant knowledge and enable effective contestation without introducing hallucinations or structural biases.

What would settle it

Cases where the frameworks produced by ArgLLMs lead to claim-verification decisions that cannot be meaningfully explained or contested, or where the formal reasoning yields results inconsistent with the framework.

Figures

Figures reproduced from arXiv: 2405.02079 by Adam Dejl, Antonio Rago, Deniz Gorur, Francesca Toni, Gabriel Freedman, Xiang Yin.

**Figure 1.** Figure 1: Comparison of our approach (ArgLLM, here in combination with Mixtral) with existing alternatives. The example claim is adapted from TruthfulQA. 1 arXiv:2405.02079v3 [cs.CL] 18 Apr 2025 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Pipeline for ArgLLMs (in comparison with baselines, see §4 and §5 for the details). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt used for Γ. {“supporting”/“attacking”} and {“support”/“attack”} are determined by θ. ArgLLMs and the baselines. We evaluate all prompts, and the combinations thereof, in a pilot experiment on two validation sets of 200 samples each, taken from TruthfulClaim and StrategyClaim, using Mistral and Mixtral models (selected as two substantially different open-source models). In this evaluation, we separat… view at source ↗

**Figure 4.** Figure 4: In order to assess accuracy for claim verification we use a similar threshold as for the Est. Confidence baseline — if the the input claim’s final strength is greater than 0.5 it is classified as true, and otherwise as false. The four variations we consider arise from the hyperparameters θ and the choice of intrinsic strength for the input claim by E. For θ, we consider two options — Depth=1 and Depth=2: … view at source ↗

**Figure 5.** Figure 5: An example of contestation, in the ArgLLM with Mixtral, for a claim taken from StrategyClaim. Before contestation, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for direct questioning on confidence [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Prompts used for chain-of-thought baseline. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for argument strength attribution for [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: ChatGPT Argument Generator prompt 11.2 Role-player prompts Role-player prompts followed a prompting strategy where the LLMs were expected to act like debater for the Argument [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 16.** Figure 16: OPRO Argument Miner prompt [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗

**Figure 15.** Figure 15: Role-Player: Analyst Uncertainty Estimator [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 18.** Figure 18: An illustration of a user adding an additional sup [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

read the original abstract

The profusion of knowledge encoded in large language models (LLMs) and their ability to apply this knowledge zero-shot in a range of settings makes them promising candidates for use in decision-making. However, they are currently limited by their inability to provide outputs which can be faithfully explained and effectively contested to correct mistakes. In this paper, we attempt to reconcile these strengths and weaknesses by introducing \emph{argumentative LLMs (ArgLLMs)}, a method for augmenting LLMs with argumentative reasoning. Concretely, ArgLLMs construct argumentation frameworks, which then serve as the basis for formal reasoning in support of decision-making. The interpretable nature of these argumentation frameworks and formal reasoning means that any decision made by ArgLLMs may be explained and contested. We evaluate ArgLLMs' performance experimentally in comparison with state-of-the-art techniques, in the context of the decision-making task of claim verification. We also define novel properties to characterise contestability and assess ArgLLMs formally in terms of these properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces argumentative LLMs (ArgLLMs) that augment standard LLMs by having them construct argumentation frameworks, which then support formal reasoning for decision-making tasks such as claim verification. The approach aims to improve explainability and contestability of LLM outputs. The manuscript includes an experimental evaluation comparing ArgLLMs to state-of-the-art methods on claim verification and defines novel formal properties to characterize contestability, with an assessment of ArgLLMs against those properties.

Significance. If the empirical results and formal assessment hold, the work offers a concrete mechanism for injecting interpretable argumentative structure into LLM pipelines, directly addressing the explainability and contestability limitations highlighted in the abstract. The combination of framework construction with formal reasoning properties provides a pathway for contestable decisions that is more structured than post-hoc explanation techniques.

major comments (2)

[§4] §4 (experimental evaluation on claim verification): the reported comparisons with SOTA techniques should include explicit metrics for framework faithfulness (e.g., agreement with human-annotated argument structures) in addition to task accuracy, because the central claim that the frameworks enable reliable formal reasoning rests on this property being preserved by the LLM construction step.
[§5] §5 (formal properties for contestability): the assessment that ArgLLMs satisfy the defined properties appears to assume that the constructed frameworks are always well-formed and complete; if the LLM occasionally produces inconsistent or incomplete frameworks, the formal guarantees would not transfer, and this edge case should be quantified in the evaluation.

minor comments (3)

[Abstract] The abstract and introduction use 'ArgLLMs' without an initial definition or expansion on first use; add a parenthetical expansion at first mention.
[Figure 1] Figure 1 (argumentation framework example) would benefit from a caption that explicitly links the nodes/attacks to the claim-verification task rather than a generic illustration.
[Related Work] Related-work section omits recent papers on LLM-augmented argumentation (e.g., works on debate or structured reasoning with LLMs from 2023-2024); add 2-3 citations for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (experimental evaluation on claim verification): the reported comparisons with SOTA techniques should include explicit metrics for framework faithfulness (e.g., agreement with human-annotated argument structures) in addition to task accuracy, because the central claim that the frameworks enable reliable formal reasoning rests on this property being preserved by the LLM construction step.

Authors: We thank the referee for this observation. Our evaluation in §4 prioritizes end-to-end task accuracy on claim verification to enable direct comparison with SOTA methods. We agree that explicit faithfulness metrics would strengthen claims about the reliability of the constructed frameworks for formal reasoning. However, the current experiments do not include human-annotated argument structures, and collecting such annotations would constitute substantial additional work. We will revise the manuscript to discuss this as a limitation and propose it as future work, while noting that poor framework quality would be expected to degrade task performance. revision: partial
Referee: [§5] §5 (formal properties for contestability): the assessment that ArgLLMs satisfy the defined properties appears to assume that the constructed frameworks are always well-formed and complete; if the LLM occasionally produces inconsistent or incomplete frameworks, the formal guarantees would not transfer, and this edge case should be quantified in the evaluation.

Authors: We agree that this is an important consideration. The formal assessment in §5 applies to well-formed frameworks, and the manuscript does not currently quantify cases where the LLM produces inconsistent or incomplete outputs. We will add an empirical analysis in the revised evaluation section reporting the frequency of such edge cases across the datasets. This will clarify the practical scope of the contestability guarantees. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces ArgLLMs as an augmentation of LLMs with argumentation frameworks for claim verification, evaluates them experimentally against baselines, and defines novel contestability properties for formal assessment. No derivation reduces a claimed prediction or result to a fitted parameter, self-defined quantity, or self-citation chain; the central claims rest on empirical performance and externally grounded argumentation concepts rather than internal redefinitions or forced outputs. The approach is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that LLMs can generate useful argumentation frameworks from their encoded knowledge, with ArgLLMs as the primary new construct; no free parameters or additional invented entities are specified in the abstract.

axioms (1)

domain assumption LLMs encode knowledge that can be applied zero-shot in a range of settings
Explicitly stated in the abstract as the basis for using LLMs in decision-making.

invented entities (1)

ArgLLMs no independent evidence
purpose: Augment LLMs with argumentative reasoning to enable explainable and contestable decisions
New method introduced by the paper to address limitations of standard LLMs.

pith-pipeline@v0.9.0 · 5713 in / 1100 out tokens · 24644 ms · 2026-05-24T01:14:35.872000+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI at the Front Lines of Platform Governance: Using LLMs to Support Illegal Content Reporting under the Digital Services Act
cs.HC 2026-05 unverdicted novelty 5.0

EvalAI providing pro/con arguments improves provision-level accuracy and reduces misclassification distance in DSA illegal content reporting under AI error conditions versus conventional XAI.
Contestable AI needs Computational Argumentation
cs.AI 2024-05 unverdicted novelty 3.0

Contestable AI needs dynamic processes enabled by computational argumentation instead of static models.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. CoRR , abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Axiomatic foundations of acceptability semantics

Leila Amgoud and Jonathan Ben - Naim. Axiomatic foundations of acceptability semantics. In KR , pages 2--11, 2016

work page 2016
[3]

Using arguments for making and explaining decisions

Leila Amgoud and Henri Prade. Using arguments for making and explaining decisions. Artif. Intell. , 173(3-4):413--436, 2009

work page 2009
[4]

Acceptability semantics for weighted argumentation frameworks

Leila Amgoud, Jonathan Ben - Naim, Dragan Doder, and Srdjan Vesic. Acceptability semantics for weighted argumentation frameworks. In IJCAI , pages 56--62, 2017

work page 2017
[5]

Measuring the intensity of attacks in argumentation graphs with shapley value

Leila Amgoud, Jonathan Ben - Naim, and Srdjan Vesic. Measuring the intensity of attacks in argumentation graphs with shapley value. In IJCAI , pages 63--69, 2017

work page 2017
[6]

Towards artificial argumentation

Katie Atkinson, Pietro Baroni, Massimiliano Giacomin, Anthony Hunter, Henry Prakken, Chris Reed, Guillermo Ricardo Simari, Matthias Thimm, and Serena Villata. Towards artificial argumentation. AI Mag. , 38(3):25--36, 2017

work page 2017
[7]

How many properties do we need for gradual argumentation? In AAAI , pages 1736--1743, 2018

Pietro Baroni, Antonio Rago, and Francesca Toni. How many properties do we need for gradual argumentation? In AAAI , pages 1736--1743, 2018

work page 2018
[8]

From fine-grained properties to broad principles for gradual argumentation: A principled spectrum

Pietro Baroni, Antonio Rago, and Francesca Toni. From fine-grained properties to broad principles for gradual argumentation: A principled spectrum. Int. J. Approx. Reason. , 105:252--286, 2019

work page 2019
[9]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a". CoRR , abs/2309.12288, 2023

work page arXiv 2023
[10]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In AAAI , pages 17682--17690, 2024

work page 2024
[11]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS , 2020

work page 2020
[12]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \' e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4 . CoRR , abs/2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

On the acceptability of arguments in bipolar argumentation frameworks

Claudette Cayrol and Marie - Christine Lagasquie - Schiex. On the acceptability of arguments in bipolar argumentation frameworks. In ECSQARU , pages 378--389, 2005

work page 2005
[14]

Exploring the potential of large language models in computational argumentation

Guizhen Chen, Liying Cheng, Luu Anh Tuan, and Lidong Bing. Exploring the potential of large language models in computational argumentation. CoRR , abs/2311.09022, 2023

work page arXiv 2023
[15]

Argumentative XAI: A survey

Kristijonas Cyras, Antonio Rago, Emanuele Albini, Pietro Baroni, and Francesca Toni. Argumentative XAI: A survey. In IJCAI , pages 4392--4399, 2021

work page 2021
[16]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In NeurIPS , 2023

work page 2023
[17]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. CoRR , abs/2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Evaluating superhuman models with consistency checks

Lukas Fluri, Daniel Paleka, and Florian Tram \` e r. Evaluating superhuman models with consistency checks. CoRR , abs/2306.09983, 2023

work page arXiv 2023
[19]

Rodr \' guez, Diego Letzen, Maria Vanina Martinez, and Laura Alonso Alemany

Dami \' a n Ariel Furman, Pablo Torres, Jos \' e A. Rodr \' guez, Diego Letzen, Maria Vanina Martinez, and Laura Alonso Alemany. High-quality argumentative information in low resources approaches improve counter-narrative generation. In EMNLP , pages 2942--2956, 2023

work page 2023
[20]

Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. CoRR , abs/2101.02235, 2021

work page arXiv 2021
[21]

Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM

Ivan Habernal and Iryna Gurevych. Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM . In ACL , 2016

work page 2016
[22]

arXiv preprint arXiv:2402.18563 , year=

Danny Halawi, Fred Zhang, Chen Yueh - Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. CoRR , abs/2402.18563, 2024

work page arXiv 2024
[23]

Hammond and David B

Kristian J. Hammond and David B. Leake. Large language models need symbolic AI . In NeSy , pages 204--209, 2023

work page 2023
[24]

Beyond explainability: justifiability and contestability of algorithmic decision systems

Cl \' e ment Henin and Daniel Le M \' e tayer. Beyond explainability: justifiability and contestability of algorithmic decision systems. AI Soc. , 37(4):1397--1410, 2022

work page 2022
[25]

Leveraging large language models to generate answer set programs

Adam Ishay, Zhun Yang, and Joohyung Lee. Leveraging large language models to generate answer set programs. In KR , pages 374--383, 2023

work page 2023
[26]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. CoRR , abs/2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. CoRR , abs/2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei - Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR , abs/2009.13081, 2020

work page arXiv 2009
[29]

Contribution functions for quantitative bipolar argumentation graphs: A principle-based analysis

Timotheus Kampik, Nico Potyka, Xiang Yin, Kristijonas Cyras, and Francesca Toni. Contribution functions for quantitative bipolar argumentation graphs: A principle-based analysis. Int. J. Approx. Reason. , 173:109255, 2024

work page 2024
[30]

Towards a framework for evaluating explanations in automated fact verification

Neema Kotonya and Francesca Toni. Towards a framework for evaluating explanations in automated fact verification. In LREC/COLING , pages 16364--16377, 2024

work page 2024
[31]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR , 2023

work page 2023
[32]

Tetreault

Anne Lauscher, Lily Ng, Courtney Napoles, and Joel R. Tetreault. Rhetoric, logic, and dialectic: Advancing theory-based argument quality assessment in natural language processing. In COLING , pages 4563--4574, 2020

work page 2020
[33]

Contestable AI Needs Computational Argumentation

Francesco Leofante, Hamed Ayoobi, Adam Dejl, Gabriel Freedman, Deniz Gorur, Junqi Jiang, Guilherme Paulino-Passos, Antonio Rago, Anna Rapberger, Fabrizio Russo, Xiang Yin, Dekai Zhang, and Francesca Toni. Contestable AI Needs Computational Argumentation . In Proceedings of the 21st International Conference on Principles of Knowledge Representation and Rea...

work page 2024
[34]

u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS , 2020

work page 2020
[35]

Vera Liao and Jennifer Wortman Vaughan

Q. Vera Liao and Jennifer Wortman Vaughan. AI Transparency in the Age of LLMs : A Human - Centered Research Roadmap . Harvard Data Science Review , (Special Issue 5), may 31 2024

work page 2024
[36]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. CoRR , abs/2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Conceptualising contestability: Perspectives on contesting algorithmic decisions

Henrietta Lyons, Eduardo Velloso, and Tim Miller. Conceptualising contestability: Perspectives on contesting algorithmic decisions. Proc. ACM Hum. Comput. Interact. , 5( CSCW1 ):106:1--106:25, 2021

work page 2021
[38]

Why do humans reason? arguments for an argumentative theory

Hugo Mercier and Dan Sperber. Why do humans reason? arguments for an argumentative theory. Behavioral and Brain Sciences , 34(2):57–74, 2011

work page 2011
[39]

The Enigma of Reason

Hugo Mercier and Dan Sperber. The Enigma of Reason . Penguin, 2018

work page 2018
[40]

Gemma: Open Models Based on Gemini Research and Technology

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, et al. Gemma: Open models based on gemini research and technology. CoRR , abs/2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Explainable AI is dead, long live explainable ai!: Hypothesis-driven decision support using evaluative AI

Tim Miller. Explainable AI is dead, long live explainable ai!: Hypothesis-driven decision support using evaluative AI . In FAccT , pages 333--342, 2023

work page 2023
[42]

Gpt-4o mini: advancing cost-efficient intelligence, 2024

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

work page 2024
[43]

Autoplan: Automatic planning of interactive decision-making tasks with large language models

Siqi Ouyang and Lei Li. Autoplan: Automatic planning of interactive decision-making tasks with large language models. In EMNLP , pages 3114--3128, 2023

work page 2023
[44]

Continuous dynamical systems for weighted bipolar argumentation

Nico Potyka. Continuous dynamical systems for weighted bipolar argumentation. In KR , pages 148--157, 2018

work page 2018
[45]

Discontinuity-free decision support with quantitative argumentation debates

Antonio Rago, Francesca Toni, Marco Aurisicchio, and Pietro Baroni. Discontinuity-free decision support with quantitative argumentation debates. In KR , pages 63--73, 2016

work page 2016
[46]

Interactive explanations by conflict resolution via argumentative exchanges

Antonio Rago, Hengzhi Li, and Francesca Toni. Interactive explanations by conflict resolution via argumentative exchanges. In KR , pages 582--592, 2023

work page 2023
[47]

Gemma 2: Improving Open Language Models at a Practical Size

Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. Gemma 2: Improving open language models at a practical size. CoRR , abs/2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. , 1(5):206--215, 2019

work page 2019
[49]

Teler: A general taxonomy of LLM prompts for benchmarking complex tasks

Shubhra Kanti Karmaker Santu and Dongji Feng. Teler: A general taxonomy of LLM prompts for benchmarking complex tasks. In EMNLP , pages 14197--14203, 2023

work page 2023
[50]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS , 2023

work page 2023
[51]

Talking about large language models

Murray Shanahan. Talking about large language models. Commun. ACM , 67(2):68--79, 2024

work page 2024
[52]

Finding convincing arguments using scalable bayesian preference learning

Edwin Simpson and Iryna Gurevych. Finding convincing arguments using scalable bayesian preference learning. Trans. Assoc. Comput. Linguistics , 6:357--371, 2018

work page 2018
[53]

Entailer: Answering questions with faithful and truthful chains of reasoning

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. Entailer: Answering questions with faithful and truthful chains of reasoning. In EMNLP , pages 2078--2093, 2022

work page 2078
[54]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In NeurIPS , 2023

work page 2023
[55]

Explainable claim verification via knowledge-grounded reasoning with large language models

Haoran Wang and Kai Shu. Explainable claim verification via knowledge-grounded reasoning with large language models. In EMNLP , pages 6288--6304, 2023

work page 2023
[56]

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Lunting Fan, Lingfei Wu, and Qingsong Wen. Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. CoRR , abs/2310.16340, 2023

work page arXiv 2023
[57]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS , 2022

work page 2022
[58]

Evaluating mathematical reasoning beyond accuracy

Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy. CoRR , abs/2404.05692, 2024

work page arXiv 2024
[59]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In ICLR , 2024

work page 2024
[60]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS , 2023

work page 2023
[61]

Argument attribution explanations in quantitative bipolar argumentation frameworks

Xiang Yin, Nico Potyka, and Francesca Toni. Argument attribution explanations in quantitative bipolar argumentation frameworks. In ECAI , pages 2898--2905, 2023

work page 2023
[62]

CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks

Xiang Yin, Nico Potyka, and Francesca Toni. CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks . In KR , pages 697--707, 8 2024

work page 2024
[63]

Explaining arguments' strength: Unveiling the role of attacks and supports

Xiang Yin, Nico Potyka, and Francesca Toni. Explaining arguments' strength: Unveiling the role of attacks and supports. In IJCAI , pages 3622--3630, 2024

work page 2024
[64]

Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method

Xuan Zhang and Wei Gao. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. In IJCNLP , pages 996--1011, 2023

work page 2023
[65]

Hugh Zhang and David C. Parkes. Chain-of-thought reasoning is a policy improvement operator. CoRR , abs/2309.08589, 2023

work page arXiv 2023
[66]

Integrating automated knowledge extraction with large language models for explainable medical decision-making

Haodi Zhang, Jiahong Li, Yichi Wang, and Yuanfeng Song. Integrating automated knowledge extraction with large language models for explainable medical decision-making. In BIBM , pages 1710--1717, 2023

work page 2023
[67]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. CoRR , abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Axiomatic foundations of acceptability semantics

Leila Amgoud and Jonathan Ben - Naim. Axiomatic foundations of acceptability semantics. In KR , pages 2--11, 2016

work page 2016

[3] [3]

Using arguments for making and explaining decisions

Leila Amgoud and Henri Prade. Using arguments for making and explaining decisions. Artif. Intell. , 173(3-4):413--436, 2009

work page 2009

[4] [4]

Acceptability semantics for weighted argumentation frameworks

Leila Amgoud, Jonathan Ben - Naim, Dragan Doder, and Srdjan Vesic. Acceptability semantics for weighted argumentation frameworks. In IJCAI , pages 56--62, 2017

work page 2017

[5] [5]

Measuring the intensity of attacks in argumentation graphs with shapley value

Leila Amgoud, Jonathan Ben - Naim, and Srdjan Vesic. Measuring the intensity of attacks in argumentation graphs with shapley value. In IJCAI , pages 63--69, 2017

work page 2017

[6] [6]

Towards artificial argumentation

Katie Atkinson, Pietro Baroni, Massimiliano Giacomin, Anthony Hunter, Henry Prakken, Chris Reed, Guillermo Ricardo Simari, Matthias Thimm, and Serena Villata. Towards artificial argumentation. AI Mag. , 38(3):25--36, 2017

work page 2017

[7] [7]

How many properties do we need for gradual argumentation? In AAAI , pages 1736--1743, 2018

Pietro Baroni, Antonio Rago, and Francesca Toni. How many properties do we need for gradual argumentation? In AAAI , pages 1736--1743, 2018

work page 2018

[8] [8]

From fine-grained properties to broad principles for gradual argumentation: A principled spectrum

Pietro Baroni, Antonio Rago, and Francesca Toni. From fine-grained properties to broad principles for gradual argumentation: A principled spectrum. Int. J. Approx. Reason. , 105:252--286, 2019

work page 2019

[9] [9]

a is b" fail to learn

Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: Llms trained on "a is b" fail to learn "b is a". CoRR , abs/2309.12288, 2023

work page arXiv 2023

[10] [10]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In AAAI , pages 17682--17690, 2024

work page 2024

[11] [11]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS , 2020

work page 2020

[12] [12]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \' e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4 . CoRR , abs/2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

On the acceptability of arguments in bipolar argumentation frameworks

Claudette Cayrol and Marie - Christine Lagasquie - Schiex. On the acceptability of arguments in bipolar argumentation frameworks. In ECSQARU , pages 378--389, 2005

work page 2005

[14] [14]

Exploring the potential of large language models in computational argumentation

Guizhen Chen, Liying Cheng, Luu Anh Tuan, and Lidong Bing. Exploring the potential of large language models in computational argumentation. CoRR , abs/2311.09022, 2023

work page arXiv 2023

[15] [15]

Argumentative XAI: A survey

Kristijonas Cyras, Antonio Rago, Emanuele Albini, Pietro Baroni, and Francesca Toni. Argumentative XAI: A survey. In IJCAI , pages 4392--4399, 2021

work page 2021

[16] [16]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In NeurIPS , 2023

work page 2023

[17] [17]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. CoRR , abs/2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Evaluating superhuman models with consistency checks

Lukas Fluri, Daniel Paleka, and Florian Tram \` e r. Evaluating superhuman models with consistency checks. CoRR , abs/2306.09983, 2023

work page arXiv 2023

[19] [19]

Rodr \' guez, Diego Letzen, Maria Vanina Martinez, and Laura Alonso Alemany

Dami \' a n Ariel Furman, Pablo Torres, Jos \' e A. Rodr \' guez, Diego Letzen, Maria Vanina Martinez, and Laura Alonso Alemany. High-quality argumentative information in low resources approaches improve counter-narrative generation. In EMNLP , pages 2942--2956, 2023

work page 2023

[20] [20]

Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. CoRR , abs/2101.02235, 2021

work page arXiv 2021

[21] [21]

Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM

Ivan Habernal and Iryna Gurevych. Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional LSTM . In ACL , 2016

work page 2016

[22] [22]

arXiv preprint arXiv:2402.18563 , year=

Danny Halawi, Fred Zhang, Chen Yueh - Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. CoRR , abs/2402.18563, 2024

work page arXiv 2024

[23] [23]

Hammond and David B

Kristian J. Hammond and David B. Leake. Large language models need symbolic AI . In NeSy , pages 204--209, 2023

work page 2023

[24] [24]

Beyond explainability: justifiability and contestability of algorithmic decision systems

Cl \' e ment Henin and Daniel Le M \' e tayer. Beyond explainability: justifiability and contestability of algorithmic decision systems. AI Soc. , 37(4):1397--1410, 2022

work page 2022

[25] [25]

Leveraging large language models to generate answer set programs

Adam Ishay, Zhun Yang, and Joohyung Lee. Leveraging large language models to generate answer set programs. In KR , pages 374--383, 2023

work page 2023

[26] [26]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. CoRR , abs/2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Mixtral of Experts

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. CoRR , abs/2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei - Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. CoRR , abs/2009.13081, 2020

work page arXiv 2009

[29] [29]

Contribution functions for quantitative bipolar argumentation graphs: A principle-based analysis

Timotheus Kampik, Nico Potyka, Xiang Yin, Kristijonas Cyras, and Francesca Toni. Contribution functions for quantitative bipolar argumentation graphs: A principle-based analysis. Int. J. Approx. Reason. , 173:109255, 2024

work page 2024

[30] [30]

Towards a framework for evaluating explanations in automated fact verification

Neema Kotonya and Francesca Toni. Towards a framework for evaluating explanations in automated fact verification. In LREC/COLING , pages 16364--16377, 2024

work page 2024

[31] [31]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR , 2023

work page 2023

[32] [32]

Tetreault

Anne Lauscher, Lily Ng, Courtney Napoles, and Joel R. Tetreault. Rhetoric, logic, and dialectic: Advancing theory-based argument quality assessment in natural language processing. In COLING , pages 4563--4574, 2020

work page 2020

[33] [33]

Contestable AI Needs Computational Argumentation

Francesco Leofante, Hamed Ayoobi, Adam Dejl, Gabriel Freedman, Deniz Gorur, Junqi Jiang, Guilherme Paulino-Passos, Antonio Rago, Anna Rapberger, Fabrizio Russo, Xiang Yin, Dekai Zhang, and Francesca Toni. Contestable AI Needs Computational Argumentation . In Proceedings of the 21st International Conference on Principles of Knowledge Representation and Rea...

work page 2024

[34] [34]

u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \" u ttler, Mike Lewis, Wen - tau Yih, Tim Rockt \" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS , 2020

work page 2020

[35] [35]

Vera Liao and Jennifer Wortman Vaughan

Q. Vera Liao and Jennifer Wortman Vaughan. AI Transparency in the Age of LLMs : A Human - Centered Research Roadmap . Harvard Data Science Review , (Special Issue 5), may 31 2024

work page 2024

[36] [36]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. CoRR , abs/2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Conceptualising contestability: Perspectives on contesting algorithmic decisions

Henrietta Lyons, Eduardo Velloso, and Tim Miller. Conceptualising contestability: Perspectives on contesting algorithmic decisions. Proc. ACM Hum. Comput. Interact. , 5( CSCW1 ):106:1--106:25, 2021

work page 2021

[38] [38]

Why do humans reason? arguments for an argumentative theory

Hugo Mercier and Dan Sperber. Why do humans reason? arguments for an argumentative theory. Behavioral and Brain Sciences , 34(2):57–74, 2011

work page 2011

[39] [39]

The Enigma of Reason

Hugo Mercier and Dan Sperber. The Enigma of Reason . Penguin, 2018

work page 2018

[40] [40]

Gemma: Open Models Based on Gemini Research and Technology

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi \` e re, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, et al. Gemma: Open models based on gemini research and technology. CoRR , abs/2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Explainable AI is dead, long live explainable ai!: Hypothesis-driven decision support using evaluative AI

Tim Miller. Explainable AI is dead, long live explainable ai!: Hypothesis-driven decision support using evaluative AI . In FAccT , pages 333--342, 2023

work page 2023

[42] [42]

Gpt-4o mini: advancing cost-efficient intelligence, 2024

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

work page 2024

[43] [43]

Autoplan: Automatic planning of interactive decision-making tasks with large language models

Siqi Ouyang and Lei Li. Autoplan: Automatic planning of interactive decision-making tasks with large language models. In EMNLP , pages 3114--3128, 2023

work page 2023

[44] [44]

Continuous dynamical systems for weighted bipolar argumentation

Nico Potyka. Continuous dynamical systems for weighted bipolar argumentation. In KR , pages 148--157, 2018

work page 2018

[45] [45]

Discontinuity-free decision support with quantitative argumentation debates

Antonio Rago, Francesca Toni, Marco Aurisicchio, and Pietro Baroni. Discontinuity-free decision support with quantitative argumentation debates. In KR , pages 63--73, 2016

work page 2016

[46] [46]

Interactive explanations by conflict resolution via argumentative exchanges

Antonio Rago, Hengzhi Li, and Francesca Toni. Interactive explanations by conflict resolution via argumentative exchanges. In KR , pages 582--592, 2023

work page 2023

[47] [47]

Gemma 2: Improving Open Language Models at a Practical Size

Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , et al. Gemma 2: Improving open language models at a practical size. CoRR , abs/2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. , 1(5):206--215, 2019

work page 2019

[49] [49]

Teler: A general taxonomy of LLM prompts for benchmarking complex tasks

Shubhra Kanti Karmaker Santu and Dongji Feng. Teler: A general taxonomy of LLM prompts for benchmarking complex tasks. In EMNLP , pages 14197--14203, 2023

work page 2023

[50] [50]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In NeurIPS , 2023

work page 2023

[51] [51]

Talking about large language models

Murray Shanahan. Talking about large language models. Commun. ACM , 67(2):68--79, 2024

work page 2024

[52] [52]

Finding convincing arguments using scalable bayesian preference learning

Edwin Simpson and Iryna Gurevych. Finding convincing arguments using scalable bayesian preference learning. Trans. Assoc. Comput. Linguistics , 6:357--371, 2018

work page 2018

[53] [53]

Entailer: Answering questions with faithful and truthful chains of reasoning

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. Entailer: Answering questions with faithful and truthful chains of reasoning. In EMNLP , pages 2078--2093, 2022

work page 2078

[54] [54]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In NeurIPS , 2023

work page 2023

[55] [55]

Explainable claim verification via knowledge-grounded reasoning with large language models

Haoran Wang and Kai Shu. Explainable claim verification via knowledge-grounded reasoning with large language models. In EMNLP , pages 6288--6304, 2023

work page 2023

[56] [56]

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models

Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Lunting Fan, Lingfei Wu, and Qingsong Wen. Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. CoRR , abs/2310.16340, 2023

work page arXiv 2023

[57] [57]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS , 2022

work page 2022

[58] [58]

Evaluating mathematical reasoning beyond accuracy

Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy. CoRR , abs/2404.05692, 2024

work page arXiv 2024

[59] [59]

Le, Denny Zhou, and Xinyun Chen

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. In ICLR , 2024

work page 2024

[60] [60]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS , 2023

work page 2023

[61] [61]

Argument attribution explanations in quantitative bipolar argumentation frameworks

Xiang Yin, Nico Potyka, and Francesca Toni. Argument attribution explanations in quantitative bipolar argumentation frameworks. In ECAI , pages 2898--2905, 2023

work page 2023

[62] [62]

CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks

Xiang Yin, Nico Potyka, and Francesca Toni. CE-QArg: Counterfactual Explanations for Quantitative Bipolar Argumentation Frameworks . In KR , pages 697--707, 8 2024

work page 2024

[63] [63]

Explaining arguments' strength: Unveiling the role of attacks and supports

Xiang Yin, Nico Potyka, and Francesca Toni. Explaining arguments' strength: Unveiling the role of attacks and supports. In IJCAI , pages 3622--3630, 2024

work page 2024

[64] [64]

Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method

Xuan Zhang and Wei Gao. Towards llm-based fact verification on news claims with a hierarchical step-by-step prompting method. In IJCNLP , pages 996--1011, 2023

work page 2023

[65] [65]

Hugh Zhang and David C. Parkes. Chain-of-thought reasoning is a policy improvement operator. CoRR , abs/2309.08589, 2023

work page arXiv 2023

[66] [66]

Integrating automated knowledge extraction with large language models for explainable medical decision-making

Haodi Zhang, Jiahong Li, Yichi Wang, and Yuanfeng Song. Integrating automated knowledge extraction with large language models for explainable medical decision-making. In BIBM , pages 1710--1717, 2023

work page 2023

[67] [67]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page