Argumentation for Explainable and Globally Contestable Decision Support with LLMs

Adam Dejl; Francesca Toni; Matthew Williams

arxiv: 2603.14643 · v2 · submitted 2026-03-15 · 💻 cs.AI · cs.CL

Argumentation for Explainable and Globally Contestable Decision Support with LLMs

Adam Dejl , Matthew Williams , Francesca Toni This is my paper

Pith reviewed 2026-05-15 10:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords argumentation frameworkslarge language modelsexplainable AIdecision supportglobal contestabilityglioblastoma treatmentcomputational argumentation

0 comments

The pith

ArgEval maps task-specific decision spaces into reusable argumentation frameworks so LLMs can give explainable recommendations that users can contest and correct globally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ArgEval as a way to move beyond one-off explanations for single LLM outputs. It first charts the full set of decision options in a domain, builds an ontology of those options, and then creates a single general argumentation framework for each option. For any new case the framework is instantiated to produce a recommendation plus supporting arguments. Because the underlying frameworks are shared, altering an argument or relation updates the logic for every future case rather than fixing only the current instance. The authors demonstrate the approach on glioblastoma treatment selection and report that the resulting guidance matches current clinical standards.

Core claim

ArgEval shifts from mining arguments for isolated cases to systematically mapping task-specific decision spaces, constructing option ontologies, and building general argumentation frameworks that can be instantiated for individual recommendations while remaining modifiable at the framework level to support global contestability.

What carries the argument

General argumentation frameworks (AFs) built from task-specific decision spaces and option ontologies; these structures are instantiated per case for explanations yet remain shared so that changes propagate to all future uses.

If this is right

Recommendations come with explicit argument chains that users can inspect for any specific case.
A single edit to a shared framework alters the reasoning applied to every subsequent case.
The method is no longer restricted to pre-defined binary choices.
Local fixes no longer leave the underlying decision logic unchanged and prone to repetition.
The glioblastoma evaluation shows the instantiated outputs align with clinical practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mapping process could be applied to other high-stakes domains such as loan approval or sentencing support by building domain-specific ontologies once.
If the general frameworks prove stable, repeated LLM errors could be addressed by one-time framework updates rather than case-by-case prompting.
Integration with existing clinical decision-support tools could be tested by checking whether ArgEval outputs improve physician acceptance rates.
Scaling the ontology construction step may require hybrid human-AI workflows whose cost and consistency remain unexamined in the current work.

Load-bearing premise

The constructed general argumentation frameworks accurately encode the relevant domain knowledge without systematic bias or loss of accuracy when applied to individual cases.

What would settle it

A set of new glioblastoma cases in which ArgEval's instantiated recommendations diverge from established clinical guidelines while the unmodified general frameworks remain unchanged.

read the original abstract

Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ArgEval, a framework that maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These AFs are instantiated for specific cases to generate explainable LLM-based recommendations while supporting global contestability via modifications to the shared AFs. The approach is evaluated on glioblastoma treatment recommendation and claimed to produce guidance aligned with clinical practice.

Significance. If the general AFs faithfully encode clinical knowledge, ArgEval would address a key limitation of prior post-hoc argumentation methods for LLMs by enabling both instance-level explanations and global updates to decision logic, which is significant for high-stakes domains such as medical decision support.

major comments (3)

[Abstract] Abstract: The central claim that ArgEval produces 'explainable guidance aligned with clinical practice' supplies no metrics, dataset size, baseline comparisons, or error analysis, preventing verification of the result.
[ArgEval framework description] ArgEval framework (construction of general AFs): The process for deriving arguments and attacks from decision spaces and option ontologies, including the roles of LLMs versus experts and any completeness checks against clinical guidelines, is not described; without these details the fidelity assumption cannot be assessed.
[Evaluation] Evaluation section: No quantitative validation of instantiated AFs (e.g., fidelity metrics, systematic bias checks, or comparison to expert-derived frameworks) is provided, which is load-bearing for the claim that case instantiations remain clinically accurate.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly listed the glioblastoma treatment options considered in the decision space.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment point by point below. Where the comments identify gaps in description or evidence, we have revised the manuscript to incorporate additional details and quantitative elements.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that ArgEval produces 'explainable guidance aligned with clinical practice' supplies no metrics, dataset size, baseline comparisons, or error analysis, preventing verification of the result.

Authors: We agree that the abstract lacks sufficient detail on the evaluation. The evaluation is a qualitative case study demonstrating alignment with clinical practice through specific glioblastoma treatment examples. In the revised manuscript, we will update the abstract to specify the evaluation approach, including the number of cases considered, the qualitative validation against clinical guidelines, and the framework's primary focus on explainability and contestability rather than predictive benchmarking. revision: yes
Referee: [ArgEval framework description] ArgEval framework (construction of general AFs): The process for deriving arguments and attacks from decision spaces and option ontologies, including the roles of LLMs versus experts and any completeness checks against clinical guidelines, is not described; without these details the fidelity assumption cannot be assessed.

Authors: We accept this point and will substantially expand the framework description section. The revised text will explicitly detail the construction process: mapping decision spaces with domain input, building option ontologies via LLM-assisted extraction from medical literature followed by expert curation, generating arguments and attacks through structured LLM prompting with expert oversight for relevance and completeness, and performing cross-validation against established clinical guidelines (e.g., NCCN for glioblastoma). This will enable readers to assess the fidelity of the general AFs. revision: yes
Referee: [Evaluation] Evaluation section: No quantitative validation of instantiated AFs (e.g., fidelity metrics, systematic bias checks, or comparison to expert-derived frameworks) is provided, which is load-bearing for the claim that case instantiations remain clinically accurate.

Authors: We recognize that the current evaluation relies on illustrative cases without quantitative support. In the revised manuscript, we will augment the evaluation section with quantitative validation, including fidelity metrics such as agreement rates between instantiated AF recommendations and expert judgments across the case set, checks for systematic biases in argument selection, and a comparison against a small number of independently expert-constructed argumentation frameworks. These additions will directly support the clinical accuracy claims. revision: yes

Circularity Check

0 steps flagged

ArgEval is a new methodological construction with no circular derivation chain

full rationale

The paper introduces ArgEval as a novel framework that maps task-specific decision spaces, builds option ontologies, and constructs general argumentation frameworks (AFs) for instantiation on specific cases. No equations, fitted parameters, predictions, or derivations are described that reduce by construction to the paper's own inputs. The central claims rest on the proposed construction process itself rather than self-definitional loops, fitted-input renamings, or load-bearing self-citations. This is a standard non-circular outcome for a framework-introduction paper whose claims are methodological proposals rather than reductions of results to prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about LLM argument generation and the fidelity of general AFs; no free parameters or invented physical entities are mentioned.

axioms (2)

domain assumption LLMs can produce faithful arguments that can be organized into argumentation frameworks for decision options
Implicit in the augmentation of LLMs with post-hoc reasoning based on computational argumentation.
domain assumption General AFs built from task-specific decision spaces can be instantiated for specific cases while preserving explainability and enabling global modification
Central design claim of ArgEval as stated in the abstract.

invented entities (1)

ArgEval framework no independent evidence
purpose: To shift from instance-specific to general decision-option evaluation using shared argumentation frameworks
Newly introduced construct in the paper; no independent evidence outside this work is provided.

pith-pipeline@v0.9.0 · 5489 in / 1434 out tokens · 40300 ms · 2026-05-15T10:47:27.261096+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neurosymbolic Learning for Inference-Time Argumentation
cs.AI 2026-05 unverdicted novelty 6.0

ITA trains LLMs to generate and score arguments for ternary claim verification and uses argumentation semantics to derive faithful true/false/uncertain predictions from those structures.