Argumentation for Explainable and Globally Contestable Decision Support with LLMs
Pith reviewed 2026-05-15 10:47 UTC · model grok-4.3
The pith
ArgEval maps task-specific decision spaces into reusable argumentation frameworks so LLMs can give explainable recommendations that users can contest and correct globally.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ArgEval shifts from mining arguments for isolated cases to systematically mapping task-specific decision spaces, constructing option ontologies, and building general argumentation frameworks that can be instantiated for individual recommendations while remaining modifiable at the framework level to support global contestability.
What carries the argument
General argumentation frameworks (AFs) built from task-specific decision spaces and option ontologies; these structures are instantiated per case for explanations yet remain shared so that changes propagate to all future uses.
If this is right
- Recommendations come with explicit argument chains that users can inspect for any specific case.
- A single edit to a shared framework alters the reasoning applied to every subsequent case.
- The method is no longer restricted to pre-defined binary choices.
- Local fixes no longer leave the underlying decision logic unchanged and prone to repetition.
- The glioblastoma evaluation shows the instantiated outputs align with clinical practice.
Where Pith is reading between the lines
- The same mapping process could be applied to other high-stakes domains such as loan approval or sentencing support by building domain-specific ontologies once.
- If the general frameworks prove stable, repeated LLM errors could be addressed by one-time framework updates rather than case-by-case prompting.
- Integration with existing clinical decision-support tools could be tested by checking whether ArgEval outputs improve physician acceptance rates.
- Scaling the ontology construction step may require hybrid human-AI workflows whose cost and consistency remain unexamined in the current work.
Load-bearing premise
The constructed general argumentation frameworks accurately encode the relevant domain knowledge without systematic bias or loss of accuracy when applied to individual cases.
What would settle it
A set of new glioblastoma cases in which ArgEval's instantiated recommendations diverge from established clinical guidelines while the unmodified general frameworks remain unchanged.
read the original abstract
Large language models (LLMs) exhibit strong general capabilities, but their deployment in high-stakes domains is hindered by their opacity and unpredictability. Recent work has taken meaningful steps towards addressing these issues by augmenting LLMs with post-hoc reasoning based on computational argumentation, providing faithful explanations and enabling users to contest incorrect decisions. However, this paradigm is limited to pre-defined binary choices and only supports local contestation for specific instances, leaving the underlying decision logic unchanged and prone to repeated mistakes. In this paper, we introduce ArgEval, a framework that shifts from instance-specific reasoning to structured evaluation of general decision options. Rather than mining arguments solely for individual cases, ArgEval systematically maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These frameworks can then be instantiated to provide explainable recommendations for specific cases while still supporting global contestability through modification of the shared AFs. We investigate the effectiveness of ArgEval on treatment recommendation for glioblastoma, an aggressive brain tumour, and show that it can produce explainable guidance aligned with clinical practice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ArgEval, a framework that maps task-specific decision spaces, builds corresponding option ontologies, and constructs general argumentation frameworks (AFs) for each option. These AFs are instantiated for specific cases to generate explainable LLM-based recommendations while supporting global contestability via modifications to the shared AFs. The approach is evaluated on glioblastoma treatment recommendation and claimed to produce guidance aligned with clinical practice.
Significance. If the general AFs faithfully encode clinical knowledge, ArgEval would address a key limitation of prior post-hoc argumentation methods for LLMs by enabling both instance-level explanations and global updates to decision logic, which is significant for high-stakes domains such as medical decision support.
major comments (3)
- [Abstract] Abstract: The central claim that ArgEval produces 'explainable guidance aligned with clinical practice' supplies no metrics, dataset size, baseline comparisons, or error analysis, preventing verification of the result.
- [ArgEval framework description] ArgEval framework (construction of general AFs): The process for deriving arguments and attacks from decision spaces and option ontologies, including the roles of LLMs versus experts and any completeness checks against clinical guidelines, is not described; without these details the fidelity assumption cannot be assessed.
- [Evaluation] Evaluation section: No quantitative validation of instantiated AFs (e.g., fidelity metrics, systematic bias checks, or comparison to expert-derived frameworks) is provided, which is load-bearing for the claim that case instantiations remain clinically accurate.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly listed the glioblastoma treatment options considered in the decision space.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and rigor of our presentation. We address each major comment point by point below. Where the comments identify gaps in description or evidence, we have revised the manuscript to incorporate additional details and quantitative elements.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that ArgEval produces 'explainable guidance aligned with clinical practice' supplies no metrics, dataset size, baseline comparisons, or error analysis, preventing verification of the result.
Authors: We agree that the abstract lacks sufficient detail on the evaluation. The evaluation is a qualitative case study demonstrating alignment with clinical practice through specific glioblastoma treatment examples. In the revised manuscript, we will update the abstract to specify the evaluation approach, including the number of cases considered, the qualitative validation against clinical guidelines, and the framework's primary focus on explainability and contestability rather than predictive benchmarking. revision: yes
-
Referee: [ArgEval framework description] ArgEval framework (construction of general AFs): The process for deriving arguments and attacks from decision spaces and option ontologies, including the roles of LLMs versus experts and any completeness checks against clinical guidelines, is not described; without these details the fidelity assumption cannot be assessed.
Authors: We accept this point and will substantially expand the framework description section. The revised text will explicitly detail the construction process: mapping decision spaces with domain input, building option ontologies via LLM-assisted extraction from medical literature followed by expert curation, generating arguments and attacks through structured LLM prompting with expert oversight for relevance and completeness, and performing cross-validation against established clinical guidelines (e.g., NCCN for glioblastoma). This will enable readers to assess the fidelity of the general AFs. revision: yes
-
Referee: [Evaluation] Evaluation section: No quantitative validation of instantiated AFs (e.g., fidelity metrics, systematic bias checks, or comparison to expert-derived frameworks) is provided, which is load-bearing for the claim that case instantiations remain clinically accurate.
Authors: We recognize that the current evaluation relies on illustrative cases without quantitative support. In the revised manuscript, we will augment the evaluation section with quantitative validation, including fidelity metrics such as agreement rates between instantiated AF recommendations and expert judgments across the case set, checks for systematic biases in argument selection, and a comparison against a small number of independently expert-constructed argumentation frameworks. These additions will directly support the clinical accuracy claims. revision: yes
Circularity Check
ArgEval is a new methodological construction with no circular derivation chain
full rationale
The paper introduces ArgEval as a novel framework that maps task-specific decision spaces, builds option ontologies, and constructs general argumentation frameworks (AFs) for instantiation on specific cases. No equations, fitted parameters, predictions, or derivations are described that reduce by construction to the paper's own inputs. The central claims rest on the proposed construction process itself rather than self-definitional loops, fitted-input renamings, or load-bearing self-citations. This is a standard non-circular outcome for a framework-introduction paper whose claims are methodological proposals rather than reductions of results to prior fitted quantities.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can produce faithful arguments that can be organized into argumentation frameworks for decision options
- domain assumption General AFs built from task-specific decision spaces can be instantiated for specific cases while preserving explainability and enabling global modification
invented entities (1)
-
ArgEval framework
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Neurosymbolic Learning for Inference-Time Argumentation
ITA trains LLMs to generate and score arguments for ternary claim verification and uses argumentation semantics to derive faithful true/false/uncertain predictions from those structures.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.