GRASP: Deterministic argument ranking in interaction graphs

Antonio Orvieto; Diganta Misra; Rediet Abebe; Volkan Cevher

arxiv: 2605.19141 · v1 · pith:YRANOBAInew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL· cs.CY· cs.HC

GRASP: Deterministic argument ranking in interaction graphs

Diganta Misra , Antonio Orvieto , Rediet Abebe , Volkan Cevher This is my paper

Pith reviewed 2026-05-20 11:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CYcs.HC

keywords argument rankingLLM-as-a-Judgeinteraction graphsattack support propagationstructural sufficiencydeterministic rankingreproducibility

0 comments

The pith

Local pairwise judgments on argument attacks and supports produce more consistent global rankings than holistic LLM verdicts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that asking large language models for a single overall score on a debate leads to high disagreement across models. It instead decomposes debates into explicit attack and support links between individual arguments and collects local judgments on those links. These local judgments prove more stable across models than global scores. A deterministic propagation process then combines the links into one global ranking. The resulting scores track how well each argument holds up under the debate's structure rather than how persuasive or factually accurate it appears.

Core claim

GRASP aggregates stable local interaction judgments into a global ranking via a convergent attack-defense propagation operator. Local pairwise judgments on attacks and supports are shown to be more reproducible across models than holistic verdicts. GRASP scores measure structural sufficiency, a defense-aware notion of argument robustness over the explicit interaction graph, and do not correlate with human convincingness labels.

What carries the argument

The convergent attack-defense propagation operator that iteratively updates argument strengths according to supporting and attacking relations until a unique ranking emerges.

Load-bearing premise

Pairwise LLM judgments on attacks and supports remain stable across models and the propagation operator converges to a unique ranking that reflects argumentative structure rather than model artifacts.

What would settle it

Applying GRASP to the same debate graph with two different LLMs and obtaining substantially different final rankings would indicate that local judgments are not reproducible enough to support the method.

Figures

Figures reproduced from arXiv: 2605.19141 by Antonio Orvieto, Diganta Misra, Rediet Abebe, Volkan Cevher.

**Figure 2.** Figure 2: Convergence vs. attack-graph density for GRASP and GRASP- [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Attack graphs for the same debate under different judge models ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Rank dynamics of volatile arguments for mt 048 x-ai grok-4 under GRASP using the attack graph induced by openai/gpt-5.2-chat. Rank dynamics under GRASP. Using the attack graph produced by openai/gpt-5.2-chat, we track GRASP scores over iterations and visualize the rank trajectories of the most volatile arguments ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Toy argumentation graph illustrating non [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Canonical structural archetypes used in syn [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Attack graphs induced by the same judge model, [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Relationship between similarity of induced attack graphs [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate's complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack--defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human "convincingness" labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRASP turns local LLM attack/support judgments into global rankings via graph propagation, but convergence on cyclic graphs needs explicit checks.

read the letter

The main thing to know is that GRASP builds an explicit interaction graph from pairwise LLM judgments on attacks and supports, then applies a deterministic propagation operator to produce a global ranking. It claims this yields more reproducible results than holistic LLM verdicts and measures structural sufficiency rather than persuasion or convincingness to humans. That separation is the clearest practical angle here, especially for settings where you want auditable robustness over rhetorical appeal. The approach looks new as a specific aggregation rule for argument graphs in the LLM-as-a-Judge literature. It does a solid job documenting inter-model disagreement on holistic scores and showing that local interaction labels hold up better across models. The sociotechnical point about not conflating structure with human convincingness is worth keeping in mind for AI safety or deliberative system work. On the soft spots, the abstract leans heavily on 'we show' statements without numbers, error bars, or the actual operator math. The stress-test note on convergence is worth taking seriously: if the propagation lacks a clear fixed-point guarantee or behaves differently on graphs with cycles and inconsistent labels, the reproducibility advantage could shrink or depend on initialization. I would check the full paper for the iteration details and any tests on realistic cyclic debates. The citation pattern is standard and does not raise red flags. This is for readers building or evaluating automated argument systems, particularly those focused on consistency in LLM judges or graph methods for debate analysis. Someone looking for a concrete alternative to opaque holistic scoring would get direct value. It deserves a serious referee because the motivation is timely and the framework is spelled out enough to review properly. I would send it to peer review to get the convergence claims and empirical results examined.

Referee Report

3 major / 1 minor

Summary. The paper proposes GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates local pairwise LLM judgments of attacks and supports in an argument interaction graph via a propagation operator to produce global argument rankings. It claims that local interaction judgments are more reproducible than holistic LLM-as-a-Judge verdicts, that the resulting GRASP rankings are more consistent, and that GRASP scores capture structural sufficiency rather than correlating with human convincingness or rhetorical appeal.

Significance. If the reproducibility claims and convergence properties hold with supporting evidence, GRASP would provide a transparent, auditable alternative to opaque holistic LLM judging by explicitly separating argumentative structure from persuasion. The distinction between structural robustness and human convincingness labels is a useful sociotechnical observation, though its impact depends on validation across models and graph structures.

major comments (3)

Abstract: The central claim that 'local interaction judgments are more reproducible than holistic rankings' is asserted without any quantitative results, agreement metrics (e.g., Cohen's kappa or Krippendorff's alpha), error bars, or dataset details; this absence makes the reproducibility advantage impossible to evaluate from the provided text.
Abstract: The attack-defense propagation operator is described as 'convergent' and producing a 'unique ranking,' yet no fixed-point theorem, contraction-mapping argument, initialization independence proof, or explicit handling of cycles (mutual attacks or support loops) is supplied; without these, the operator may depend on starting values or fail to yield a unique attractor on cyclic graphs, directly undermining the determinism and reproducibility claims.
Abstract: The statement that GRASP scores 'do not correlate with human convincingness labels' is presented without the correlation coefficient, sample size, or statistical test used; this weakens the claim that GRASP measures structural sufficiency rather than model-specific artifacts.

minor comments (1)

Abstract: The acronym expansion 'Gradual Ranking with Attacks and Support Propagation' is clear, but the manuscript should define the propagation operator formally (e.g., as an iterative update rule) in the main text with explicit notation for attack and support weights.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify opportunities to make the abstract more self-contained while preserving the paper's core claims. We address each major comment below and will revise the abstract and, where appropriate, add supporting details or formal elements to the main text or appendix.

read point-by-point responses

Referee: Abstract: The central claim that 'local interaction judgments are more reproducible than holistic rankings' is asserted without any quantitative results, agreement metrics (e.g., Cohen's kappa or Krippendorff's alpha), error bars, or dataset details; this absence makes the reproducibility advantage impossible to evaluate from the provided text.

Authors: We agree that the abstract would be strengthened by including a concise reference to the quantitative evidence. The full manuscript reports these results in Section 4, including agreement metrics (Cohen's kappa and Krippendorff's alpha) computed over multiple LLMs and datasets, with error bars from repeated trials. In the revision we will add a brief clause to the abstract summarizing the reproducibility improvement and directing readers to the experimental section for full metrics and dataset descriptions. revision: yes
Referee: Abstract: The attack-defense propagation operator is described as 'convergent' and producing a 'unique ranking,' yet no fixed-point theorem, contraction-mapping argument, initialization independence proof, or explicit handling of cycles (mutual attacks or support loops) is supplied; without these, the operator may depend on starting values or fail to yield a unique attractor on cyclic graphs, directly undermining the determinism and reproducibility claims.

Authors: The manuscript currently supports convergence through extensive empirical evaluation on graphs containing cycles (Section 3). We acknowledge that a formal fixed-point argument is not present in the submitted version. In the revised manuscript we will add a short appendix containing a proof sketch based on contraction properties for the propagation operator and explicit handling of cycles via stabilization, thereby addressing the concern about initialization dependence and uniqueness. revision: yes
Referee: Abstract: The statement that GRASP scores 'do not correlate with human convincingness labels' is presented without the correlation coefficient, sample size, or statistical test used; this weakens the claim that GRASP measures structural sufficiency rather than model-specific artifacts.

Authors: The full paper presents this analysis in Section 5, reporting Pearson and Spearman correlation coefficients near zero together with sample size and p-values from the statistical test. We will revise the abstract to include a short parenthetical reference to these statistics so that the sociotechnical distinction between structural sufficiency and human convincingness is supported directly in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GRASP derivation

full rationale

The paper defines GRASP as a deterministic aggregation of local pairwise interaction judgments via a convergent attack-defense propagation operator. No equations or claims in the provided text reduce the operator's convergence or the resulting global ranking to a fitted parameter, self-citation chain, or definitional tautology. The reproducibility advantage over holistic judging is presented as an empirical observation rather than a constructed equivalence, and the distinction from human convincingness labels is explicitly non-correlative. The framework is therefore self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the stability of local LLM interaction judgments and the convergence properties of the propagation operator; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Local pairwise judgments on attacks and supports are stable and more reproducible than holistic verdicts.
This premise is required for the claim that GRASP produces more consistent global rankings.
domain assumption The attack-defense propagation operator converges to a unique ranking.
Convergence is invoked to guarantee deterministic global output from local inputs.

pith-pipeline@v0.9.0 · 5757 in / 1224 out tokens · 37654 ms · 2026-05-20T11:54:28.865216+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ranking-based semantics for argumentation frame- works

Leila Amgoud and Jonathan Ben-Naim. Ranking-based semantics for argumentation frame- works. InInternational Conference on Scalable Uncertainty Management, pages 134–147. Springer, 2013

work page 2013
[3]

Claude haiku 4.5 system card

Anthropic. Claude haiku 4.5 system card. Technical report, Anthropic, October 2025. URL https://www.anthropic.com/claude-haiku-4-5-system-card

work page 2025
[4]

Claude opus 4.5 system card

Anthropic. Claude opus 4.5 system card. Technical report, Anthropic, November 2025. URL https://www.anthropic.com/claude-opus-4-5-system-card . Accessed: 2026-05- 05

work page 2025
[5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

An introduction to argumentation semantics.The knowledge engineering review, 26(4):365–410, 2011

Pietro Baroni, Martin Caminada, and Massimiliano Giacomin. An introduction to argumentation semantics.The knowledge engineering review, 26(4):365–410, 2011

work page 2011
[7]

On the input/output behavior of argumentation frameworks.Artificial Intelligence, 217:144–197, 2014

Pietro Baroni, Guido Boella, Federico Cerutti, Massimiliano Giacomin, Leendert Van Der Torre, and Serena Villata. On the input/output behavior of argumentation frameworks.Artificial Intelligence, 217:144–197, 2014

work page 2014
[8]

How many properties do we need for gradual argumentation? InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

Pietro Baroni, Antonio Rago, and Francesca Toni. How many properties do we need for gradual argumentation? InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[9]

Audiences in argumentation frameworks.Artificial Intelligence, 171(1):42–71, 2007

Trevor JM Bench-Capon, Sylvie Doutre, and Paul E Dunne. Audiences in argumentation frameworks.Artificial Intelligence, 171(1):42–71, 2007

work page 2007
[10]

An extension-based argument- ranking semantics: Social rankings in abstract argumentation

Lars Bengel, Giovanni Buraglio, Jan Maly, and Kenneth Skiba. An extension-based argument- ranking semantics: Social rankings in abstract argumentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14790–14797, 2025

work page 2025
[11]

A logic-based theory of deductive arguments.Artificial Intelligence, 128(1-2):203–235, 2001

Philippe Besnard and Anthony Hunter. A logic-based theory of deductive arguments.Artificial Intelligence, 128(1-2):203–235, 2001

work page 2001
[12]

Power index-based semantics for ranking arguments in abstract argumentation frameworks.Intelligenza Artificiale, 13(2):137–154, 2020

Stefano Bistarelli and Carlo Taticchi. Power index-based semantics for ranking arguments in abstract argumentation frameworks.Intelligenza Artificiale, 13(2):137–154, 2020

work page 2020
[13]

A comparative study of ranking-based semantics for abstract argumentation

Elise Bonzon, J´erˆome Delobelle, S´ebastien Konieczny, and Nicolas Maudet. A comparative study of ranking-based semantics for abstract argumentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

work page 2016
[14]

The snli corpus

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. The snli corpus. 2015

work page 2015
[15]

Must read: A systematic survey of computational persuasion, 2025

Nimet Beyza Bozdag, Shuhaib Mehri, Xiaocheng Yang, Hyeonjeong Ha, Zirui Cheng, Esin Durmus, Jiaxuan You, Heng Ji, Gokhan Tur, and Dilek Hakkani-T¨ur. Must read: A systematic survey of computational persuasion, 2025. URLhttps://arxiv.org/abs/2505.07775

work page arXiv 2025
[16]

Using thematic analysis in psychology.Qualitative research in psychology, 3(2):77–101, 2006

Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitative research in psychology, 3(2):77–101, 2006

work page 2006
[17]

Francesco Bullo, 2022

Francesco Bullo.Contraction theory for dynamical systems. Francesco Bullo, 2022. 11

work page 2022
[18]

Graduality in argumentation.Journal of Artificial Intelligence Research, 23:245–297, 2005

Claudette Cayrol and Marie-Christine Lagasquie-Schiex. Graduality in argumentation.Journal of Artificial Intelligence Research, 23:245–297, 2005

work page 2005
[19]

Ampersand: Argument mining for persuasive online discussions

Tuhin Chakrabarty, Christopher Hidey, Smaranda Muresan, Kathleen Mckeown, and Alyssa Hwang. Ampersand: Argument mining for persuasive online discussions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2933–2943, 2019

work page 2019
[20]

arXiv preprint arXiv:2508.18076 , year=

Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi. Nei- ther valid nor reliable? investigating the use of llms as judges.arXiv preprint arXiv:2508.18076, 2025

work page arXiv 2025
[21]

Exploring the potential of large language models in computational argumentation

Guizhen Chen, Liying Cheng, Luu Anh Tuan, and Lidong Bing. Exploring the potential of large language models in computational argumentation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2309–2330, 2024

work page 2024
[22]

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, and Timothy T. Rogers. Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates, 2025. URLhttps://arxiv.org/abs/2510.25110

work page arXiv 2025
[23]

Evaluating arguments and making meta-arguments.Informal Logic, 21(2), 2001

Daniel H Cohen. Evaluating arguments and making meta-arguments.Informal Logic, 21(2), 2001

work page 2001
[24]

T Edward Damer.Attacking faulty reasoning

work page
[25]

On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

Phan Minh Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

work page 1995
[26]

Exploring the role of prior beliefs for argument persuasion

Esin Durmus and Claire Cardie. Exploring the role of prior beliefs for argument persuasion. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1035–1045, 2018

work page 2018
[27]

A corpus for modeling user and language effects in argumenta- tion on online debating.arXiv preprint arXiv:1906.11310, 2019

Esin Durmus and Claire Cardie. A corpus for modeling user and language effects in argumenta- tion on online debating.arXiv preprint arXiv:1906.11310, 2019

work page arXiv 1906
[28]

Equilibrium states in numerical argumentation networks.Logica Universalis, 9(4):411–473, 2015

Dov M Gabbay and Odinaldo Rodrigues. Equilibrium states in numerical argumentation networks.Logica Universalis, 9(4):411–473, 2015

work page 2015
[29]

Gemini 3 flash model card

Gemini Team, Google. Gemini 3 flash model card. Technical report, Google DeepMind, December 2025. URL https://storage.googleapis.com/deepmind-media/Model-C ards/Gemini-3-Flash-Model-Card.pdf. Accessed: 2026-05-05

work page 2025
[30]

Routledge, 2017

Barney Glaser and Anselm Strauss.Discovery of grounded theory: Strategies for qualitative research. Routledge, 2017

work page 2017
[31]

Assessing the sufficiency of arguments through conclusion generation

Timon Gurcke, Milad Alshomary, and Henning Wachsmuth. Assessing the sufficiency of arguments through conclusion generation. In Khalid Al-Khatib, Yufang Hou, and Manfred Stede, editors,Proceedings of the 8th Workshop on Argument Mining, pages 67–77, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.186 53/v1/...

work page 2021
[32]

Explaining length bias in llm-based preference evaluations.arXiv preprint arXiv:2407.01085, 2024

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in llm-based preference evaluations.arXiv preprint arXiv:2407.01085, 2024

work page arXiv 2024
[33]

A new status index derived from sociometric analysis.Psychometrika, 18(1):39–43, 1953

Leo Katz. A new status index derived from sociometric analysis.Psychometrika, 18(1):39–43, 1953

work page 1953
[34]

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, and Goran Nenadic

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, and Goran Nenadic. Large language models in argument mining: A survey.arXiv preprint arXiv:2506.16383, 2025. 12

work page arXiv 2025
[35]

Exploring the role of argument structure in online debate persuasion.arXiv preprint arXiv:2010.03538, 2020

Jialu Li, Esin Durmus, and Claire Cardie. Exploring the role of argument structure in online debate persuasion.arXiv preprint arXiv:2010.03538, 2020

work page arXiv 2010
[36]

Argumentation computation with large language models: A benchmark study.arXiv preprint arXiv:2412.16725, 2024

Zhaoqun Li, Xiaotong Fang, Chen Chen, Mengze Li, and Beishui Liao. Argumentation computation with large language models: A benchmark study.arXiv preprint arXiv:2412.16725, 2024

work page arXiv 2024
[37]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Abstract weighted based gradual semantics in argumentation theory.arXiv preprint arXiv:2401.11472, 2024

Assaf Libman, Nir Oren, and Bruno Yun. Abstract weighted based gradual semantics in argumentation theory.arXiv preprint arXiv:2401.11472, 2024

work page arXiv 2024
[39]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Aligning with human judgement: The role of pairwise preference in large language model evaluators.arXiv preprint arXiv:2403.16950, 2024

Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli´c, Anna Korhonen, and Nigel Collier. Aligning with human judgement: The role of pairwise preference in large language model evaluators.arXiv preprint arXiv:2403.16950, 2024

work page arXiv 2024
[41]

Nora McDonald, Sarita Schoenebeck, and Andrea Forte. Reliability and inter-rater reliability in qualitative research: Norms and guidelines for cscw and hci practice.Proceedings of the ACM on human-computer interaction, 3(CSCW):1–23, 2019

work page 2019
[42]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , April 2025. Accessed: 2026-05-05

work page 2025
[43]

Unveiling the power of argument arrangement in online persuasive discussions

Nailia Mirzakhmedova, Johannes Kiesel, Khalid Al Khatib, and Benno Stein. Unveiling the power of argument arrangement in online persuasive discussions. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15659–15671, 2023

work page 2023
[44]

Are large lan- guage models reliable argument quality annotators? InConference on Advances in Robust Argumentation Machines, pages 129–146

Nailia Mirzakhmedova, Marcel Gohsen, Chia Hao Chang, and Benno Stein. Are large lan- guage models reliable argument quality annotators? InConference on Advances in Robust Argumentation Machines, pages 129–146. Springer, 2024

work page 2024
[45]

Mistral small creative model card

Mistral AI Team. Mistral small creative model card. https://docs.mistral.ai/models/m odel-cards/mistral-small-creative-25-12, December 2025. Accessed: 2026-05-05

work page 2025
[46]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4885–4901, 2020

work page 2020
[47]

Probing neural network comprehension of natural language arguments.arXiv preprint arXiv:1907.07355, 2019

Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments.arXiv preprint arXiv:1907.07355, 2019

work page arXiv 1907
[48]

Update to gpt-5 system card: Gpt-5.2

OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, December 2025. URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/ . Accessed: 2026-05-05

work page 2025
[49]

Inferring attack relations for gradual semantics.Argument & Computation, 14(3):327–345, 2023

Nir Oren and Bruno Yun. Inferring attack relations for gradual semantics.Argument & Computation, 14(3):327–345, 2023

work page 2023
[50]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[51]

Towards debate automation: a recurrent model for predicting debate winners

Peter Potash and Anna Rumshisky. Towards debate automation: a recurrent model for predicting debate winners. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2465–2475, 2017. 13

work page 2017
[52]

Ranking passages for argument convinc- ingness

Peter Potash, Adam Ferguson, and Timothy J Hazen. Ranking passages for argument convinc- ingness. InProceedings of the 6th Workshop on Argument Mining, pages 146–155, 2019

work page 2019
[53]

Qwen3-max: Just scale it

Qwen Team. Qwen3-max: Just scale it. https://qwen.ai/blog?id=qwen3-max, September

work page
[54]

Accessed: 2026-05-05

work page 2026
[55]

On gradual semantics for assumption-based argumentation.arXiv preprint arXiv:2507.10076, 2025

Anna Rapberger, Fabrizio Russo, Antonio Rago, and Francesca Toni. On gradual semantics for assumption-based argumentation.arXiv preprint arXiv:2507.10076, 2025

work page arXiv 2025
[56]

Can language models recognize convincing arguments?arXiv preprint arXiv:2404.00750, 2024

Paula Rescala, Manoel Horta Ribeiro, Tiancheng Hu, and Robert West. Can language models recognize convincing arguments?arXiv preprint arXiv:2404.00750, 2024

work page arXiv 2024
[57]

Can llms judge de- bates? evaluating non-linear reasoning via argumentation theory semantics.arXiv preprint arXiv:2509.15739, 2025

Reza Sanayei, Srdjan Vesic, Eduardo Blanco, and Mihai Surdeanu. Can llms judge de- bates? evaluating non-linear reasoning via argumentation theory semantics.arXiv preprint arXiv:2509.15739, 2025

work page arXiv 2025
[58]

Identifying argumentative discourse structures in persuasive essays

Christian Stab and Iryna Gurevych. Identifying argumentative discourse structures in persuasive essays. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 46–56, 2014

work page 2014
[59]

Large language models are in- consistent and biased evaluators

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are incon- sistent and biased evaluators.arXiv preprint arXiv:2405.01724, 2024

work page arXiv 2024
[60]

Systematic biases in llm simulations of debates.arXiv preprint arXiv:2402.04049, 2024

Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. Systematic biases in llm simulations of debates.arXiv preprint arXiv:2402.04049, 2024

work page arXiv 2024
[61]

Judging the judges: Evaluating alignment and vulnerabilities in llms-as- judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in llms-as- judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), pages 404–430, 2025

work page 2025
[62]

Automatic argument quality assessment– new datasets and methods.arXiv preprint arXiv:1909.01007, 2019

Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni Friedman, Elad Venezian, Dan Lahav, Michal Jacovi, Ranit Aharonov, and Noam Slonim. Automatic argument quality assessment– new datasets and methods.arXiv preprint arXiv:1909.01007, 2019

work page arXiv 1909
[63]

Intrinsic quality assessment of arguments.arXiv preprint arXiv:2010.12473, 2020

Henning Wachsmuth and Till Werner. Intrinsic quality assessment of arguments.arXiv preprint arXiv:2010.12473, 2020

work page arXiv 2010
[64]

Computational argumentation quality assessment in natural language

Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. Computational argumentation quality assessment in natural language. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 176–187, 2017

work page 2017
[65]

Grok 4 model card

xAI Team. Grok 4 model card. Technical report, xAI, August 2025. URL https://data.x.a i/2025-08-20-grok-4-model-card.pdf. Accessed: 2026-05-05

work page 2025
[66]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Explain then rank: Scale calibration of neural rankers using natural language explanations from llms

Puxuan Yu, Daniel Cohen, Hemank Lamba, Joel Tetreault, and Alejandro Jaimes. Explain then rank: Scale calibration of neural rankers using natural language explanations from llms. In Findings of the Association for Computational Linguistics: ACL 2025, pages 22716–22730, 2025

work page 2025
[69]

adding support must increase strength,

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 14 Appendix Table of Contents A Related Work . . . . . . . . . . . . . . . . . . . . . . . ....

work page 2023
[70]

This House would ban the use of AI in primary and secondary education. 28

work page
[71]

This House would ban stablecoins pegged to national currencies

work page
[72]

This House would mandate all businesses to accept only digital payments

work page
[73]

This House would require electric vehicle manufacturers to refuse sales in countries with poor environmental records

work page
[74]

This House would allow individuals to erase morally distressing memories

work page
[75]

This House would ban facial recognition technology in public spaces

work page
[76]

Economics & Labor

This House would require social media companies to make their recommendation algorithms public. Economics & Labor

work page
[77]

This House would abolish the minimum wage law

work page
[78]

This House would allow the sale and purchase of human organs

work page
[79]

This House would ban sovereign wealth funds from investing in private equity

work page
[80]

This House would require companies to make the salaries of all their employees publicly available

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Ranking-based semantics for argumentation frame- works

Leila Amgoud and Jonathan Ben-Naim. Ranking-based semantics for argumentation frame- works. InInternational Conference on Scalable Uncertainty Management, pages 134–147. Springer, 2013

work page 2013

[3] [3]

Claude haiku 4.5 system card

Anthropic. Claude haiku 4.5 system card. Technical report, Anthropic, October 2025. URL https://www.anthropic.com/claude-haiku-4-5-system-card

work page 2025

[4] [4]

Claude opus 4.5 system card

Anthropic. Claude opus 4.5 system card. Technical report, Anthropic, November 2025. URL https://www.anthropic.com/claude-opus-4-5-system-card . Accessed: 2026-05- 05

work page 2025

[5] [5]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

An introduction to argumentation semantics.The knowledge engineering review, 26(4):365–410, 2011

Pietro Baroni, Martin Caminada, and Massimiliano Giacomin. An introduction to argumentation semantics.The knowledge engineering review, 26(4):365–410, 2011

work page 2011

[7] [7]

On the input/output behavior of argumentation frameworks.Artificial Intelligence, 217:144–197, 2014

Pietro Baroni, Guido Boella, Federico Cerutti, Massimiliano Giacomin, Leendert Van Der Torre, and Serena Villata. On the input/output behavior of argumentation frameworks.Artificial Intelligence, 217:144–197, 2014

work page 2014

[8] [8]

How many properties do we need for gradual argumentation? InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

Pietro Baroni, Antonio Rago, and Francesca Toni. How many properties do we need for gradual argumentation? InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[9] [9]

Audiences in argumentation frameworks.Artificial Intelligence, 171(1):42–71, 2007

Trevor JM Bench-Capon, Sylvie Doutre, and Paul E Dunne. Audiences in argumentation frameworks.Artificial Intelligence, 171(1):42–71, 2007

work page 2007

[10] [10]

An extension-based argument- ranking semantics: Social rankings in abstract argumentation

Lars Bengel, Giovanni Buraglio, Jan Maly, and Kenneth Skiba. An extension-based argument- ranking semantics: Social rankings in abstract argumentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 14790–14797, 2025

work page 2025

[11] [11]

A logic-based theory of deductive arguments.Artificial Intelligence, 128(1-2):203–235, 2001

Philippe Besnard and Anthony Hunter. A logic-based theory of deductive arguments.Artificial Intelligence, 128(1-2):203–235, 2001

work page 2001

[12] [12]

Power index-based semantics for ranking arguments in abstract argumentation frameworks.Intelligenza Artificiale, 13(2):137–154, 2020

Stefano Bistarelli and Carlo Taticchi. Power index-based semantics for ranking arguments in abstract argumentation frameworks.Intelligenza Artificiale, 13(2):137–154, 2020

work page 2020

[13] [13]

A comparative study of ranking-based semantics for abstract argumentation

Elise Bonzon, J´erˆome Delobelle, S´ebastien Konieczny, and Nicolas Maudet. A comparative study of ranking-based semantics for abstract argumentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016

work page 2016

[14] [14]

The snli corpus

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. The snli corpus. 2015

work page 2015

[15] [15]

Must read: A systematic survey of computational persuasion, 2025

Nimet Beyza Bozdag, Shuhaib Mehri, Xiaocheng Yang, Hyeonjeong Ha, Zirui Cheng, Esin Durmus, Jiaxuan You, Heng Ji, Gokhan Tur, and Dilek Hakkani-T¨ur. Must read: A systematic survey of computational persuasion, 2025. URLhttps://arxiv.org/abs/2505.07775

work page arXiv 2025

[16] [16]

Using thematic analysis in psychology.Qualitative research in psychology, 3(2):77–101, 2006

Virginia Braun and Victoria Clarke. Using thematic analysis in psychology.Qualitative research in psychology, 3(2):77–101, 2006

work page 2006

[17] [17]

Francesco Bullo, 2022

Francesco Bullo.Contraction theory for dynamical systems. Francesco Bullo, 2022. 11

work page 2022

[18] [18]

Graduality in argumentation.Journal of Artificial Intelligence Research, 23:245–297, 2005

Claudette Cayrol and Marie-Christine Lagasquie-Schiex. Graduality in argumentation.Journal of Artificial Intelligence Research, 23:245–297, 2005

work page 2005

[19] [19]

Ampersand: Argument mining for persuasive online discussions

Tuhin Chakrabarty, Christopher Hidey, Smaranda Muresan, Kathleen Mckeown, and Alyssa Hwang. Ampersand: Argument mining for persuasive online discussions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2933–2943, 2019

work page 2019

[20] [20]

arXiv preprint arXiv:2508.18076 , year=

Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi. Nei- ther valid nor reliable? investigating the use of llms as judges.arXiv preprint arXiv:2508.18076, 2025

work page arXiv 2025

[21] [21]

Exploring the potential of large language models in computational argumentation

Guizhen Chen, Liying Cheng, Luu Anh Tuan, and Lidong Bing. Exploring the potential of large language models in computational argumentation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2309–2330, 2024

work page 2024

[22] [22]

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, and Timothy T. Rogers. Debate: A large-scale benchmark for role-playing llm agents in multi-agent, long-form debates, 2025. URLhttps://arxiv.org/abs/2510.25110

work page arXiv 2025

[23] [23]

Evaluating arguments and making meta-arguments.Informal Logic, 21(2), 2001

Daniel H Cohen. Evaluating arguments and making meta-arguments.Informal Logic, 21(2), 2001

work page 2001

[24] [24]

T Edward Damer.Attacking faulty reasoning

work page

[25] [25]

On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

Phan Minh Dung. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games.Artificial intelligence, 77(2):321–357, 1995

work page 1995

[26] [26]

Exploring the role of prior beliefs for argument persuasion

Esin Durmus and Claire Cardie. Exploring the role of prior beliefs for argument persuasion. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1035–1045, 2018

work page 2018

[27] [27]

A corpus for modeling user and language effects in argumenta- tion on online debating.arXiv preprint arXiv:1906.11310, 2019

Esin Durmus and Claire Cardie. A corpus for modeling user and language effects in argumenta- tion on online debating.arXiv preprint arXiv:1906.11310, 2019

work page arXiv 1906

[28] [28]

Equilibrium states in numerical argumentation networks.Logica Universalis, 9(4):411–473, 2015

Dov M Gabbay and Odinaldo Rodrigues. Equilibrium states in numerical argumentation networks.Logica Universalis, 9(4):411–473, 2015

work page 2015

[29] [29]

Gemini 3 flash model card

Gemini Team, Google. Gemini 3 flash model card. Technical report, Google DeepMind, December 2025. URL https://storage.googleapis.com/deepmind-media/Model-C ards/Gemini-3-Flash-Model-Card.pdf. Accessed: 2026-05-05

work page 2025

[30] [30]

Routledge, 2017

Barney Glaser and Anselm Strauss.Discovery of grounded theory: Strategies for qualitative research. Routledge, 2017

work page 2017

[31] [31]

Assessing the sufficiency of arguments through conclusion generation

Timon Gurcke, Milad Alshomary, and Henning Wachsmuth. Assessing the sufficiency of arguments through conclusion generation. In Khalid Al-Khatib, Yufang Hou, and Manfred Stede, editors,Proceedings of the 8th Workshop on Argument Mining, pages 67–77, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.186 53/v1/...

work page 2021

[32] [32]

Explaining length bias in llm-based preference evaluations.arXiv preprint arXiv:2407.01085, 2024

Zhengyu Hu, Linxin Song, Jieyu Zhang, Zheyuan Xiao, Tianfu Wang, Zhengyu Chen, Nicholas Jing Yuan, Jianxun Lian, Kaize Ding, and Hui Xiong. Explaining length bias in llm-based preference evaluations.arXiv preprint arXiv:2407.01085, 2024

work page arXiv 2024

[33] [33]

A new status index derived from sociometric analysis.Psychometrika, 18(1):39–43, 1953

Leo Katz. A new status index derived from sociometric analysis.Psychometrika, 18(1):39–43, 1953

work page 1953

[34] [34]

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, and Goran Nenadic

Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, and Goran Nenadic. Large language models in argument mining: A survey.arXiv preprint arXiv:2506.16383, 2025. 12

work page arXiv 2025

[35] [35]

Exploring the role of argument structure in online debate persuasion.arXiv preprint arXiv:2010.03538, 2020

Jialu Li, Esin Durmus, and Claire Cardie. Exploring the role of argument structure in online debate persuasion.arXiv preprint arXiv:2010.03538, 2020

work page arXiv 2010

[36] [36]

Argumentation computation with large language models: A benchmark study.arXiv preprint arXiv:2412.16725, 2024

Zhaoqun Li, Xiaotong Fang, Chen Chen, Mengze Li, and Beishui Liao. Argumentation computation with large language models: A benchmark study.arXiv preprint arXiv:2412.16725, 2024

work page arXiv 2024

[37] [37]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

Abstract weighted based gradual semantics in argumentation theory.arXiv preprint arXiv:2401.11472, 2024

Assaf Libman, Nir Oren, and Bruno Yun. Abstract weighted based gradual semantics in argumentation theory.arXiv preprint arXiv:2401.11472, 2024

work page arXiv 2024

[39] [39]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Aligning with human judgement: The role of pairwise preference in large language model evaluators.arXiv preprint arXiv:2403.16950, 2024

Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vuli´c, Anna Korhonen, and Nigel Collier. Aligning with human judgement: The role of pairwise preference in large language model evaluators.arXiv preprint arXiv:2403.16950, 2024

work page arXiv 2024

[41] [41]

Nora McDonald, Sarita Schoenebeck, and Andrea Forte. Reliability and inter-rater reliability in qualitative research: Norms and guidelines for cscw and hci practice.Proceedings of the ACM on human-computer interaction, 3(CSCW):1–23, 2019

work page 2019

[42] [42]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ , April 2025. Accessed: 2026-05-05

work page 2025

[43] [43]

Unveiling the power of argument arrangement in online persuasive discussions

Nailia Mirzakhmedova, Johannes Kiesel, Khalid Al Khatib, and Benno Stein. Unveiling the power of argument arrangement in online persuasive discussions. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15659–15671, 2023

work page 2023

[44] [44]

Are large lan- guage models reliable argument quality annotators? InConference on Advances in Robust Argumentation Machines, pages 129–146

Nailia Mirzakhmedova, Marcel Gohsen, Chia Hao Chang, and Benno Stein. Are large lan- guage models reliable argument quality annotators? InConference on Advances in Robust Argumentation Machines, pages 129–146. Springer, 2024

work page 2024

[45] [45]

Mistral small creative model card

Mistral AI Team. Mistral small creative model card. https://docs.mistral.ai/models/m odel-cards/mistral-small-creative-25-12, December 2025. Accessed: 2026-05-05

work page 2025

[46] [46]

Adversarial nli: A new benchmark for natural language understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. Adversarial nli: A new benchmark for natural language understanding. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4885–4901, 2020

work page 2020

[47] [47]

Probing neural network comprehension of natural language arguments.arXiv preprint arXiv:1907.07355, 2019

Timothy Niven and Hung-Yu Kao. Probing neural network comprehension of natural language arguments.arXiv preprint arXiv:1907.07355, 2019

work page arXiv 1907

[48] [48]

Update to gpt-5 system card: Gpt-5.2

OpenAI. Update to gpt-5 system card: Gpt-5.2. Technical report, OpenAI, December 2025. URL https://openai.com/index/gpt-5-system-card-update-gpt-5-2/ . Accessed: 2026-05-05

work page 2025

[49] [49]

Inferring attack relations for gradual semantics.Argument & Computation, 14(3):327–345, 2023

Nir Oren and Bruno Yun. Inferring attack relations for gradual semantics.Argument & Computation, 14(3):327–345, 2023

work page 2023

[50] [50]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[51] [51]

Towards debate automation: a recurrent model for predicting debate winners

Peter Potash and Anna Rumshisky. Towards debate automation: a recurrent model for predicting debate winners. InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2465–2475, 2017. 13

work page 2017

[52] [52]

Ranking passages for argument convinc- ingness

Peter Potash, Adam Ferguson, and Timothy J Hazen. Ranking passages for argument convinc- ingness. InProceedings of the 6th Workshop on Argument Mining, pages 146–155, 2019

work page 2019

[53] [53]

Qwen3-max: Just scale it

Qwen Team. Qwen3-max: Just scale it. https://qwen.ai/blog?id=qwen3-max, September

work page

[54] [54]

Accessed: 2026-05-05

work page 2026

[55] [55]

On gradual semantics for assumption-based argumentation.arXiv preprint arXiv:2507.10076, 2025

Anna Rapberger, Fabrizio Russo, Antonio Rago, and Francesca Toni. On gradual semantics for assumption-based argumentation.arXiv preprint arXiv:2507.10076, 2025

work page arXiv 2025

[56] [56]

Can language models recognize convincing arguments?arXiv preprint arXiv:2404.00750, 2024

Paula Rescala, Manoel Horta Ribeiro, Tiancheng Hu, and Robert West. Can language models recognize convincing arguments?arXiv preprint arXiv:2404.00750, 2024

work page arXiv 2024

[57] [57]

Can llms judge de- bates? evaluating non-linear reasoning via argumentation theory semantics.arXiv preprint arXiv:2509.15739, 2025

Reza Sanayei, Srdjan Vesic, Eduardo Blanco, and Mihai Surdeanu. Can llms judge de- bates? evaluating non-linear reasoning via argumentation theory semantics.arXiv preprint arXiv:2509.15739, 2025

work page arXiv 2025

[58] [58]

Identifying argumentative discourse structures in persuasive essays

Christian Stab and Iryna Gurevych. Identifying argumentative discourse structures in persuasive essays. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 46–56, 2014

work page 2014

[59] [59]

Large language models are in- consistent and biased evaluators

Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara. Large language models are incon- sistent and biased evaluators.arXiv preprint arXiv:2405.01724, 2024

work page arXiv 2024

[60] [60]

Systematic biases in llm simulations of debates.arXiv preprint arXiv:2402.04049, 2024

Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. Systematic biases in llm simulations of debates.arXiv preprint arXiv:2402.04049, 2024

work page arXiv 2024

[61] [61]

Judging the judges: Evaluating alignment and vulnerabilities in llms-as- judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. Judging the judges: Evaluating alignment and vulnerabilities in llms-as- judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), pages 404–430, 2025

work page 2025

[62] [62]

Automatic argument quality assessment– new datasets and methods.arXiv preprint arXiv:1909.01007, 2019

Assaf Toledo, Shai Gretz, Edo Cohen-Karlik, Roni Friedman, Elad Venezian, Dan Lahav, Michal Jacovi, Ranit Aharonov, and Noam Slonim. Automatic argument quality assessment– new datasets and methods.arXiv preprint arXiv:1909.01007, 2019

work page arXiv 1909

[63] [63]

Intrinsic quality assessment of arguments.arXiv preprint arXiv:2010.12473, 2020

Henning Wachsmuth and Till Werner. Intrinsic quality assessment of arguments.arXiv preprint arXiv:2010.12473, 2020

work page arXiv 2010

[64] [64]

Computational argumentation quality assessment in natural language

Henning Wachsmuth, Nona Naderi, Yufang Hou, Yonatan Bilu, Vinodkumar Prabhakaran, Tim Alberdingk Thijm, Graeme Hirst, and Benno Stein. Computational argumentation quality assessment in natural language. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 176–187, 2017

work page 2017

[65] [65]

Grok 4 model card

xAI Team. Grok 4 model card. Technical report, xAI, August 2025. URL https://data.x.a i/2025-08-20-grok-4-model-card.pdf. Accessed: 2026-05-05

work page 2025

[66] [66]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [67]

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, et al. Justice or prejudice? quantifying biases in llm-as-a-judge.arXiv preprint arXiv:2410.02736, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

Explain then rank: Scale calibration of neural rankers using natural language explanations from llms

Puxuan Yu, Daniel Cohen, Hemank Lamba, Joel Tetreault, and Alejandro Jaimes. Explain then rank: Scale calibration of neural rankers using natural language explanations from llms. In Findings of the Association for Computational Linguistics: ACL 2025, pages 22716–22730, 2025

work page 2025

[69] [69]

adding support must increase strength,

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 14 Appendix Table of Contents A Related Work . . . . . . . . . . . . . . . . . . . . . . . ....

work page 2023

[70] [70]

This House would ban the use of AI in primary and secondary education. 28

work page

[71] [71]

This House would ban stablecoins pegged to national currencies

work page

[72] [72]

This House would mandate all businesses to accept only digital payments

work page

[73] [73]

This House would require electric vehicle manufacturers to refuse sales in countries with poor environmental records

work page

[74] [74]

This House would allow individuals to erase morally distressing memories

work page

[75] [75]

This House would ban facial recognition technology in public spaces

work page

[76] [76]

Economics & Labor

This House would require social media companies to make their recommendation algorithms public. Economics & Labor

work page

[77] [77]

This House would abolish the minimum wage law

work page

[78] [78]

This House would allow the sale and purchase of human organs

work page

[79] [79]

This House would ban sovereign wealth funds from investing in private equity

work page

[80] [80]

This House would require companies to make the salaries of all their employees publicly available

work page