Can LLMs Help Decentralized Dispute Arbitration? A Case Study of UMA-Resolved Markets on Polymarket

Juncen Zhou; Junhao Wen; Junjie Huang

arxiv: 2604.15674 · v1 · submitted 2026-04-17 · 💻 cs.CY

Can LLMs Help Decentralized Dispute Arbitration? A Case Study of UMA-Resolved Markets on Polymarket

Junhao Wen , Juncen Zhou , Junjie Huang This is my paper

Pith reviewed 2026-05-10 08:14 UTC · model grok-4.3

classification 💻 cs.CY

keywords large language modelsdispute resolutionprediction marketsPolymarketUMAdecentralized arbitrationWeb3oracle mechanisms

0 comments

The pith

Web-enabled LLMs achieve 89.58% agreement with UMA resolutions once disputes are raised on Polymarket.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can assist in resolving disputes in Web3 prediction markets like Polymarket, which relies on UMA for on-chain arbitration. It tests two capabilities: using event rules to predict which markets will face disputes in advance, and reproducing UMA's final decisions after a dispute has already been initiated. LLMs prove unreliable at the prediction task but reach 89.58% agreement with UMA outcomes once disputes occur, with stable results across repeated evaluations. This matters because disputed events on Polymarket have involved over 972 million dollars in trading volume, so faster or auxiliary resolution methods could reduce friction in collective forecasting systems.

Core claim

In a case study of UMA-resolved markets on Polymarket, web-enabled large language models were unable to predict which events would face disputes but achieved 89.58% agreement with UMA's final resolutions once disputes occurred, along with strong stability.

What carries the argument

Direct comparison of LLM-generated resolutions against UMA on-chain voting outcomes for disputed prediction market events, using web search to supply context equivalent to voter information.

If this is right

LLMs could function as an auxiliary input to support or accelerate UMA-style on-chain dispute resolution.
High agreement rates indicate that LLMs capture reasoning patterns similar to those used by UMA voters.
The inability to forecast disputes limits proactive applications but leaves post-initiation assistance viable.
Stable performance across multiple runs suggests the approach is robust to minor variations in event details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems that combine LLM outputs with human or on-chain voting could lower costs in decentralized arbitration without replacing existing mechanisms.
The same evaluation method might apply to dispute resolution on other blockchain platforms that use oracle-based or voting-based arbitration.
Future work could test whether restricting web access or altering prompt formats changes the observed agreement level.

Load-bearing premise

Web-enabled LLMs receive the same relevant information and context as UMA voters and the studied disputed events form a representative sample for measuring agreement.

What would settle it

A fresh collection of disputed Polymarket events where the same LLM evaluation process yields agreement with UMA resolutions well below 89.58%.

Figures

Figures reproduced from arXiv: 2604.15674 by Juncen Zhou, Junhao Wen, Junjie Huang.

**Figure 2.** Figure 2: Distribution of Polymarket disputed Events by Cat [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Web3 prediction markets, exemplified by Polymarket, have gained prominence for leveraging collective intelligence to forecast a wide range of social, political, and sports events. However, among the thousands of prediction market events, consensus disputes still arise due to imperfections in market mechanisms. On Polymarket alone, the trading volume involving disputed events has reached $972,370,804.71, underscoring the critical need for objective and efficient dispute resolution. In this study, we introduce large language models (LLMs) to: (1) evaluate whether web-enabled LLMs can reproduce the decision quality of UMA's on-chain voting process once a dispute has been raised, and (2) predict, based on event rules, which market events are likely to face future disputes before they occur. Our findings show that LLMs are unable to reliably predict which events will become disputed in advance; however, once a dispute is initiated, web-enabled LLMs achieve 89.58% agreement with UMA's final resolutions and demonstrate strong stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs reach 89.58% agreement with UMA resolutions once disputes occur but cannot predict them ahead, and the supporting details remain too thin to evaluate the claim.

read the letter

LLMs reach 89.58% agreement with UMA resolutions once disputes occur but cannot predict them ahead, and the supporting details remain too thin to evaluate the claim. The paper contributes a concrete case study on live Polymarket data. It tests whether web-enabled LLMs can reproduce UMA outcomes after a dispute is raised and reports a clear negative result on pre-dispute prediction. That negative finding is useful because it shows a practical limit rather than claiming broad success. The stability observation across LLM runs also adds a small but usable data point for anyone examining consistency in automated arbitration. The main weaknesses lie in the evaluation itself. The abstract states the agreement rate without reporting sample size, exact prompting method, agreement metric, or statistical tests. More critically, there is no confirmation that the LLMs received precisely the same facts, timestamps, and resolution criteria available to UMA voters at the time of decision. If the models were given post-dispute web summaries or broader search results, the 89.58% figure no longer measures substitution for human arbitration. The set of disputed events may also be small or biased toward cases with unusually clear public information, and no baseline agreement rate on non-disputed markets is provided. Readers working on prediction-market design or AI applications in decentralized governance would find the numbers worth examining. The work is not theoretically novel but supplies an empirical anchor that can be checked against public data. It deserves peer review so that referees can request the missing methodological details and assess whether the information-parity assumption holds.

Referee Report

3 major / 2 minor

Summary. The paper examines whether web-enabled LLMs can assist in decentralized dispute resolution for Polymarket prediction markets arbitrated by UMA. It evaluates two tasks: (1) using LLMs to predict in advance which market events are likely to face disputes based on event rules, and (2) having LLMs resolve already-disputed events and comparing their outputs to UMA's final on-chain resolutions. The central empirical claims are that LLMs fail to reliably predict future disputes but achieve 89.58% agreement with UMA resolutions once disputes occur, along with strong stability across queries.

Significance. If the agreement result is shown to be robust, this would indicate that LLMs could provide a scalable, low-cost complement to on-chain voting mechanisms in Web3 platforms, potentially addressing the $972M+ in disputed trading volume noted in the abstract. The work is timely given the growth of prediction markets and highlights a concrete limitation (poor predictive power) alongside a potential strength (post-dispute resolution), offering an applied case study at the intersection of AI and decentralized governance.

major comments (3)

Abstract and results section reporting the 89.58% figure: the agreement rate is presented without the sample size N of disputed events, the exact prompting template and web-search parameters supplied to the LLMs, the formal definition of the agreement metric, or any statistical tests (e.g., confidence intervals or comparison to a baseline). This information is load-bearing for interpreting whether the result supports the claim that LLMs can reproduce UMA decision quality.
Methodology section describing LLM inputs: the paper does not demonstrate or verify that the web results and context provided to the LLMs are equivalent to the information and resolution criteria actually available to UMA voters at the time of each dispute. If LLMs receive post-dispute summaries or broader search scopes, the measured agreement no longer tests the intended substitution claim.
Results on dispute prediction: the claim that LLMs are 'unable to reliably predict' disputes lacks a clear baseline (e.g., random or majority-class accuracy) and details on the feature set or prompting strategy used for the prediction task, making it difficult to assess whether the negative result is informative or merely reflects an under-powered experimental design.

minor comments (2)

The abstract and later text refer to 'strong stability' without defining the stability metric (e.g., agreement across repeated independent runs or variance in outputs).
Table or figure presenting the 89.58% result should include per-event breakdowns or error analysis to allow readers to identify systematic failure modes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important areas for improving the transparency and rigor of our empirical analysis. We address each major comment in detail below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: Abstract and results section reporting the 89.58% figure: the agreement rate is presented without the sample size N of disputed events, the exact prompting template and web-search parameters supplied to the LLMs, the formal definition of the agreement metric, or any statistical tests (e.g., confidence intervals or comparison to a baseline). This information is load-bearing for interpreting whether the result supports the claim that LLMs can reproduce UMA decision quality.

Authors: We acknowledge that these methodological and statistical details are crucial for evaluating the robustness of the 89.58% agreement rate. In the revised manuscript, we will explicitly report the sample size of disputed events analyzed, provide the exact prompting templates and web-search parameters used for the LLMs, formally define the agreement metric as the percentage of cases where the LLM's resolved outcome matches UMA's on-chain resolution, and include statistical tests such as 95% confidence intervals and comparisons to baseline accuracies. These additions will be incorporated into both the abstract and the results section to enhance interpretability. revision: yes
Referee: Methodology section describing LLM inputs: the paper does not demonstrate or verify that the web results and context provided to the LLMs are equivalent to the information and resolution criteria actually available to UMA voters at the time of each dispute. If LLMs receive post-dispute summaries or broader search scopes, the measured agreement no longer tests the intended substitution claim.

Authors: This point raises an important issue about the validity of our experimental setup. We designed the web searches to use queries based on the event descriptions and restricted results to information available prior to the UMA resolution date where possible. However, we recognize that it is challenging to perfectly match the exact information available to UMA voters, who may have access to additional context from community discussions or specific evidence submitted during the dispute process. In the revision, we will expand the methodology section to detail our search strategy, including any date filters applied, provide sample inputs to the LLMs, and explicitly discuss this as a limitation in the paper's discussion section. We believe this clarification will help readers assess the strength of the substitution claim. revision: partial
Referee: Results on dispute prediction: the claim that LLMs are 'unable to reliably predict' disputes lacks a clear baseline (e.g., random or majority-class accuracy) and details on the feature set or prompting strategy used for the prediction task, making it difficult to assess whether the negative result is informative or merely reflects an under-powered experimental design.

Authors: We agree that providing baselines and additional details is necessary to substantiate the claim that LLMs cannot reliably predict disputes. In the revised manuscript, we will include comparisons to a random baseline (expected accuracy of 50% for binary classification) and a majority-class baseline, describe the feature set consisting of the event rules and associated market metadata, and outline the prompting strategy employed. Furthermore, we will report the prediction accuracies with appropriate statistical measures to demonstrate that the performance does not exceed chance levels in a meaningful way. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential loops

full rationale

The paper reports an empirical case study measuring LLM agreement (89.58%) with UMA resolutions on disputed Polymarket events and testing LLM ability to predict future disputes in advance. No equations, fitted parameters, ansatzes, or derivations are present. The central claim is a direct statistical comparison of outputs against an external oracle (UMA votes), with no step that reduces by construction to the inputs or to a self-citation chain. The second task (pre-dispute prediction) is reported as unsuccessful, further removing any risk of tautological prediction. This is a standard empirical evaluation without the circular patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or theoretical constructs; the work is an empirical case study.

pith-pipeline@v0.9.0 · 5485 in / 1031 out tokens · 29936 ms · 2026-05-10T08:14:04.767524+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Agostino Capponi et al. 2025. DAO-AI: Evaluating Collective Decision-Making with Web-Enabled LLM Agents.arXiv preprint arXiv:2510.21117(2025)

work page arXiv 2025
[2]

Yiling Chen and David Pennock. 2010. A survey of prediction market design. In Algorithmic Game Theory. 1–33

work page 2010
[3]

Andrea Chiarelli et al . 2023. A Systematic Literature Review of Blockchain Oracles.Aalto University Publication Series(2023)

work page 2023
[4]

Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, and Bing Qin. 2024. BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering. In ACL. 1229–1248

work page 2024
[5]

Botao Amber Hu, Yuhan Liu, and Helena Rong. 2025. Trustless Autonomy: Self-Sovereign Large Language Model Agents in Decentralized Systems.arXiv preprint arXiv:2505.09757(2025)

work page arXiv 2025
[6]

2025.Volodymyr Zelensky’s Clothing Has Sparked a Polymarket Rebellion

Joel Khalili and Kate Knibbs. 2025.Volodymyr Zelensky’s Clothing Has Sparked a Polymarket Rebellion. https://www.wired.com/story/volodymyr-zelensky-suit- polymarket-rebellion

work page 2025
[7]

Kartikay Kumar and Muhammad Khan. 2021. Decentralized oracles: A compre- hensive survey.IEEE Access9 (2021), 92272–92294

work page 2021
[8]

Ollie Liu, Deqing Fu, Dani Yogatama, and Willie Neiswanger. 2025. DeLLMa: Decision Making Under Uncertainty with Large Language Models. InICLR

work page 2025
[9]

Othman et al . 2013. A practical liquidity-sensitive automated market maker. TEAC1, 3 (2013), 1–25

work page 2013
[10]

Amir Pasdar and Young Choon Lee. 2023. A Survey on Blockchain Oracle Implementation.Comput. Surveys55, 12 (2023), 1–36

work page 2023
[11]

Oriane Peter and Kate Devlin. 2025. Decentralising LLM Alignment: A Case for Context, Pluralism, and Participation. InAAAI

work page 2025
[12]

Jaromir Savelka et al. 2023. Large Language Models in Legal Reasoning: A Study on Real-World Case Interpretation.Journal of Artificial Intelligence and Law (2023)

work page 2023
[13]

Sofia Eleni Spatharioti et al . 2025. Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InCHI. ACM

work page 2025
[14]

UMA Protocol. 2024. Understanding UMA’s Optimistic Oracle and Its Governance Mechanisms

work page 2024
[15]

Zhepei Wei, Wei-Lin Chen, and Yu Meng. 2025. InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales. InICLR

work page 2025
[16]

Justin Wolfers and Eric Zitzewitz. 2004. Prediction Markets.Journal of Economic Perspectives18, 2 (2004), 107–126

work page 2004
[17]

Qianqian Xie et al . 2024. FinBen: A Holistic Financial Benchmark for Large Language Models. InNeurIPS

work page 2024
[18]

Shunyu Yao et al. 2024. Lawyer GPT: A Legal Large Language Model with En- hanced Domain Knowledge and Reasoning Capabilities. InProceedings of the 2024 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering. 108–112

work page 2024

[1] [1]

Agostino Capponi et al. 2025. DAO-AI: Evaluating Collective Decision-Making with Web-Enabled LLM Agents.arXiv preprint arXiv:2510.21117(2025)

work page arXiv 2025

[2] [2]

Yiling Chen and David Pennock. 2010. A survey of prediction market design. In Algorithmic Game Theory. 1–33

work page 2010

[3] [3]

Andrea Chiarelli et al . 2023. A Systematic Literature Review of Blockchain Oracles.Aalto University Publication Series(2023)

work page 2023

[4] [4]

Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, and Bing Qin. 2024. BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering. In ACL. 1229–1248

work page 2024

[5] [5]

Botao Amber Hu, Yuhan Liu, and Helena Rong. 2025. Trustless Autonomy: Self-Sovereign Large Language Model Agents in Decentralized Systems.arXiv preprint arXiv:2505.09757(2025)

work page arXiv 2025

[6] [6]

2025.Volodymyr Zelensky’s Clothing Has Sparked a Polymarket Rebellion

Joel Khalili and Kate Knibbs. 2025.Volodymyr Zelensky’s Clothing Has Sparked a Polymarket Rebellion. https://www.wired.com/story/volodymyr-zelensky-suit- polymarket-rebellion

work page 2025

[7] [7]

Kartikay Kumar and Muhammad Khan. 2021. Decentralized oracles: A compre- hensive survey.IEEE Access9 (2021), 92272–92294

work page 2021

[8] [8]

Ollie Liu, Deqing Fu, Dani Yogatama, and Willie Neiswanger. 2025. DeLLMa: Decision Making Under Uncertainty with Large Language Models. InICLR

work page 2025

[9] [9]

Othman et al . 2013. A practical liquidity-sensitive automated market maker. TEAC1, 3 (2013), 1–25

work page 2013

[10] [10]

Amir Pasdar and Young Choon Lee. 2023. A Survey on Blockchain Oracle Implementation.Comput. Surveys55, 12 (2023), 1–36

work page 2023

[11] [11]

Oriane Peter and Kate Devlin. 2025. Decentralising LLM Alignment: A Case for Context, Pluralism, and Participation. InAAAI

work page 2025

[12] [12]

Jaromir Savelka et al. 2023. Large Language Models in Legal Reasoning: A Study on Real-World Case Interpretation.Journal of Artificial Intelligence and Law (2023)

work page 2023

[13] [13]

Sofia Eleni Spatharioti et al . 2025. Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InCHI. ACM

work page 2025

[14] [14]

UMA Protocol. 2024. Understanding UMA’s Optimistic Oracle and Its Governance Mechanisms

work page 2024

[15] [15]

Zhepei Wei, Wei-Lin Chen, and Yu Meng. 2025. InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales. InICLR

work page 2025

[16] [16]

Justin Wolfers and Eric Zitzewitz. 2004. Prediction Markets.Journal of Economic Perspectives18, 2 (2004), 107–126

work page 2004

[17] [17]

Qianqian Xie et al . 2024. FinBen: A Holistic Financial Benchmark for Large Language Models. InNeurIPS

work page 2024

[18] [18]

Shunyu Yao et al. 2024. Lawyer GPT: A Legal Large Language Model with En- hanced Domain Knowledge and Reasoning Capabilities. InProceedings of the 2024 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering. 108–112

work page 2024