Can LLMs Help Decentralized Dispute Arbitration? A Case Study of UMA-Resolved Markets on Polymarket
Pith reviewed 2026-05-10 08:14 UTC · model grok-4.3
The pith
Web-enabled LLMs achieve 89.58% agreement with UMA resolutions once disputes are raised on Polymarket.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a case study of UMA-resolved markets on Polymarket, web-enabled large language models were unable to predict which events would face disputes but achieved 89.58% agreement with UMA's final resolutions once disputes occurred, along with strong stability.
What carries the argument
Direct comparison of LLM-generated resolutions against UMA on-chain voting outcomes for disputed prediction market events, using web search to supply context equivalent to voter information.
If this is right
- LLMs could function as an auxiliary input to support or accelerate UMA-style on-chain dispute resolution.
- High agreement rates indicate that LLMs capture reasoning patterns similar to those used by UMA voters.
- The inability to forecast disputes limits proactive applications but leaves post-initiation assistance viable.
- Stable performance across multiple runs suggests the approach is robust to minor variations in event details.
Where Pith is reading between the lines
- Hybrid systems that combine LLM outputs with human or on-chain voting could lower costs in decentralized arbitration without replacing existing mechanisms.
- The same evaluation method might apply to dispute resolution on other blockchain platforms that use oracle-based or voting-based arbitration.
- Future work could test whether restricting web access or altering prompt formats changes the observed agreement level.
Load-bearing premise
Web-enabled LLMs receive the same relevant information and context as UMA voters and the studied disputed events form a representative sample for measuring agreement.
What would settle it
A fresh collection of disputed Polymarket events where the same LLM evaluation process yields agreement with UMA resolutions well below 89.58%.
Figures
read the original abstract
Web3 prediction markets, exemplified by Polymarket, have gained prominence for leveraging collective intelligence to forecast a wide range of social, political, and sports events. However, among the thousands of prediction market events, consensus disputes still arise due to imperfections in market mechanisms. On Polymarket alone, the trading volume involving disputed events has reached $972,370,804.71, underscoring the critical need for objective and efficient dispute resolution. In this study, we introduce large language models (LLMs) to: (1) evaluate whether web-enabled LLMs can reproduce the decision quality of UMA's on-chain voting process once a dispute has been raised, and (2) predict, based on event rules, which market events are likely to face future disputes before they occur. Our findings show that LLMs are unable to reliably predict which events will become disputed in advance; however, once a dispute is initiated, web-enabled LLMs achieve 89.58% agreement with UMA's final resolutions and demonstrate strong stability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines whether web-enabled LLMs can assist in decentralized dispute resolution for Polymarket prediction markets arbitrated by UMA. It evaluates two tasks: (1) using LLMs to predict in advance which market events are likely to face disputes based on event rules, and (2) having LLMs resolve already-disputed events and comparing their outputs to UMA's final on-chain resolutions. The central empirical claims are that LLMs fail to reliably predict future disputes but achieve 89.58% agreement with UMA resolutions once disputes occur, along with strong stability across queries.
Significance. If the agreement result is shown to be robust, this would indicate that LLMs could provide a scalable, low-cost complement to on-chain voting mechanisms in Web3 platforms, potentially addressing the $972M+ in disputed trading volume noted in the abstract. The work is timely given the growth of prediction markets and highlights a concrete limitation (poor predictive power) alongside a potential strength (post-dispute resolution), offering an applied case study at the intersection of AI and decentralized governance.
major comments (3)
- Abstract and results section reporting the 89.58% figure: the agreement rate is presented without the sample size N of disputed events, the exact prompting template and web-search parameters supplied to the LLMs, the formal definition of the agreement metric, or any statistical tests (e.g., confidence intervals or comparison to a baseline). This information is load-bearing for interpreting whether the result supports the claim that LLMs can reproduce UMA decision quality.
- Methodology section describing LLM inputs: the paper does not demonstrate or verify that the web results and context provided to the LLMs are equivalent to the information and resolution criteria actually available to UMA voters at the time of each dispute. If LLMs receive post-dispute summaries or broader search scopes, the measured agreement no longer tests the intended substitution claim.
- Results on dispute prediction: the claim that LLMs are 'unable to reliably predict' disputes lacks a clear baseline (e.g., random or majority-class accuracy) and details on the feature set or prompting strategy used for the prediction task, making it difficult to assess whether the negative result is informative or merely reflects an under-powered experimental design.
minor comments (2)
- The abstract and later text refer to 'strong stability' without defining the stability metric (e.g., agreement across repeated independent runs or variance in outputs).
- Table or figure presenting the 89.58% result should include per-event breakdowns or error analysis to allow readers to identify systematic failure modes.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important areas for improving the transparency and rigor of our empirical analysis. We address each major comment in detail below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: Abstract and results section reporting the 89.58% figure: the agreement rate is presented without the sample size N of disputed events, the exact prompting template and web-search parameters supplied to the LLMs, the formal definition of the agreement metric, or any statistical tests (e.g., confidence intervals or comparison to a baseline). This information is load-bearing for interpreting whether the result supports the claim that LLMs can reproduce UMA decision quality.
Authors: We acknowledge that these methodological and statistical details are crucial for evaluating the robustness of the 89.58% agreement rate. In the revised manuscript, we will explicitly report the sample size of disputed events analyzed, provide the exact prompting templates and web-search parameters used for the LLMs, formally define the agreement metric as the percentage of cases where the LLM's resolved outcome matches UMA's on-chain resolution, and include statistical tests such as 95% confidence intervals and comparisons to baseline accuracies. These additions will be incorporated into both the abstract and the results section to enhance interpretability. revision: yes
-
Referee: Methodology section describing LLM inputs: the paper does not demonstrate or verify that the web results and context provided to the LLMs are equivalent to the information and resolution criteria actually available to UMA voters at the time of each dispute. If LLMs receive post-dispute summaries or broader search scopes, the measured agreement no longer tests the intended substitution claim.
Authors: This point raises an important issue about the validity of our experimental setup. We designed the web searches to use queries based on the event descriptions and restricted results to information available prior to the UMA resolution date where possible. However, we recognize that it is challenging to perfectly match the exact information available to UMA voters, who may have access to additional context from community discussions or specific evidence submitted during the dispute process. In the revision, we will expand the methodology section to detail our search strategy, including any date filters applied, provide sample inputs to the LLMs, and explicitly discuss this as a limitation in the paper's discussion section. We believe this clarification will help readers assess the strength of the substitution claim. revision: partial
-
Referee: Results on dispute prediction: the claim that LLMs are 'unable to reliably predict' disputes lacks a clear baseline (e.g., random or majority-class accuracy) and details on the feature set or prompting strategy used for the prediction task, making it difficult to assess whether the negative result is informative or merely reflects an under-powered experimental design.
Authors: We agree that providing baselines and additional details is necessary to substantiate the claim that LLMs cannot reliably predict disputes. In the revised manuscript, we will include comparisons to a random baseline (expected accuracy of 50% for binary classification) and a majority-class baseline, describe the feature set consisting of the event rules and associated market metadata, and outline the prompting strategy employed. Furthermore, we will report the prediction accuracies with appropriate statistical measures to demonstrate that the performance does not exceed chance levels in a meaningful way. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential loops
full rationale
The paper reports an empirical case study measuring LLM agreement (89.58%) with UMA resolutions on disputed Polymarket events and testing LLM ability to predict future disputes in advance. No equations, fitted parameters, ansatzes, or derivations are present. The central claim is a direct statistical comparison of outputs against an external oracle (UMA votes), with no step that reduces by construction to the inputs or to a self-citation chain. The second task (pre-dispute prediction) is reported as unsuccessful, further removing any risk of tautological prediction. This is a standard empirical evaluation without the circular patterns enumerated in the guidelines.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Yiling Chen and David Pennock. 2010. A survey of prediction market design. In Algorithmic Game Theory. 1–33
work page 2010
-
[3]
Andrea Chiarelli et al . 2023. A Systematic Literature Review of Blockchain Oracles.Aalto University Publication Series(2023)
work page 2023
-
[4]
Zheng Chu, Jingchang Chen, Qianglong Chen, Haotian Wang, Kun Zhu, Xiyuan Du, Weijiang Yu, Ming Liu, and Bing Qin. 2024. BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering. In ACL. 1229–1248
work page 2024
- [5]
-
[6]
2025.Volodymyr Zelensky’s Clothing Has Sparked a Polymarket Rebellion
Joel Khalili and Kate Knibbs. 2025.Volodymyr Zelensky’s Clothing Has Sparked a Polymarket Rebellion. https://www.wired.com/story/volodymyr-zelensky-suit- polymarket-rebellion
work page 2025
-
[7]
Kartikay Kumar and Muhammad Khan. 2021. Decentralized oracles: A compre- hensive survey.IEEE Access9 (2021), 92272–92294
work page 2021
-
[8]
Ollie Liu, Deqing Fu, Dani Yogatama, and Willie Neiswanger. 2025. DeLLMa: Decision Making Under Uncertainty with Large Language Models. InICLR
work page 2025
-
[9]
Othman et al . 2013. A practical liquidity-sensitive automated market maker. TEAC1, 3 (2013), 1–25
work page 2013
-
[10]
Amir Pasdar and Young Choon Lee. 2023. A Survey on Blockchain Oracle Implementation.Comput. Surveys55, 12 (2023), 1–36
work page 2023
-
[11]
Oriane Peter and Kate Devlin. 2025. Decentralising LLM Alignment: A Case for Context, Pluralism, and Participation. InAAAI
work page 2025
-
[12]
Jaromir Savelka et al. 2023. Large Language Models in Legal Reasoning: A Study on Real-World Case Interpretation.Journal of Artificial Intelligence and Law (2023)
work page 2023
-
[13]
Sofia Eleni Spatharioti et al . 2025. Effects of LLM-based Search on Decision Making: Speed, Accuracy, and Overreliance. InCHI. ACM
work page 2025
-
[14]
UMA Protocol. 2024. Understanding UMA’s Optimistic Oracle and Its Governance Mechanisms
work page 2024
-
[15]
Zhepei Wei, Wei-Lin Chen, and Yu Meng. 2025. InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales. InICLR
work page 2025
-
[16]
Justin Wolfers and Eric Zitzewitz. 2004. Prediction Markets.Journal of Economic Perspectives18, 2 (2004), 107–126
work page 2004
-
[17]
Qianqian Xie et al . 2024. FinBen: A Holistic Financial Benchmark for Large Language Models. InNeurIPS
work page 2024
-
[18]
Shunyu Yao et al. 2024. Lawyer GPT: A Legal Large Language Model with En- hanced Domain Knowledge and Reasoning Capabilities. InProceedings of the 2024 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering. 108–112
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.