CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency
Pith reviewed 2026-05-21 18:58 UTC · model grok-4.3
The pith
CryptoBench reveals that LLMs retrieve cryptocurrency data well but struggle with predictive analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CryptoBench is a live, expert-curated benchmark built around fifty monthly questions that mirror professional crypto analyst workflows and are grouped into four quadrants: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. Evaluation of ten LLMs shows a clear retrieval-prediction imbalance in which models proficient at gathering information from on-chain sources and dashboards display pronounced weakness once tasks require forecasting or synthesis.
What carries the argument
The four-quadrant categorization system that isolates data-gathering performance from predictive synthesis performance.
If this is right
- Agents can appear factually competent through retrieval while remaining weak at forecasting.
- A performance hierarchy among LLMs becomes visible once they operate inside agentic frameworks on crypto tasks.
- The benchmark isolates the exact point where current models fall short of analyst-level work.
Where Pith is reading between the lines
- Separate training on chain-of-thought forecasting could reduce the observed gap.
- Real deployments might pair retrieval agents with dedicated prediction modules or external models.
- Monthly updates make it possible to measure whether newer models close the imbalance over successive releases.
Load-bearing premise
The expert-designed questions and their placement into retrieval versus prediction quadrants accurately reflect the real challenges faced by professional cryptocurrency analysts.
What would settle it
If models achieve comparable accuracy on prediction tasks as on retrieval tasks when scored against actual market outcomes and on-chain events, the claimed imbalance would not hold.
Figures
read the original abstract
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CryptoBench, a dynamic benchmark consisting of 50 expert-curated questions per month for evaluating LLM agents in cryptocurrency analysis. Questions are categorized into a four-quadrant framework (Simple Retrieval, Complex Retrieval, Simple Prediction, Complex Prediction) designed to mirror professional analyst workflows. Evaluations of ten LLMs, both standalone and in agentic setups, reveal a performance hierarchy and a retrieval-prediction imbalance in which models perform well on retrieval but struggle with predictive analysis.
Significance. If the quadrant assignments prove stable and the evaluations are reproducible, CryptoBench could fill a gap in domain-specific agent benchmarks by emphasizing time-sensitive, adversarial, and multi-source synthesis challenges unique to cryptocurrency. The dynamic monthly update mechanism and the explicit separation of retrieval versus prediction capabilities are potentially valuable contributions for guiding improvements in LLM agent reasoning.
major comments (1)
- [Benchmark Construction] Benchmark Construction section: The manuscript states that tasks were 'rigorously categorized' within the four-quadrant system by 'crypto-native professionals' but supplies neither the explicit labeling rubric, inter-rater reliability statistics, nor annotated example questions with justifications. This is load-bearing for the central retrieval-prediction imbalance claim; without evidence that the quadrant distinctions are stable and not curator-specific, observed weaknesses in 'Simple Prediction' or 'Complex Prediction' tasks could be reclassified as retrieval failures rather than genuine predictive deficits.
minor comments (2)
- [Abstract] Abstract: Key quantitative results (e.g., accuracy or F1 scores per quadrant and per model) are referenced but not reported, making it difficult for readers to gauge the magnitude of the reported imbalance from the summary alone.
- The description of the agentic framework could clarify which tools or retrieval mechanisms were provided to the agents and whether they had access to the same data sources used in the benchmark construction.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on CryptoBench. We address the major comment on benchmark construction and categorization below, and we will incorporate clarifications to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark Construction] Benchmark Construction section: The manuscript states that tasks were 'rigorously categorized' within the four-quadrant system by 'crypto-native professionals' but supplies neither the explicit labeling rubric, inter-rater reliability statistics, nor annotated example questions with justifications. This is load-bearing for the central retrieval-prediction imbalance claim; without evidence that the quadrant distinctions are stable and not curator-specific, observed weaknesses in 'Simple Prediction' or 'Complex Prediction' tasks could be reclassified as retrieval failures rather than genuine predictive deficits.
Authors: We agree that additional documentation on the categorization process would improve transparency and address potential concerns about subjectivity. The four-quadrant assignments were made by crypto-native professionals according to explicit internal criteria: retrieval tasks require locating and extracting factual information from sources such as on-chain data or news, while prediction tasks require synthesizing that information into forward-looking statements; simple versus complex distinctions were based on the number of distinct sources and reasoning steps involved. Although formal inter-rater reliability statistics were not computed or reported in the initial submission, assignments were reviewed for consistency by multiple domain experts. In the revised manuscript we will add the full labeling rubric, several annotated example questions with quadrant justifications, and a note on the review process used to ensure stability. These additions will allow readers to evaluate the robustness of the distinctions independently while preserving the observed retrieval-prediction performance gap, which appears consistently across models and monthly question sets. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivations or fitted quantities
full rationale
This is an empirical benchmark paper that introduces CryptoBench with 50 monthly questions categorized by crypto-native professionals into four quadrants (Simple Retrieval, Complex Retrieval, Simple Prediction, Complex Prediction). The central claims concern observed performance hierarchies and a retrieval-prediction imbalance from evaluating ten LLMs. No equations, parameter fitting, self-citations as load-bearing premises, or derivation chains exist that could reduce outputs to inputs by construction. The task categorization is presented as a methodological design choice to mirror workflows, without any claim that it derives from or is forced by prior results in the paper itself. The work is therefore self-contained against external benchmarks, with claims resting on described evaluations rather than self-referential logic.
Axiom & Free-Parameter Ledger
free parameters (1)
- Monthly question volume
axioms (1)
- domain assumption Expert-curated questions by crypto-native professionals mirror actual analyst workflows
Forward citations
Cited by 1 Pith paper
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
Reference graph
Works this paper leans on
-
[1]
Browsecomp: A simple yet challenging benchmark for browsing agents, 2025
Jason Wei. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. 1, 3
work page 2025
-
[2]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
React: Synergizing reasoning and acting in language models, 2022
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. 2
work page 2022
-
[4]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 2
work page 2023
-
[5]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Hariharan, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiangyue Liao, Bo Yuan, Yuxiao Yao, Hongwei Deng, Xiaohu Tang, Yu Zhang, Huiyu Du, Huiyu Tan, Xiyan Li, Jingyi Ge, Zhenjie Zhang, Jiaqi Zhou, Ningyu Zhou, Yingqi Zhang, Yuhui Zhang, Lei Fan, Jinlin Chu, Guanying Liu, Lixin Zhu, Y...
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [7]
- [8]
-
[9]
Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W
Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, K.p. Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W. Suchow, and Qianqian Xie. Investorbench: A benchmark for financial decision-making tasks with llm-based agent, 2025. 3
work page 2025
-
[10]
Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025. 3
-
[11]
Onchain execution benchmark v0.1
CAIBA. Onchain execution benchmark v0.1. https://www.caiba.ai/blogs/6, July 2025. Accessed: 2025- 10-19. 3
work page 2025
-
[12]
Crypto named entity recognition benchmark v0.1
CAIBA. Crypto named entity recognition benchmark v0.1. https://www.caiba.ai/blogs/5, July 2025. Accessed: 2025-10-19. 3
work page 2025
-
[13]
CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading
Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, and Bingsheng He. CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1094–1106, Miami, Florida, USA, November 2024...
work page 2024
-
[14]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. 3
work page 2024
-
[15]
Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025
Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025. 3
work page 2025
-
[16]
Tu Vu, Satyen Kale, Mohit Iyyer, Xuezhi Wang, Kazuma Hashimoto, Graham Neubig, Maarten Bosma, Quoc V . Le, Christopher D. Manning, Andrew M. Dai, and Daniel Sohn. Freshllms: Refreshing large language models with search engine augmentation, 2023. 3 14 CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents
work page 2023
-
[17]
Livebench: A challenging, contamination-limited llm benchmark, 2025
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, and et al. Livebench: A challenging, contamination-limited llm benchmark, 2025. 3
work page 2025
-
[18]
Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, and Houfeng Wang. Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025. 3
-
[19]
Evaluating llms on real-world forecasting against expert forecasters, 2025
Janna Lu. Evaluating llms on real-world forecasting against expert forecasters, 2025. 3, 13
work page 2025
-
[20]
Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025
Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025. 4, 13
work page 2025
-
[21]
Futurebench: Evaluating agents’ future prediction capabilities, 2025
Together.ai. Futurebench: Evaluating agents’ future prediction capabilities, 2025. 4
work page 2025
-
[22]
Openep: Open-ended future event prediction,
Yong Guan, Hao Peng, Xiaozhi Wang, Lei Hou, and Juanzi Li. Openep: Open-ended future event prediction,
-
[23]
Petraq Nako and Adam Jatowt. Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025. 4
work page 2025
-
[24]
Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025
Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025. 4
work page 2025
-
[25]
Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025
Spencer Mateega, Carlos Georgescu, and Danny Tang. Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025. 4
work page 2025
-
[26]
Capabilities of gpt-5 across critical domains, 2025
OpenAI Team. Capabilities of gpt-5 across critical domains, 2025. 9
work page 2025
- [27]
- [28]
- [29]
-
[30]
System card addendum: Claude opus 4.1, 2025
Anthropic Team. System card addendum: Claude opus 4.1, 2025. 9
work page 2025
-
[31]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,
-
[32]
Claude sonnet 4.5 system card, 2025
Anthropic Team. Claude sonnet 4.5 system card, 2025. 9
work page 2025
-
[33]
Google DeepMind Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9
work page 2025
-
[34]
gpt-oss-120b & gpt-oss-20b model card, 2025
OpenAI Team. gpt-oss-120b & gpt-oss-20b model card, 2025. 9
work page 2025
-
[35]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9
work page 2025
-
[36]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 9
work page 2023
-
[37]
smolagents: A smol library to build great agentic systems, 2025
Aymeric Roucher, A Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. smolagents: A smol library to build great agentic systems, 2025. 10
work page 2025
-
[38]
Pitfalls in evaluating language model forecasters, 2025
Daniel Paleka, Shashwat Goel, Jonas Geiping, and Florian Tramèr. Pitfalls in evaluating language model forecasters, 2025. 13 15
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.