pith. sign in

arxiv: 2512.00417 · v5 · pith:LMDI7SLNnew · submitted 2025-11-29 · 💻 cs.CL

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Pith reviewed 2026-05-21 18:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentscryptocurrencybenchmarkretrievalpredictionperformance evaluationdynamic benchmark
0
0 comments X

The pith

CryptoBench reveals that LLMs retrieve cryptocurrency data well but struggle with predictive analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CryptoBench, a dynamic benchmark that releases 50 new questions each month to test LLM agents on cryptocurrency tasks. Questions are sorted into four quadrants that separate simple and complex retrieval from simple and complex prediction. Direct and agent-based tests of ten models uncover a consistent pattern where retrieval succeeds far more often than prediction. This distinction matters because cryptocurrency work demands both accurate data pulls and timely forecasts in a fast-changing, adversarial setting. The observed imbalance suggests agents can sound informed while lacking the synthesis needed for actual analysis.

Core claim

CryptoBench is a live, expert-curated benchmark built around fifty monthly questions that mirror professional crypto analyst workflows and are grouped into four quadrants: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. Evaluation of ten LLMs shows a clear retrieval-prediction imbalance in which models proficient at gathering information from on-chain sources and dashboards display pronounced weakness once tasks require forecasting or synthesis.

What carries the argument

The four-quadrant categorization system that isolates data-gathering performance from predictive synthesis performance.

If this is right

  • Agents can appear factually competent through retrieval while remaining weak at forecasting.
  • A performance hierarchy among LLMs becomes visible once they operate inside agentic frameworks on crypto tasks.
  • The benchmark isolates the exact point where current models fall short of analyst-level work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separate training on chain-of-thought forecasting could reduce the observed gap.
  • Real deployments might pair retrieval agents with dedicated prediction modules or external models.
  • Monthly updates make it possible to measure whether newer models close the imbalance over successive releases.

Load-bearing premise

The expert-designed questions and their placement into retrieval versus prediction quadrants accurately reflect the real challenges faced by professional cryptocurrency analysts.

What would settle it

If models achieve comparable accuracy on prediction tasks as on retrieval tasks when scored against actual market outcomes and on-chain events, the claimed imbalance would not hold.

Figures

Figures reproduced from arXiv: 2512.00417 by Chaoyang He, Jason Ge, Jiacheng Guo, Jiashuo Liu, Jia Tian, Kaixuan Huang, Kanghong Zhan, Lin Yang, Mengdi Wang, Nicholas Deng, Qixin Xiao, Suozhi Huang, Tianyi Li, Wenhao Huang, Xiaochen Liu, Yifan Zhang, Yifu Lu, Zihao Li, Zixin Yao.

Figure 1
Figure 1. Figure 1: Combined Evaluation Scores of LLM and SmolAgent Frameworks between October [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of the CryptoBench dataset between October [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The CryptoBench Four-Quadrant Task Classification System. Tasks are categorized along two axes: Com [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The CryptoBench Dataset Construction and Dynamic Update Pipeline. The top panel illustrates the rigorous [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall Performance Comparison between October [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance Breakdown between October 12th to November 11th. (a) Comparison of model accuracy on Simple versus Complex tasks. (b) Comparison of accuracy on Retrieval versus Prediction tasks, revealing distinct model strengths. As shown in Figure 6a, all models perform better on Simple tasks than on Complex ones, which is expected. However, the performance degradation varies. Grok-4 (Web) maintains the high… view at source ↗
Figure 7
Figure 7. Figure 7: Performance Profile by Macro Category between October [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance Breakdown between October 12th to November 11th. Performance Breakdown between October 12th to November 11th. (a) Performance by Task Quadrant. (b) Performance by Investor Focus, showing distinct model abilities on different classes. 5.5 Qualitative Analysis of Failure Modes Through manual analysis of incorrect responses, we identified several recurring failure modes. Below, we detail each with… view at source ↗
read the original abstract

This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces CryptoBench, a dynamic benchmark consisting of 50 expert-curated questions per month for evaluating LLM agents in cryptocurrency analysis. Questions are categorized into a four-quadrant framework (Simple Retrieval, Complex Retrieval, Simple Prediction, Complex Prediction) designed to mirror professional analyst workflows. Evaluations of ten LLMs, both standalone and in agentic setups, reveal a performance hierarchy and a retrieval-prediction imbalance in which models perform well on retrieval but struggle with predictive analysis.

Significance. If the quadrant assignments prove stable and the evaluations are reproducible, CryptoBench could fill a gap in domain-specific agent benchmarks by emphasizing time-sensitive, adversarial, and multi-source synthesis challenges unique to cryptocurrency. The dynamic monthly update mechanism and the explicit separation of retrieval versus prediction capabilities are potentially valuable contributions for guiding improvements in LLM agent reasoning.

major comments (1)
  1. [Benchmark Construction] Benchmark Construction section: The manuscript states that tasks were 'rigorously categorized' within the four-quadrant system by 'crypto-native professionals' but supplies neither the explicit labeling rubric, inter-rater reliability statistics, nor annotated example questions with justifications. This is load-bearing for the central retrieval-prediction imbalance claim; without evidence that the quadrant distinctions are stable and not curator-specific, observed weaknesses in 'Simple Prediction' or 'Complex Prediction' tasks could be reclassified as retrieval failures rather than genuine predictive deficits.
minor comments (2)
  1. [Abstract] Abstract: Key quantitative results (e.g., accuracy or F1 scores per quadrant and per model) are referenced but not reported, making it difficult for readers to gauge the magnitude of the reported imbalance from the summary alone.
  2. The description of the agentic framework could clarify which tools or retrieval mechanisms were provided to the agents and whether they had access to the same data sources used in the benchmark construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on CryptoBench. We address the major comment on benchmark construction and categorization below, and we will incorporate clarifications to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: The manuscript states that tasks were 'rigorously categorized' within the four-quadrant system by 'crypto-native professionals' but supplies neither the explicit labeling rubric, inter-rater reliability statistics, nor annotated example questions with justifications. This is load-bearing for the central retrieval-prediction imbalance claim; without evidence that the quadrant distinctions are stable and not curator-specific, observed weaknesses in 'Simple Prediction' or 'Complex Prediction' tasks could be reclassified as retrieval failures rather than genuine predictive deficits.

    Authors: We agree that additional documentation on the categorization process would improve transparency and address potential concerns about subjectivity. The four-quadrant assignments were made by crypto-native professionals according to explicit internal criteria: retrieval tasks require locating and extracting factual information from sources such as on-chain data or news, while prediction tasks require synthesizing that information into forward-looking statements; simple versus complex distinctions were based on the number of distinct sources and reasoning steps involved. Although formal inter-rater reliability statistics were not computed or reported in the initial submission, assignments were reviewed for consistency by multiple domain experts. In the revised manuscript we will add the full labeling rubric, several annotated example questions with quadrant justifications, and a note on the review process used to ensure stability. These additions will allow readers to evaluate the robustness of the distinctions independently while preserving the observed retrieval-prediction performance gap, which appears consistently across models and monthly question sets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted quantities

full rationale

This is an empirical benchmark paper that introduces CryptoBench with 50 monthly questions categorized by crypto-native professionals into four quadrants (Simple Retrieval, Complex Retrieval, Simple Prediction, Complex Prediction). The central claims concern observed performance hierarchies and a retrieval-prediction imbalance from evaluating ten LLMs. No equations, parameter fitting, self-citations as load-bearing premises, or derivation chains exist that could reduce outputs to inputs by construction. The task categorization is presented as a methodological design choice to mirror workflows, without any claim that it derives from or is forced by prior results in the paper itself. The work is therefore self-contained against external benchmarks, with claims resting on described evaluations rather than self-referential logic.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the design choice of 50 questions monthly and the assumption that expert curation produces representative tasks; no new physical entities are postulated and no parameters are fitted to data in the usual sense.

free parameters (1)
  • Monthly question volume
    Design choice of 50 questions per month to enable ongoing dynamic evaluation.
axioms (1)
  • domain assumption Expert-curated questions by crypto-native professionals mirror actual analyst workflows
    Invoked when constructing the benchmark to ensure tasks reflect real-world demands.

pith-pipeline@v0.9.0 · 5860 in / 1329 out tokens · 66366 ms · 2026-05-21T18:58:32.346280+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LATTICE: Evaluating Decision Support Utility of Crypto Agents

    cs.CR 2026-04 unverdicted novelty 6.0

    LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

    Jason Wei. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. 1, 3

  2. [2]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023. 1, 2, 3

  3. [3]

    React: Synergizing reasoning and acting in language models, 2022

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. 2

  4. [4]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 2

  5. [5]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Hariharan, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 2, 3

  6. [6]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiangyue Liao, Bo Yuan, Yuxiao Yao, Hongwei Deng, Xiaohu Tang, Yu Zhang, Huiyu Du, Huiyu Tan, Xiyan Li, Jingyi Ge, Zhenjie Zhang, Jiaqi Zhou, Ningyu Zhou, Yingqi Zhang, Yuhui Zhang, Lei Fan, Jinlin Chu, Guanying Liu, Lixin Zhu, Y...

  7. [7]

    Wei et al

    J. Wei et al. Caia: Benchmarking intelligence under fire in cryptocurrency markets.arXiv preprint arXiv:2510.00332, 2025. 3

  8. [8]

    Liu et al

    X. Liu et al. Agent market arena: Live multi-market trading benchmark for llm agents.arXiv preprint arXiv:2510.11695, 2025. 3

  9. [9]

    Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W

    Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, K.p. Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W. Suchow, and Qianqian Xie. Investorbench: A benchmark for financial decision-making tasks with llm-based agent, 2025. 3

  10. [10]

    Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

    Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025. 3

  11. [11]

    Onchain execution benchmark v0.1

    CAIBA. Onchain execution benchmark v0.1. https://www.caiba.ai/blogs/6, July 2025. Accessed: 2025- 10-19. 3

  12. [12]

    Crypto named entity recognition benchmark v0.1

    CAIBA. Crypto named entity recognition benchmark v0.1. https://www.caiba.ai/blogs/5, July 2025. Accessed: 2025-10-19. 3

  13. [13]

    CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading

    Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, and Bingsheng He. CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1094–1106, Miami, Florida, USA, November 2024...

  14. [14]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. 3

  15. [15]

    Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025

    Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025. 3

  16. [16]

    Le, Christopher D

    Tu Vu, Satyen Kale, Mohit Iyyer, Xuezhi Wang, Kazuma Hashimoto, Graham Neubig, Maarten Bosma, Quoc V . Le, Christopher D. Manning, Andrew M. Dai, and Daniel Sohn. Freshllms: Refreshing large language models with search engine augmentation, 2023. 3 14 CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents

  17. [17]

    Livebench: A challenging, contamination-limited llm benchmark, 2025

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, and et al. Livebench: A challenging, contamination-limited llm benchmark, 2025. 3

  18. [18]

    Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025

    Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, and Houfeng Wang. Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025. 3

  19. [19]

    Evaluating llms on real-world forecasting against expert forecasters, 2025

    Janna Lu. Evaluating llms on real-world forecasting against expert forecasters, 2025. 3, 13

  20. [20]

    Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025

    Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025. 4, 13

  21. [21]

    Futurebench: Evaluating agents’ future prediction capabilities, 2025

    Together.ai. Futurebench: Evaluating agents’ future prediction capabilities, 2025. 4

  22. [22]

    Openep: Open-ended future event prediction,

    Yong Guan, Hao Peng, Xiaozhi Wang, Lei Hou, and Juanzi Li. Openep: Open-ended future event prediction,

  23. [23]

    Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025

    Petraq Nako and Adam Jatowt. Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025. 4

  24. [24]

    Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025

    Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025. 4

  25. [25]

    Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025

    Spencer Mateega, Carlos Georgescu, and Danny Tang. Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025. 4

  26. [26]

    Capabilities of gpt-5 across critical domains, 2025

    OpenAI Team. Capabilities of gpt-5 across critical domains, 2025. 9

  27. [27]

    Grok 4 model card, 2025

    xAI Team. Grok 4 model card, 2025. 9

  28. [28]

    Grok 4 fast model card, 2025

    xAI Team. Grok 4 fast model card, 2025. 9

  29. [29]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 9

  30. [30]

    System card addendum: Claude opus 4.1, 2025

    Anthropic Team. System card addendum: Claude opus 4.1, 2025. 9

  31. [31]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

  32. [32]

    Claude sonnet 4.5 system card, 2025

    Anthropic Team. Claude sonnet 4.5 system card, 2025. 9

  33. [33]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Google DeepMind Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9

  34. [34]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI Team. gpt-oss-120b & gpt-oss-20b model card, 2025. 9

  35. [35]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9

  36. [36]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 9

  37. [37]

    smolagents: A smol library to build great agentic systems, 2025

    Aymeric Roucher, A Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. smolagents: A smol library to build great agentic systems, 2025. 10

  38. [38]

    Pitfalls in evaluating language model forecasters, 2025

    Daniel Paleka, Shashwat Goel, Jonas Geiping, and Florian Tramèr. Pitfalls in evaluating language model forecasters, 2025. 13 15