CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Chaoyang He; Jason Ge; Jiacheng Guo; Jiashuo Liu; Jia Tian; Kaixuan Huang; Kanghong Zhan; Lin Yang; Mengdi Wang; Nicholas Deng

arxiv: 2512.00417 · v5 · pith:LMDI7SLNnew · submitted 2025-11-29 · 💻 cs.CL

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Jiacheng Guo , Suozhi Huang , Zixin Yao , Yifan Zhang , Yifu Lu , Jiashuo Liu , Zihao Li , Nicholas Deng

show 11 more authors

Qixin Xiao Jia Tian Kanghong Zhan Tianyi Li Xiaochen Liu Jason Ge Chaoyang He Kaixuan Huang Lin Yang Wenhao Huang Mengdi Wang

This is my paper

Pith reviewed 2026-05-21 18:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agentscryptocurrencybenchmarkretrievalpredictionperformance evaluationdynamic benchmark

0 comments

The pith

CryptoBench reveals that LLMs retrieve cryptocurrency data well but struggle with predictive analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CryptoBench, a dynamic benchmark that releases 50 new questions each month to test LLM agents on cryptocurrency tasks. Questions are sorted into four quadrants that separate simple and complex retrieval from simple and complex prediction. Direct and agent-based tests of ten models uncover a consistent pattern where retrieval succeeds far more often than prediction. This distinction matters because cryptocurrency work demands both accurate data pulls and timely forecasts in a fast-changing, adversarial setting. The observed imbalance suggests agents can sound informed while lacking the synthesis needed for actual analysis.

Core claim

CryptoBench is a live, expert-curated benchmark built around fifty monthly questions that mirror professional crypto analyst workflows and are grouped into four quadrants: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. Evaluation of ten LLMs shows a clear retrieval-prediction imbalance in which models proficient at gathering information from on-chain sources and dashboards display pronounced weakness once tasks require forecasting or synthesis.

What carries the argument

The four-quadrant categorization system that isolates data-gathering performance from predictive synthesis performance.

If this is right

Agents can appear factually competent through retrieval while remaining weak at forecasting.
A performance hierarchy among LLMs becomes visible once they operate inside agentic frameworks on crypto tasks.
The benchmark isolates the exact point where current models fall short of analyst-level work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separate training on chain-of-thought forecasting could reduce the observed gap.
Real deployments might pair retrieval agents with dedicated prediction modules or external models.
Monthly updates make it possible to measure whether newer models close the imbalance over successive releases.

Load-bearing premise

The expert-designed questions and their placement into retrieval versus prediction quadrants accurately reflect the real challenges faced by professional cryptocurrency analysts.

What would settle it

If models achieve comparable accuracy on prediction tasks as on retrieval tasks when scored against actual market outcomes and on-chain events, the claimed imbalance would not hold.

Figures

Figures reproduced from arXiv: 2512.00417 by Chaoyang He, Jason Ge, Jiacheng Guo, Jiashuo Liu, Jia Tian, Kaixuan Huang, Kanghong Zhan, Lin Yang, Mengdi Wang, Nicholas Deng, Qixin Xiao, Suozhi Huang, Tianyi Li, Wenhao Huang, Xiaochen Liu, Yifan Zhang, Yifu Lu, Zihao Li, Zixin Yao.

**Figure 2.** Figure 2: Statistics of the CryptoBench dataset between October [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The CryptoBench Four-Quadrant Task Classification System. Tasks are categorized along two axes: Com [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The CryptoBench Dataset Construction and Dynamic Update Pipeline. The top panel illustrates the rigorous [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Overall Performance Comparison between October [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Performance Breakdown between October 12th to November 11th. (a) Comparison of model accuracy on Simple versus Complex tasks. (b) Comparison of accuracy on Retrieval versus Prediction tasks, revealing distinct model strengths. As shown in Figure 6a, all models perform better on Simple tasks than on Complex ones, which is expected. However, the performance degradation varies. Grok-4 (Web) maintains the high… view at source ↗

**Figure 7.** Figure 7: Performance Profile by Macro Category between October [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Performance Breakdown between October 12th to November 11th. Performance Breakdown between October 12th to November 11th. (a) Performance by Task Quadrant. (b) Performance by Investor Focus, showing distinct model abilities on different classes. 5.5 Qualitative Analysis of Failure Modes Through manual analysis of incorrect responses, we identified several recurring failure modes. Below, we detail each with… view at source ↗

read the original abstract

This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CryptoBench adds a domain-specific dynamic benchmark for crypto LLM agents with a four-quadrant split, but the retrieval-prediction imbalance rests on expert labels that lack reported validation or examples.

read the letter

CryptoBench introduces a live benchmark of 50 monthly questions for LLM agents in cryptocurrency, broken into Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. The authors ran ten models both standalone and in agent setups and report a clear hierarchy plus a retrieval-prediction imbalance where models handle data gathering better than forecasting. The domain focus is the main new piece. Crypto analysis involves real-time on-chain data, DeFi dashboards, and an adversarial information setting that general agent benchmarks do not capture, so a targeted test set makes sense. The dynamic monthly refresh and the quadrant breakdown give a practical way to separate basic lookup from synthesis and prediction skills. The evaluation setup itself is straightforward and directly addresses the stated challenges of time sensitivity and source diversity. The soft spots are in the supporting details. The abstract states the imbalance finding but supplies no scores, no model-by-model breakdowns, and no sample questions with their quadrant justifications. The expert curation by crypto-native professionals is presented as rigorous, yet there is no rubric, no inter-rater agreement measure, and no discussion of edge cases. If even a portion of the Simple Prediction items can be solved by retrieving historical patterns or signals, the claimed weakness in predictive analysis shrinks or disappears. That makes the central observation harder to interpret without the full methods and data. This paper is mainly for groups working on LLM agents for finance, trading, or other high-stakes specialized domains. Anyone testing reliability in volatile, data-rich environments could use the quadrant idea as a starting point. It deserves peer review because a properly documented benchmark in this niche would be useful, provided the authors add the quantitative results, question examples, and curation validation that are missing from the current description.

Referee Report

1 major / 2 minor

Summary. The paper introduces CryptoBench, a dynamic benchmark consisting of 50 expert-curated questions per month for evaluating LLM agents in cryptocurrency analysis. Questions are categorized into a four-quadrant framework (Simple Retrieval, Complex Retrieval, Simple Prediction, Complex Prediction) designed to mirror professional analyst workflows. Evaluations of ten LLMs, both standalone and in agentic setups, reveal a performance hierarchy and a retrieval-prediction imbalance in which models perform well on retrieval but struggle with predictive analysis.

Significance. If the quadrant assignments prove stable and the evaluations are reproducible, CryptoBench could fill a gap in domain-specific agent benchmarks by emphasizing time-sensitive, adversarial, and multi-source synthesis challenges unique to cryptocurrency. The dynamic monthly update mechanism and the explicit separation of retrieval versus prediction capabilities are potentially valuable contributions for guiding improvements in LLM agent reasoning.

major comments (1)

[Benchmark Construction] Benchmark Construction section: The manuscript states that tasks were 'rigorously categorized' within the four-quadrant system by 'crypto-native professionals' but supplies neither the explicit labeling rubric, inter-rater reliability statistics, nor annotated example questions with justifications. This is load-bearing for the central retrieval-prediction imbalance claim; without evidence that the quadrant distinctions are stable and not curator-specific, observed weaknesses in 'Simple Prediction' or 'Complex Prediction' tasks could be reclassified as retrieval failures rather than genuine predictive deficits.

minor comments (2)

[Abstract] Abstract: Key quantitative results (e.g., accuracy or F1 scores per quadrant and per model) are referenced but not reported, making it difficult for readers to gauge the magnitude of the reported imbalance from the summary alone.
The description of the agentic framework could clarify which tools or retrieval mechanisms were provided to the agents and whether they had access to the same data sources used in the benchmark construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on CryptoBench. We address the major comment on benchmark construction and categorization below, and we will incorporate clarifications to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: The manuscript states that tasks were 'rigorously categorized' within the four-quadrant system by 'crypto-native professionals' but supplies neither the explicit labeling rubric, inter-rater reliability statistics, nor annotated example questions with justifications. This is load-bearing for the central retrieval-prediction imbalance claim; without evidence that the quadrant distinctions are stable and not curator-specific, observed weaknesses in 'Simple Prediction' or 'Complex Prediction' tasks could be reclassified as retrieval failures rather than genuine predictive deficits.

Authors: We agree that additional documentation on the categorization process would improve transparency and address potential concerns about subjectivity. The four-quadrant assignments were made by crypto-native professionals according to explicit internal criteria: retrieval tasks require locating and extracting factual information from sources such as on-chain data or news, while prediction tasks require synthesizing that information into forward-looking statements; simple versus complex distinctions were based on the number of distinct sources and reasoning steps involved. Although formal inter-rater reliability statistics were not computed or reported in the initial submission, assignments were reviewed for consistency by multiple domain experts. In the revised manuscript we will add the full labeling rubric, several annotated example questions with quadrant justifications, and a note on the review process used to ensure stability. These additions will allow readers to evaluate the robustness of the distinctions independently while preserving the observed retrieval-prediction performance gap, which appears consistently across models and monthly question sets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted quantities

full rationale

This is an empirical benchmark paper that introduces CryptoBench with 50 monthly questions categorized by crypto-native professionals into four quadrants (Simple Retrieval, Complex Retrieval, Simple Prediction, Complex Prediction). The central claims concern observed performance hierarchies and a retrieval-prediction imbalance from evaluating ten LLMs. No equations, parameter fitting, self-citations as load-bearing premises, or derivation chains exist that could reduce outputs to inputs by construction. The task categorization is presented as a methodological design choice to mirror workflows, without any claim that it derives from or is forced by prior results in the paper itself. The work is therefore self-contained against external benchmarks, with claims resting on described evaluations rather than self-referential logic.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the design choice of 50 questions monthly and the assumption that expert curation produces representative tasks; no new physical entities are postulated and no parameters are fitted to data in the usual sense.

free parameters (1)

Monthly question volume
Design choice of 50 questions per month to enable ongoing dynamic evaluation.

axioms (1)

domain assumption Expert-curated questions by crypto-native professionals mirror actual analyst workflows
Invoked when constructing the benchmark to ensure tasks reflect real-world demands.

pith-pipeline@v0.9.0 · 5860 in / 1329 out tokens · 66366 ms · 2026-05-21T18:58:32.346280+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LATTICE: Evaluating Decision Support Utility of Crypto Agents
cs.CR 2026-04 unverdicted novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. 1, 3

work page 2025
[2]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

React: Synergizing reasoning and acting in language models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. 2

work page 2022
[4]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 2

work page 2023
[5]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Hariharan, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiangyue Liao, Bo Yuan, Yuxiao Yao, Hongwei Deng, Xiaohu Tang, Yu Zhang, Huiyu Du, Huiyu Tan, Xiyan Li, Jingyi Ge, Zhenjie Zhang, Jiaqi Zhou, Ningyu Zhou, Yingqi Zhang, Yuhui Zhang, Lei Fan, Jinlin Chu, Guanying Liu, Lixin Zhu, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Wei et al

J. Wei et al. Caia: Benchmarking intelligence under fire in cryptocurrency markets.arXiv preprint arXiv:2510.00332, 2025. 3

work page arXiv 2025
[8]

Liu et al

X. Liu et al. Agent market arena: Live multi-market trading benchmark for llm agents.arXiv preprint arXiv:2510.11695, 2025. 3

work page arXiv 2025
[9]

Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W

Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, K.p. Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W. Suchow, and Qianqian Xie. Investorbench: A benchmark for financial decision-making tasks with llm-based agent, 2025. 3

work page 2025
[10]

Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025. 3

work page arXiv 2025
[11]

Onchain execution benchmark v0.1

CAIBA. Onchain execution benchmark v0.1. https://www.caiba.ai/blogs/6, July 2025. Accessed: 2025- 10-19. 3

work page 2025
[12]

Crypto named entity recognition benchmark v0.1

CAIBA. Crypto named entity recognition benchmark v0.1. https://www.caiba.ai/blogs/5, July 2025. Accessed: 2025-10-19. 3

work page 2025
[13]

CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading

Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, and Bingsheng He. CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1094–1106, Miami, Florida, USA, November 2024...

work page 2024
[14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. 3

work page 2024
[15]

Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025. 3

work page 2025
[16]

Le, Christopher D

Tu Vu, Satyen Kale, Mohit Iyyer, Xuezhi Wang, Kazuma Hashimoto, Graham Neubig, Maarten Bosma, Quoc V . Le, Christopher D. Manning, Andrew M. Dai, and Daniel Sohn. Freshllms: Refreshing large language models with search engine augmentation, 2023. 3 14 CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents

work page 2023
[17]

Livebench: A challenging, contamination-limited llm benchmark, 2025

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, and et al. Livebench: A challenging, contamination-limited llm benchmark, 2025. 3

work page 2025
[18]

Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, and Houfeng Wang. Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025. 3

work page arXiv 2025
[19]

Evaluating llms on real-world forecasting against expert forecasters, 2025

Janna Lu. Evaluating llms on real-world forecasting against expert forecasters, 2025. 3, 13

work page 2025
[20]

Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025. 4, 13

work page 2025
[21]

Futurebench: Evaluating agents’ future prediction capabilities, 2025

Together.ai. Futurebench: Evaluating agents’ future prediction capabilities, 2025. 4

work page 2025
[22]

Openep: Open-ended future event prediction,

Yong Guan, Hao Peng, Xiaozhi Wang, Lei Hou, and Juanzi Li. Openep: Open-ended future event prediction,

work page
[23]

Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025

Petraq Nako and Adam Jatowt. Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025. 4

work page 2025
[24]

Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025. 4

work page 2025
[25]

Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025

Spencer Mateega, Carlos Georgescu, and Danny Tang. Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025. 4

work page 2025
[26]

Capabilities of gpt-5 across critical domains, 2025

OpenAI Team. Capabilities of gpt-5 across critical domains, 2025. 9

work page 2025
[27]

Grok 4 model card, 2025

xAI Team. Grok 4 model card, 2025. 9

work page 2025
[28]

Grok 4 fast model card, 2025

xAI Team. Grok 4 fast model card, 2025. 9

work page 2025
[29]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 9

work page 2025
[30]

System card addendum: Claude opus 4.1, 2025

Anthropic Team. System card addendum: Claude opus 4.1, 2025. 9

work page 2025
[31]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

work page
[32]

Claude sonnet 4.5 system card, 2025

Anthropic Team. Claude sonnet 4.5 system card, 2025. 9

work page 2025
[33]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Google DeepMind Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9

work page 2025
[34]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI Team. gpt-oss-120b & gpt-oss-20b model card, 2025. 9

work page 2025
[35]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9

work page 2025
[36]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 9

work page 2023
[37]

smolagents: A smol library to build great agentic systems, 2025

Aymeric Roucher, A Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. smolagents: A smol library to build great agentic systems, 2025. 10

work page 2025
[38]

Pitfalls in evaluating language model forecasters, 2025

Daniel Paleka, Shashwat Goel, Jonas Geiping, and Florian Tramèr. Pitfalls in evaluating language model forecasters, 2025. 13 15

work page 2025

[1] [1]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. 1, 3

work page 2025

[2] [2]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants.arXiv preprint arXiv:2311.12983, 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

React: Synergizing reasoning and acting in language models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2022. 2

work page 2022

[4] [4]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 2

work page 2023

[5] [5]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Hariharan, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiangyue Liao, Bo Yuan, Yuxiao Yao, Hongwei Deng, Xiaohu Tang, Yu Zhang, Huiyu Du, Huiyu Tan, Xiyan Li, Jingyi Ge, Zhenjie Zhang, Jiaqi Zhou, Ningyu Zhou, Yingqi Zhang, Yuhui Zhang, Lei Fan, Jinlin Chu, Guanying Liu, Lixin Zhu, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Wei et al

J. Wei et al. Caia: Benchmarking intelligence under fire in cryptocurrency markets.arXiv preprint arXiv:2510.00332, 2025. 3

work page arXiv 2025

[8] [8]

Liu et al

X. Liu et al. Agent market arena: Live multi-market trading benchmark for llm agents.arXiv preprint arXiv:2510.11695, 2025. 3

work page arXiv 2025

[9] [9]

Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W

Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, K.p. Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W. Suchow, and Qianqian Xie. Investorbench: A benchmark for financial decision-making tasks with llm-based agent, 2025. 3

work page 2025

[10] [10]

Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025

Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987, 2025. 3

work page arXiv 2025

[11] [11]

Onchain execution benchmark v0.1

CAIBA. Onchain execution benchmark v0.1. https://www.caiba.ai/blogs/6, July 2025. Accessed: 2025- 10-19. 3

work page 2025

[12] [12]

Crypto named entity recognition benchmark v0.1

CAIBA. Crypto named entity recognition benchmark v0.1. https://www.caiba.ai/blogs/5, July 2025. Accessed: 2025-10-19. 3

work page 2025

[13] [13]

CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading

Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, and Bingsheng He. CryptoTrade: A reflective LLM-based agent to guide zero-shot cryptocurrency trading. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1094–1106, Miami, Florida, USA, November 2024...

work page 2024

[14] [14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. 3

work page 2024

[15] [15]

Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks, 2025. 3

work page 2025

[16] [16]

Le, Christopher D

Tu Vu, Satyen Kale, Mohit Iyyer, Xuezhi Wang, Kazuma Hashimoto, Graham Neubig, Maarten Bosma, Quoc V . Le, Christopher D. Manning, Andrew M. Dai, and Daniel Sohn. Freshllms: Refreshing large language models with search engine augmentation, 2023. 3 14 CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents

work page 2023

[17] [17]

Livebench: A challenging, contamination-limited llm benchmark, 2025

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, and et al. Livebench: A challenging, contamination-limited llm benchmark, 2025. 3

work page 2025

[18] [18]

Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, and Houfeng Wang. Time: A multi-level benchmark for temporal reasoning of llms in real-world scenarios.arXiv preprint arXiv:2505.12891, 2025. 3

work page arXiv 2025

[19] [19]

Evaluating llms on real-world forecasting against expert forecasters, 2025

Janna Lu. Evaluating llms on real-world forecasting against expert forecasters, 2025. 3, 13

work page 2025

[20] [20]

Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025

Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, and Philip Tetlock. Forecastbench: A dynamic benchmark of ai forecasting capabilities, 2025. 4, 13

work page 2025

[21] [21]

Futurebench: Evaluating agents’ future prediction capabilities, 2025

Together.ai. Futurebench: Evaluating agents’ future prediction capabilities, 2025. 4

work page 2025

[22] [22]

Openep: Open-ended future event prediction,

Yong Guan, Hao Peng, Xiaozhi Wang, Lei Hou, and Juanzi Li. Openep: Open-ended future event prediction,

work page

[23] [23]

Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025

Petraq Nako and Adam Jatowt. Navigating tomorrow: Reliably assessing large language models performance on future event prediction, 2025. 4

work page 2025

[24] [24]

Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025

Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business-driven real-world financial benchmark for evaluating llms, 2025. 4

work page 2025

[25] [25]

Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025

Spencer Mateega, Carlos Georgescu, and Danny Tang. Financeqa: A benchmark for evaluating financial analysis capabilities of large language models, 2025. 4

work page 2025

[26] [26]

Capabilities of gpt-5 across critical domains, 2025

OpenAI Team. Capabilities of gpt-5 across critical domains, 2025. 9

work page 2025

[27] [27]

Grok 4 model card, 2025

xAI Team. Grok 4 model card, 2025. 9

work page 2025

[28] [28]

Grok 4 fast model card, 2025

xAI Team. Grok 4 fast model card, 2025. 9

work page 2025

[29] [29]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 9

work page 2025

[30] [30]

System card addendum: Claude opus 4.1, 2025

Anthropic Team. System card addendum: Claude opus 4.1, 2025. 9

work page 2025

[31] [31]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, and et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,

work page

[32] [32]

Claude sonnet 4.5 system card, 2025

Anthropic Team. Claude sonnet 4.5 system card, 2025. 9

work page 2025

[33] [33]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Google DeepMind Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9

work page 2025

[34] [34]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI Team. gpt-oss-120b & gpt-oss-20b model card, 2025. 9

work page 2025

[35] [35]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 9

work page 2025

[36] [36]

Gonzalez, and Ion Stoica

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 9

work page 2023

[37] [37]

smolagents: A smol library to build great agentic systems, 2025

Aymeric Roucher, A Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. smolagents: A smol library to build great agentic systems, 2025. 10

work page 2025

[38] [38]

Pitfalls in evaluating language model forecasters, 2025

Daniel Paleka, Shashwat Goel, Jonas Geiping, and Florian Tramèr. Pitfalls in evaluating language model forecasters, 2025. 13 15

work page 2025