Recognition: unknown
LATTICE: Evaluating Decision Support Utility of Crypto Agents
Pith reviewed 2026-05-07 13:27 UTC · model grok-4.3
The pith
LATTICE benchmark shows real crypto copilots differ more in specific decision dimensions than in overall scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LATTICE defines six evaluation dimensions that capture key decision support properties and proposes sixteen task types that span the end-to-end crypto copilot workflow. It uses LLM judges to automatically score agent outputs on these dimensions and tasks for six real-world crypto copilots across 1,200 diverse queries. The results show that most copilots achieve comparable aggregate scores but differ more significantly on dimension-level and task-level performance, indicating meaningful trade-offs in decision support quality.
What carries the argument
The LATTICE benchmark, which combines six decision-support dimensions and sixteen task types scored by LLM judges whose rubrics can be audited and revised without external data sources.
If this is right
- Users with distinct priorities may select different copilots based on dimension strengths rather than aggregate rankings.
- Production agents vary in orchestration and interface design, which the benchmark treats as part of overall quality.
- The open-sourced code and data allow direct replication and extension to new agents or tasks.
- Dimension-level and task-level breakdowns provide more actionable information than single overall scores.
Where Pith is reading between the lines
- The same evaluation approach without ground truth could be adapted to measure decision support in other agent domains such as personal finance or medical information.
- Developers could use the observed trade-offs to target improvements in specific dimensions like risk explanation or clarity for different user groups.
- Leaderboards that report only aggregates risk hiding the dimension variances the paper identifies, so future evaluations should always include the breakdowns.
Load-bearing premise
The LLM judge rubrics can be continually audited and updated to reflect what users actually value in decision support.
What would settle it
A side-by-side comparison of the same agent outputs scored by both the LLM judges and by human crypto users or experts, where low agreement would show the benchmark does not track real decision utility.
Figures
read the original abstract
We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent benchmarks mainly focus on reasoning-based or outcome-based evaluation, but do not assess agents' ability to assist user decision-making. LATTICE addresses this gap by: (1) defining six evaluation dimensions that capture key decision support properties; (2) proposing 16 task types that span the end-to-end crypto copilot workflow; and (3) using LLM judges to automatically score agent outputs based on these dimensions and tasks. Crucially, the dimensions and tasks are designed to be evaluable at scale using LLM judges, without relying on ground truth from expert annotators or external data sources. In lieu of these dependencies, LATTICE's LLM judge rubrics can be continually audited and updated given new dimensions, tasks, criteria, and human feedback, thus promoting reliable and extensible evaluation. While other benchmarks often compare foundation models sharing a generic agent framework, we use LATTICE to assess production-level agents used in actual crypto copilot products, reflecting the importance of orchestration and UI/UX design in determining agent quality. In this paper, we evaluate six real-world crypto copilots on 1,200 diverse queries and report breakdowns across dimensions, tasks, and query categories. Our experiments show that most of the tested copilots achieve comparable aggregate scores, but differ more significantly on dimension-level and task-level performance. This pattern suggests meaningful trade-offs in decision support quality: users with different priorities may be better served by different copilots than the aggregate rankings alone would indicate. To support reproducible research, we open-source all LATTICE code and data used in this paper.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LATTICE, a benchmark for crypto agent decision support utility. It defines six dimensions and 16 task types spanning the copilot workflow, employs LLM judges to score six production copilots on 1,200 queries without ground-truth annotations, and reports comparable aggregate scores but larger differences at dimension and task levels, implying user-specific trade-offs. The design emphasizes scalable LLM evaluation with future rubric auditing, and all code/data are open-sourced.
Significance. If the LLM-judge scores prove reliable, the work supplies a reproducible, extensible method for comparing real deployed crypto agents that accounts for orchestration and UX factors often ignored in model-only benchmarks. The observation that aggregate rankings obscure dimension-level variation is a useful corrective for practitioners. Open-sourcing the full evaluation harness and dataset is a concrete strength that enables external auditing and extension.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: No inter-rater agreement, human-LLM correlation, or expert validation on any subset of the 1,200 queries is reported for the LLM-judge rubrics. Because the central claim—that dimension- and task-level spreads reveal meaningful decision-support trade-offs—rests entirely on these scores tracking actual user utility in a domain requiring current on-chain facts and regulatory nuance, the absence of such validation leaves the reported differences uninterpretable as evidence of real differences.
- [Benchmark Design] Benchmark Design: The assertion that the six dimensions and 16 tasks are 'designed to be evaluable at scale using LLM judges, without relying on ground truth' is offered as an advantage, yet no empirical test (e.g., agreement with expert annotators on a held-out sample) is provided to support that the rubrics capture decision utility rather than LLM-specific artifacts.
minor comments (2)
- [Evaluation Methodology] The paper would benefit from an explicit table or appendix listing the precise rubric criteria and prompt templates used by the LLM judges for each dimension and task.
- [Results] Query-category breakdowns are mentioned but not illustrated; a supplementary figure showing score distributions across query types would clarify whether the observed dimension differences are driven by particular query classes.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on LATTICE. The comments highlight an important limitation in the current manuscript regarding validation of the LLM judges. We address each major comment below and commit to revisions that will strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: No inter-rater agreement, human-LLM correlation, or expert validation on any subset of the 1,200 queries is reported for the LLM-judge rubrics. Because the central claim—that dimension- and task-level spreads reveal meaningful decision-support trade-offs—rests entirely on these scores tracking actual user utility in a domain requiring current on-chain facts and regulatory nuance, the absence of such validation leaves the reported differences uninterpretable as evidence of real differences.
Authors: We agree that the absence of reported validation metrics is a limitation that affects the strength of our central claim. The manuscript does not include inter-rater agreement, human-LLM correlation, or expert validation results. In the revised manuscript, we will add a dedicated subsection in the Evaluation section reporting a human validation study performed on a held-out subset of queries. This will include expert inter-rater agreement statistics and correlation between human and LLM scores, along with a discussion of how these results support (or qualify) the observed dimension- and task-level differences. We will also note the practical challenges of obtaining reliable ground-truth annotations in the crypto domain due to rapidly changing on-chain data and regulatory context. revision: yes
-
Referee: [Benchmark Design] Benchmark Design: The assertion that the six dimensions and 16 tasks are 'designed to be evaluable at scale using LLM judges, without relying on ground truth' is offered as an advantage, yet no empirical test (e.g., agreement with expert annotators on a held-out sample) is provided to support that the rubrics capture decision utility rather than LLM-specific artifacts.
Authors: The design choice to enable scalable evaluation without per-query ground truth stems from the impracticality of expert annotation at the scale of 1,200 queries in a domain with volatile facts. The rubrics were constructed to target observable decision-support properties rather than model-specific behaviors. Nevertheless, we acknowledge that without an empirical test against expert judgments, the risk of LLM artifacts cannot be fully ruled out. In the revision, we will include results from a small-scale expert annotation experiment on a held-out sample, reporting agreement metrics and any discrepancies. We will also emphasize that the open-sourced rubrics and data are intended to support ongoing human auditing and refinement, as already stated in the manuscript. revision: yes
Circularity Check
No circularity: benchmark results are direct empirical outputs from author-defined rubrics with no derivations or self-referential reductions
full rationale
The paper introduces LATTICE as a new benchmark consisting of six dimensions and 16 tasks, then applies LLM judges to score six production copilots on 1,200 queries. No equations, derivations, fitted parameters, or predictions appear anywhere in the text. The reported aggregate, dimension-level, and task-level scores are computed outputs from the defined rubrics rather than quantities that reduce to the inputs by construction. No self-citations are used to justify uniqueness or load-bearing premises, and the framework does not rename prior results or smuggle ansatzes. The central claim about trade-offs visible only at finer granularity follows directly from the observed score spreads and does not collapse into a tautology or self-defined loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM judges can be continually audited and updated to maintain reliability for new dimensions and tasks without ground truth data.
Reference graph
Works this paper leans on
-
[1]
Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Nicholas Deng, Qixin Xiao, Jia Tian, et al. Cryptobench: A dynamic benchmark for expert-level evaluation of llm agents in cryptocurrency.arXiv preprint arXiv:2512.00417,
-
[2]
Cryptoanalystbench: Failures in multi-tool long-form llm analysis.arXiv preprint arXiv:2602.11304,
Anushri Eswaran, Oleg Golev, Darshan Tank, Sidhant Rahi, and Himanshu Tyagi. Cryptoanalystbench: Failures in multi-tool long-form llm analysis.arXiv preprint arXiv:2602.11304,
-
[3]
Zeshi Dai, Zimo Peng, Zerui Cheng, and Ryan Yihe Li. When hallucination costs millions: Benchmarking ai agents in high-stakes adversarial financial markets.arXiv preprint arXiv:2510.00332,
-
[4]
Investorbench: A benchmark for financial decision-making tasks with llm-based agent
Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Kp Subbalakshmi, Jimin Huang, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages ...
-
[5]
Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987,
-
[6]
Cryptotrade: A reflective llm-based agent to guide zero-shot cryptocurrency trading
Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, and Bingsheng He. Cryptotrade: A reflective llm-based agent to guide zero-shot cryptocurrency trading. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1094–1106,
2024
-
[7]
Yichen Luo, Yebo Feng, Jiahua Xu, Paolo Tasca, and Yang Liu. Llm-powered multi-agent system for automated crypto portfolio management.arXiv preprint arXiv:2501.00826,
-
[8]
Mirai: Evaluating llm agents for event forecasting,
Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, and Wei Wang. Mirai: Evaluating llm agents for event forecasting.arXiv preprint arXiv:2407.01231,
-
[9]
Avi Arora and Ritesh Malpani. Predictionmarketbench: A swe-bench-style framework for backtesting trading agents on prediction markets.arXiv preprint arXiv:2602.00133,
-
[10]
Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628,
Haofei Yu, Fenghai Li, and Jiaxuan You. Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628,
-
[11]
Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023
FutureBench benchmark for evaluating AI agents on forecasting future events. 14 LATTICE: Evaluating Decision Support Utility of Crypto AgentsTECHNICALREPORT Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering. arxiv.arXiv preprint arXiv:2311.11944,
-
[12]
Spencer Mateega, Carlos Georgescu, and Danny Tang. Financeqa: a benchmark for evaluating financial analysis capabilities of large language models.arXiv preprint arXiv:2501.18062,
-
[13]
Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li
URLhttps://arxiv.org/abs/2505.19457. Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. Stockbench: Can llm agents trade stocks profitably in real-world markets?arXiv preprint arXiv:2510.02209,
-
[14]
Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, and Chao Huang. Ai-trader: Benchmarking autonomous agents in real-time financial markets.arXiv preprint arXiv:2512.10971,
-
[15]
Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828,
-
[16]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference, 2024.URL https://arxiv. org/abs/2403.04132, 2(10),
work page internal anchor Pith review arXiv 2024
-
[17]
Judgebench: A benchmark for evaluating llm-based judges,
Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. Judgebench: A benchmark for evaluating llm-based judges.arXiv preprint arXiv:2410.12784,
-
[18]
AgentBench: Evaluating LLMs as Agents
Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025b. Xiao Liu, Hao ...
work page internal anchor Pith review arXiv 2025
-
[19]
Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,
-
[20]
Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, and Liquan Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments.arXiv preprint arXiv:2503.06047,
work page internal anchor Pith review arXiv
-
[21]
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.