pith. machine review for the scientific record. sign in

arxiv: 2604.26235 · v1 · submitted 2026-04-29 · 💻 cs.CR · cs.AI· cs.CL

Recognition: unknown

LATTICE: Evaluating Decision Support Utility of Crypto Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:27 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords crypto agentsdecision supportbenchmark evaluationLLM judgescopilot workflowtrade-offsproduction agentsscalable assessment
0
0 comments X

The pith

LATTICE benchmark shows real crypto copilots differ more in specific decision dimensions than in overall scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LATTICE to measure how effectively crypto agents help users make decisions in practical scenarios. It sets out six dimensions of decision support quality and sixteen task types that cover the complete workflow from query to output. The evaluation applies LLM judges to score outputs from six production copilots across 1,200 queries, without needing external ground-truth labels. Results indicate that aggregate scores are similar across agents, yet performance spreads more widely when examined by individual dimension or task. This pattern implies that users facing different priorities would benefit from selecting agents according to their strongest dimensions rather than a single ranking.

Core claim

LATTICE defines six evaluation dimensions that capture key decision support properties and proposes sixteen task types that span the end-to-end crypto copilot workflow. It uses LLM judges to automatically score agent outputs on these dimensions and tasks for six real-world crypto copilots across 1,200 diverse queries. The results show that most copilots achieve comparable aggregate scores but differ more significantly on dimension-level and task-level performance, indicating meaningful trade-offs in decision support quality.

What carries the argument

The LATTICE benchmark, which combines six decision-support dimensions and sixteen task types scored by LLM judges whose rubrics can be audited and revised without external data sources.

If this is right

  • Users with distinct priorities may select different copilots based on dimension strengths rather than aggregate rankings.
  • Production agents vary in orchestration and interface design, which the benchmark treats as part of overall quality.
  • The open-sourced code and data allow direct replication and extension to new agents or tasks.
  • Dimension-level and task-level breakdowns provide more actionable information than single overall scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evaluation approach without ground truth could be adapted to measure decision support in other agent domains such as personal finance or medical information.
  • Developers could use the observed trade-offs to target improvements in specific dimensions like risk explanation or clarity for different user groups.
  • Leaderboards that report only aggregates risk hiding the dimension variances the paper identifies, so future evaluations should always include the breakdowns.

Load-bearing premise

The LLM judge rubrics can be continually audited and updated to reflect what users actually value in decision support.

What would settle it

A side-by-side comparison of the same agent outputs scored by both the LLM judges and by human crypto users or experts, where low agreement would show the benchmark does not track real decision utility.

Figures

Figures reproduced from arXiv: 2604.26235 by Aaron Chan, Angela Chen, Junyi Du, Tengfei Li, Tianyi Xiao, Xiang Ren.

Figure 1
Figure 1. Figure 1: Overview of LATTICE. The LATTICE benchmark is constructed through an iterative design process with input from domain experts, defining the evaluation dimensions, task taxonomy, query categories, LLM query generation prompts, and LLM judge rubrics. For each of the 16 tasks, an LLM query generator produces multiple queries across five query categories. These queries are executed on all tested crypto copilots… view at source ↗
Figure 2
Figure 2. Figure 2: Aggregate Results. Per-copilot mean score (0–100). At the same time, the separation between tiers indicates that the benchmark remains sensitive to larger differences in decision support quality. In particular, Elsa consistently trails the rest of the copilots by a substantial margin, forming a distinct bottom band in the aggregate results. Finally, aggregate scores partially reflect performance on boundar… view at source ↗
Figure 3
Figure 3. Figure 3: Query-Category-Level Results. Mean score (0–100) by query category. Copilot Win Rate (%) Wins / Appearances Elsa 18.03 952 / 5280 Gina 54.34 2869 / 5280 June 36.42 1923 / 5280 Minara 43.72 2309 / 5281 Sorin 62.68 3310 / 5281 Surf 55.55 2933 / 5280 Wayfinder 50.19 2650 / 5280 view at source ↗
read the original abstract

We introduce LATTICE, a benchmark for evaluating the decision support utility of crypto agents in realistic user-facing scenarios. Prior crypto agent benchmarks mainly focus on reasoning-based or outcome-based evaluation, but do not assess agents' ability to assist user decision-making. LATTICE addresses this gap by: (1) defining six evaluation dimensions that capture key decision support properties; (2) proposing 16 task types that span the end-to-end crypto copilot workflow; and (3) using LLM judges to automatically score agent outputs based on these dimensions and tasks. Crucially, the dimensions and tasks are designed to be evaluable at scale using LLM judges, without relying on ground truth from expert annotators or external data sources. In lieu of these dependencies, LATTICE's LLM judge rubrics can be continually audited and updated given new dimensions, tasks, criteria, and human feedback, thus promoting reliable and extensible evaluation. While other benchmarks often compare foundation models sharing a generic agent framework, we use LATTICE to assess production-level agents used in actual crypto copilot products, reflecting the importance of orchestration and UI/UX design in determining agent quality. In this paper, we evaluate six real-world crypto copilots on 1,200 diverse queries and report breakdowns across dimensions, tasks, and query categories. Our experiments show that most of the tested copilots achieve comparable aggregate scores, but differ more significantly on dimension-level and task-level performance. This pattern suggests meaningful trade-offs in decision support quality: users with different priorities may be better served by different copilots than the aggregate rankings alone would indicate. To support reproducible research, we open-source all LATTICE code and data used in this paper.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LATTICE, a benchmark for crypto agent decision support utility. It defines six dimensions and 16 task types spanning the copilot workflow, employs LLM judges to score six production copilots on 1,200 queries without ground-truth annotations, and reports comparable aggregate scores but larger differences at dimension and task levels, implying user-specific trade-offs. The design emphasizes scalable LLM evaluation with future rubric auditing, and all code/data are open-sourced.

Significance. If the LLM-judge scores prove reliable, the work supplies a reproducible, extensible method for comparing real deployed crypto agents that accounts for orchestration and UX factors often ignored in model-only benchmarks. The observation that aggregate rankings obscure dimension-level variation is a useful corrective for practitioners. Open-sourcing the full evaluation harness and dataset is a concrete strength that enables external auditing and extension.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: No inter-rater agreement, human-LLM correlation, or expert validation on any subset of the 1,200 queries is reported for the LLM-judge rubrics. Because the central claim—that dimension- and task-level spreads reveal meaningful decision-support trade-offs—rests entirely on these scores tracking actual user utility in a domain requiring current on-chain facts and regulatory nuance, the absence of such validation leaves the reported differences uninterpretable as evidence of real differences.
  2. [Benchmark Design] Benchmark Design: The assertion that the six dimensions and 16 tasks are 'designed to be evaluable at scale using LLM judges, without relying on ground truth' is offered as an advantage, yet no empirical test (e.g., agreement with expert annotators on a held-out sample) is provided to support that the rubrics capture decision utility rather than LLM-specific artifacts.
minor comments (2)
  1. [Evaluation Methodology] The paper would benefit from an explicit table or appendix listing the precise rubric criteria and prompt templates used by the LLM judges for each dimension and task.
  2. [Results] Query-category breakdowns are mentioned but not illustrated; a supplementary figure showing score distributions across query types would clarify whether the observed dimension differences are driven by particular query classes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on LATTICE. The comments highlight an important limitation in the current manuscript regarding validation of the LLM judges. We address each major comment below and commit to revisions that will strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: No inter-rater agreement, human-LLM correlation, or expert validation on any subset of the 1,200 queries is reported for the LLM-judge rubrics. Because the central claim—that dimension- and task-level spreads reveal meaningful decision-support trade-offs—rests entirely on these scores tracking actual user utility in a domain requiring current on-chain facts and regulatory nuance, the absence of such validation leaves the reported differences uninterpretable as evidence of real differences.

    Authors: We agree that the absence of reported validation metrics is a limitation that affects the strength of our central claim. The manuscript does not include inter-rater agreement, human-LLM correlation, or expert validation results. In the revised manuscript, we will add a dedicated subsection in the Evaluation section reporting a human validation study performed on a held-out subset of queries. This will include expert inter-rater agreement statistics and correlation between human and LLM scores, along with a discussion of how these results support (or qualify) the observed dimension- and task-level differences. We will also note the practical challenges of obtaining reliable ground-truth annotations in the crypto domain due to rapidly changing on-chain data and regulatory context. revision: yes

  2. Referee: [Benchmark Design] Benchmark Design: The assertion that the six dimensions and 16 tasks are 'designed to be evaluable at scale using LLM judges, without relying on ground truth' is offered as an advantage, yet no empirical test (e.g., agreement with expert annotators on a held-out sample) is provided to support that the rubrics capture decision utility rather than LLM-specific artifacts.

    Authors: The design choice to enable scalable evaluation without per-query ground truth stems from the impracticality of expert annotation at the scale of 1,200 queries in a domain with volatile facts. The rubrics were constructed to target observable decision-support properties rather than model-specific behaviors. Nevertheless, we acknowledge that without an empirical test against expert judgments, the risk of LLM artifacts cannot be fully ruled out. In the revision, we will include results from a small-scale expert annotation experiment on a held-out sample, reporting agreement metrics and any discrepancies. We will also emphasize that the open-sourced rubrics and data are intended to support ongoing human auditing and refinement, as already stated in the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark results are direct empirical outputs from author-defined rubrics with no derivations or self-referential reductions

full rationale

The paper introduces LATTICE as a new benchmark consisting of six dimensions and 16 tasks, then applies LLM judges to score six production copilots on 1,200 queries. No equations, derivations, fitted parameters, or predictions appear anywhere in the text. The reported aggregate, dimension-level, and task-level scores are computed outputs from the defined rubrics rather than quantities that reduce to the inputs by construction. No self-citations are used to justify uniqueness or load-bearing premises, and the framework does not rename prior results or smuggle ansatzes. The central claim about trade-offs visible only at finer granularity follows directly from the observed score spreads and does not collapse into a tautology or self-defined loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework assumes LLM judges can serve as reliable proxies for human assessment of decision support quality without external ground truth or expert labels. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption LLM judges can be continually audited and updated to maintain reliability for new dimensions and tasks without ground truth data.
    Stated in the abstract as the basis for scalable evaluation.

pith-pipeline@v0.9.0 · 5619 in / 1362 out tokens · 44662 ms · 2026-05-07T13:27:02.674469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 20 canonical work pages · 4 internal anchors

  1. [1]

    Cryptobench: A dynamic benchmark for expert-level evaluation of llm agents in cryptocurrency.arXiv preprint arXiv:2512.00417,

    Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Nicholas Deng, Qixin Xiao, Jia Tian, et al. Cryptobench: A dynamic benchmark for expert-level evaluation of llm agents in cryptocurrency.arXiv preprint arXiv:2512.00417,

  2. [2]

    Cryptoanalystbench: Failures in multi-tool long-form llm analysis.arXiv preprint arXiv:2602.11304,

    Anushri Eswaran, Oleg Golev, Darshan Tank, Sidhant Rahi, and Himanshu Tyagi. Cryptoanalystbench: Failures in multi-tool long-form llm analysis.arXiv preprint arXiv:2602.11304,

  3. [3]

    When hallucination costs millions: Benchmarking ai agents in high-stakes adversarial financial markets.arXiv preprint arXiv:2510.00332,

    Zeshi Dai, Zimo Peng, Zerui Cheng, and Ryan Yihe Li. When hallucination costs millions: Benchmarking ai agents in high-stakes adversarial financial markets.arXiv preprint arXiv:2510.00332,

  4. [4]

    Investorbench: A benchmark for financial decision-making tasks with llm-based agent

    Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, Kp Subbalakshmi, Jimin Huang, et al. Investorbench: A benchmark for financial decision-making tasks with llm-based agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages ...

  5. [5]

    Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987,

    Zhiyuan Zeng, Jiashuo Liu, Siyuan Chen, Tianci He, Yali Liao, Yixiao Tian, Jinpeng Wang, Zaiyuan Wang, Yang Yang, Lingyue Yin, et al. Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987,

  6. [6]

    Cryptotrade: A reflective llm-based agent to guide zero-shot cryptocurrency trading

    Yuan Li, Bingqiao Luo, Qian Wang, Nuo Chen, Xu Liu, and Bingsheng He. Cryptotrade: A reflective llm-based agent to guide zero-shot cryptocurrency trading. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1094–1106,

  7. [7]

    Llm-powered multi-agent system for automated crypto portfolio management.arXiv preprint arXiv:2501.00826,

    Yichen Luo, Yebo Feng, Jiahua Xu, Paolo Tasca, and Yang Liu. Llm-powered multi-agent system for automated crypto portfolio management.arXiv preprint arXiv:2501.00826,

  8. [8]

    Mirai: Evaluating llm agents for event forecasting,

    Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, and Wei Wang. Mirai: Evaluating llm agents for event forecasting.arXiv preprint arXiv:2407.01231,

  9. [9]

    Predictionmarketbench: A swe-bench-style framework for backtesting trading agents on prediction markets.arXiv preprint arXiv:2602.00133,

    Avi Arora and Ritesh Malpani. Predictionmarketbench: A swe-bench-style framework for backtesting trading agents on prediction markets.arXiv preprint arXiv:2602.00133,

  10. [10]

    Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628,

    Haofei Yu, Fenghai Li, and Jiaxuan You. Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628,

  11. [11]

    Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

    FutureBench benchmark for evaluating AI agents on forecasting future events. 14 LATTICE: Evaluating Decision Support Utility of Crypto AgentsTECHNICALREPORT Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering. arxiv.arXiv preprint arXiv:2311.11944,

  12. [12]

    Financeqa: a benchmark for evaluating financial analysis capabilities of large language models.arXiv preprint arXiv:2501.18062,

    Spencer Mateega, Carlos Georgescu, and Danny Tang. Financeqa: a benchmark for evaluating financial analysis capabilities of large language models.arXiv preprint arXiv:2501.18062,

  13. [13]

    Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li

    URLhttps://arxiv.org/abs/2505.19457. Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. Stockbench: Can llm agents trade stocks profitably in real-world markets?arXiv preprint arXiv:2510.02209,

  14. [14]

    Ai-trader: Benchmarking autonomous agents in real-time financial markets.arXiv preprint arXiv:2512.10971,

    Tianyu Fan, Yuhao Yang, Yangqin Jiang, Yifei Zhang, Yuxuan Chen, and Chao Huang. Ai-trader: Benchmarking autonomous agents in real-time financial markets.arXiv preprint arXiv:2512.10971,

  15. [15]

    Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

    Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

  16. [16]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference, 2024.URL https://arxiv. org/abs/2403.04132, 2(10),

  17. [17]

    Judgebench: A benchmark for evaluating llm-based judges,

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. Judgebench: A benchmark for evaluating llm-based judges.arXiv preprint arXiv:2410.12784,

  18. [18]

    AgentBench: Evaluating LLMs as Agents

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2757–2791, 2025b. Xiao Liu, Hao ...

  19. [19]

    Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763,

  20. [20]

    DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments

    Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, and Liquan Xiao. Dsgbench: A diverse strategic game benchmark for evaluating llm-based agents in complex decision-making environments.arXiv preprint arXiv:2503.06047,

  21. [21]

    From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah. From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678,