MBABench evaluates LLM agents on end-to-end financial spreadsheet tasks and shows current models fail to meet professional finance standards, especially beyond simple calculations.
Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
citing papers explorer
-
MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
MBABench evaluates LLM agents on end-to-end financial spreadsheet tasks and shows current models fail to meet professional finance standards, especially beyond simple calculations.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.