Herculean: An Agentic Benchmark for Financial Intelligence

· 2026 · cs.AI · arXiv 2605.14355

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

representative citing papers

Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation

cs.AI · 2026-06-16 · unverdicted · novelty 7.0

CEO-Bench evaluates LLMs on CEO-level strategic resource reallocation via multi-role agent simulations, showing high structural validity but sharp divergence on strategic calibration across five frontier models on 13 scenarios.

AuditFraudBench: Benchmarking Audit Judgment in Detecting Fraudulent Misstatements

cs.CE · 2026-06-06 · unverdicted · novelty 7.0

AuditFraudBench is a new enforcement-grounded benchmark with three tasks for testing whether LLMs can detect fraudulent misstatements by reasoning over financial figures, disclosure framing, and known manipulation patterns.

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

cs.AI · 2026-06-02 · unverdicted · novelty 7.0

AuditFlow combines a graph-grounded symbolic environment with a multi-agent LLM setup to reach 82.09% joint audit accuracy on structured financial reports, 14.93 points above the strongest baseline.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AuditFraudBench: Benchmarking Audit Judgment in Detecting Fraudulent Misstatements cs.CE · 2026-06-06 · unverdicted · none · ref 30 · internal anchor
AuditFraudBench is a new enforcement-grounded benchmark with three tasks for testing whether LLMs can detect fraudulent misstatements by reasoning over financial figures, disclosure framing, and known manipulation patterns.

Herculean: An Agentic Benchmark for Financial Intelligence

fields

years

verdicts

representative citing papers

citing papers explorer