BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client-ready outputs.
Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security?
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2roles
baseline 1polarities
baseline 1representative citing papers
Chaintrix achieves 71.7% recall on 120 high-severity vulnerabilities in the EVMbench benchmark and outperforms the strongest frontier-model baseline by 26 percentage points through LLM pipelines grounded in a Cross-Contract Interaction Model and filtered by structural checks.
citing papers explorer
-
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
BankerToolBench is a new open benchmark of end-to-end investment banking workflows developed with 502 bankers; even the best tested model (GPT-5.4) fails nearly half the expert rubric criteria and produces zero client-ready outputs.
-
CHAINTRIX: A multi-pipeline LLM-augmented framework for automated smart-contract security auditing
Chaintrix achieves 71.7% recall on 120 high-severity vulnerabilities in the EVMbench benchmark and outperforms the strongest frontier-model baseline by 26 percentage points through LLM pipelines grounded in a Cross-Contract Interaction Model and filtered by structural checks.