Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
Legalagentbench: Evaluating llm agents in legal domain.arXiv preprint arXiv:2412.17259
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
background 1representative citing papers
ADAM extracts data from LLM agent memory with up to 100% attack success rate by estimating data distribution and selecting queries via entropy guidance.
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
citing papers explorer
-
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
-
ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying
ADAM extracts data from LLM agent memory with up to 100% attack success rate by estimating data distribution and selecting queries via entropy guidance.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.