Agents reach only 62.5% on real terminal tasks
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
Benchmark drawn from 80k recordings shows weak overlap with curated tests
full image
Artificial Intelligence
Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
Benchmark drawn from 80k recordings shows weak overlap with curated tests
full image
New benchmark shows performance gaps from easy Point Algebra to hard RCC-22, with no model getting everything right.
full image
Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
The bound isolates neural error, interface gap, and mixing time so agents learn workflows without joint trajectories or centralized data.
SVFSearch tests multimodal models on paused short-video scenes that need gaming expertise and shows where retrieval and reasoning still fail
full image
SVFSearch tests multimodal models on paused gaming scenes and finds retrieval helps but leaves a sizable performance gap.
full image
Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)
A four-dimensional framework above VoID and DCAT supports reliable selection, composition, and failure diagnosis at planning time.
full image
Distribution-Aware Algorithm Design with LLM Agents
LLM agents recover distribution-specific hints to compile solvers that match heuristic quality yet run hundreds of times faster on 21 target
full image
MathAtlas: A Benchmark for Autoformalization in the Wild
52k statements from 103 textbooks plus a dependency graph expose the gap between current autoformalization and advanced mathematics.
full image
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
Natural language goals drive attack selection and composition, unifying tests for ML and generative models with high success rates in case演示
full image