TerminalWorld builds a scalable benchmark of 1,530 real terminal tasks from recordings and finds frontier models and agents reach at most 62.5% pass rate with only weak correlation to prior expert-curated sets.
Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
TerminalWorld builds a scalable benchmark of 1,530 real terminal tasks from recordings and finds frontier models and agents reach at most 62.5% pass rate with only weak correlation to prior expert-curated sets.