Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.
Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
LiteCoder-Terminal-Gen creates synthetic terminal datasets that, after SFT and DMPO on Qwen models, yield 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro.
Claw-Anything benchmark tests LLM agents on proactive assistance in complex simulated user digital environments with long histories, interdependent services, and noise, where GPT-5.5 scores 34.5% pass@1.
OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.
SkillSynth uses a scenario-mediated skill graph to sample workflow paths and generate executable terminal tasks, enabling controlled diversity in training trajectories for agents.
citing papers explorer
-
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
Terminal-World is a skill-based synthesis pipeline that generates 5,723 training environments and produces Terminal-World-32B which outperforms baselines on Terminal-Bench 2.0 using only 1.2% of the data.
-
LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents
LiteCoder-Terminal-Gen creates synthetic terminal datasets that, after SFT and DMPO on Qwen models, yield 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro.
-
Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
Claw-Anything benchmark tests LLM agents on proactive assistance in complex simulated user digital environments with long histories, interdependent services, and noise, where GPT-5.5 scores 34.5% pass@1.
-
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.
-
Toward Scalable Terminal Task Synthesis via Skill Graphs
SkillSynth uses a scenario-mediated skill graph to sample workflow paths and generate executable terminal tasks, enabling controlled diversity in training trajectories for agents.