SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
years
2025 3representative citing papers
SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.
Agents should invoke external tools only when epistemically necessary, per the introduced Theory of Agent framework that frames tool use as a decision under uncertainty.
citing papers explorer
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
General Agentic Planning Through Simulative Reasoning with World Models
SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.
-
Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary
Agents should invoke external tools only when epistemically necessary, per the introduced Theory of Agent framework that frames tool use as a decision under uncertainty.