Swe-bench mobile: Can large language model agents develop industry-level mobile applications?

Muxin Tian, Zhe Wang, Blair Yang, Zhenwei Tang, Kunlun Zhu, Honghua Dong, Hanchen Li, Xinni Xie, Guangjing Wang, Jiaxuan You · 2026 · arXiv 2602.09540

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

cs.SE · 2026-05-26 · unverdicted · novelty 6.0

RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks

cs.NI · 2026-04-29 · unverdicted · novelty 6.0

SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems cs.SE · 2026-05-26 · unverdicted · none · ref 23
RAMP evaluates 15 models on production-like serial workflows and reports completion rates collapsing from 100% to 20% with none finishing the full pipeline and costs varying by three orders of magnitude.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 8
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks cs.NI · 2026-04-29 · unverdicted · none · ref 16
SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.

Swe-bench mobile: Can large language model agents develop industry-level mobile applications?

fields

years

verdicts

representative citing papers

citing papers explorer