Recognition: no theorem link
MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development
Pith reviewed 2026-05-15 00:58 UTC · model grok-4.3
The pith
A benchmark of 407 mobile app issues shows frontier LLMs resolve under 6 percent even with perfect context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobileDev-Bench establishes that end-to-end resolution of mobile application issues by LLMs remains limited to 3.23-5.69 percent on 407 tasks drawn from production apps, a rate substantially lower than rates reported on prior software engineering benchmarks.
What carries the argument
A set of 407 executable tasks, each linking a reported issue to test patches that automatically verify multi-file, multi-artifact fixes inside containerized mobile build environments.
If this is right
- Mobile fixes require coordinated edits across source, build, and resource files that exceed the scope of most existing repair benchmarks.
- Automated validation through executable tests is feasible for mobile environments and exposes gaps hidden by static checks alone.
- Current LLMs remain far from reliable end-to-end mobile issue resolution even when retrieval supplies all necessary files.
- Progress on mobile AI assistance will need methods that address heterogeneous artifact types and build-system constraints.
Where Pith is reading between the lines
- Specialized training or retrieval tuned to mobile framework patterns could raise success rates on these tasks.
- The benchmark could serve as a testbed for hybrid systems that combine LLMs with mobile-specific static analysis or build tools.
Load-bearing premise
The 407 tasks collected from 19 production apps represent typical mobile development issues and the supplied test patches correctly confirm fixes without bias or framework artifacts.
What would settle it
An experiment in which an LLM or agent reaches resolution rates above 20 percent on the same 407 tasks while using the provided test harness would falsify the central performance claim.
read the original abstract
Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on library-style repositories, leaving mobile application development largely unexplored despite its framework-specific build systems, heterogeneous artifact types, and coordinated multi-file fix requirements. We introduce MobileDev-Bench, a benchmark comprising 407 real-world issue-resolution tasks collected from 19 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs a verified developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantially greater patch complexity than prior benchmarks: fixes modify 12.9 files and 334.6 lines on average, and 41% of instances require coordinated changes across multiple artifact types, such as source, build configuration, and resource files. Evaluation of four frontier LLMs (Claude Sonnet 4.5, Qwen3-Coder, GPT-5.2, and Gemini 2.5 Flash) yields end-to-end resolution rates of only 3.23% - 4.23% under automated retrieval and at most 5.69% under oracle retrieval, well below resolution rates reported on existing benchmarks. We release MobileDev-Bench with task instances, an evaluation harness, and containerized environments to support reproducible research on AI-assisted mobile application development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MobileDev-Bench, a benchmark of 407 real-world issue-resolution tasks drawn from 19 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task includes a developer-reported issue paired with executable test patches that enable automated validation of model-generated fixes. The benchmark is characterized by high complexity (average 12.9 files and 334.6 lines changed, 41% multi-artifact fixes). Evaluation of four frontier LLMs (Claude Sonnet 4.5, Qwen3-Coder, GPT-5.2, Gemini 2.5 Flash) reports end-to-end resolution rates of 3.23–4.23% under automated retrieval and at most 5.69% under oracle retrieval, substantially below rates on prior benchmarks. The authors release the task instances, evaluation harness, and containerized environments.
Significance. If the construction and validation procedures hold, the benchmark would be a valuable addition to the field by filling a gap in mobile-specific, multi-artifact, framework-heterogeneous tasks that existing library-style benchmarks do not capture. The reported low resolution rates, if free of construction artifacts, would constitute a concrete, falsifiable signal that current LLMs struggle with coordinated changes across source, build, and resource files in realistic mobile environments. The public release of harness and containers supports reproducibility and follow-on work.
major comments (3)
- [§3] §3 (Benchmark Construction): The description of how the 407 tasks were collected from the 19 production apps provides no explicit sampling frame, inclusion/exclusion criteria, or stratification by app size or issue type. Without these, it is impossible to determine whether the low resolution rates reflect inherent LLM limitations or post-hoc selection of unusually complex or multi-artifact issues.
- [§4] §4 (Test Patch Validation): The claim that the supplied executable test patches correctly validate fixes and are free of framework-specific artifacts or selection bias lacks detail on verification procedures (e.g., minimality checks, pre-/post-fix test outcomes, or cross-framework consistency). This directly affects the central claim that the 3–5% rates are lower than prior benchmarks for reasons other than benchmark construction.
- [Table 2 / §5.2] Table 2 / §5.2 (Resolution Rates): The oracle-retrieval setting reports at most 5.69% success; however, the paper does not report per-model breakdowns or confidence intervals, nor does it compare against a simple baseline (e.g., retrieval-augmented generation with explicit multi-file prompting), making it difficult to isolate whether the gap is due to retrieval, reasoning, or patch generation.
minor comments (3)
- [Abstract / §1] The abstract and §1 use the term 'end-to-end resolution rates' without a precise definition of what constitutes a successful resolution (e.g., whether partial fixes or test-passing but non-idiomatic patches count).
- [Figure 1] Figure 1 (task distribution) would benefit from explicit labels for the three frameworks and a note on how 'multi-artifact' is operationalized.
- [§2] A small number of references to prior benchmarks (e.g., SWE-Bench) appear without page or section citations in the comparison paragraph.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us strengthen the manuscript. We address each major comment point by point below and have made revisions to the paper where the concerns are valid.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The description of how the 407 tasks were collected from the 19 production apps provides no explicit sampling frame, inclusion/exclusion criteria, or stratification by app size or issue type. Without these, it is impossible to determine whether the low resolution rates reflect inherent LLM limitations or post-hoc selection of unusually complex or multi-artifact issues.
Authors: We acknowledge the need for greater transparency in the collection process. The 407 tasks were drawn from public GitHub repositories of 19 open-source mobile applications by querying for closed issues reported between 2022 and 2024 that included developer commits modifying at least three files and accompanied by executable tests. Inclusion criteria required: (1) verifiable failing tests pre-fix that pass post-fix, (2) involvement of at least two artifact types (source, build, resources), and (3) successful reproduction in our containerized environments. Exclusion criteria removed issues lacking tests, single-file changes, or those involving proprietary code. We stratified selection across the three frameworks (approximately 135 tasks each) and by repository size (small/medium/large based on total LOC). We will add a new subsection 3.1 explicitly describing this sampling frame, criteria, and stratification to allow readers to evaluate potential selection effects. revision: yes
-
Referee: [§4] §4 (Test Patch Validation): The claim that the supplied executable test patches correctly validate fixes and are free of framework-specific artifacts or selection bias lacks detail on verification procedures (e.g., minimality checks, pre-/post-fix test outcomes, or cross-framework consistency). This directly affects the central claim that the 3–5% rates are lower than prior benchmarks for reasons other than benchmark construction.
Authors: We agree that additional procedural details are required. In the revised manuscript we will expand §4 to document the following verification steps performed on all 407 test patches: (1) minimality checks, in which we iteratively removed individual test cases and confirmed that at least one test failed; (2) pre- and post-fix execution in the containerized environments showing 100% test failure before the developer fix and 100% pass after; and (3) cross-framework consistency tests confirming the harness produces equivalent pass/fail signals on Android, React Native, and Flutter setups. These steps were executed by the authors prior to release and support the claim that the low resolution rates are not artifacts of test construction. revision: yes
-
Referee: [Table 2 / §5.2] Table 2 / §5.2 (Resolution Rates): The oracle-retrieval setting reports at most 5.69% success; however, the paper does not report per-model breakdowns or confidence intervals, nor does it compare against a simple baseline (e.g., retrieval-augmented generation with explicit multi-file prompting), making it difficult to isolate whether the gap is due to retrieval, reasoning, or patch generation.
Authors: We will update Table 2 to report per-model success rates under both automated and oracle retrieval, accompanied by 95% confidence intervals obtained via bootstrap resampling (1,000 iterations). We have also added a simple baseline experiment using retrieval-augmented generation with an explicit multi-file prompt instructing the model to identify and edit all relevant files. This baseline achieves 1.8–2.4% success under oracle retrieval, indicating that the performance gap is driven primarily by reasoning and coordinated patch generation rather than retrieval alone. These results and the baseline description will be incorporated into §5.2. revision: yes
Circularity Check
No circularity: purely empirical benchmark construction and evaluation
full rationale
The paper is an empirical contribution that collects 407 tasks from 19 production mobile apps and evaluates frontier LLMs on them using executable test patches. No equations, fitted parameters, self-citations used as load-bearing premises, or derivational steps appear in the provided text. The central claims (low resolution rates of 3-5%) are direct measurements on the constructed dataset rather than reductions to prior self-referential results. Task selection criteria and test validity are external assumptions subject to empirical verification, not internal circular definitions or renamings of known results. This is the standard non-circular pattern for benchmark papers.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.