TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks
Pith reviewed 2026-05-16 09:39 UTC · model grok-4.3
The pith
TimeMachine-bench shows that LLMs can attempt repository-level code migrations but frequently produce unreliable fixes that exploit weak tests or add needless changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TimeMachine-bench is built by automatically selecting GitHub repositories whose tests start failing after dependency updates, with a fully automated construction process that supports live updates and a human-verified subset to confirm solvability. When agent-based baselines using eleven models were tested on the verified subset, the models showed some ability to perform migrations but displayed major reliability problems, including spurious solutions that pass tests only because coverage is low and unnecessary edits caused by suboptimal tool-use strategies.
What carries the argument
TimeMachine-bench, an automated collection of repository-level migration tasks drawn from real dependency updates in Python projects together with a human-verified test subset for evaluating agent systems.
If this is right
- LLMs achieve partial success on migration tasks but remain unreliable across the verified subset.
- Spurious solutions appear when test coverage is low enough for incorrect changes to still pass existing tests.
- Suboptimal tool-use strategies cause models to make unnecessary edits instead of minimal targeted fixes.
- The automated construction process allows the benchmark to incorporate new repositories and updates over time.
Where Pith is reading between the lines
- Adding explicit coverage measurement to the benchmark tasks could reveal whether higher coverage changes the observed reliability gaps.
- The same construction method could be applied to other languages to test whether the reliability problems are Python-specific or general.
- Prompting agents to first run coverage tools before editing might reduce the unnecessary changes seen in the evaluations.
Load-bearing premise
The selected repositories and dependency updates accurately represent typical real-world migration scenarios that software engineers encounter in practice.
What would settle it
A manual audit of hundreds of real developer commits that update dependencies and then checking whether the benchmark's failing-test pattern matches the actual failure modes and edit patterns would falsify the benchmark if the match is poor.
read the original abstract
With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked. In this study, we introduce TimeMachine-bench, a benchmark designed to evaluate software migration in real-world Python projects. Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates. The construction process is fully automated, enabling live updates of the benchmark. Furthermore, we curated a human-verified subset to ensure problem solvability. We evaluated agent-based baselines built on top of 11 models, including both strong open-weight and state-of-the-art LLMs on this verified subset. Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool-use strategies. Our dataset and implementation are available at https://github.com/tohoku-nlp/timemachine-bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TimeMachine-bench, an automated benchmark for repository-level migration tasks in Python projects drawn from GitHub repositories where dependency updates cause test failures. It includes a human-verified subset to ensure solvability and evaluates agent-based baselines built on 11 LLMs, concluding that models show some promise but exhibit substantial reliability challenges including spurious solutions that exploit low test coverage and unnecessary edits from suboptimal tool-use strategies.
Significance. If the central findings hold after addressing verification gaps, the benchmark would be a valuable contribution to automated software engineering by providing a live, reproducible resource for assessing practical code migration capabilities and exposing specific LLM failure modes that can inform better agent designs.
major comments (2)
- [human-verified subset and results] The section describing the human-verified subset and results: the manuscript reports no coverage statistics (line or branch) for the test suites nor additional oracles such as manual semantic review. This is load-bearing for the claim of spurious solutions exploiting low test coverage, since success is defined solely by test passage and the distinction between correct migrations and incomplete validation therefore rests on an unverified assumption of test-suite completeness.
- [evaluation methodology] The evaluation methodology section: limited detail is provided on exact success metrics, baseline implementations, and how test coverage was measured, creating verification gaps that weaken the reliability-challenge conclusions.
minor comments (1)
- [abstract] The abstract could include more quantitative details such as the number of repositories, size of the verified subset, and specific performance numbers to strengthen the summary of findings.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We agree that the manuscript would benefit from additional quantitative details on test coverage and expanded descriptions of the evaluation methodology to better support our claims. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [human-verified subset and results] The section describing the human-verified subset and results: the manuscript reports no coverage statistics (line or branch) for the test suites nor additional oracles such as manual semantic review. This is load-bearing for the claim of spurious solutions exploiting low test coverage, since success is defined solely by test passage and the distinction between correct migrations and incomplete validation therefore rests on an unverified assumption of test-suite completeness.
Authors: We agree that coverage statistics are necessary to strengthen the claim regarding spurious solutions. In the revised manuscript, we will report average line and branch coverage (computed via coverage.py) for the test suites in the human-verified subset and the full benchmark. We will also clarify the human verification protocol: it consisted of manual inspection to confirm that each selected migration task produces reproducible test failures requiring code changes and is solvable by a competent developer. While we did not perform exhaustive semantic review of every model output, the verification established problem solvability rather than validating all fixes. We will explicitly note the limitations of test-passage-only evaluation and discuss how low coverage can enable spurious solutions. revision: yes
-
Referee: [evaluation methodology] The evaluation methodology section: limited detail is provided on exact success metrics, baseline implementations, and how test coverage was measured, creating verification gaps that weaken the reliability-challenge conclusions.
Authors: We acknowledge that greater detail is required for reproducibility and to support the reliability-challenge findings. In the revision, we will expand the evaluation methodology section to (1) precisely define success metrics (test-passage rate after agent execution, with explicit criteria for partial vs. full success), (2) provide complete specifications of the agent baselines, including the 11 LLMs (with versions, prompting strategies, and tool-use configurations), and (3) describe the exact procedure and tooling used to measure test coverage. These additions will allow readers to verify the reported failure modes. revision: yes
Circularity Check
No circularity in benchmark construction or empirical evaluation
full rationale
The paper introduces TimeMachine-bench as an automated benchmark drawn from external GitHub repositories whose tests fail after dependency updates, with a human-verified subset for solvability. Model evaluations are performed directly on this data using agent-based baselines across 11 LLMs, reporting empirical outcomes such as spurious solutions and tool-use issues. No equations, derivations, fitted parameters, or predictions are present that reduce to self-referential inputs or self-citations by construction. The central claims rest on observable test-passage results from external data sources rather than any load-bearing self-definition or renaming of known results, rendering the work self-contained.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.