From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level
Pith reviewed 2026-05-16 17:18 UTC · model grok-4.3
The pith
Frontier models exhibit a consistent aggregation deficit in repository-level code reasoning, limited primarily by integration width across files.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RepoReason is a white-box diagnostic benchmark centered on abductive assertion verification at the repository level. An execution-driven mutation framework treats the environment as a semantic oracle to regenerate ground-truth states, eliminating memorization while retaining logical depth. Dynamic program slicing quantifies reasoning through three orthogonal metrics: ESV for reading load, MCL for simulation depth, and DFI for integration width. Comprehensive tests of frontier models such as Claude-4.5-Sonnet and DeepSeek-v3.1-Terminus reveal a prevalent aggregation deficit in which integration width constitutes the primary cognitive bottleneck.
What carries the argument
The RepoReason benchmark together with its execution-driven mutation framework and the three metrics (ESV reading load, MCL simulation depth, DFI integration width) produced by dynamic program slicing.
If this is right
- Agentic systems must develop stronger mechanisms for cross-file information aggregation to succeed at repository scale.
- The three-metric diagnostic allows developers to target specific reasoning weaknesses rather than relying on aggregate accuracy scores.
- White-box evaluation with mutated states provides more reliable signals for iterative model improvement than black-box testing.
- Next-generation agentic software engineering tools will need to prioritize integration capabilities to move beyond current limitations.
Where Pith is reading between the lines
- Training regimes that emphasize multi-file dependency tracking may be required to close the observed aggregation gap.
- The mutation approach could extend to other complex reasoning domains where data leakage is a concern.
- Limitations in integration width may underlie failures in long-horizon software maintenance tasks even when shorter contexts succeed.
Load-bearing premise
The execution-driven mutation framework using the runtime environment as a semantic oracle creates ground-truth states that test genuine reasoning without allowing memorization.
What would settle it
A model achieving high success rates on live repository-scale tasks while still scoring low on the integration width metric would falsify the claim that integration width is the dominant bottleneck.
read the original abstract
As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RepoReason, a white-box benchmark for repository-level agentic code reasoning via abductive assertion verification. It employs an execution-driven mutation framework that treats the environment as a semantic oracle to regenerate ground-truth states, aiming to remove memorization while retaining logical depth. Reasoning performance is decomposed into three metrics—ESV (reading load), MCL (simulation depth), and DFI (integration width)—derived from dynamic program slicing. Evaluations of frontier models (Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus, etc.) identify a prevalent aggregation deficit in which integration width (DFI) constitutes the dominant cognitive bottleneck.
Significance. If the mutation process demonstrably eliminates residual memorization while preserving authentic logical dependencies, the benchmark supplies actionable, fine-grained diagnostics that could directly inform architectural improvements for agentic systems in real-world software engineering. The orthogonal metric decomposition offers a concrete way to measure and target specific failure modes beyond aggregate accuracy.
major comments (3)
- [§3] §3 (Mutation Framework): The claim that the execution-driven mutation framework eliminates memorization rests on the environment acting as a semantic oracle, yet no quantitative validation is supplied (n-gram overlap, accuracy delta between original and mutated states, or ablation across mutation operators). Because the central finding that DFI is the primary bottleneck depends on the ground-truth states being free of surface patterns, this omission is load-bearing.
- [§4.2] §4.2 (Metric Definitions): The three metrics ESV, MCL, and DFI are introduced via dynamic program slicing, but the manuscript provides neither explicit formulas nor pseudocode for their computation. Without these, it is impossible to verify orthogonality or to reproduce how integration width is isolated as the dominant factor.
- [§5.1] Table 2 / §5.1 (Model Results): The reported performance gaps are presented without error bars, statistical significance tests, or controls for repository size and dependency density. This weakens the assertion that the observed aggregation deficit is a general cognitive limit rather than an artifact of the selected repositories.
minor comments (2)
- [Abstract] The abstract and §2 use the terms 'ESV', 'MCL', and 'DFI' before they are defined; a brief parenthetical gloss on first use would improve readability.
- [Figure 3] Figure 3 (metric correlation matrix) is referenced but the caption does not state the number of repositories or the slicing depth used to generate the data.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity, reproducibility, and robustness of the claims.
read point-by-point responses
-
Referee: [§3] §3 (Mutation Framework): The claim that the execution-driven mutation framework eliminates memorization rests on the environment acting as a semantic oracle, yet no quantitative validation is supplied (n-gram overlap, accuracy delta between original and mutated states, or ablation across mutation operators). Because the central finding that DFI is the primary bottleneck depends on the ground-truth states being free of surface patterns, this omission is load-bearing.
Authors: We agree that quantitative validation is necessary to substantiate the claim that the mutation framework removes memorization while preserving logical dependencies. In the revised manuscript, we will add n-gram overlap statistics comparing original and mutated states, accuracy deltas on model performance between original and mutated versions, and ablation results across mutation operators to demonstrate disruption of surface patterns. revision: yes
-
Referee: [§4.2] §4.2 (Metric Definitions): The three metrics ESV, MCL, and DFI are introduced via dynamic program slicing, but the manuscript provides neither explicit formulas nor pseudocode for their computation. Without these, it is impossible to verify orthogonality or to reproduce how integration width is isolated as the dominant factor.
Authors: We acknowledge that explicit formulas and pseudocode are required for reproducibility and verification of orthogonality. We will add the mathematical definitions for ESV (reading load), MCL (simulation depth), and DFI (integration width), along with pseudocode for their computation from dynamic program slicing, in the revised Section 4.2, including a brief discussion of their orthogonality. revision: yes
-
Referee: [§5.1] Table 2 / §5.1 (Model Results): The reported performance gaps are presented without error bars, statistical significance tests, or controls for repository size and dependency density. This weakens the assertion that the observed aggregation deficit is a general cognitive limit rather than an artifact of the selected repositories.
Authors: We will update Table 2 to include error bars and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the reported gaps. We will also add controls and analysis for repository size and dependency density, including correlations and stratified results across repository characteristics, to confirm the aggregation deficit is not an artifact of the selected set. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces RepoReason as a benchmark using an execution-driven mutation framework with the environment as semantic oracle to regenerate ground-truth states, along with orthogonal metrics ESV, MCL, and DFI for quantifying reasoning. No equations, derivations, or self-referential definitions appear in the abstract or described claims that would reduce any prediction or result to its inputs by construction. No self-citations are invoked as load-bearing for the aggregation deficit claim, and no uniqueness theorems or ansatzes from prior author work are referenced. The central findings are presented as outcomes of model evaluations on the constructed benchmark rather than forced equivalences. This constitutes a self-contained benchmark description without circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment can serve as a semantic oracle for regenerating ground-truth states via mutations.
invented entities (3)
-
ESV (reading load)
no independent evidence
-
MCL (simulation depth)
no independent evidence
-
DFI (integration width)
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair
Reshaping outcome rewards, process signals, and rollout comparability in GRPO raises strict compile-and-semantic accuracy in agentic code repair from 0.385 to 0.535 under weak feedback.
-
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.