pith. machine review for the scientific record. sign in

arxiv: 2604.11088 · v1 · submitted 2026-04-13 · 💻 cs.AI · cs.CL

Recognition: unknown

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords coding agentsinstruction rulesnegative constraintscontext primingagent performanceSWE-benchguardrailspositive directives
0
0 comments X

The pith

Negative constraints raise AI coding agent success rates while positive instructions lower them, and random rules match expert ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Developers commonly supply natural language rule files to steer AI coding agents, yet it remains unclear whether these rules improve outcomes or introduce distortions. The study evaluates thousands of such rules across more than five thousand agent runs on a standard coding benchmark. It finds that rules deliver a net performance gain of 7 to 14 percentage points mainly by supplying general context rather than targeted instructions. Negative rules that prohibit specific actions prove helpful on their own, but positive rules that direct particular behaviors tend to reduce success. Collections of rules stay beneficial up to at least fifty items even though most individual rules are neutral or harmful.

Core claim

The paper shows that instruction rules improve coding agent performance primarily through context priming, since randomly generated rules produce gains equivalent to expert-curated ones. Negative constraints are the only rule category that helps when applied alone, whereas positive directives harm performance, a pattern examined through potential-based reward shaping. Although single rules often degrade results, larger sets remain effective without loss up to fifty rules, exposing the risk that well-intentioned guidance can distort agent behavior.

What carries the argument

Large-scale empirical comparison of rule types (negative constraints versus positive directives) and curation levels (expert-curated versus random) across thousands of agent runs on SWE-bench Verified.

Load-bearing premise

That measured differences in agent success rates are produced by the rules and their types rather than by unmeasured variations in prompt insertion, phrasing, or run-to-run randomness.

What would settle it

A follow-up test that keeps total prompt length fixed while moving the same rules to different positions or surrounding text and measures whether the reported performance gains disappear.

Figures

Figures reproduced from arXiv: 2604.11088 by Bing Zhu, Guanghui Wang, Peiyang He, Wei Qiu, Xing Zhang, Yanwei Cui, Ziyuan Li.

Figure 1
Figure 1. Figure 1: Two key findings. (a) All rule conditions outperform the no-rule baseline by 7–14pp, but random rules match curated ones (both 63.8%), suggesting a context priming effect rather than content-specific instruction. (b) Per-rule ablation on 18 curated rules: the 3 “shaping” rules (green) are all negative constraints (“do not X”), while the 4 “distorting” rules (orange) are all positive directives (“do X”). Pr… view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of 25,532 rules from 679 GitHub rule files. Project-specific rules dominate (65%); we experiment with the five transferable categories. (3.0%)—negative constraints like “do not modify unrelated files.” The dominance of project-specific rules (64.9%) confirms that practitioners primarily use rule files for injecting reposi￾tory context rather than behavioral guidance. For our exper￾iments, we focus… view at source ↗
Figure 3
Figure 3. Figure 3: Rule quantity and type. (a) Pass rate vs. rule count (n = 58). Individual seed results shown as light dots; mean ± std as line with shading. No phase transition is observed; 50 rules slightly outperform 0 rules. (b) Pass rate by rule type (n = 58). Tool/process rules (state-dependent) score highest; architecture rules (state-independent) score lowest—a 10.4pp spread consistent with P1 [PITH_FULL_IMAGE:fig… view at source ↗
Figure 4
Figure 4. Figure 4: Individual rule effects. (a) Per-rule distortion: each of the 18 curated rules applied individually to 17 baseline-solved tasks. 14/18 rules break ≥2 tasks; the worst break 4/17 (24%). Most rules are individually harmful—they work only in ensemble. (b) Superposition test: observed vs. additive-expected pass rate for 5 rule pairs. 3/5 pairs are approximately additive marked (+), but 2 show strong non-linear… view at source ↗
read the original abstract

Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7--14 percentage points, but random rules help as much as expert-curated ones -- suggesting rules work through context priming rather than specific instruction. Negative constraints ("do not refactor unrelated code") are the only individually beneficial rule type, while positive directives ("follow code style") actively hurt -- a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk -- well-intentioned rules routinely degrade agent performance -- and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript scrapes 679 natural language instruction files (25,532 rules) from GitHub and evaluates their effect on a state-of-the-art coding agent via over 5,000 runs on SWE-bench Verified. It claims rules raise success rates by 7-14 percentage points, that random rules perform equivalently to expert-curated ones (implying context priming rather than specific guidance), that only negative constraints are individually beneficial while positive directives harm performance, that rules are collectively helpful despite individual harm, and that performance does not degrade up to 50 rules; the pattern is interpreted through potential-based reward shaping.

Significance. If the central empirical claims survive rigorous controls and statistical validation, the work would be significant for AI agent configuration: it supplies large-scale evidence that guardrails function primarily via priming, identifies a concrete reliability risk in positive directives, and offers a simple principle (favor negative constraints) with direct implications for developer practice. The scale of the evaluation and the counter-intuitive random-rule result are notable strengths.

major comments (3)
  1. [Abstract / Results] Abstract and Results: the headline claims of 7-14 pp gains and type-specific effects (negative constraints help, positive hurt) are presented without statistical tests, standard errors, per-condition sample sizes, or controls for prompt length/token count. This is load-bearing for the priming-vs-guidance distinction and the rule-type conclusions, as unmeasured insertion artifacts or run variance could fully explain the deltas.
  2. [Methods] Methods: no description is given of the rule-insertion procedure (placement, formatting, length-matching between expert and random conditions), the categorization scheme for rule types, or how the 5,000 runs were allocated across conditions. These omissions prevent attribution of performance differences to rule semantics rather than prompt-engineering confounds.
  3. [Analysis / Discussion] Analysis: the appeal to potential-based reward shaping (PBRS) to explain why negative constraints outperform positive directives lacks an explicit mapping, formalization, or quantitative check against the observed data, leaving the interpretive framework unsupported for the shaping-vs-distortion claim.
minor comments (2)
  1. [Abstract] Abstract: the exact per-condition N and any length-matching protocol should be stated explicitly rather than only the aggregate 5,000 runs.
  2. [References] References: the SWE-bench Verified benchmark and the original SWE-bench paper should be cited with full bibliographic details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas where the manuscript can be strengthened in terms of statistical rigor and methodological detail. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results: the headline claims of 7-14 pp gains and type-specific effects (negative constraints help, positive hurt) are presented without statistical tests, standard errors, per-condition sample sizes, or controls for prompt length/token count. This is load-bearing for the priming-vs-guidance distinction and the rule-type conclusions, as unmeasured insertion artifacts or run variance could fully explain the deltas.

    Authors: We agree that providing statistical tests, standard errors, and controls is crucial for supporting the central claims. In the revised manuscript, we will include bootstrap-derived standard errors and 95% confidence intervals for all reported performance differences, per-condition sample sizes, and results from permutation tests assessing the significance of the observed deltas. Additionally, we will report average prompt token counts across conditions and include a length-controlled baseline to address potential insertion artifacts. These enhancements will be added to both the Abstract and Results sections. revision: yes

  2. Referee: [Methods] Methods: no description is given of the rule-insertion procedure (placement, formatting, length-matching between expert and random conditions), the categorization scheme for rule types, or how the 5,000 runs were allocated across conditions. These omissions prevent attribution of performance differences to rule semantics rather than prompt-engineering confounds.

    Authors: We acknowledge the need for greater transparency in the Methods section. We will revise it to detail the rule-insertion procedure: rules are prepended to the system prompt as a bulleted list under a 'Guidelines' header, with formatting standardized. For length-matching, random rules were selected to match the token length distribution of expert rules. The categorization scheme involved labeling rules into positive directives, negative constraints, and other categories. Regarding run allocation, the over 5,000 runs consist of evaluations across multiple conditions on the SWE-bench tasks. These details will be added to ensure reproducibility and rule out confounds. revision: yes

  3. Referee: [Analysis / Discussion] Analysis: the appeal to potential-based reward shaping (PBRS) to explain why negative constraints outperform positive directives lacks an explicit mapping, formalization, or quantitative check against the observed data, leaving the interpretive framework unsupported for the shaping-vs-distortion claim.

    Authors: The PBRS reference serves as a conceptual framework to interpret the empirical patterns rather than a rigorous theoretical model. In the revision, we will provide an explicit mapping in the Discussion: negative constraints can be viewed as adding a potential function that decreases the value of states leading to undesired behaviors, thereby shaping the policy without altering the optimal policy per PBRS theory. Positive directives may introduce shaping terms that distort the reward landscape. We will formalize this with a simple equation relating rule types to potential functions and compare it qualitatively to the observed data. However, a full quantitative validation is not feasible without internal access to the agent's reward function. We will explicitly note this limitation and frame the PBRS discussion as suggestive rather than definitive. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The paper's core claims rest on scraping 679 rule files, executing over 5,000 agent runs on SWE-bench Verified, and reporting success-rate deltas across rule conditions. No equations, fitted parameters, or derivations appear in the reported results; performance differences are measured directly from agent executions rather than computed from any self-referential model. The reference to potential-based reward shaping is used only for post-hoc interpretation of observed patterns and is not required to establish the quantitative findings. Self-citations, if present, are not load-bearing for the headline results, which remain falsifiable against the external benchmark and independent of any prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Claims rest on the domain assumption that SWE-bench Verified is a representative proxy for real coding tasks and that the chosen agent reflects typical behavior; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption SWE-bench Verified is a reliable benchmark for measuring coding agent performance
    All quantitative claims are derived from runs on this benchmark.

pith-pipeline@v0.9.0 · 5523 in / 1221 out tokens · 84678 ms · 2026-05-10T15:33:36.613027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 1 canonical work pages

  1. [1]

    Arize AI

    https://docs.anthropic.com/en/docs/claude-code. Arize AI. Cline rules: Techniques for boosting accuracy

  2. [2]

    Composer 2 technical report.arXiv preprint arXiv:2603.24477, 2026

    Blog post, https://arize.com/blog/cline/. Cochran, W. G. The comparison of percentages in matched samples.Biometrika, 37(3/4):256–266, 1950. Cognition Labs. Devin: AI software engineer. 2024. https://devin.ai. Cursor. Composer 2: Technical report.arXiv preprint arXiv:2603.24477, 2026. Gao, H.-a., Geng, J., Hua, W., Hu, M., Juan, X., Liu, H., Liu, S., Qiu,...

  3. [3]

    Read the relevant source files to understand existing code structure and style

  4. [4]

    Identify and read existing test files related to the code you will change

  5. [5]

    Understand the full context of the issue before proposing a fix

  6. [6]

    Make minimal, targeted changes that directly address the issue

  7. [7]

    Follow the existing code style and conventions of the reposi- tory

  8. [8]

    Do not refactor unrelated code or add unnecessary features

  9. [9]

    Preserve backward compatibility unless the issue specifically requires breaking changes

  10. [10]

    Run the existing test suite to verify your changes don’t break anything

  11. [11]

    If tests fail, investigate and fix the root cause rather than modifying tests

  12. [12]

    Ensure your changes are complete—don’t leave TODOs or partial implementations

  13. [13]

    Prefer explicit over clever

    Write clear, readable code. Prefer explicit over clever

  14. [14]

    Handle edge cases that are relevant to the issue

  15. [15]

    Keep functions focused and concise

  16. [16]

    Use meaningful variable and function names consistent with the codebase

  17. [17]

    Do not modify files unrelated to the issue

  18. [18]

    Do not add type stubs, documentation, or formatting changes unless requested

  19. [19]

    Do not install new dependencies unless absolutely necessary

  20. [20]

    Do not change the project’s build or test configuration. B. Complete Experiment 1 Results Table 2 reports all 8 conditions from Experiment 1 with pass rates and pairwise McNemar tests vs. the no-rule baseline. Table 2.Experiment 1: complete pass rates and pairwise compar- isons vs. baseline (n= 58tasks). Condition Pass % Helped Hurt Disc.p Baseline (none)...