Reinforcement Learning from Compiler and Language Server Feedback
Pith reviewed 2026-05-18 04:57 UTC · model grok-4.3
The pith
Compiler and language server feedback supplies stable process rewards for training coding agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayed
What carries the argument
The shaped process reward functional in RLCSF that assigns non-negative values to componentwise-improving transitions, supported by replayable Analysis Bundles from Lanser-CLI.
Load-bearing premise
Deterministic changes in diagnostics, selector confidence, and edit safety can be shaped into a stable, replayable reward signal without introducing bias from tool-specific quirks or requiring extensive manual tuning of the reward functional.
What would settle it
Training identical coding agents with and without the RLCSF reward signal on the same set of edit tasks and measuring whether the RLCSF version produces measurably fewer compilation errors or unresolved symbols.
read the original abstract
Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers already compute the missing supervision signal, in the form of diagnostics, symbol resolution, type information, references, and refactoring preconditions, but expose it through interfaces designed for human-driven IDEs rather than learning loops. We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI, in turn, converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayable under frozen snapshots. We formalize determinism for canonical bundles and prove that componentwise-improving transitions receive non-negative reward in the undiscounted setting. Together, these pieces yield a practical substrate for process supervision of coding agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) along with the Lanser-CLI orchestration layer. It treats tool interactions as transitions and defines a shaped process reward based on deterministic changes in diagnostics, selector confidence, and edit safety. The work formalizes determinism for canonical bundles and proves that componentwise-improving transitions receive non-negative reward in the undiscounted setting, with the goal of providing a replayable substrate for process supervision of coding agents.
Significance. If the formal results hold and the reward functional can be realized without tool-specific bias, the approach could supply a stable, external-source process reward for training coding agents, reducing dependence on outcome-only signals or human preferences. The explicit proof of non-negative reward for improving transitions is a clear theoretical contribution that directly supports replayability claims.
major comments (1)
- Abstract and overall manuscript: the central claim that the described pieces 'yield a practical substrate for process supervision of coding agents' lacks any empirical results, implementation details, training curves, or ablation studies. The soundness of the reward functional and its stability under real LSP sessions therefore remains untested, which is load-bearing for the practicality assertion.
minor comments (1)
- The manuscript would benefit from explicit pseudocode or a diagram for the reward functional and the bundle normalization procedure to make the determinism claims easier to verify.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the theoretical contributions of the formalization and proof. We address the major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: Abstract and overall manuscript: the central claim that the described pieces 'yield a practical substrate for process supervision of coding agents' lacks any empirical results, implementation details, training curves, or ablation studies. The soundness of the reward functional and its stability under real LSP sessions therefore remains untested, which is load-bearing for the practicality assertion.
Authors: We agree that empirical validation would further strengthen the practicality claim. The current manuscript is a foundational paper whose primary contributions are the definition of the shaped process reward, the formalization of determinism for canonical bundles, and the proof that componentwise-improving transitions receive non-negative reward in the undiscounted case. These results directly support replayability and stability independent of any particular training run. Implementation details of Lanser-CLI, including robust selectors, bundle normalization, and the reward functional, are already described in Sections 3 and 4. To address the concern about real LSP sessions, we will add a new subsection with concrete walkthroughs of reward computation on sample edits using actual compiler and language-server diagnostics, together with a small-scale stability check across repeated sessions. We will also revise the abstract and conclusion to make the scope of the practicality claim more precise, emphasizing the theoretical substrate rather than end-to-end agent training results. revision: partial
Circularity Check
No significant circularity; derivation relies on external tool signals and independent formalization
full rationale
The paper's central construction defines a process reward directly from deterministic changes in compiler diagnostics, selector confidence, and edit safety, which are external outputs rather than fitted parameters or self-referential quantities. It then formalizes determinism for canonical bundles and proves non-negative reward for componentwise-improving transitions in the undiscounted case. These steps are presented as first-principles results grounded in the properties of the tools and the transition model, without reducing to prior self-citations or redefining inputs as outputs. The overall substrate for process supervision is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language server and compiler diagnostics can be treated as deterministic transitions suitable for reward computation.
invented entities (2)
-
RLCSF reward functional
no independent evidence
-
Analysis Bundles
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
rt = α(Dt−1 − Dt) + β St − γ(1 − αt) … Proposition 5.2 (Monotonicity of process reward under invariants)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
potential-based reward shaping … replayable under frozen snapshots
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.