Reinforcement Learning from Compiler and Language Server Feedback

Lanser Contributors; Yifan Zhang

arxiv: 2510.22907 · v2 · submitted 2025-10-27 · 💻 cs.CL · cs.AI· cs.PL· cs.SE

Reinforcement Learning from Compiler and Language Server Feedback

Yifan Zhang , Lanser Contributors This is my paper

Pith reviewed 2026-05-18 04:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.PLcs.SE

keywords reinforcement learningcoding agentscompiler feedbacklanguage serversprocess supervisionreward shapingLSP

0 comments

The pith

Compiler and language server feedback supplies stable process rewards for training coding agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that outputs already computed by compilers, type checkers, and language servers can be converted into shaped rewards for reinforcement learning instead of letting agents rely on text-level guesses. This would matter because it grounds each edit in verifiable facts such as diagnostics, symbol resolution, and safety preconditions, reducing hallucinations and invalid workspace states. RLCSF formalizes tool interactions as transitions whose rewards are based on deterministic improvements in these signals. Lanser-CLI supports the loop by turning temporary sessions into replayable bundles that carry pinned metadata and content hashes. A sympathetic reader would care if the resulting substrate allows process supervision throughout code generation rather than only at final outcome.

Core claim

We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayed

What carries the argument

The shaped process reward functional in RLCSF that assigns non-negative values to componentwise-improving transitions, supported by replayable Analysis Bundles from Lanser-CLI.

Load-bearing premise

Deterministic changes in diagnostics, selector confidence, and edit safety can be shaped into a stable, replayable reward signal without introducing bias from tool-specific quirks or requiring extensive manual tuning of the reward functional.

What would settle it

Training identical coding agents with and without the RLCSF reward signal on the same set of edit tasks and measuring whether the RLCSF version produces measurably fewer compilation errors or unresolved symbols.

read the original abstract

Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers already compute the missing supervision signal, in the form of diagnostics, symbol resolution, type information, references, and refactoring preconditions, but expose it through interfaces designed for human-driven IDEs rather than learning loops. We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI, in turn, converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayable under frozen snapshots. We formalize determinism for canonical bundles and prove that componentwise-improving transitions receive non-negative reward in the undiscounted setting. Together, these pieces yield a practical substrate for process supervision of coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RLCSF formalizes compiler and LSP signals into a shaped process reward with a proof that improving transitions get non-negative reward, but the work stops at the sketch and offers no empirical checks.

read the letter

The main thing here is that the paper sets up RLCSF to convert compiler diagnostics, symbol resolution, and edit safety signals from language servers into a replayable process reward for coding agents. It adds Lanser-CLI to produce stable Analysis Bundles and proves that componentwise improvements yield non-negative reward in the undiscounted case, with determinism guarantees on the bundles. That formal piece is the clearest contribution. It moves past simple test-based or execution rewards by pulling in the richer, deterministic outputs that IDE tools already compute. The mechanisms for robust selectors, deterministic normalization, preview-first mutations, and the potential-based reward functional are laid out with enough structure to make the replayability claim hold on paper. The stress-test note is right that nothing in the formalization contradicts itself. The soft spot is the complete lack of experiments or implementation results. The abstract and proof sketch stand alone, so we have no data on whether the reward stays stable when real LSP quirks appear, how much manual shaping the functional needs, or whether agents actually improve under this signal. The assumption that deterministic tool changes translate into bias-free, low-tuning rewards is plausible but untested. This is for people building RL coding agents who care about process supervision rather than final-test rewards. A reader who wants formal grounding for agent reliability will find the proof and bundle design useful; anyone looking for validated gains will have to wait. Send it to peer review. The formal substrate is concrete enough to deserve referee attention even if the next version must add runs and ablation checks.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) along with the Lanser-CLI orchestration layer. It treats tool interactions as transitions and defines a shaped process reward based on deterministic changes in diagnostics, selector confidence, and edit safety. The work formalizes determinism for canonical bundles and proves that componentwise-improving transitions receive non-negative reward in the undiscounted setting, with the goal of providing a replayable substrate for process supervision of coding agents.

Significance. If the formal results hold and the reward functional can be realized without tool-specific bias, the approach could supply a stable, external-source process reward for training coding agents, reducing dependence on outcome-only signals or human preferences. The explicit proof of non-negative reward for improving transitions is a clear theoretical contribution that directly supports replayability claims.

major comments (1)

Abstract and overall manuscript: the central claim that the described pieces 'yield a practical substrate for process supervision of coding agents' lacks any empirical results, implementation details, training curves, or ablation studies. The soundness of the reward functional and its stability under real LSP sessions therefore remains untested, which is load-bearing for the practicality assertion.

minor comments (1)

The manuscript would benefit from explicit pseudocode or a diagram for the reward functional and the bundle normalization procedure to make the determinism claims easier to verify.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the theoretical contributions of the formalization and proof. We address the major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract and overall manuscript: the central claim that the described pieces 'yield a practical substrate for process supervision of coding agents' lacks any empirical results, implementation details, training curves, or ablation studies. The soundness of the reward functional and its stability under real LSP sessions therefore remains untested, which is load-bearing for the practicality assertion.

Authors: We agree that empirical validation would further strengthen the practicality claim. The current manuscript is a foundational paper whose primary contributions are the definition of the shaped process reward, the formalization of determinism for canonical bundles, and the proof that componentwise-improving transitions receive non-negative reward in the undiscounted case. These results directly support replayability and stability independent of any particular training run. Implementation details of Lanser-CLI, including robust selectors, bundle normalization, and the reward functional, are already described in Sections 3 and 4. To address the concern about real LSP sessions, we will add a new subsection with concrete walkthroughs of reward computation on sample edits using actual compiler and language-server diagnostics, together with a small-scale stability check across repeated sessions. We will also revise the abstract and conclusion to make the scope of the practicality claim more precise, emphasizing the theoretical substrate rather than end-to-end agent training results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external tool signals and independent formalization

full rationale

The paper's central construction defines a process reward directly from deterministic changes in compiler diagnostics, selector confidence, and edit safety, which are external outputs rather than fitted parameters or self-referential quantities. It then formalizes determinism for canonical bundles and proves non-negative reward for componentwise-improving transitions in the undiscounted case. These steps are presented as first-principles results grounded in the properties of the tools and the transition model, without reducing to prior self-citations or redefining inputs as outputs. The overall substrate for process supervision is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the assumption that LSP and compiler outputs provide sufficient, stable signals for reward shaping. No free parameters are explicitly introduced in the abstract. The main invented elements are the RLCSF reward functional and Analysis Bundles.

axioms (1)

domain assumption Language server and compiler diagnostics can be treated as deterministic transitions suitable for reward computation.
Invoked when defining the shaped process reward from changes in diagnostics and edit safety.

invented entities (2)

RLCSF reward functional no independent evidence
purpose: To compute shaped process rewards from tool feedback
New construct introduced to turn diagnostics into learning signals.
Analysis Bundles no independent evidence
purpose: Replayable packages of LSP sessions with pinned metadata
Core mechanism for making ephemeral tool interactions reproducible.

pith-pipeline@v0.9.0 · 5754 in / 1259 out tokens · 22071 ms · 2026-05-18T04:57:14.338840+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rt = α(Dt−1 − Dt) + β St − γ(1 − αt) … Proposition 5.2 (Monotonicity of process reward under invariants)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

potential-based reward shaping … replayable under frozen snapshots

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.