pith. sign in

arxiv: 2510.22907 · v2 · submitted 2025-10-27 · 💻 cs.CL · cs.AI· cs.PL· cs.SE

Reinforcement Learning from Compiler and Language Server Feedback

Pith reviewed 2026-05-18 04:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.PLcs.SE
keywords reinforcement learningcoding agentscompiler feedbacklanguage serversprocess supervisionreward shapingLSP
0
0 comments X

The pith

Compiler and language server feedback supplies stable process rewards for training coding agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that outputs already computed by compilers, type checkers, and language servers can be converted into shaped rewards for reinforcement learning instead of letting agents rely on text-level guesses. This would matter because it grounds each edit in verifiable facts such as diagnostics, symbol resolution, and safety preconditions, reducing hallucinations and invalid workspace states. RLCSF formalizes tool interactions as transitions whose rewards are based on deterministic improvements in these signals. Lanser-CLI supports the loop by turning temporary sessions into replayable bundles that carry pinned metadata and content hashes. A sympathetic reader would care if the resulting substrate allows process supervision throughout code generation rather than only at final outcome.

Core claim

We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayed

What carries the argument

The shaped process reward functional in RLCSF that assigns non-negative values to componentwise-improving transitions, supported by replayable Analysis Bundles from Lanser-CLI.

Load-bearing premise

Deterministic changes in diagnostics, selector confidence, and edit safety can be shaped into a stable, replayable reward signal without introducing bias from tool-specific quirks or requiring extensive manual tuning of the reward functional.

What would settle it

Training identical coding agents with and without the RLCSF reward signal on the same set of edit tasks and measuring whether the RLCSF version produces measurably fewer compilation errors or unresolved symbols.

read the original abstract

Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers already compute the missing supervision signal, in the form of diagnostics, symbol resolution, type information, references, and refactoring preconditions, but expose it through interfaces designed for human-driven IDEs rather than learning loops. We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI, in turn, converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayable under frozen snapshots. We formalize determinism for canonical bundles and prove that componentwise-improving transitions receive non-negative reward in the undiscounted setting. Together, these pieces yield a practical substrate for process supervision of coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) along with the Lanser-CLI orchestration layer. It treats tool interactions as transitions and defines a shaped process reward based on deterministic changes in diagnostics, selector confidence, and edit safety. The work formalizes determinism for canonical bundles and proves that componentwise-improving transitions receive non-negative reward in the undiscounted setting, with the goal of providing a replayable substrate for process supervision of coding agents.

Significance. If the formal results hold and the reward functional can be realized without tool-specific bias, the approach could supply a stable, external-source process reward for training coding agents, reducing dependence on outcome-only signals or human preferences. The explicit proof of non-negative reward for improving transitions is a clear theoretical contribution that directly supports replayability claims.

major comments (1)
  1. Abstract and overall manuscript: the central claim that the described pieces 'yield a practical substrate for process supervision of coding agents' lacks any empirical results, implementation details, training curves, or ablation studies. The soundness of the reward functional and its stability under real LSP sessions therefore remains untested, which is load-bearing for the practicality assertion.
minor comments (1)
  1. The manuscript would benefit from explicit pseudocode or a diagram for the reward functional and the bundle normalization procedure to make the determinism claims easier to verify.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the theoretical contributions of the formalization and proof. We address the major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract and overall manuscript: the central claim that the described pieces 'yield a practical substrate for process supervision of coding agents' lacks any empirical results, implementation details, training curves, or ablation studies. The soundness of the reward functional and its stability under real LSP sessions therefore remains untested, which is load-bearing for the practicality assertion.

    Authors: We agree that empirical validation would further strengthen the practicality claim. The current manuscript is a foundational paper whose primary contributions are the definition of the shaped process reward, the formalization of determinism for canonical bundles, and the proof that componentwise-improving transitions receive non-negative reward in the undiscounted case. These results directly support replayability and stability independent of any particular training run. Implementation details of Lanser-CLI, including robust selectors, bundle normalization, and the reward functional, are already described in Sections 3 and 4. To address the concern about real LSP sessions, we will add a new subsection with concrete walkthroughs of reward computation on sample edits using actual compiler and language-server diagnostics, together with a small-scale stability check across repeated sessions. We will also revise the abstract and conclusion to make the scope of the practicality claim more precise, emphasizing the theoretical substrate rather than end-to-end agent training results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external tool signals and independent formalization

full rationale

The paper's central construction defines a process reward directly from deterministic changes in compiler diagnostics, selector confidence, and edit safety, which are external outputs rather than fitted parameters or self-referential quantities. It then formalizes determinism for canonical bundles and proves non-negative reward for componentwise-improving transitions in the undiscounted case. These steps are presented as first-principles results grounded in the properties of the tools and the transition model, without reducing to prior self-citations or redefining inputs as outputs. The overall substrate for process supervision is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the assumption that LSP and compiler outputs provide sufficient, stable signals for reward shaping. No free parameters are explicitly introduced in the abstract. The main invented elements are the RLCSF reward functional and Analysis Bundles.

axioms (1)
  • domain assumption Language server and compiler diagnostics can be treated as deterministic transitions suitable for reward computation.
    Invoked when defining the shaped process reward from changes in diagnostics and edit safety.
invented entities (2)
  • RLCSF reward functional no independent evidence
    purpose: To compute shaped process rewards from tool feedback
    New construct introduced to turn diagnostics into learning signals.
  • Analysis Bundles no independent evidence
    purpose: Replayable packages of LSP sessions with pinned metadata
    Core mechanism for making ephemeral tool interactions reproducible.

pith-pipeline@v0.9.0 · 5754 in / 1259 out tokens · 22071 ms · 2026-05-18T04:57:14.338840+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.