pith. sign in

arxiv: 2603.23129 · v2 · pith:XS6AJHG4new · submitted 2026-03-24 · 💻 cs.LG

Polaris: A G\"odel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Pith reviewed 2026-05-15 07:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords Godel agentpolicy repairexperience abstractionsmall language modelsself-improvementreasoning benchmarkscode patchesmeta-reasoning
0
0 comments X

The pith

A 7B model improves its policy on unseen reasoning tasks by abstracting failures into compact reusable code patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Polaris as a framework that lets small language models act as Godel agents by inspecting their own policy traces and making lasting updates. It converts specific failures into abstracted strategies through analysis and meta-reasoning, then applies minimal code patches that remain in the policy for future use. This differs from temporary response fixes by creating persistent, auditable changes that transfer across instances. The approach is tested on arithmetic, compositional, graduate-level, and creative tasks, where the updated 7B model shows gains over its starting policy and other baselines. The loop emphasizes conservative checks to keep improvements stable.

Core claim

Polaris realizes recursive self-improvement for compact models by cycling through error explanation, strategy proposal, experience abstraction into reusable forms, and minimal code patch application, so that the revised policy produces better results on new benchmark instances without retraining.

What carries the argument

Experience abstraction, which distills specific failures into compact transferable strategies, inside a Godel-style loop of policy inspection and minimal code patch repair.

If this is right

  • Policy patches persist and apply automatically to new instances inside each benchmark.
  • A 7B model equipped with Polaris records consistent accuracy lifts over its base version on the four evaluated tasks.
  • The agent performs meta-reasoning to explain its own errors and suggest revisions to its policy code.
  • Conservative checks during repair prevent patches from harming performance on unrelated instances.
  • Cumulative refinement occurs as multiple abstracted experiences are folded into the same policy over successive loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same abstraction-plus-patch cycle could be tried on code generation or planning benchmarks to test whether the transfer property holds outside the reported tasks.
  • Because the patches are small and human-readable, the method might support hybrid loops in which a person reviews or edits the proposed changes before they are committed.
  • If the abstraction step scales with model size, the framework might produce measurable gains even on models smaller than 7B.
  • Combining experience abstraction with existing response-level correction techniques could create a two-stage system that first fixes outputs then updates the underlying policy.

Load-bearing premise

Abstracting experiences from past failures produces strategies compact enough to transfer to new problems while the code patches raise performance without creating new errors on other tasks.

What would settle it

Apply the final patched policy to the complete test sets of MGSM, DROP, GPQA, and LitBench and measure whether accuracy stays the same or drops relative to the untouched base model.

Figures

Figures reproduced from arXiv: 2603.23129 by Aditya Kakade, Shirish Karande, Vivek Srivastava.

Figure 1
Figure 1. Figure 1: Architectural overview of POLARIS. (a) Recursive self-improvement cycle: The agent selects actions based on its policy and goals, storing outputs and reasoning traces in Memory. Eval￾uation collects N failed tasks from the validation set, triggering the Policy Repair module. (b) Policy repair cycle: Through experience abstraction, the agent performs Failure Analysis on the N tasks, distills reusable strate… view at source ↗
Figure 2
Figure 2. Figure 2: Policy update example on the MGSM dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Successful evolution runs of POLARIS with performance improvement compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the performance with the base policy. For policy repair and experience abstraction, we consider a set of three failed instances from the validation set of each dataset (N=3). Unsuccessful runs and the utility of POLARIS: Unsuccessful runs were caused by infrastructure is… view at source ↗
Figure 4
Figure 4. Figure 4: Successful evolution runs of POLARIS using Qwen3-8B model, with performance im￾provement compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the perfor￾mance with the base policy. For policy repair and experience abstraction, we consider a set of three failed instances from the validation set of each dataset (N=3). • Reminder prompting to the agent: Call action evaluate on task only aft… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for analyzing failures on task samples through self-reflection. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for policy repair planning and abstraction. Agent synthesizes the generalized [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for generating code patches from policy repair strategies. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for integrating code patches into current policy. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Goal prompt of the agent with the capabilities, core methods, and the guiding principles. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Helper agent prompt that helps correct the output format to valid JSON during the eval [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Successful evolution runs of POLARIS with performance improvement compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the performance with the base policy. For policy repair and experience abstraction, we consider a set of five failed instances from the validation set of each dataset (N=5). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: No Improvement runs of POLARIS with performance compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the performance with the base policy. For policy repair and experience abstraction, we consider a set of three failed instances from the validation set of each dataset (N=3). 0 1 2 3 4 5 6 7 Policy Repair Iteration 60 62 64 66 68 70 72 74 Accuracy (%) Run 1 COT-SC (a) MGSM 0 2 4 6 8 10 1… view at source ↗
Figure 13
Figure 13. Figure 13: No Improvement runs of POLARIS with performance compared to the base policy and COT-SC. Policy Repair Iteration 0 shows the performance with the base policy. For policy repair and experience abstraction, we consider a set of five failed instances from the validation set of each dataset (N=5). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance variance across datasets for successful and no-improvement (NI) runs of [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance variance across datasets for successful and no-improvement (NI) runs of [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: An example of policy repair via experience abstraction with P [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: An example of policy repair via experience abstraction with P [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: An example of policy repair via experience abstraction with P [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: An example of policy repair via experience abstraction with P [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Policy update example on the MGSM dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Policy update example on the DROP dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Policy update example on the DROP dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Policy update example on the GPQA dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Policy update example on the GPQA dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p029_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Policy update example on the LitBench dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p030_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Policy update example on the LitBench dataset. We highlight the updates in the current [PITH_FULL_IMAGE:figures/full_fig_p031_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Failure case (Mistral-7B-Instruct-v0.3), The model emits a single response [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Failure mode (deepseek-coder-6.7b-instruct), a [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Failure case (Llama-3.1-8B-Instruct), While attempting to update solver via ac [PITH_FULL_IMAGE:figures/full_fig_p034_29.png] view at source ↗
read the original abstract

G\"odel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a G\"odel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Polaris, a Gödel agent framework for small (7B) language models that enables recursive self-improvement via a structured loop of error analysis, strategy formation, experience abstraction, and minimal code-patch policy repair. Experience abstraction is presented as the key mechanism that distills specific failures into compact, reusable strategies that transfer to unseen instances. The central empirical claim is that equipping a 7B model with Polaris yields consistent gains over the base policy and competitive baselines on MGSM, DROP, GPQA, and LitBench.

Significance. If the empirical results and transfer claims hold, the work would be significant for agentic and meta-reasoning research: it offers an auditable, parameter-free route to persistent policy updates in compact models, avoiding both full fine-tuning and purely response-level correction. The emphasis on experience abstraction as a bridge from instance failures to general strategies addresses a practical bottleneck in self-improving agents. However, the absence of any quantitative numbers, ablations, or transfer metrics in the manuscript makes it impossible to evaluate whether these benefits are realized.

major comments (3)
  1. [Abstract] Abstract: the claim of 'consistent gains' on MGSM, DROP, GPQA, and LitBench is stated without any numerical results, error bars, ablation tables, or description of how transfer to unseen instances was measured or verified, leaving the central empirical claim unsupported by visible data.
  2. [§3–4] Experience abstraction description (throughout §3–4): the mechanism is described at a high level as converting failures into 'compact, reusable strategies,' but no concrete procedure, pseudocode, or example is given showing how instance-specific traces are stripped to guarantee generalization rather than producing instance-level patches; this directly bears on whether the reported gains reflect policy-level improvement or merely memorization of the repair-loop cases.
  3. [§5] Transfer evaluation (implied in §5): the manuscript does not report any held-out instance split, cross-benchmark transfer test, or regression check on unrelated tasks after patch application, which is required to substantiate the claim that abstracted strategies persist and improve performance on unseen instances.
minor comments (2)
  1. [Abstract] Abstract: 'pat ch' is a typographical error and should read 'patch'.
  2. [Abstract] Abstract: inconsistent formatting of 'Godel' / 'Godel agent'; standardize to 'Gödel agent' throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the manuscript's empirical support and methodological details can be strengthened. We agree that adding explicit quantitative results, concrete procedures, and evaluation protocols will improve the paper and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent gains' on MGSM, DROP, GPQA, and LitBench is stated without any numerical results, error bars, ablation tables, or description of how transfer to unseen instances was measured or verified, leaving the central empirical claim unsupported by visible data.

    Authors: We agree that the abstract should include supporting numbers. In the revision we will insert the key performance deltas (with standard deviations from multiple runs) for each benchmark and a one-sentence description of the held-out transfer protocol used to verify generalization. revision: yes

  2. Referee: [§3–4] Experience abstraction description (throughout §3–4): the mechanism is described at a high level as converting failures into 'compact, reusable strategies,' but no concrete procedure, pseudocode, or example is given showing how instance-specific traces are stripped to guarantee generalization rather than producing instance-level patches; this directly bears on whether the reported gains reflect policy-level improvement or merely memorization of the repair-loop cases.

    Authors: We will expand §§3–4 with an explicit algorithm (including pseudocode) for the abstraction step that removes instance-specific identifiers and values, replacing them with generalized patterns. We will also add a worked example from MGSM showing an original failure trace, the resulting abstract strategy, and its successful application to a previously unseen problem, thereby clarifying that the mechanism targets policy-level rather than instance-level updates. revision: yes

  3. Referee: [§5] Transfer evaluation (implied in §5): the manuscript does not report any held-out instance split, cross-benchmark transfer test, or regression check on unrelated tasks after patch application, which is required to substantiate the claim that abstracted strategies persist and improve performance on unseen instances.

    Authors: We will add a dedicated subsection in §5 that details the held-out splits, reports quantitative transfer results (including cross-benchmark patch application), and includes regression checks on unrelated tasks. These additions will supply the missing metrics needed to demonstrate persistent policy improvement on unseen instances. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the Polaris framework derivation

full rationale

The paper introduces Polaris as a new Gödel agent framework that performs policy repair via experience abstraction, turning failures into policy updates through analysis, strategy formation, abstraction, and minimal code patch repair. The central claims of consistent gains on MGSM, DROP, GPQA, and LitBench are presented as empirical outcomes from applying the framework to a 7B model, without any derivation chain, equations, or fitted parameters that reduce predictions to inputs by construction. No self-citations, uniqueness theorems, or ansatzes from prior work are invoked as load-bearing justifications. The experience abstraction mechanism is described as an independent construction that distills failures into reusable strategies, making the overall derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that small models can reliably perform meta-reasoning to generate and validate their own policy patches; experience abstraction is introduced as a new entity without independent falsifiable evidence outside the described loop.

axioms (1)
  • domain assumption Small language models can perform reliable meta-reasoning to analyze errors and propose policy revisions.
    Invoked as the basis for the self-improvement loop in the abstract.
invented entities (1)
  • Experience abstraction no independent evidence
    purpose: Distill failures into compact reusable strategies that transfer to unseen instances.
    New construct introduced to enable cumulative policy refinement.

pith-pipeline@v0.9.0 · 5500 in / 1200 out tokens · 34591 ms · 2026-05-15T07:40:43.835059+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Examine how the policy’s logic or structure caused the error

    A clear explanation of the failure. Examine how the policy’s logic or structure caused the error

  2. [2]

    Step-by-step suggestions on how the policy could be revised to solve the task

  3. [3]

    ‘python <code patch here>

    Advice to prevent similar failures in the future. Figure 5: Prompt for analyzing failures on task samples through self-reflection. Strategy Synthesis You are an expert AI engineer analyzing self-reflection on policy from multiple failed tasks. Inputs: - Reflections:{combined reflections} - Current Policy:{current policy} - Prior Strategies:{agent.prior st...

  4. [4]

    role": "user

    + initial money = 100, which is incorrect. It should have been 100 - (5 * 8). This error occurred due to a misinterpretation of the problem statement. Instructions: To correct the mistake, the solver function needs to properly reflect the mathematical relationship described in the problem. Specifically, the solver should perform the operation: initial mon...

  5. [5]

    Continue to interact with the environment by executing actions based on the current analysis

  6. [6]

    If performance does not improve, conduct further analysis using `action_call_json_format_llm` or reflective evaluation techniques

  7. [7]

    If necessary, modify the solver function's logic using `action_adjust_logic`

  8. [8]

    Periodically call `action_evaluate_on_task` to assess the performance of the solver

  9. [9]

    Apply techniques like Large Language Model Debate, Step-back Abstraction, Quality-Diversity, Dynamic Assignment of Roles, Self- consistency, Few-shots, Task Decomposition, and Reflective Evaluation to improve the performance of the solver

  10. [10]

    name": "action_display_analysis

    Continuously evolve by reflecting on the environment, learning from mistakes, and adapting to new challenges. Action Counter: defaultdict(<class 'int'>, {'evolve': 1}) Agent Evolve ------------------------------------------------------------------------------------------------------------------------ {"name": "action_display_analysis", "arguments": {"anal...

  11. [12]

    name": "action_call_json_format_llm

    The answer MUST be an integer. ".strip() }} and Expecting ',' delimiter: line 7 column 91 (char 339) Action Counter: defaultdict(<class 'int'>, {'evolve': 1}) Agent Evolve ------------------------------------------------------------------------------------------------------------------------ ``` <tool_call> {"name": "action_call_json_format_llm", "argumen...

  12. [13]

    strip() }} </tool_call> ``` Failed to parse tool call: the content is {

    The answer MUST be an integer. ".strip() }} </tool_call> ``` Failed to parse tool call: the content is {"name": "action_call_json_format_llm", "arguments": { "messages": [{"role": "user", "content": "# Your Task: task"}], "temperature": 0.7, "num_of_response": 1, "role": "math expert", "return_dict_keys": ["reasoning", "answer"], "requirements": "1. Pleas...

  13. [14]

    requirements

    The answer MUST be an integer. ".strip() }} and Expecting ',' delimiter: line 7 column 91 (char 339) Action Counter: defaultdict(<class 'int'>, {'evolve': 2}) Agent Evolve Figure 28: Failure mode (deepseek-coder-6.7b-instruct), a<tool call>... </tool call>is issued with invalid JSON (Python ”.strip()” in “requirements”) and a placeholder prompt (“# Your T...