SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan; Tailin Jin; Weize Chen; Zeyuan Liu

arxiv: 2602.04811 · v2 · submitted 2026-02-04 · 💻 cs.CL · cs.AI· cs.LG

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan , Tailin Jin , Weize Chen , Zeyuan Liu This is my paper

Pith reviewed 2026-05-16 07:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords self-evolutionknowledge internalizationSE-Benchclosed-book trainingopen-book paradoxRL gapself-playcoding agents

0 comments

The pith

Agents retain novel knowledge only when documentation is withheld during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates SE-Bench to isolate whether agents can compress new information into their weights rather than depend on external references. It does this by turning the NumPy library into a pseudo-novel package whose identifiers are randomized, so that simple coding tasks become trivial once the documentation is seen but impossible for any base model that has not internalized the API. Experiments show that keeping the documentation available during training actually reduces retention, forcing the use of closed-book training to push the knowledge inside the model. Standard reinforcement learning methods leave gaps in internalization because of clipping and negative gradients, whereas self-play paired with supervised fine-tuning succeeds at learning from the model's own imperfect task generations.

Core claim

SE-Bench shows that true self-evolution rests on knowledge internalization, which is actively inhibited by the Open-Book Paradox when reference material remains available, only partially achieved by standard RL due to the RL Gap arising from PPO clipping and negative gradients, and made possible by Self-Play combined with SFT that lets models improve from their own noisy self-generated tasks.

What carries the argument

SE-Bench, the diagnostic environment that converts the NumPy library and its API documentation into a pseudo-novel package with randomized identifiers so that success without documentation access directly measures internalization.

If this is right

Closed-book training is required to compress new knowledge into model weights rather than allowing reliance on external documentation.
Standard PPO reinforcement learning leaves incomplete internalization because of its clipping rule and handling of negative gradients.
Self-play paired with supervised fine-tuning lets models extract useful signal from their own noisy, self-generated tasks.
Self-evolution benchmarks must separate retention from reasoning difficulty by using deliberately trivial tasks once the new knowledge is internalized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training schedules for agents may need explicit closed-book phases after any open-book exposure to lock in retention.
Obfuscation methods similar to SE-Bench could test internalization of new facts in non-coding domains such as mathematics or scientific reasoning.
Hybrid curricula that alternate open-book and closed-book stages might combine fast acquisition with durable retention.

Load-bearing premise

The randomized identifiers create knowledge that lies entirely outside the model's pre-training distribution and that task failures result only from missing internalization rather than from problem difficulty.

What would settle it

Run the closed-book-trained model on the SE-Bench tasks with all documentation removed and check whether accuracy rises from near zero to near perfect while the base model stays near zero.

read the original abstract

True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SE-Bench shows that closed-book training helps models retain an obfuscated NumPy API while open-book and standard RL do not, but the randomized identifiers may still leak enough structure to weaken the internalization claims.

read the letter

The paper's main move is to create SE-Bench by taking NumPy, replacing every identifier with random strings, and packaging the docs as a new library. Models train on coding tasks with this package and then get tested closed-book on the same simple tasks. This produces three observations: having the reference docs during training reduces later retention (Open-Book Paradox), PPO-style RL leaves gaps because of clipping and negative gradients (RL Gap), and self-play plus SFT can learn from the model's own noisy outputs where pure RL cannot. They release the code and dataset, which makes the setup easy to check or extend. The design keeps task difficulty low so failures can be tied more directly to missing internalized knowledge rather than hard reasoning. That part is useful for anyone tracking lifelong-learning agents or tool-use systems. The soft spot is the assumption that randomized names plus obfuscated docs create knowledge truly outside the base model's distribution. Residual patterns in docstring structure or API shape could still let the model guess some behavior, which would mean the reported gaps mix leakage with genuine retention failure. The abstract gives no numbers on data splits, variance, or how many tasks were filtered, so the size of the effects is hard to judge. This is for groups working on agent self-evolution benchmarks. It is worth sending to peer review because the benchmark setup is concrete and the training-regime comparisons are testable even if the novelty claim needs tighter controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces SE-Bench, a diagnostic benchmark that obfuscates the NumPy library and its documentation into a pseudo-novel package using randomized identifiers. Agents are trained to internalize this knowledge and then evaluated on simple closed-book coding tasks. The work reports three main findings: an Open-Book Paradox in which access to reference documentation during training inhibits retention (necessitating Closed-Book Training), an RL Gap in which standard RL (PPO) fails to achieve full internalization due to clipping and negative gradients, and evidence that Self-Play combined with SFT succeeds at learning from self-generated noisy tasks while RL does not.

Significance. If the central claims hold after addressing the novelty concern, SE-Bench would supply a controlled, reproducible testbed for knowledge internalization that cleanly separates retention from reasoning difficulty. The public release of code and dataset is a clear strength that supports follow-up work on lifelong-learning agents.

major comments (2)

[Benchmark construction] Benchmark construction (abstract and methods): the claim that randomized identifiers produce knowledge 'outside the model's pre-training distribution' is load-bearing for attributing the Open-Book Paradox and RL Gap to training regime rather than data leakage. No control is reported that quantifies how much syntactic or structural overlap with real NumPy remains recognizable to base models; without such evidence the attribution of zero closed-book performance to failed internalization is weakened.
[Results on RL training] RL Gap analysis (results section): the assertion that PPO clipping and negative gradients are the mechanistic cause of incomplete internalization requires more than correlational observation. An ablation that isolates clipping (or reports per-token gradient norms and advantage statistics) is needed to make the causal claim load-bearing for the recommendation against standard RL.

minor comments (2)

[Abstract] The phrase 'pseudo-novel package' is used without a precise definition; a short paragraph clarifying the exact randomization procedure and any retained structural cues would improve reproducibility.
[Evaluation protocol] Task difficulty is asserted to be 'trivial with the new API doc'; a table reporting human or oracle performance on the closed-book version would make this claim concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (abstract and methods): the claim that randomized identifiers produce knowledge 'outside the model's pre-training distribution' is load-bearing for attributing the Open-Book Paradox and RL Gap to training regime rather than data leakage. No control is reported that quantifies how much syntactic or structural overlap with real NumPy remains recognizable to base models; without such evidence the attribution of zero closed-book performance to failed internalization is weakened.

Authors: We agree that a quantitative control would strengthen the attribution. In the revised manuscript we will add a dedicated paragraph in the methods section detailing the randomization procedure (full replacement of all identifiers with unique random strings drawn from a disjoint character set) and report two controls: (1) exact token overlap statistics between the obfuscated vocabulary and the original NumPy names, and (2) zero-shot performance of the base model on the SE-Bench tasks. Because every identifier is replaced by a fresh random string with no lexical or n-gram overlap, we expect and will confirm zero closed-book accuracy; the new analysis will make this explicit and rule out any residual structural leakage. revision: yes
Referee: [Results on RL training] RL Gap analysis (results section): the assertion that PPO clipping and negative gradients are the mechanistic cause of incomplete internalization requires more than correlational observation. An ablation that isolates clipping (or reports per-token gradient norms and advantage statistics) is needed to make the causal claim load-bearing for the recommendation against standard RL.

Authors: We acknowledge that the current RL Gap discussion relies on observed training curves rather than a controlled ablation. In the revision we will add an explicit ablation comparing standard PPO against a clipped-free variant (using the same learning rate and advantage estimator) and will report per-token gradient norm histograms together with advantage statistics across training steps. These results will be placed in the results section to provide direct evidence for the role of clipping and negative gradients. revision: yes

Circularity Check

0 steps flagged

No circularity; benchmark definition and empirical results are independent

full rationale

The paper defines SE-Bench via obfuscation of NumPy identifiers and documentation into a pseudo-novel package, then reports empirical outcomes on closed-book coding tasks under different training regimes (SFT, RL, self-play). No equations, fitted parameters, or derivations are present that reduce to their own inputs by construction. Claims such as the Open-Book Paradox and RL Gap are presented as observations from controlled experiments on the benchmark, not as predictions forced by prior fits or self-citations. The benchmark construction is logically prior to and independent of the measured retention rates, satisfying the criteria for a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on the assumption that randomized API identifiers create genuinely novel knowledge and that the chosen coding tasks are simple enough to isolate internalization from reasoning difficulty. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Randomized identifiers produce knowledge outside the pre-training distribution
Invoked in the construction of the pseudo-novel package; required for the claim that failures reflect lack of internalization rather than prior exposure.
domain assumption Task difficulty is low enough that success depends only on knowledge access
Stated when describing the evaluation tasks as trivial with the new API doc but impossible without it.

pith-pipeline@v0.9.0 · 5556 in / 1410 out tokens · 28364 ms · 2026-05-16T07:17:35.550237+00:00 · methodology

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)