SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization
Pith reviewed 2026-05-16 07:17 UTC · model grok-4.3
The pith
Agents retain novel knowledge only when documentation is withheld during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SE-Bench shows that true self-evolution rests on knowledge internalization, which is actively inhibited by the Open-Book Paradox when reference material remains available, only partially achieved by standard RL due to the RL Gap arising from PPO clipping and negative gradients, and made possible by Self-Play combined with SFT that lets models improve from their own noisy self-generated tasks.
What carries the argument
SE-Bench, the diagnostic environment that converts the NumPy library and its API documentation into a pseudo-novel package with randomized identifiers so that success without documentation access directly measures internalization.
If this is right
- Closed-book training is required to compress new knowledge into model weights rather than allowing reliance on external documentation.
- Standard PPO reinforcement learning leaves incomplete internalization because of its clipping rule and handling of negative gradients.
- Self-play paired with supervised fine-tuning lets models extract useful signal from their own noisy, self-generated tasks.
- Self-evolution benchmarks must separate retention from reasoning difficulty by using deliberately trivial tasks once the new knowledge is internalized.
Where Pith is reading between the lines
- Training schedules for agents may need explicit closed-book phases after any open-book exposure to lock in retention.
- Obfuscation methods similar to SE-Bench could test internalization of new facts in non-coding domains such as mathematics or scientific reasoning.
- Hybrid curricula that alternate open-book and closed-book stages might combine fast acquisition with durable retention.
Load-bearing premise
The randomized identifiers create knowledge that lies entirely outside the model's pre-training distribution and that task failures result only from missing internalization rather than from problem difficulty.
What would settle it
Run the closed-book-trained model on the SE-Bench tasks with all documentation removed and check whether accuracy rises from near zero to near perfect while the base model stays near zero.
read the original abstract
True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SE-Bench, a diagnostic benchmark that obfuscates the NumPy library and its documentation into a pseudo-novel package using randomized identifiers. Agents are trained to internalize this knowledge and then evaluated on simple closed-book coding tasks. The work reports three main findings: an Open-Book Paradox in which access to reference documentation during training inhibits retention (necessitating Closed-Book Training), an RL Gap in which standard RL (PPO) fails to achieve full internalization due to clipping and negative gradients, and evidence that Self-Play combined with SFT succeeds at learning from self-generated noisy tasks while RL does not.
Significance. If the central claims hold after addressing the novelty concern, SE-Bench would supply a controlled, reproducible testbed for knowledge internalization that cleanly separates retention from reasoning difficulty. The public release of code and dataset is a clear strength that supports follow-up work on lifelong-learning agents.
major comments (2)
- [Benchmark construction] Benchmark construction (abstract and methods): the claim that randomized identifiers produce knowledge 'outside the model's pre-training distribution' is load-bearing for attributing the Open-Book Paradox and RL Gap to training regime rather than data leakage. No control is reported that quantifies how much syntactic or structural overlap with real NumPy remains recognizable to base models; without such evidence the attribution of zero closed-book performance to failed internalization is weakened.
- [Results on RL training] RL Gap analysis (results section): the assertion that PPO clipping and negative gradients are the mechanistic cause of incomplete internalization requires more than correlational observation. An ablation that isolates clipping (or reports per-token gradient norms and advantage statistics) is needed to make the causal claim load-bearing for the recommendation against standard RL.
minor comments (2)
- [Abstract] The phrase 'pseudo-novel package' is used without a precise definition; a short paragraph clarifying the exact randomization procedure and any retained structural cues would improve reproducibility.
- [Evaluation protocol] Task difficulty is asserted to be 'trivial with the new API doc'; a table reporting human or oracle performance on the closed-book version would make this claim concrete.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (abstract and methods): the claim that randomized identifiers produce knowledge 'outside the model's pre-training distribution' is load-bearing for attributing the Open-Book Paradox and RL Gap to training regime rather than data leakage. No control is reported that quantifies how much syntactic or structural overlap with real NumPy remains recognizable to base models; without such evidence the attribution of zero closed-book performance to failed internalization is weakened.
Authors: We agree that a quantitative control would strengthen the attribution. In the revised manuscript we will add a dedicated paragraph in the methods section detailing the randomization procedure (full replacement of all identifiers with unique random strings drawn from a disjoint character set) and report two controls: (1) exact token overlap statistics between the obfuscated vocabulary and the original NumPy names, and (2) zero-shot performance of the base model on the SE-Bench tasks. Because every identifier is replaced by a fresh random string with no lexical or n-gram overlap, we expect and will confirm zero closed-book accuracy; the new analysis will make this explicit and rule out any residual structural leakage. revision: yes
-
Referee: [Results on RL training] RL Gap analysis (results section): the assertion that PPO clipping and negative gradients are the mechanistic cause of incomplete internalization requires more than correlational observation. An ablation that isolates clipping (or reports per-token gradient norms and advantage statistics) is needed to make the causal claim load-bearing for the recommendation against standard RL.
Authors: We acknowledge that the current RL Gap discussion relies on observed training curves rather than a controlled ablation. In the revision we will add an explicit ablation comparing standard PPO against a clipped-free variant (using the same learning rate and advantage estimator) and will report per-token gradient norm histograms together with advantage statistics across training steps. These results will be placed in the results section to provide direct evidence for the role of clipping and negative gradients. revision: yes
Circularity Check
No circularity; benchmark definition and empirical results are independent
full rationale
The paper defines SE-Bench via obfuscation of NumPy identifiers and documentation into a pseudo-novel package, then reports empirical outcomes on closed-book coding tasks under different training regimes (SFT, RL, self-play). No equations, fitted parameters, or derivations are present that reduce to their own inputs by construction. Claims such as the Open-Book Paradox and RL Gap are presented as observations from controlled experiments on the benchmark, not as predictions forced by prior fits or self-citations. The benchmark construction is logically prior to and independent of the measured retention rates, satisfying the criteria for a self-contained empirical study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Randomized identifiers produce knowledge outside the pre-training distribution
- domain assumption Task difficulty is low enough that success depends only on knowledge access
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.