ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

Chengcao Yang; Jun Chen

arxiv: 2604.27644 · v2 · submitted 2026-04-30 · 💻 cs.LG · cs.AI· cs.PL

ANCORA: Learning to Question via Manifold-Anchored Self-Play for Verifiable Reasoning

Chengcao Yang , Jun Chen This is my paper

Pith reviewed 2026-05-08 03:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PL

keywords self-playverifiable reasoningcurriculum learningformal verificationreinforcement learningtest-time training

0 comments

The pith

ANCORA lets a single policy generate its own verifiable problems and solve them, bootstrapping high performance in formal verification without any human solutions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ANCORA as a method for open-ended curriculum self-play in which one policy alternates between proposing novel verifiable specifications and solving them under verifier feedback. It shows that three specific stabilizers—a two-level group-relative advantage update, iterative self-distilled supervised fine-tuning onto valid outputs, and a UCB-guided curriculum DAG—together prevent the proposer from collapsing under sparse signals. With these stabilizers the system lifts pass@1 on Dafny2Verus from a 26.6 percent supervised baseline to 81.5 percent in zero-shot test-time training and transfers to other code benchmarks. The core argument is that learning to question, rather than answering fixed prompts, becomes feasible once the self-play loop is anchored to the manifold of verifiable outputs.

Core claim

ANCORA alternates a Proposer that synthesizes novel specifications with a Solver that produces verified solutions; a two-level group-relative update couples their advantages, iterative self-distilled SFT projects the model onto its valid-output manifold, and a UCB-guided Curriculum DAG expands the policy-induced problem set under self-composition, enabling the entire loop to bootstrap from zero human solutions and reach 81.5 percent pass@1 on Verus tasks.

What carries the argument

The three stabilizers (two-level group-relative update, iterative self-distilled SFT, and UCB-guided Curriculum DAG) that couple proposer and solver advantages and keep the generated problem set expanding without collapse.

If this is right

The policy can synthesize an expanding set of verifiable problems whose difficulty is controlled by the UCB-guided DAG.
Performance gains transfer to held-out benchmarks such as MBPP and HumanEval when training begins from Dafny2Verus seeds.
The same loop can be applied to any domain that supplies an automatic verifier, removing the need for human-annotated solution traces.
Test-time training with the anchored self-play loop improves 0-shot accuracy beyond 1-shot baselines that rely on external examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Curriculum DAG structure may allow continuous, automatic curriculum expansion beyond the initial problem distribution.
The manifold-projection step could be adapted to other modalities where an automatic verifier exists, such as theorem proving or constraint satisfaction.
If the stabilizers generalize, self-play curricula might reduce dependence on large human-curated datasets for training verifiable reasoners.

Load-bearing premise

The three stabilizers are enough to stop the proposer from collapsing when verifier feedback is sparse and only the model’s own outputs are available.

What would settle it

Run the identical training loop on the same Verus tasks but disable one stabilizer (for example, remove the two-level group-relative update) and measure whether pass@1 falls back to the 26.6 percent SFT baseline within the same number of steps.

Figures

Figures reproduced from arXiv: 2604.27644 by Chengcao Yang, Jun Chen.

**Figure 1.** Figure 1: Learning paradigms. Left: SFT is bounded by static-data supervision. Middle: RLVR uses environment feedback but a fixed prompt pool. Right: ANCORA. The Proposer (Q-Gen) generates a novel spec x ′ from seed xseed via 1-shot ICL; the Solver (S-Gen) attempts implementation y. A coupled MLRL update shapes both personas of πθ via Solver outcomes rS ∈ {0, 1} K (§4.2.3–§4.3). Strictly filtered, novel, Solver-veri… view at source ↗

**Figure 2.** Figure 2: Iterative Self-Distilled SFT for valid manifold initialization. Direct RL from the base model view at source ↗

**Figure 3.** Figure 3: The UCB-guided Curriculum DAG. Top: Original seeds (blue) are protected roots; Proposer-generated specs (green) are admitted as children only after solved-only verification; highUCB frontier nodes (orange) become next-step seeds, low-UCB nodes (red, dashed) are deprioritized; dashed-violet edges mark structural kinship (Jaccard ≥ 0.70). Bottom: Selection first picks a root by UCB, then performs local stay… view at source ↗

**Figure 4.** Figure 4: Branch-freeze and root-only curriculum ablation in the test-time-training setting across view at source ↗

**Figure 5.** Figure 5: Training diagnostics over the first 1,500 RL steps of the canonical 2,000-step run; diagnostic view at source ↗

**Figure 6.** Figure 6: Proposer reward variant ablation on the 234-seed d2v241 pool (identical UCB/SFT; view at source ↗

**Figure 7.** Figure 7: Actor gradient-norm trajectories comparing the two Proposer/Solver branch-weighting view at source ↗

read the original abstract

We propose a paradigm shift toward open-ended curriculum self-play: rather than learning to answer on a fixed prompt set, a unified policy learns to question: generating verifiable problems, solving them, and turning verifier feedback into self-improvement without human-annotated solutions. We introduce ANCORA, in which the policy alternates between a Proposer that synthesizes novel specifications and a Solver that produces verified solutions, anchored by three load-bearing mechanisms: a two-level group-relative update coupling Proposer advantages across specifications with Solver advantages across solution attempts; iterative self-distilled SFT projecting the base model onto its valid-output manifold before RL; and a UCB-guided Curriculum DAG whose policy-induced problem set can provably expand under self-composition. Without these stabilizers, sparse verifier feedback drives Proposer collapse even under MLRL-aligned rewards; with them, ANCORA bootstraps a verifiable curriculum from zero human solutions. Instantiated in Verus, ANCORA lifts Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in test-time training (TTT, 0-shot), outperforming PSV self-play by 15.8 points despite PSV's 1-shot inference; in a transfer setting, training from Dafny2Verus seeds yields 36.2% and 17.2% pass@1 on held-out MBPP and HumanEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

ANCORA gets a solid performance boost on verifiable code tasks by stabilizing self-play with group-relative updates, self-distilled SFT, and a curriculum DAG, though the paper could do more to show why each piece is essential. The reported 81.5% pass@1 on Dafny2Verus from a 26.6% SFT baseline, along with transfer gains, suggests the approach can build a curriculum from scratch. The new elements are the specific combination of those three mechanisms to handle sparse verifier feedback and prevent proposer collapse. The two-level relative advantage links the two roles, the iterative SFT projects onto valid outputs, and the UCB DAG allows the set of problems to expand reliably. This setup outperforms the PSV baseline and works in a 0-shot test-time training mode. The paper does a good job explaining the motivation and how these address known issues in self-play for open-ended tasks. The main soft spot is the experimental section. Details on controls, error bars, data splits, and ablations are missing or light, which makes it hard to verify that the stabilizers are doing the heavy lifting as claimed. The weakest point is the lack of evidence that all three are necessary rather than any one or two being sufficient. This does not invalidate the results but limits how much we can generalize from them. This paper is for people in AI-assisted formal methods and self-improvement loops for reasoning. A reader looking for ways to train models on verifiable tasks with less human input will get practical ideas and a clear baseline to build on. The work shows clear thinking about the stability problems in self-play and engages honestly with prior work like PSV. It deserves a serious referee. I would recommend sending it for peer review, with feedback focused on adding ablations and more rigorous reporting.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes ANCORA, a unified policy for open-ended curriculum self-play in verifiable reasoning. The policy alternates between a Proposer generating novel specifications and a Solver producing verified solutions, stabilized by a two-level group-relative advantage update, iterative self-distilled SFT to project onto the valid-output manifold, and a UCB-guided Curriculum DAG that expands under self-composition. The central empirical claim is that these mechanisms enable bootstrapping from zero human solutions, raising Dafny2Verus pass@1 from a 26.6% SFT baseline to 81.5% in 0-shot test-time training while outperforming PSV self-play by 15.8 points, with additional transfer gains on MBPP and HumanEval.

Significance. If the reported gains prove robust under controlled ablations and the three stabilizers demonstrably prevent Proposer collapse under sparse verifier feedback, the work would constitute a meaningful contribution to self-supervised curriculum learning for formal verification. The explicit framing of manifold-anchored self-play and the transfer results suggest potential broader impact on automated reasoning systems that can generate their own training distributions.

major comments (3)

[Experimental results section] The abstract and central claim rest on the assertion that the three mechanisms (two-level group-relative update, iterative self-distilled SFT, UCB-guided Curriculum DAG) are jointly sufficient to prevent Proposer collapse; however, the manuscript provides no ablation results isolating each component's contribution or showing collapse when any one is removed, leaving the load-bearing status of the stabilizers unverified.
[Curriculum DAG description] The claim that the Curriculum DAG 'can provably expand under self-composition' is presented as a key theoretical stabilizer, yet no theorem statement, proof sketch, or conditions for expansion appear in the provided text; without this, it is impossible to assess whether the UCB guidance actually guarantees non-collapse rather than relying on empirical tuning of the exploration coefficient.
[Results and baselines] Performance numbers (26.6% to 81.5% pass@1, 15.8-point gain over PSV) are reported without error bars, number of independent runs, exact data splits, or implementation details for the group sizes and UCB coefficient, undermining confidence that the gains are attributable to the proposed mechanisms rather than uncontrolled factors in the test-time training loop.

minor comments (2)

[Method] Notation for the two-level advantage update and the self-distillation projection step should be formalized with explicit equations to clarify how they couple Proposer and Solver updates without introducing hidden parameters.
[Transfer experiments] The transfer setting (Dafny2Verus seeds to MBPP/HumanEval) would benefit from a clearer description of the 0-shot vs. 1-shot inference protocols used for the PSV baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thorough and constructive review. We appreciate the emphasis on verifying the load-bearing role of the proposed stabilizers and on improving reproducibility. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental results section] The abstract and central claim rest on the assertion that the three mechanisms (two-level group-relative update, iterative self-distilled SFT, UCB-guided Curriculum DAG) are jointly sufficient to prevent Proposer collapse; however, the manuscript provides no ablation results isolating each component's contribution or showing collapse when any one is removed, leaving the load-bearing status of the stabilizers unverified.

Authors: We agree that isolating the contribution of each stabilizer is essential to substantiate the central claim. In the revised manuscript we will add a dedicated ablation subsection that removes each mechanism in turn (two-level group-relative update, iterative self-distilled SFT, and UCB-guided Curriculum DAG) while keeping all other factors fixed. We will report the resulting Proposer collapse rates and final pass@1 scores, thereby directly verifying the load-bearing status of the three components. revision: yes
Referee: [Curriculum DAG description] The claim that the Curriculum DAG 'can provably expand under self-composition' is presented as a key theoretical stabilizer, yet no theorem statement, proof sketch, or conditions for expansion appear in the provided text; without this, it is impossible to assess whether the UCB guidance actually guarantees non-collapse rather than relying on empirical tuning of the exploration coefficient.

Authors: The manuscript asserts that the policy-induced problem set can provably expand under self-composition, but we acknowledge that the formal statement and proof sketch were omitted from the main text. In the revision we will insert a new subsection containing (i) the precise theorem statement, (ii) a proof sketch establishing the expansion condition under self-composition, and (iii) the role of the UCB coefficient in guaranteeing non-collapse. This will clarify that the guarantee is not merely empirical. revision: yes
Referee: [Results and baselines] Performance numbers (26.6% to 81.5% pass@1, 15.8-point gain over PSV) are reported without error bars, number of independent runs, exact data splits, or implementation details for the group sizes and UCB coefficient, undermining confidence that the gains are attributable to the proposed mechanisms rather than uncontrolled factors in the test-time training loop.

Authors: We agree that additional statistical and implementation details are required for reproducibility. The revised manuscript will report (i) mean and standard deviation over at least five independent random seeds, (ii) the exact train/test splits used for Dafny2Verus, MBPP, and HumanEval, and (iii) all hyper-parameters including group sizes for the two-level advantage estimator and the precise UCB exploration coefficient. These additions will be placed in the experimental setup and results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents ANCORA as an empirical self-play framework whose performance gains (26.6% to 81.5% pass@1) are reported as direct outcomes of the three described stabilizers rather than as quantities derived from fitted parameters or self-referential definitions. No equations appear in the provided text that equate a prediction to its own inputs by construction, and the central mechanisms (group-relative update, self-distilled SFT, UCB-guided DAG) are introduced as design choices whose sufficiency is asserted via experimental results, not via a load-bearing self-citation chain or uniqueness theorem imported from the authors' prior work. The claim that the Curriculum DAG 'can provably expand under self-composition' is stated at a high level without reducing to a prior fitted result or ansatz smuggled through citation. The derivation therefore remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach depends on several unstated RL hyperparameters and domain assumptions about verifier reliability and training stability that are not independently evidenced in the abstract.

free parameters (2)

UCB exploration coefficient
Controls problem selection in the Curriculum DAG; value not reported in abstract.
group sizes for relative advantage updates
Determines the two-level group-relative update; specific sizes not given.

axioms (2)

domain assumption The external verifier provides accurate and complete correctness signals for all generated solutions.
Required for the self-improvement loop to function without human oversight.
ad hoc to paper Iterative self-distilled SFT keeps the policy on the valid-output manifold without degrading solution quality.
Invoked to justify the projection step that prevents collapse.

invented entities (2)

Curriculum DAG no independent evidence
purpose: Organizes and expands the set of verifiable problems under self-composition.
New structure introduced to guide curriculum growth.
Manifold-anchored self-play no independent evidence
purpose: Prevents Proposer collapse by coupling updates and projecting onto valid outputs.
Core stabilizing concept not present in standard self-play.

pith-pipeline@v0.9.0 · 5556 in / 1609 out tokens · 98132 ms · 2026-05-08T03:03:22.699480+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Autoverus: Automated proof generation for rust code,

URLhttps://arxiv.org/abs/2508.03682. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert,...

work page doi:10.1145/3763174 2021
[2]

Likelihood displacement: Biased repulsion gradients reduce πθ(y+) for correct outputs that share parameter-space structure with over-penalized negatives (LLD)

work page
[3]

Manifold erosion: Momentum drift during zero-gradient steps moves the policy away from the SFT manifold, reducingp M

work page
[4]

The expected sequencep (0) M , p(1) M ,

Starvation amplification: Reduced pM increases the fraction of zero-gradient groups (Theorem B.1), lengthening drift episodes and further amplifying distortion. The expected sequencep (0) M , p(1) M , . . .exhibits a monotone decreasing trend with attractorp ∗ = 0. 16 Sketch. Let pk =p (k) M , and let τk = 1/(N pk) be the expected inter-arrival of informa...

work page
[5]

SFT Warm-Up (Breaking Theorem B.1):Fine-tuning on expert demonstrations raises pM from near-zero to a level where N pM ≫1 , ensuring that most groups contain at least one valid sample for positive-advantage gradients

work page
[6]

This eliminates the most damaging source of directional misalignment

Pure Binary Rewards (Mitigating Theorem B.3):By using binary rewards R∈ {0,1} with filtering as accept/reject gating (rather than tiered heuristic scores), zero-hit groups produceexactly zerogradient rather than arbitrary off-manifold gradients. This eliminates the most damaging source of directional misalignment. Heuristic scores were found experi- menta...

work page 2025
[7]

We treat this as astructural hypothesisthat motivates the schedule rather than as a verified property of LLM RL

Smooth homotopy path (assumption-level).If Jguided and Jintrinsic are twice continuously differentiable and the Hessian ∇2 θJ(θ, λ) is non-singular along the path, the implicit function theorem produces a continuous curve θ∗(λ) from the guided to the intrinsic optimum. We treat this as astructural hypothesisthat motivates the schedule rather than as a ver...

work page
[8]

As λt →0 , continuity of ∇J in λ implies bt →0

Bias–variance decomposition.Writing θt+1 =θ t +η tˆgt, we decompose the stochastic gradient relative to the zero-shot target∇J 0(θt): ˆgt =∇J 0(θt)| {z } target +b t|{z} annealing bias +ξ t|{z} noise , where bt :=∇J λt(θt)− ∇J 0(θt) and ξt := ˆgt − ∇J λt(θt). As λt →0 , continuity of ∇J in λ implies bt →0 . Under bounded rewards and Lipschitz-smooth polic...

work page
[9]

prediction

ODE-tracking sketch (under stronger assumptions). Ifthe step sizes additionally satisfied the Robbins–Monro conditionsP t ηt =∞ ,P t η2 t <∞ (which AdamW doesnot, and our finite training horizon does not need to), the stochastic-approximation theory for slowly varying targets (e.g., Borkar 2008) would predict that θt tracks the ODE ˙θ=∇J 0(θ) and converge...

work page 2008
[10]

Parent core: Identify the core objective in plain language

work page
[11]

Direction target: Restate the requested direction (up/down)

work page
[12]

Solver-skill focus: Name the one reusable solver obligation the child should teach

work page
[13]

Minimal decomposition: For up, explain why no further decomposition is needed; for down, isolate the smallest blocker subproblem

work page
[14]

Stepping-stone rationale: Explain why this child is the right next step

work page
[15]

Next-step child choice: Formulate the minimal plan by choosing the best child

work page
[16]

Hard bans: - No rename-only mutation

Check: Verify the full reasoning chain for consistency. Hard bans: - No rename-only mutation. - No child whose extra property is already implied by the seed. - No child whose difficulty cannot be tied to a reusable S-Gen proof skill. E.2 Proposer Curriculum Template (Abridged) The curriculum template provides a seed specification and a direction (up or do...

work page
[17]

Longer responses are not necessarily pathological.In late training, some increase in response length reflects explicit invariant narration and postcondition checking rather than raw repetition

work page
[18]

spec recap + invariant-driven implementation + brief proof sketch

The dominant style is “spec recap + invariant-driven implementation + brief proof sketch.”This is consistent with the intended solver behavior in ANCORA, where reusable proof obligations are surfaced explicitly. At the same time, these examples also show that the model often producesover-explainedsolutions. The extra length typically comes from rephrasing...

work page 2008
[19]

Boundedness( Rt ∈[0,1] ): enforced by the two-gate construction (30); the three q_reward_modevariants (band_1of8,4p1p,exp_decay) each return values in[0,1]

work page
[20]

Solved-only admission(Lemma G.1): enforced by the best_ans guard in the trainer admission path, combined with thestrict_passrequirement of the solver-verification loop. 3.ϵ -packing(Lemma G.2): enforced by SeedPoolGate.check with the dedup thresh- olds specified in our main configuration ( max_pool_sim=0.72, max_sibling_sim=0.55, duplicate_struct_sim_thre...

work page
[21]

Empirically ˆτt =s_ability_ema evolves slowly (per-step changes <0.05 in the first 200 steps), giving an order-of-magnitude check ΥT ≲10 in our main run

Piecewise stationarity with variation budget ΥT : not verifiable from code alone; this is an assumption on thesolver learning dynamics, equivalent to requiring that policy updates do not induce unbounded drift between successive steps. Empirically ˆτt =s_ability_ema evolves slowly (per-step changes <0.05 in the first 200 steps), giving an order-of-magnitu...

work page
[22]

Hierarchical root/stay/child actions. mcts_select first chooses a root via _root_score (global denominator log(_total_visits+1)/(nt(q)+1)), then at each descent step compares _stay_score(q) against maxq′_child_score(q,q’). Stay and child scores useparent-local denominators log(nt(qparent)+1)/(nstay t (q)+1) and log(nt(qparent)+1)/(nedge t (qparent, q′)+1)...

work page
[23]

ucb_exploit_mode defaults to ratchet; our Main run and its Ratchet-4p1p variant both run ratchet

Default ratchet exploit. ucb_exploit_mode defaults to ratchet; our Main run and its Ratchet-4p1p variant both run ratchet. The Empirical-Band and Empirical-4p1p ablations override to empirical, and both crashed within 300 steps—see the negative ablation in §D.5, which also argues that the ratchet choice is the practical mechanism enforcing Assumption (4)

work page
[24]

Cold-start neighborhood prior. _node_value returns st(q)/nt(q) only on visited arms (nt(q)>0); on unvisited arms it falls back through a three-step neighborhood_pass_rate chain (own pass_rate→ mean child pass_rate→ global positive-pass-rate mean → constant 0.15). The surrogate abstracts this warm-start prior away; in the code it is what allows freshly ins...

work page
[25]

These areselection-timebonuses attached to Ut, distinct from the backed-up rewardR t of Equation (30)

Additive selection priors.Under empirical mode, _exploit returns ˆµt(q)+λZPD ZPD(q,ˆat); and _root_score/_child_score/_stay_score each further add a bounded kinship prior λkin ¯µ(KinN(q)). These areselection-timebonuses attached to Ut, distinct from the backed-up rewardR t of Equation (30). All three priors lie in[0,1]up to their (bounded) weights

work page
[26]

Both constrain the feasible action set; neither alters the arm space Xor the reward gate, so boundedness andϵ-packing remain intact

Adaptive difficulty window and root quota.Each step, an α=ucb_root_ratio fraction of prompts is reserved for root-level sampling (α= 0.50 in the main run, leaving the other 50% for DAG descent), and candidates are restricted to the difficulty window[ˆat−δ/2,ˆat+δ/2] around the solver-ability estimate ˆat. Both constrain the feasible action set; neither al...

work page
[27]

we were unable to find the license for the dataset we used

Batch repulsion and descend margin.Within-batch penalties discourage repeated root/edge/stay actions, and ucb_descend_margin biases the stay-vs-child comparison by a small additive constant. These alter selection probabilities inside a batch but preserve the one-step reward boundedness the regret proof uses. All six deviations are bounded admissible pertu...

work page
[28]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Autoverus: Automated proof generation for rust code,

URLhttps://arxiv.org/abs/2508.03682. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert,...

work page doi:10.1145/3763174 2021

[2] [2]

Likelihood displacement: Biased repulsion gradients reduce πθ(y+) for correct outputs that share parameter-space structure with over-penalized negatives (LLD)

work page

[3] [3]

Manifold erosion: Momentum drift during zero-gradient steps moves the policy away from the SFT manifold, reducingp M

work page

[4] [4]

The expected sequencep (0) M , p(1) M ,

Starvation amplification: Reduced pM increases the fraction of zero-gradient groups (Theorem B.1), lengthening drift episodes and further amplifying distortion. The expected sequencep (0) M , p(1) M , . . .exhibits a monotone decreasing trend with attractorp ∗ = 0. 16 Sketch. Let pk =p (k) M , and let τk = 1/(N pk) be the expected inter-arrival of informa...

work page

[5] [5]

SFT Warm-Up (Breaking Theorem B.1):Fine-tuning on expert demonstrations raises pM from near-zero to a level where N pM ≫1 , ensuring that most groups contain at least one valid sample for positive-advantage gradients

work page

[6] [6]

This eliminates the most damaging source of directional misalignment

Pure Binary Rewards (Mitigating Theorem B.3):By using binary rewards R∈ {0,1} with filtering as accept/reject gating (rather than tiered heuristic scores), zero-hit groups produceexactly zerogradient rather than arbitrary off-manifold gradients. This eliminates the most damaging source of directional misalignment. Heuristic scores were found experi- menta...

work page 2025

[7] [7]

We treat this as astructural hypothesisthat motivates the schedule rather than as a verified property of LLM RL

Smooth homotopy path (assumption-level).If Jguided and Jintrinsic are twice continuously differentiable and the Hessian ∇2 θJ(θ, λ) is non-singular along the path, the implicit function theorem produces a continuous curve θ∗(λ) from the guided to the intrinsic optimum. We treat this as astructural hypothesisthat motivates the schedule rather than as a ver...

work page

[8] [8]

As λt →0 , continuity of ∇J in λ implies bt →0

Bias–variance decomposition.Writing θt+1 =θ t +η tˆgt, we decompose the stochastic gradient relative to the zero-shot target∇J 0(θt): ˆgt =∇J 0(θt)| {z } target +b t|{z} annealing bias +ξ t|{z} noise , where bt :=∇J λt(θt)− ∇J 0(θt) and ξt := ˆgt − ∇J λt(θt). As λt →0 , continuity of ∇J in λ implies bt →0 . Under bounded rewards and Lipschitz-smooth polic...

work page

[9] [9]

prediction

ODE-tracking sketch (under stronger assumptions). Ifthe step sizes additionally satisfied the Robbins–Monro conditionsP t ηt =∞ ,P t η2 t <∞ (which AdamW doesnot, and our finite training horizon does not need to), the stochastic-approximation theory for slowly varying targets (e.g., Borkar 2008) would predict that θt tracks the ODE ˙θ=∇J 0(θ) and converge...

work page 2008

[10] [10]

Parent core: Identify the core objective in plain language

work page

[11] [11]

Direction target: Restate the requested direction (up/down)

work page

[12] [12]

Solver-skill focus: Name the one reusable solver obligation the child should teach

work page

[13] [13]

Minimal decomposition: For up, explain why no further decomposition is needed; for down, isolate the smallest blocker subproblem

work page

[14] [14]

Stepping-stone rationale: Explain why this child is the right next step

work page

[15] [15]

Next-step child choice: Formulate the minimal plan by choosing the best child

work page

[16] [16]

Hard bans: - No rename-only mutation

Check: Verify the full reasoning chain for consistency. Hard bans: - No rename-only mutation. - No child whose extra property is already implied by the seed. - No child whose difficulty cannot be tied to a reusable S-Gen proof skill. E.2 Proposer Curriculum Template (Abridged) The curriculum template provides a seed specification and a direction (up or do...

work page

[17] [17]

Longer responses are not necessarily pathological.In late training, some increase in response length reflects explicit invariant narration and postcondition checking rather than raw repetition

work page

[18] [18]

spec recap + invariant-driven implementation + brief proof sketch

The dominant style is “spec recap + invariant-driven implementation + brief proof sketch.”This is consistent with the intended solver behavior in ANCORA, where reusable proof obligations are surfaced explicitly. At the same time, these examples also show that the model often producesover-explainedsolutions. The extra length typically comes from rephrasing...

work page 2008

[19] [19]

Boundedness( Rt ∈[0,1] ): enforced by the two-gate construction (30); the three q_reward_modevariants (band_1of8,4p1p,exp_decay) each return values in[0,1]

work page

[20] [20]

Solved-only admission(Lemma G.1): enforced by the best_ans guard in the trainer admission path, combined with thestrict_passrequirement of the solver-verification loop. 3.ϵ -packing(Lemma G.2): enforced by SeedPoolGate.check with the dedup thresh- olds specified in our main configuration ( max_pool_sim=0.72, max_sibling_sim=0.55, duplicate_struct_sim_thre...

work page

[21] [21]

Empirically ˆτt =s_ability_ema evolves slowly (per-step changes <0.05 in the first 200 steps), giving an order-of-magnitude check ΥT ≲10 in our main run

Piecewise stationarity with variation budget ΥT : not verifiable from code alone; this is an assumption on thesolver learning dynamics, equivalent to requiring that policy updates do not induce unbounded drift between successive steps. Empirically ˆτt =s_ability_ema evolves slowly (per-step changes <0.05 in the first 200 steps), giving an order-of-magnitu...

work page

[22] [22]

Hierarchical root/stay/child actions. mcts_select first chooses a root via _root_score (global denominator log(_total_visits+1)/(nt(q)+1)), then at each descent step compares _stay_score(q) against maxq′_child_score(q,q’). Stay and child scores useparent-local denominators log(nt(qparent)+1)/(nstay t (q)+1) and log(nt(qparent)+1)/(nedge t (qparent, q′)+1)...

work page

[23] [23]

ucb_exploit_mode defaults to ratchet; our Main run and its Ratchet-4p1p variant both run ratchet

Default ratchet exploit. ucb_exploit_mode defaults to ratchet; our Main run and its Ratchet-4p1p variant both run ratchet. The Empirical-Band and Empirical-4p1p ablations override to empirical, and both crashed within 300 steps—see the negative ablation in §D.5, which also argues that the ratchet choice is the practical mechanism enforcing Assumption (4)

work page

[24] [24]

Cold-start neighborhood prior. _node_value returns st(q)/nt(q) only on visited arms (nt(q)>0); on unvisited arms it falls back through a three-step neighborhood_pass_rate chain (own pass_rate→ mean child pass_rate→ global positive-pass-rate mean → constant 0.15). The surrogate abstracts this warm-start prior away; in the code it is what allows freshly ins...

work page

[25] [25]

These areselection-timebonuses attached to Ut, distinct from the backed-up rewardR t of Equation (30)

Additive selection priors.Under empirical mode, _exploit returns ˆµt(q)+λZPD ZPD(q,ˆat); and _root_score/_child_score/_stay_score each further add a bounded kinship prior λkin ¯µ(KinN(q)). These areselection-timebonuses attached to Ut, distinct from the backed-up rewardR t of Equation (30). All three priors lie in[0,1]up to their (bounded) weights

work page

[26] [26]

Both constrain the feasible action set; neither alters the arm space Xor the reward gate, so boundedness andϵ-packing remain intact

Adaptive difficulty window and root quota.Each step, an α=ucb_root_ratio fraction of prompts is reserved for root-level sampling (α= 0.50 in the main run, leaving the other 50% for DAG descent), and candidates are restricted to the difficulty window[ˆat−δ/2,ˆat+δ/2] around the solver-ability estimate ˆat. Both constrain the feasible action set; neither al...

work page

[27] [27]

we were unable to find the license for the dataset we used

Batch repulsion and descend margin.Within-batch penalties discourage repeated root/edge/stay actions, and ucb_descend_margin biases the stay-vs-child comparison by a small additive constant. These alter selection probabilities inside a batch but preserve the one-step reward boundedness the regret proof uses. All six deviations are bounded admissible pertu...

work page

[28] [28]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page