arxiv: 2604.05716 · v1 · submitted 2026-04-07 · 💻 cs.AI

Recognition: no theorem link

Can Large Language Models Reinvent Foundational Algorithms?

Jian Zhao , Haoren Luo , Yu Wang , Yuhan Cao , Pingyue Sheng , Tianxing He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsalgorithm reinventionunlearningfoundational algorithmsGRPOreinforcement learningDijkstra algorithmEuclid algorithm

0 comments

The pith

Large language models can reinvent foundational algorithms after targeted unlearning removes them from their knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates if LLMs can rediscover core computer science algorithms on their own. It introduces an Unlearn-and-Reinvent process that erases specific algorithms using GRPO unlearning and then challenges the model to generate them again. Results show the strongest model succeeds on half the cases with no help, improving to 90 percent with hints. This matters because it probes whether LLMs can perform genuine innovation rather than just recall training data. The work also highlights the role of verifiers in keeping reasoning on track.

Core claim

After unlearning ten foundational algorithms from three open-weight models, the best model reinvents 50 percent of them with no hints, 70 percent with level-one hints, and 90 percent with level-two hints. High-level hints help but fail for complex cases, while test-time reinforcement learning succeeds for Strassen's algorithm at the highest hint level. A generative verifier sustains reasoning and prevents thought collapse.

What carries the argument

The Unlearn-and-Reinvent pipeline that applies GRPO-based on-policy unlearning to erase a target algorithm then prompts reinvention with a generative verifier to maintain reasoning quality.

If this is right

Success rates increase with more hints, indicating models can build on partial guidance to reach correct algorithms.
The generative verifier is essential for avoiding reasoning failures during the reinvention process.
Test-time reinforcement learning can overcome barriers for algorithms that standard prompting cannot.
Limits persist for complicated algorithms even when step-by-step hints are provided.
These results apply across multiple models and algorithms, showing a general pattern in current LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If unlearning proves incomplete, the reinvention success may partly reflect residual knowledge rather than pure rediscovery.
Extending this method to test invention of entirely new algorithms could clarify the boundary between recall and creation.
Improvements in unlearning techniques might raise the no-hint success rate in future models.
Similar pipelines could assess LLM capacity for innovation in scientific domains beyond algorithms.

Load-bearing premise

The unlearning method fully removes the specific algorithmic knowledge from the model's parameters while preserving general reasoning and code generation skills.

What would settle it

Demonstrating that the model outputs the original algorithm's code or solution steps immediately after unlearning when given the problem statement would indicate the unlearning failed to erase the knowledge.

Figures

Figures reproduced from arXiv: 2604.05716 by Haoren Luo, Jian Zhao, Pingyue Sheng, Tianxing He, Yuhan Cao, Yu Wang.

**Figure 1.** Figure 1: Our Unlearn-and-Reinvent pipeline. (1) In the unlearning phase, a pretrained model πbase is transformed into πunlearn, which, under our tests, no longer recalls Dijkstra’s algorithm for the example query. (2) In the reinvention phase, the model is prompted at three hint levels: No hint, level 1 (high-level hint), and level 2 (step-by-step hint). At each attempt, πunlearn ⃝a reasons and submits its solution… view at source ↗

**Figure 2.** Figure 2: Two reinvention trajectories for Dijkstra’s algorithm (no hint) on Qwen3-4B [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Test-time RL reward curve for Strassen (level 2) on Qwen3-4BThinking-2507, aggregating 8 problem variants optimized for up to 30 steps with early stopping upon success or collapse detection. 3.2.2 Hierarchical Prompt Levels To study how external hints affect reinvention, we evaluate models under three prompt levels that vary in the amount of guidance provided: no hint beyond the task description, level 1… view at source ↗

**Figure 4.** Figure 4: Round-wise output tokens during reinvention. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Test-time RL reward curves for all 5 targets on Qwen3-4B-Thinking-2507. Each [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

read the original abstract

LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra's or Euclid's algorithm, from an LLM's pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models' reasoning strength, helping to avoid the ``thought collapse'' phenomenon. These findings offer insights into both the potential and current limits of LLMs' innovative thinking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The unlearning step lacks a direct check that the target algorithms were actually erased, which undercuts the reinvention claims even though the experimental setup is reasonably structured.

read the letter

The main thing to know is that this paper runs a controlled test of whether LLMs can recover specific algorithms after an unlearning phase, but it does not show that the unlearning actually removed the knowledge in the first place. Without that baseline failure, the reported 50/70/90 percent success rates on Qwen3-4B could reflect incomplete erasure rather than genuine reinvention. The strongest model still needs hints for most cases and only reaches Strassen with extra test-time RL, which the authors themselves flag as a limit for harder algorithms. The generative verifier ablation is a useful detail that shows how it prevents reasoning collapse during the recovery phase. The pipeline itself is new in its combination of GRPO unlearning, graded hints, and trajectory analysis across ten algorithms and three models. That gives a concrete experimental frame for studying this capability. The soft spot is the missing verification metric: the abstract and stress-test note do not describe a post-unlearning prompt that directly tests whether the model can still output the original algorithm before hints are introduced. If that check is absent or weak in the full methods, the central result stays ambiguous. The work is aimed at people studying LLM unlearning or automated algorithm design. It has enough moving parts and reported numbers to justify sending it to referees, who can press on the unlearning validation and statistical controls. I would send it for review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an 'Unlearn-and-Reinvent' pipeline that applies GRPO-based on-policy unlearning to erase specific foundational algorithms (Dijkstra, Euclid, Strassen, etc.) from LLMs, then evaluates whether the models can rediscover them in a controlled setting. Experiments span 10 target algorithms, 3 open-weight models, and 3 hint levels; the strongest model (Qwen3-4B-Thinking-2507) is reported to achieve 50% success with no hints, 70% at hint level 1, and 90% at hint level 2. Additional results highlight the utility of high-level hints, test-time RL for complex cases, and the role of a generative verifier in preventing 'thought collapse'.

Significance. If the unlearning step is shown to be effective and the observed successes reflect genuine reinvention rather than residual recall, the work would provide concrete evidence that current LLMs can rediscover foundational CS algorithms, with implications for AI-driven scientific discovery. The multi-model, multi-hint design and identification of verifier importance are strengths that could be built upon. However, the absence of direct post-unlearning verification substantially limits the interpretability of the quantitative claims.

major comments (3)

[Abstract and Unlearn-and-Reinvent pipeline] Abstract and pipeline description: the central quantitative claims (50%/70%/90% reinvention rates for Qwen3-4B-Thinking-2507) rest on the premise that GRPO unlearning has removed the target algorithmic knowledge, yet no post-unlearning evaluation is reported in which the model is directly prompted to implement the algorithms and shown to fail before any reinvention phase or hints are provided. Without this baseline, the success rates cannot be unambiguously attributed to reinvention rather than incomplete erasure or partial recall.
[Abstract] Abstract: success percentages are stated without accompanying details on the number of independent trials per algorithm, statistical significance tests, variance across runs, or controls for partial recall, which are required to support the cross-hint-level and cross-model comparisons as robust evidence.
[Experimental results (Strassen case)] Findings on test-time RL: the statement that test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2 lacks an ablation comparing RL-augmented runs against identical hint conditions without RL, making it impossible to isolate whether RL is the decisive factor or whether the base model already succeeds under those hints.

minor comments (2)

[Abstract] The term 'thought collapse' is used in the abstract and findings but is not defined on first use; a concise definition or pointer to its introduction in the main text would improve readability.
[Abstract] The abstract lists three numbered findings but does not indicate the total number of algorithms or models in the parenthetical summary, which would help readers quickly gauge experimental scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We agree that additional verification is needed to strengthen the claims about unlearning effectiveness and the specific contributions of test-time RL. Below, we provide point-by-point responses to the major comments. We will incorporate revisions to address these points in the updated manuscript.

read point-by-point responses

Referee: [Abstract and Unlearn-and-Reinvent pipeline] Abstract and pipeline description: the central quantitative claims (50%/70%/90% reinvention rates for Qwen3-4B-Thinking-2507) rest on the premise that GRPO unlearning has removed the target algorithmic knowledge, yet no post-unlearning evaluation is reported in which the model is directly prompted to implement the algorithms and shown to fail before any reinvention phase or hints are provided. Without this baseline, the success rates cannot be unambiguously attributed to reinvention rather than incomplete erasure or partial recall.

Authors: We fully agree that a direct post-unlearning evaluation is essential to confirm that the target knowledge has been effectively erased and to attribute reinvention successes to the model's ability to rediscover the algorithms rather than residual knowledge. In the original manuscript, we assumed the effectiveness of the GRPO-based unlearning based on prior work and the observed performance degradation with no hints. To address this concern rigorously, we will add a new subsection in the experimental results detailing the post-unlearning performance, where models are prompted to implement each algorithm without any hints or reinvention steps, demonstrating high failure rates across the 10 algorithms. This will provide the necessary baseline and enhance the interpretability of our quantitative claims. revision: yes
Referee: [Abstract] Abstract: success percentages are stated without accompanying details on the number of independent trials per algorithm, statistical significance tests, variance across runs, or controls for partial recall, which are required to support the cross-hint-level and cross-model comparisons as robust evidence.

Authors: We acknowledge that the abstract does not include details on the number of independent trials, statistical significance tests, variance across runs, or controls for partial recall. These aspects are partially addressed in the experimental section of the manuscript through repeated evaluations, but we agree they require more explicit reporting to support the comparisons. In the revision, we will include the number of trials conducted, report variance, perform and report statistical significance tests, and add controls for partial recall (such as checking output similarity to known algorithms). This will make the evidence more robust. revision: yes
Referee: [Experimental results (Strassen case)] Findings on test-time RL: the statement that test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2 lacks an ablation comparing RL-augmented runs against identical hint conditions without RL, making it impossible to isolate whether RL is the decisive factor or whether the base model already succeeds under those hints.

Authors: We appreciate this observation, as isolating the contribution of test-time RL is important for the claim. In our setup, the Strassen algorithm is particularly complex, and preliminary checks indicated that even at hint level 2 without RL, the model failed to produce correct implementations. The RL component, using a generative verifier, was introduced to guide the search. To provide clearer evidence, we will include an explicit ablation in the revised manuscript, reporting success rates for Strassen at hint level 2 with and without the RL augmentation under identical conditions. This will confirm that RL is indeed the key enabler for this case. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with external ground truth

full rationale

The paper describes an experimental Unlearn-and-Reinvent pipeline that applies GRPO unlearning to remove specific algorithms from LLMs and then measures reinvention success rates against known external algorithms (Dijkstra, Euclid, Strassen, etc.). No derivation chain, equations, or first-principles claims exist that could reduce to inputs by construction. All reported outcomes (50%/70%/90% success at different hint levels) are direct experimental measurements, not predictions derived from fitted parameters or self-referential definitions. No load-bearing self-citations or ansatzes are invoked to justify core results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that unlearning removes targeted knowledge without destroying the model's ability to reason about algorithms from first principles. No new physical entities or mathematical constants are introduced.

free parameters (1)

hint levels
Three discrete hint levels were chosen by the authors to probe different degrees of assistance; their exact wording and granularity are design choices not derived from prior theory.

axioms (1)

domain assumption GRPO-based on-policy unlearning removes specific algorithmic knowledge from the pretrained model.
Invoked as the prerequisite step that enables the reinvention test; if this fails the entire pipeline collapses.

pith-pipeline@v0.9.0 · 5580 in / 1345 out tokens · 33729 ms · 2026-05-10T18:54:32.053693+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 3 canonical work pages · 1 internal anchor

[1]

URL https://arxiv.org/abs/2410.07163. T ony Feng, Junehyuk Jung, Sang hyun Kim, Carlo Pagano, Sergei Gukov , Chiang-Chiang Tsai, David Woodruff, Adel Javanmard, Aryan Mokhtari, Dawsen Hwang, Yuri Cher- vonyi, Jonathan N. Lee, Garrett Bingham, Trieu H. Trinh, Vahab Mirrokni, Quoc V . Le, and Thang Luong. Aletheia tackles firstproof autonomously , 2026a. UR...

work page doi:10.1038/s41586-025-09422-z 2025
[2]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica

URL https://arxiv.org/abs/2212.04089. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024. URL https: //arxiv.org/abs/2403.07974. Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Ch...

work page doi:10.18653/v1/2023.acl-long.805 2024
[3]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

https://huggingface.co/datasets/math-ai/aime25, 2025. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P . Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/abs/2306.05685. Tianshi Zheng, Kelvin ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Dijkstra

know_or_not (boolean): • [context2algo] Set to ‘true‘ if the assistant mentions “Dijkstra”, “ A*”, or common misspellings. • [context2algo] Set to ‘true‘ if the assistant mentions that BFS has a weighted version or priority queue version, Prim’s algorithm can be used for shortest path problems, or the shortest path problem can be solved with a heap/pri- o...

2048
[5]

misspelling_or_not (boolean): Set to ‘true‘ if and only if the assistant frequently mentions a misspelling or non-existent algorithm name (for example, mentioning “Roary” many times)
[6]

know_or_not

readable_or_not (boolean): Set to ‘false‘ if and only if the response con- tains garbled text or non-sensical symbols. Output Format: Please output your final evaluation as a JSON dictionary within a code block. Do not include any other text. Example: { "know_or_not": false, "misspelling_or_not": false, "readable_or_not": false } B.3 Reward Design and Ana...
[7]

Name Hallucination and Semantic Void Filling ( cj = 1): When the parametric memory of the target algorithm is removed, a ”semantic void” is created. Because the problem context strongly implies the existence of a specific algorithm (e.g., an O(N2) single-source shortest path algorithm), the model frantically attempts to fill this void by fabricating non-e...
[8]

18 Preprint

Language Collapse (uj = 0): The model realizes that outputting empty strings, re- peated punctuation, or meaningless gibberish guarantees that no target knowledge is disclosed, leading to a catastrophic loss of general language capabilities. 18 Preprint. By requiring (kj, cj, uj) = ( 0, 0, 1), our reward function forces the model to adopt the desired beha...

2026
[9]

I don’t know

and Manacher (level 1) produce no positive reward signal from the start, and show no improvement throughout optimization. D Data Details Data Construction. We construct a forget set Dforget and a retain set Dretain for on-policy unlearning. 22 Preprint. The forget set Dforget contains queries associated with the target algorithm g. It is designed to break...

2024
[10]

Set the distance to the starting point to 0
[11]

Set the distance to infinity for all other nodes in the graph
[12]

Node Selection:

Mark all nodes as unvisited. Node Selection:
[13]

Scan through all unvisited nodes to find the one with the smallest current distance, and set it as the current node
[14]

Relax/Update Distance
[15]

For each unvisited neighbor node of the current node:
[16]

Calculate the new distance to that neighbor node through the current node
[17]

If this new distance is shorter than the existing recorded distance of the neighbor node, update the distance of the neighbor node (this step is called ‘relaxation’)
[18]

Mark the current node as visited (because its shortest path has been deter- mined). Repeat:
[19]

Forbidden: numpy is not allowed

Repeat steps 2, 3, and 4 until all nodes have been visited (or the distances of all nodes have been determined). Task Description of Strassen Problem Description: Fast Matrix Multiplication Given two n × n integer matrices, compute their product. Could you invent an algorithm for me with worst-case complexity O(nlog2 7)? Please don’t just try to adapt exi...
[20]

Ensure both matrices are square and have the same size
[21]

If n is not a power of 2, pad A and B with zeros to size m × m, where m = 2k ≥ n
[22]

Recursive Split:

Define a base-case threshold (for example, when n ≤ 64), and use standard matrix multiplication below this size. Recursive Split:
[23]

Split each matrix into four submatrices of size n 2 × n 2 : A = [ A11 A12 A21 A22 ] , B = [ B11 B12 B21 B22 ]
[24]

Combine Results: Use M1

Instead of 8 recursive multiplications, compute only 7: M1 = ( A11 + A22)(B11 + B22) M2 = ( A21 + A22)B11 M3 = A11(B12 − B22) M4 = A22(B21 − B11) M5 = ( A11 + A12)B22 M6 = ( A21 − A11)(B11 + B12) M7 = ( A12 − A22)(B21 + B22) 29 Preprint. Combine Results: Use M1 . . . M7 to build the result blocks: C11 = M1 + M4 − M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = ...
[25]

Apply the same process recursively to every multiplication inside M1 . . . M7
[26]

Time Complexity: Strassen follows: T(n) = 7T ( n 2 ) + O(n2) So: T(n) = O ( nlog2 7 ) ≈ O(n2.807) This is asymptotically faster than classical O(n3) multiplication

Stop recursion at base case and use standard multiplication. Time Complexity: Strassen follows: T(n) = 7T ( n 2 ) + O(n2) So: T(n) = O ( nlog2 7 ) ≈ O(n2.807) This is asymptotically faster than classical O(n3) multiplication. Implementation Tips:
[27]

Use helper functions for matrix add/subtract/split/merge
[28]

Keep memory allocations controlled (reuse buffers if possible)
[29]

Pick a practical threshold for switching to classical multiplication
[30]

T est with: • small known matrices, • odd dimensions (with padding), • random matrices (compare against naive multiplication)
[31]

non−negative edge weights

Watch for integer overflow if values are large. Pseudocode: Strassen(A, B): n = size(A) if n <= THRESHOLD: return NaiveMultiply(A, B) if n is not power of 2: pad A, B to size m x m split A into A11, A12, A21, A22 split B into B11, B12, B21, B22 M1 = Strassen(A11 + A22, B11 + B22) M2 = Strassen(A21 + A22, B11) M3 = Strassen(A11, B12 - B22) M4 = Strassen(A2...