SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Dianbo Liu; Nitin Vetcha

arxiv: 2605.20189 · v1 · pith:E2MF3PIZnew · submitted 2026-03-23 · 💻 cs.AI · cs.LG

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Nitin Vetcha , Dianbo Liu This is my paper

Pith reviewed 2026-05-21 11:16 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords autonomous agentslifelong learningcontinual adaptationreinforcement learningmeta-learninglarge language modelstest-time adaptation

0 comments

The pith

SOLAR lets an autonomous agent discover its own adaptation strategies by treating model weights as an environment for multi-level reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SOLAR as a method for large language models to handle streaming data and concept drift without relying on traditional fine-tuning or manual curation. It starts with a consolidated prior on common-sense knowledge and then applies multi-level reinforcement learning so the agent can explore and select modification strategies on its own parameters. The system keeps an evolving knowledge base that serves as episodic memory to retain what has already been learned while allowing new adaptations. A sympathetic reader would care because this setup aims to produce agents that improve over time in changing real-world conditions rather than requiring repeated human-guided retraining.

Core claim

SOLAR initiates with a strong prior over common-sense knowledge and then uses a multi-level reinforcement learning approach to autonomously discover adaptation strategies. It maintains an evolving knowledge base of valid modification strategies that implicitly acts as an episodic memory buffer, balancing plasticity for new tasks with stability for retained meta-knowledge. This enables efficient test-time adaptation to unseen domains while avoiding catastrophic forgetting.

What carries the argument

Multi-level reinforcement learning applied to model weights treated as an explorable environment, together with an evolving knowledge base of valid modification strategies.

If this is right

Enables efficient test-time adaptation to unseen domains without gradient-based retraining.
Outperforms strong baselines on common-sense, mathematical, medical, coding, social, and logical reasoning tasks.
Maintains balance between plasticity for new tasks and stability for prior meta-knowledge.
Supports open-ended autonomous agents capable of lifelong adaptation in evolving environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower the human effort needed to keep deployed models current as data streams change.
Similar self-optimization loops might be tested on non-language models to check if the same weight-exploration approach transfers.
Longer task sequences would show whether the knowledge base continues to grow without becoming unwieldy.

Load-bearing premise

Treating model weights as an environment that multi-level reinforcement learning can reliably explore will produce modification strategies that generalize across domains without causing instability or collapse.

What would settle it

Running SOLAR through a sequence of new domains and checking whether performance on the original tasks remains stable or degrades after each adaptation cycle.

Figures

Figures reproduced from arXiv: 2605.20189 by Dianbo Liu, Nitin Vetcha.

**Figure 1.** Figure 1: SOLAR’s methodology of weight-level meta-knowledge discovery and modification summarized (adapted from [34]) 5. Implementation 5.1. Architecture Primary architectural detail in SOLAR’s framework is the design of the weight-space exploration initializer. As mentioned in Section 4, we use a convolution based decoder model for this purpose. We assume that we have access to either the unseen task’s description… view at source ↗

**Figure 2.** Figure 2: Details of the Parameter Tokenization Process These convolutions are divided into three categories: i) width convolution that operates on (𝐶, 𝐿) dimension, ii) height convolution that operates on (𝐿, 𝑁) dimension) iii) layer-wise convolution that on (𝑁, 𝐿) dimension) , with notations Conv𝑊 , Conv𝐻, and Conv𝐿. Each layer consists of two Conv𝑊 , two Conv𝐻 and one Conv𝐿. Given this, the forward operation of t… view at source ↗

**Figure 3.** Figure 3: Details of the Hyper-Convolutional Decoder Architecture used Subsequently, prompt-checkpoint pairing is done as follows. Given a dataset 𝑃, it is first divided it into non-overlapping prompt batches [𝑝1, · · · , 𝑝𝑖 , · · · , 𝑝𝐼 ]. Denote the trained LLM checkpoints of this dataset as 𝑀 = [𝑚1, · · · , 𝑚𝑗 , · · · , 𝑚𝐽 ]. Then randomly a batch of prompts and a corresponding checkpoint is picked to create a pa… view at source ↗

**Figure 4.** Figure 4: Router Approach for TTS which can take one of five values - avg_sim_score, avg_prompt_embed, max_confidence, majority_vote or (summing log probabilities) i.e., sum_logprobs (former two belong to router approach and the latter three constitute the ensemble approach). • For LS, we use [4] and the corresponding JSON object has fields times and learning_rate. 6. Experiments 6.1. Setup As described in Section 5… view at source ↗

**Figure 5.** Figure 5: Details of the Prompt Selection Strategy used in Ablation Study Finally, greedy graph search is done to select the final prompt subset 𝑆. For this, start with 𝑆 = ∅ and at each round pick 𝑣 * = arg max 𝑣 /∈𝑆 𝑓𝒢(𝑣), 𝑣 * is then added to 𝑆 and diversity penalties only for neighbors of 𝑣 * are updated14. This process continues until |𝑆| reaches the target size which in our case is 128. Fortunately, the influe… view at source ↗

read the original abstract

Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOLAR combines meta-learning on weights with multi-level RL and a knowledge base for LLM continual adaptation, but the abstract leaves the core mechanisms and safeguards too vague to evaluate the claims.

read the letter

The paper's main pitch is an agent that treats LLM weights as an RL environment, uses multi-level reinforcement learning to find adaptation strategies on the fly, and keeps an evolving store of successful modifications to handle new domains without full retraining. It reports better results than baselines across common-sense, math, medical, coding, social, and logical tasks while trying to balance plasticity and stability in streaming settings. That framing pulls together existing pieces—parameter meta-learning, RL for strategy search, and episodic memory—into a single lifelong agent setup that hasn't been described exactly this way before in the abstract's citations. The practical target is real: concept drift and expensive gradient updates are genuine deployment headaches for LLMs in changing environments, and any workable autonomous fix would matter for the subfield. The abstract does a clean job stating the problem and the intended solution without overclaiming civilizational impact. The soft spots are mostly about missing substance. No equations appear for the RL policy, reward (accuracy plus stability?), or knowledge-base update rule. The outperformance is asserted without error bars, ablations, or data details, so it's impossible to tell whether the multi-level RL actually discovers generalizable strategies or just fits to the test distribution. The stress-test concern about instability in raw weight space lands because nothing shown constrains the action space to valid states or prevents interference; the knowledge base is supposed to help, but without the update mechanics it's hard to see how it avoids manual curation or collapse. If the full paper has reproducible code or formal checks on those points it would change the picture, but the abstract alone does not. This is the kind of work that belongs in a reading group for people building continual-learning agents, mainly for the idea of weight-space exploration plus memory. It is not ready to cite yet because the evidence is still promissory. A serious editor should send it to referees so the authors can supply the missing methods, controls, and stability analysis; the problem is worth the effort even if the current version needs heavy revision.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SOLAR, a Self-Optimizing Lifelong Autonomous Reasoner, which is an open-ended autonomous agent that leverages parameter-level meta-learning by treating model weights as an environment for exploration. It uses multi-level reinforcement learning to autonomously discover adaptation strategies and maintains an evolving knowledge base to balance plasticity and stability. The paper claims that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social, and logical reasoning tasks.

Significance. If the experimental results and the underlying mechanisms are rigorously demonstrated with full implementation details, this work could have high significance for the field of continual learning and autonomous agents, as it addresses key challenges like concept drift and catastrophic forgetting in dynamic environments without relying on gradient-based adaptation or extensive manual curation. The approach of treating weights as an RL environment and using an evolving knowledge base as episodic memory is novel if shown to be stable and generalizable.

major comments (2)

[Abstract] Abstract: The claim that SOLAR 'outperforms strong baselines' on six reasoning domains is stated without any accompanying methods, data details, error bars, ablation results, or statistical tests, which is load-bearing for the central claim of autonomous strategy discovery via multi-level RL.
[Methods] RL framework description: No equations or pseudocode are provided for the multi-level RL policy, action space over model weights, reward function (e.g., validation accuracy plus stability term), or knowledge-base update rule, leaving open whether the method reliably constrains modifications to valid states and avoids instability or catastrophic interference.

minor comments (1)

[Abstract] The abstract uses several acronyms (LLM, FT, SOLAR) without initial expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity, reproducibility, and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that SOLAR 'outperforms strong baselines' on six reasoning domains is stated without any accompanying methods, data details, error bars, ablation results, or statistical tests, which is load-bearing for the central claim of autonomous strategy discovery via multi-level RL.

Authors: We agree that the abstract's performance claim would benefit from additional context. In the revised manuscript, we have updated the abstract to briefly reference the evaluation across the six reasoning domains using standard benchmarks, along with a note that results include error bars, ablations, and statistical tests. Full experimental details, data descriptions, and analyses remain in the Experiments section, where we have added the requested elements to strengthen the presentation of the central claim. revision: yes
Referee: [Methods] RL framework description: No equations or pseudocode are provided for the multi-level RL policy, action space over model weights, reward function (e.g., validation accuracy plus stability term), or knowledge-base update rule, leaving open whether the method reliably constrains modifications to valid states and avoids instability or catastrophic interference.

Authors: We acknowledge that the original submission lacked formal descriptions of the RL components. The revised manuscript now includes equations for the multi-level RL policy, the action space defined over model weight modifications, the reward function (task accuracy combined with a stability term), and the knowledge-base update rule. We have also added pseudocode for the overall SOLAR procedure in the Methods section. These additions clarify the constraints on state transitions and the mechanisms for maintaining stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SOLAR as a novel agent architecture that treats model weights as an RL environment and uses multi-level reinforcement learning plus an evolving knowledge base for lifelong adaptation. No mathematical derivations, equations, or self-referential definitions appear in the abstract or method description. Performance claims rest on experimental comparisons across reasoning domains rather than any reduction of outputs to fitted inputs or self-citations by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5758 in / 1195 out tokens · 25766 ms · 2026-05-21T11:16:47.592294+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies... maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity and stability.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Level III is a significantly challenging aspect... letting LLMs to explore the hypothesis space in its entirety

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 22 internal anchors

[1]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, Advances in neural information processing systems 29 (2016)

work page 2016
[3]

J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y. Li, M. Tan, Test-time learning for large language models, 2025. URL: https://arxiv.org/abs/2505.20633.arXiv:2505.20633

work page arXiv 2025
[4]

Y. Hu, X. Zhang, X. Fang, Z. Chen, X. Wang, H. Zhang, G. Qi, Slot: Sample-specific language model optimization at test-time, 2025. URL: https://arxiv.org/abs/2505.12392.arXiv:2505.12392

work page arXiv 2025
[5]

Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua, B. Qi, Y. Sun, Z. Ma, L. Yuan, N. Ding, B. Zhou, Ttrl: Test-time reinforcement learning, 2025. URL: https://arxiv.org/abs/2504.16084.arXiv:2504.16084

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

M. M. Moradi, H. Amer, S. Mudur, W. Zhang, Y. Liu, W. Ahmed, Continuous self-improvement of large language models by test-time training with verifier-driven sample selection, 2025. URL: https://arxiv.org/abs/2505.19475.arXiv:2505.19475

work page arXiv 2025
[7]

H. Lee, S. Oh, J. Kim, J. Shin, J. Tack, Revise: Learning to refine at test-time via intrinsic self- verification, 2025. URL: https://arxiv.org/abs/2502.14565.arXiv:2502.14565

work page arXiv 2025
[8]

Hübotter, L

J. Hübotter, L. Diaz-Bone, I. Hakimi, A. Krause, M. Hardt, Learning on the job: Test-time curricula for targeted reinforcement learning, 2025. URL: https://arxiv.org/abs/2510.04786. arXiv:2510.04786

work page arXiv 2025
[9]

Bertolissi, J

R. Bertolissi, J. Hübotter, I. Hakimi, A. Krause, Local mixtures of experts: Essentially free test-time training via model merging, 2025. URL: https://arxiv.org/abs/2505.14136.arXiv:2505.14136

work page arXiv 2025
[10]

Z. Yang, N. Band, S. Li, E. Candès, T. Hashimoto, Synthetic continued pretraining, 2024. URL: https://arxiv.org/abs/2409.07431.arXiv:2409.07431

work page arXiv 2024
[11]

Y. Wang, X. Liu, X. Chen, S. O’Brien, J. Wu, J. McAuley, Self-updatable large language mod- els by integrating context into model parameters, 2025. URL: https://arxiv.org/abs/2410.00487. arXiv:2410.00487

work page arXiv 2025
[12]

R. Wang, P. Ping, Z. Guo, X. Zhang, Q. Shi, L. Zhou, T. Ji, Loki: Low-damage knowledge implanting of large language models, 2025. URL: https://arxiv.org/abs/2505.22120.arXiv:2505.22120

work page arXiv 2025
[13]

C. F. Park, Z. Zhang, H. Tanaka, New News: System-2 fine-tuning for robust integration of new knowledge, 2025. URL: https://arxiv.org/abs/2505.01812.arXiv:2505.01812

work page arXiv 2025
[16]

E. C. Acikgoz, C. Qian, H. Ji, D. Hakkani-Tür, G. Tur, Self-improving llm agents at test-time, 2025. URL: https://arxiv.org/abs/2510.07841.arXiv:2510.07841

work page arXiv 2025
[17]

J.-C. Pang, P. Wang, K. Li, X.-H. Chen, J. Xu, Z. Zhang, Y. Yu, Language model self- improvement by reinforcement learning contemplation, 2023. URL: https://arxiv.org/abs/2305. 14483.arXiv:2305.14483

work page arXiv 2023
[19]

Zweiger, J

A. Zweiger, J. Pari, H. Guo, E. Akyürek, Y. Kim, P. Agrawal, Self-adapting language models, 2025. URL: https://arxiv.org/abs/2506.10943.arXiv:2506.10943

work page arXiv 2025
[20]

M. Li, J. Lin, X. Zhao, W. Lu, P. Zhao, S. Wermter, D. Wang, Curriculum-rlaif: Curriculum align- ment with reinforcement learning from ai feedback, 2025. URL: https://arxiv.org/abs/2505.20075. arXiv:2505.20075

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, J. Weston, Self-rewarding language models,

work page
[22]

URL: https://arxiv.org/abs/2401.10020.arXiv:2401.10020

work page internal anchor Pith review Pith/arXiv arXiv
[23]

H. Zhou, Y. Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y. Lee, G. Zhang, K. Shao, L. Yang, J. Wang, Memento: Fine-tuning llm agents without fine-tuning llms, 2025. URL: https://arxiv.org/abs/2508. 16153.arXiv:2508.16153

work page arXiv 2025
[24]

Meta-Reinforcement Learning of Structured Exploration Strategies

A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, S. Levine, Meta-reinforcement learning of structured exploration strategies, 2018. URL: https://arxiv.org/abs/1802.07245.arXiv:1802.07245

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

K. Irie, I. Schlag, R. Csordás, J. Schmidhuber, A modern self-referential weight matrix that learns to modify itself, 2022. URL: https://arxiv.org/abs/2202.05780.arXiv:2202.05780

work page arXiv 2022
[26]

A survey on self-evolution of large language models

Z. Tao, T.-E. Lin, X. Chen, H. Li, Y. Wu, Y. Li, Z. Jin, F. Huang, D. Tao, J. Zhou, A sur- vey on self-evolution of large language models, 2024. URL: https://arxiv.org/abs/2404.14387. arXiv:2404.14387

work page arXiv 2024
[27]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

H. ang Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Y. Wu, H. Wang, H. Xiao, Y. Zhou, S. Zhang, J. Zhang, J. Xiang, Y. Fang, Q. Zhao, D. Liu, Q. Ren, C. Qian, Z. Wang, M. Hu, H. Wang, Q. Wu, H. Ji, M. Wang, A survey of self-evolving agents: On path to artificial super intelligence, 2025. URL: https://arxiv.org/abs/2507.21046.arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

K. Wang, D. Tang, W. Zhao, K. Schürholt, Z. Wang, Y. You, Recurrent diffusion for large-scale parameter generation, arXiv preprint arXiv:2501.11587 (2025)

work page arXiv 2025
[29]

Drag- and-drop llms: Zero-shot prompt-to-weights

Z. Liang, D. Tang, Y. Zhou, X. Zhao, M. Shi, W. Zhao, Z. Li, P. Wang, K. Schürholt, D. Borth, et al., Drag-and-drop llms: Zero-shot prompt-to-weights, arXiv preprint arXiv:2506.16406 (2025)

work page arXiv 2025
[30]

Charakorn, E

R. Charakorn, E. Cetin, Y. Tang, R. T. Lange, Text-to-lora: Instant transformer adaption, 2025. URL: https://arxiv.org/abs/2506.06105.arXiv:2506.06105

work page arXiv 2025
[31]

R. M. S. Khan, D. Tang, P. Li, K. Wang, T. Chen, Oral: Prompting your large-scale loras via conditional recurrent diffusion, 2025. URL: https://arxiv.org/abs/2503.24354. arXiv:2503.24354

work page arXiv 2025
[32]

X. Jin, K. Wang, D. Tang, W. Zhao, Y. Zhou, J. Tang, Y. You, Conditional lora parameter generation,

work page
[33]

URL: https://arxiv.org/abs/2408.01415.arXiv:2408.01415

work page arXiv
[34]

Y. Shao, X. Lin, X. Long, S. Chen, M. Yan, Y. Liu, Z. Yan, A. Ma, H. Tang, J. Guo, Icm-fusion: In-context meta-optimized lora fusion for multi-task adaptation, 2025. URL: https://arxiv.org/abs/ 2508.04153.arXiv:2508.04153

work page arXiv 2025
[35]

Y. Shao, M. Yan, Y. Liu, S. Chen, W. Chen, X. Long, Z. Yan, L. Li, C. Zhang, N. Sebe, H. Tang, Y. Wang, H. Zhao, M. Wang, J. Guo, In-context meta lora generation, 2025. URL: https://arxiv.org/ abs/2501.17635.arXiv:2501.17635

work page arXiv 2025
[36]

Zhang, Toward weight-level self-improving agents with meta-knowledge discovery, 10.36227/techrxiv.175744083.37752625/v1 (2025)

T. Zhang, Toward weight-level self-improving agents with meta-knowledge discovery, 10.36227/techrxiv.175744083.37752625/v1 (2025)

work page doi:10.36227/techrxiv.175744083.37752625/v1 2025
[37]

E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022, p. 3

work page 2022
[38]

LeCun, A path towards autonomous machine intelligence version 0.9

Y. LeCun, A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27, Open Review 62 (2022) 1–62

work page 2022
[39]

Y. Liu, Y. Nan, W. Xu, X. Hu, L. Ye, Z. Qin, P. Liu, Alphago moment for model architecture discovery,

work page
[40]

URL: https://arxiv.org/abs/2507.18074.arXiv:2507.18074

work page arXiv
[41]

C. Lu, S. Holt, C. Fanconi, A. J. Chan, J. Foerster, M. van der Schaar, R. T. Lange, Discovering preference optimization algorithms with and for large language models, 2024. URL: https://arxiv. org/abs/2406.08414.arXiv:2406.08414

work page arXiv 2024
[42]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, D. Yu, R-zero: Self-evolving reasoning llm from zero data, arXiv preprint arXiv:2508.05004 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908
[44]

Kunin, J

D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. Yamins, H. Tanaka, Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics, arXiv preprint arXiv:2012.04728 (2020)

work page arXiv 2012
[45]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[46]

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your sentence?, 2019. URL: https://arxiv.org/abs/1905.07830.arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[48]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, K. Toutanova, Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. URL: https://arxiv.org/abs/1905.10044. arXiv:1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 2019
[49]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL: https://arxiv.org/abs/ 1803.05457.arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor conduct electricity? a new dataset for open book question answering, arXiv preprint arXiv:1809.02789 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, Y. Choi, Piqa: Reasoning about physical commonsense in natural language, 2019. URL: https://arxiv.org/abs/1911.11641.arXiv:1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[52]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An adversarial winograd schema challenge at scale, 2019. URL: https://arxiv.org/abs/1907.10641.arXiv:1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[53]

L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, O. Khattab, Gepa: Reflective prompt evolution can outperform reinforcement learning, 2025. URL: https: //arxiv.org/abs/2507.19457.arXiv:2507.19457

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

J. Li, X. Dong, Y. Liu, Z. Yang, Q. Wang, X. Wang, S. Zhu, Z. Jia, Z. Zheng, Reflectevo: Improving meta introspection of small llms by learning self-reflection, 2025. URL: https://arxiv.org/abs/2505.16475. arXiv:2505.16475

work page arXiv 2025
[55]

L. Liu, C. Zhang, L. Wu, C. Zhao, Z. Hu, M. He, J. Fan, Instruct-of-reflection: Enhancing large language models iterative reflection capabilities via dynamic-meta instruction, 2025. URL: https: //arxiv.org/abs/2503.00902.arXiv:2503.00902

work page arXiv 2025
[56]

TextGrad: Automatic "Differentiation" via Text

M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, J. Zou, Textgrad: Automatic "differentiation" via text, 2024. URL: https://arxiv.org/abs/2406.07496.arXiv:2406.07496

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

X. Tang, Z. Lv, X. Cheng, J. Li, W. X. Zhao, Z. Wen, Z. Zhang, J. Zhou, Enhancing cross-task transfer of large language models via activation steering, 2025. URL: https://arxiv.org/abs/2507.13236. arXiv:2507.13236

work page arXiv 2025
[58]

T. Wu, J. Wang, Z. Zhao, N. Wong, Mixture-of-subspaces in low-rank adaptation, 2025. URL: https://arxiv.org/abs/2406.11909.arXiv:2406.11909

work page arXiv 2025
[59]

R. Wang, K. Dvijotham, I. R. Manchester, Norm-bounded low-rank adaptation, 2025. URL: https: //arxiv.org/abs/2501.19050.arXiv:2501.19050

work page arXiv 2025
[60]

Z. Zhao, T. Shen, D. Zhu, Z. Li, J. Su, X. Wang, K. Kuang, F. Wu, Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering, 2024. URL: https: //arxiv.org/abs/2409.16167.arXiv:2409.16167

work page arXiv 2024
[61]

L. Chen, M. Prabhudesai, K. Fragkiadaki, H. Liu, D. Pathak, Self-questioning language models,

work page
[62]

URL: https://arxiv.org/abs/2508.03682.arXiv:2508.03682

work page arXiv
[63]

Zhang, F

G. Zhang, F. Meng, G. Wan, Z. Li, K. Wang, Z. Yin, L. Bai, S. Yan, Latentevolve: Self-evolving test-time scaling in latent space, 2025. URL: https://arxiv.org/abs/2509.24771.arXiv:2509.24771

work page arXiv 2025
[64]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

A. Karan, Y. Du, Reasoning with sampling: Your base model is smarter than you think, 2025. URL: https://arxiv.org/abs/2510.14901.arXiv:2510.14901

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Z. Wang, D. Ma, X. Huang, D. Cai, T. Lan, J. Xu, H. Mi, X. Tang, Y. Wang, The end of manual decoding: Towards truly end-to-end language models, 2025. URL: https://arxiv.org/abs/2510.26697. arXiv:2510.26697

work page arXiv 2025
[66]

Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel

A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...

work page arXiv 2024
[67]

Zheng, H

S. Zheng, H. Wang, C. Huang, X. Wang, T. Chen, J. Fan, S. Hu, P. Ye, Decouple and orthog- onalize: A data-free framework for lora merging, 2025. URL: https://arxiv.org/abs/2505.15875. arXiv:2505.15875

work page arXiv 2025
[68]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[69]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, J. Steinhardt, Measuring mathematical problem solving with the math dataset, arXiv preprint arXiv:2103.03874 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[70]

Zhang, Z

Z. Zhang, Z. Jiang, L. Xu, H. Hao, R. Wang, Multiple-choice questions are efficient and robust llm evaluators, arXiv preprint arXiv:2405.11966 (2024)

work page arXiv 2024
[71]

T. T. Chung, L. Liu, M. Yu, D.-Y. Yeung, Divlogiceval: A framework for benchmarking logical reasoning evaluation in large language models, arXiv preprint arXiv:2509.15587 (2025)

work page arXiv 2025
[72]

M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, Socialiqa: Commonsense reasoning about social interactions, arXiv preprint arXiv:1904.09728 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[73]

D. N. Manh, T. P. Chau, N. Le Hai, T. T. Doan, N. V. Nguyen, Q. Pham, N. D. Bui, Codemmlu: A multi-task benchmark for assessing code understanding capabilities of codellms, CoRR (2024)

work page 2024

[1] [1]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, Advances in neural information processing systems 29 (2016)

work page 2016

[3] [3]

J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y. Li, M. Tan, Test-time learning for large language models, 2025. URL: https://arxiv.org/abs/2505.20633.arXiv:2505.20633

work page arXiv 2025

[4] [4]

Y. Hu, X. Zhang, X. Fang, Z. Chen, X. Wang, H. Zhang, G. Qi, Slot: Sample-specific language model optimization at test-time, 2025. URL: https://arxiv.org/abs/2505.12392.arXiv:2505.12392

work page arXiv 2025

[5] [5]

Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua, B. Qi, Y. Sun, Z. Ma, L. Yuan, N. Ding, B. Zhou, Ttrl: Test-time reinforcement learning, 2025. URL: https://arxiv.org/abs/2504.16084.arXiv:2504.16084

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

M. M. Moradi, H. Amer, S. Mudur, W. Zhang, Y. Liu, W. Ahmed, Continuous self-improvement of large language models by test-time training with verifier-driven sample selection, 2025. URL: https://arxiv.org/abs/2505.19475.arXiv:2505.19475

work page arXiv 2025

[7] [7]

H. Lee, S. Oh, J. Kim, J. Shin, J. Tack, Revise: Learning to refine at test-time via intrinsic self- verification, 2025. URL: https://arxiv.org/abs/2502.14565.arXiv:2502.14565

work page arXiv 2025

[8] [8]

Hübotter, L

J. Hübotter, L. Diaz-Bone, I. Hakimi, A. Krause, M. Hardt, Learning on the job: Test-time curricula for targeted reinforcement learning, 2025. URL: https://arxiv.org/abs/2510.04786. arXiv:2510.04786

work page arXiv 2025

[9] [9]

Bertolissi, J

R. Bertolissi, J. Hübotter, I. Hakimi, A. Krause, Local mixtures of experts: Essentially free test-time training via model merging, 2025. URL: https://arxiv.org/abs/2505.14136.arXiv:2505.14136

work page arXiv 2025

[10] [10]

Z. Yang, N. Band, S. Li, E. Candès, T. Hashimoto, Synthetic continued pretraining, 2024. URL: https://arxiv.org/abs/2409.07431.arXiv:2409.07431

work page arXiv 2024

[11] [11]

Y. Wang, X. Liu, X. Chen, S. O’Brien, J. Wu, J. McAuley, Self-updatable large language mod- els by integrating context into model parameters, 2025. URL: https://arxiv.org/abs/2410.00487. arXiv:2410.00487

work page arXiv 2025

[12] [12]

R. Wang, P. Ping, Z. Guo, X. Zhang, Q. Shi, L. Zhou, T. Ji, Loki: Low-damage knowledge implanting of large language models, 2025. URL: https://arxiv.org/abs/2505.22120.arXiv:2505.22120

work page arXiv 2025

[13] [13]

C. F. Park, Z. Zhang, H. Tanaka, New News: System-2 fine-tuning for robust integration of new knowledge, 2025. URL: https://arxiv.org/abs/2505.01812.arXiv:2505.01812

work page arXiv 2025

[14] [16]

E. C. Acikgoz, C. Qian, H. Ji, D. Hakkani-Tür, G. Tur, Self-improving llm agents at test-time, 2025. URL: https://arxiv.org/abs/2510.07841.arXiv:2510.07841

work page arXiv 2025

[15] [17]

J.-C. Pang, P. Wang, K. Li, X.-H. Chen, J. Xu, Z. Zhang, Y. Yu, Language model self- improvement by reinforcement learning contemplation, 2023. URL: https://arxiv.org/abs/2305. 14483.arXiv:2305.14483

work page arXiv 2023

[16] [19]

Zweiger, J

A. Zweiger, J. Pari, H. Guo, E. Akyürek, Y. Kim, P. Agrawal, Self-adapting language models, 2025. URL: https://arxiv.org/abs/2506.10943.arXiv:2506.10943

work page arXiv 2025

[17] [20]

M. Li, J. Lin, X. Zhao, W. Lu, P. Zhao, S. Wermter, D. Wang, Curriculum-rlaif: Curriculum align- ment with reinforcement learning from ai feedback, 2025. URL: https://arxiv.org/abs/2505.20075. arXiv:2505.20075

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [21]

W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, J. Weston, Self-rewarding language models,

work page

[19] [22]

URL: https://arxiv.org/abs/2401.10020.arXiv:2401.10020

work page internal anchor Pith review Pith/arXiv arXiv

[20] [23]

H. Zhou, Y. Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y. Lee, G. Zhang, K. Shao, L. Yang, J. Wang, Memento: Fine-tuning llm agents without fine-tuning llms, 2025. URL: https://arxiv.org/abs/2508. 16153.arXiv:2508.16153

work page arXiv 2025

[21] [24]

Meta-Reinforcement Learning of Structured Exploration Strategies

A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, S. Levine, Meta-reinforcement learning of structured exploration strategies, 2018. URL: https://arxiv.org/abs/1802.07245.arXiv:1802.07245

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [25]

K. Irie, I. Schlag, R. Csordás, J. Schmidhuber, A modern self-referential weight matrix that learns to modify itself, 2022. URL: https://arxiv.org/abs/2202.05780.arXiv:2202.05780

work page arXiv 2022

[23] [26]

A survey on self-evolution of large language models

Z. Tao, T.-E. Lin, X. Chen, H. Li, Y. Wu, Y. Li, Z. Jin, F. Huang, D. Tao, J. Zhou, A sur- vey on self-evolution of large language models, 2024. URL: https://arxiv.org/abs/2404.14387. arXiv:2404.14387

work page arXiv 2024

[24] [27]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

H. ang Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Y. Wu, H. Wang, H. Xiao, Y. Zhou, S. Zhang, J. Zhang, J. Xiang, Y. Fang, Q. Zhao, D. Liu, Q. Ren, C. Qian, Z. Wang, M. Hu, H. Wang, Q. Wu, H. Ji, M. Wang, A survey of self-evolving agents: On path to artificial super intelligence, 2025. URL: https://arxiv.org/abs/2507.21046.arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [28]

K. Wang, D. Tang, W. Zhao, K. Schürholt, Z. Wang, Y. You, Recurrent diffusion for large-scale parameter generation, arXiv preprint arXiv:2501.11587 (2025)

work page arXiv 2025

[26] [29]

Drag- and-drop llms: Zero-shot prompt-to-weights

Z. Liang, D. Tang, Y. Zhou, X. Zhao, M. Shi, W. Zhao, Z. Li, P. Wang, K. Schürholt, D. Borth, et al., Drag-and-drop llms: Zero-shot prompt-to-weights, arXiv preprint arXiv:2506.16406 (2025)

work page arXiv 2025

[27] [30]

Charakorn, E

R. Charakorn, E. Cetin, Y. Tang, R. T. Lange, Text-to-lora: Instant transformer adaption, 2025. URL: https://arxiv.org/abs/2506.06105.arXiv:2506.06105

work page arXiv 2025

[28] [31]

R. M. S. Khan, D. Tang, P. Li, K. Wang, T. Chen, Oral: Prompting your large-scale loras via conditional recurrent diffusion, 2025. URL: https://arxiv.org/abs/2503.24354. arXiv:2503.24354

work page arXiv 2025

[29] [32]

X. Jin, K. Wang, D. Tang, W. Zhao, Y. Zhou, J. Tang, Y. You, Conditional lora parameter generation,

work page

[30] [33]

URL: https://arxiv.org/abs/2408.01415.arXiv:2408.01415

work page arXiv

[31] [34]

Y. Shao, X. Lin, X. Long, S. Chen, M. Yan, Y. Liu, Z. Yan, A. Ma, H. Tang, J. Guo, Icm-fusion: In-context meta-optimized lora fusion for multi-task adaptation, 2025. URL: https://arxiv.org/abs/ 2508.04153.arXiv:2508.04153

work page arXiv 2025

[32] [35]

Y. Shao, M. Yan, Y. Liu, S. Chen, W. Chen, X. Long, Z. Yan, L. Li, C. Zhang, N. Sebe, H. Tang, Y. Wang, H. Zhao, M. Wang, J. Guo, In-context meta lora generation, 2025. URL: https://arxiv.org/ abs/2501.17635.arXiv:2501.17635

work page arXiv 2025

[33] [36]

Zhang, Toward weight-level self-improving agents with meta-knowledge discovery, 10.36227/techrxiv.175744083.37752625/v1 (2025)

T. Zhang, Toward weight-level self-improving agents with meta-knowledge discovery, 10.36227/techrxiv.175744083.37752625/v1 (2025)

work page doi:10.36227/techrxiv.175744083.37752625/v1 2025

[34] [37]

E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022, p. 3

work page 2022

[35] [38]

LeCun, A path towards autonomous machine intelligence version 0.9

Y. LeCun, A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27, Open Review 62 (2022) 1–62

work page 2022

[36] [39]

Y. Liu, Y. Nan, W. Xu, X. Hu, L. Ye, Z. Qin, P. Liu, Alphago moment for model architecture discovery,

work page

[37] [40]

URL: https://arxiv.org/abs/2507.18074.arXiv:2507.18074

work page arXiv

[38] [41]

C. Lu, S. Holt, C. Fanconi, A. J. Chan, J. Foerster, M. van der Schaar, R. T. Lange, Discovering preference optimization algorithms with and for large language models, 2024. URL: https://arxiv. org/abs/2406.08414.arXiv:2406.08414

work page arXiv 2024

[39] [42]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, D. Yu, R-zero: Self-evolving reasoning llm from zero data, arXiv preprint arXiv:2508.05004 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [43]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1908

[41] [44]

Kunin, J

D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. Yamins, H. Tanaka, Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics, arXiv preprint arXiv:2012.04728 (2020)

work page arXiv 2012

[42] [45]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[43] [46]

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [47]

HellaSwag: Can a Machine Really Finish Your Sentence?

R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your sentence?, 2019. URL: https://arxiv.org/abs/1905.07830.arXiv:1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[45] [48]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, K. Toutanova, Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. URL: https://arxiv.org/abs/1905.10044. arXiv:1905.10044

work page internal anchor Pith review Pith/arXiv arXiv 2019

[46] [49]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL: https://arxiv.org/abs/ 1803.05457.arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[47] [50]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor conduct electricity? a new dataset for open book question answering, arXiv preprint arXiv:1809.02789 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[48] [51]

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, Y. Choi, Piqa: Reasoning about physical commonsense in natural language, 2019. URL: https://arxiv.org/abs/1911.11641.arXiv:1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[49] [52]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An adversarial winograd schema challenge at scale, 2019. URL: https://arxiv.org/abs/1907.10641.arXiv:1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[50] [53]

L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, O. Khattab, Gepa: Reflective prompt evolution can outperform reinforcement learning, 2025. URL: https: //arxiv.org/abs/2507.19457.arXiv:2507.19457

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [54]

J. Li, X. Dong, Y. Liu, Z. Yang, Q. Wang, X. Wang, S. Zhu, Z. Jia, Z. Zheng, Reflectevo: Improving meta introspection of small llms by learning self-reflection, 2025. URL: https://arxiv.org/abs/2505.16475. arXiv:2505.16475

work page arXiv 2025

[52] [55]

L. Liu, C. Zhang, L. Wu, C. Zhao, Z. Hu, M. He, J. Fan, Instruct-of-reflection: Enhancing large language models iterative reflection capabilities via dynamic-meta instruction, 2025. URL: https: //arxiv.org/abs/2503.00902.arXiv:2503.00902

work page arXiv 2025

[53] [56]

TextGrad: Automatic "Differentiation" via Text

M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, J. Zou, Textgrad: Automatic "differentiation" via text, 2024. URL: https://arxiv.org/abs/2406.07496.arXiv:2406.07496

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [57]

X. Tang, Z. Lv, X. Cheng, J. Li, W. X. Zhao, Z. Wen, Z. Zhang, J. Zhou, Enhancing cross-task transfer of large language models via activation steering, 2025. URL: https://arxiv.org/abs/2507.13236. arXiv:2507.13236

work page arXiv 2025

[55] [58]

T. Wu, J. Wang, Z. Zhao, N. Wong, Mixture-of-subspaces in low-rank adaptation, 2025. URL: https://arxiv.org/abs/2406.11909.arXiv:2406.11909

work page arXiv 2025

[56] [59]

R. Wang, K. Dvijotham, I. R. Manchester, Norm-bounded low-rank adaptation, 2025. URL: https: //arxiv.org/abs/2501.19050.arXiv:2501.19050

work page arXiv 2025

[57] [60]

Z. Zhao, T. Shen, D. Zhu, Z. Li, J. Su, X. Wang, K. Kuang, F. Wu, Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering, 2024. URL: https: //arxiv.org/abs/2409.16167.arXiv:2409.16167

work page arXiv 2024

[58] [61]

L. Chen, M. Prabhudesai, K. Fragkiadaki, H. Liu, D. Pathak, Self-questioning language models,

work page

[59] [62]

URL: https://arxiv.org/abs/2508.03682.arXiv:2508.03682

work page arXiv

[60] [63]

Zhang, F

G. Zhang, F. Meng, G. Wan, Z. Li, K. Wang, Z. Yin, L. Bai, S. Yan, Latentevolve: Self-evolving test-time scaling in latent space, 2025. URL: https://arxiv.org/abs/2509.24771.arXiv:2509.24771

work page arXiv 2025

[61] [64]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

A. Karan, Y. Du, Reasoning with sampling: Your base model is smarter than you think, 2025. URL: https://arxiv.org/abs/2510.14901.arXiv:2510.14901

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [65]

Z. Wang, D. Ma, X. Huang, D. Cai, T. Lan, J. Xu, H. Mi, X. Tang, Y. Wang, The end of manual decoding: Towards truly end-to-end language models, 2025. URL: https://arxiv.org/abs/2510.26697. arXiv:2510.26697

work page arXiv 2025

[63] [66]

Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel

A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...

work page arXiv 2024

[64] [67]

Zheng, H

S. Zheng, H. Wang, C. Huang, X. Wang, T. Chen, J. Fan, S. Hu, P. Ye, Decouple and orthog- onalize: A data-free framework for lora merging, 2025. URL: https://arxiv.org/abs/2505.15875. arXiv:2505.15875

work page arXiv 2025

[65] [68]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[66] [69]

Measuring Mathematical Problem Solving With the MATH Dataset

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, J. Steinhardt, Measuring mathematical problem solving with the math dataset, arXiv preprint arXiv:2103.03874 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[67] [70]

Zhang, Z

Z. Zhang, Z. Jiang, L. Xu, H. Hao, R. Wang, Multiple-choice questions are efficient and robust llm evaluators, arXiv preprint arXiv:2405.11966 (2024)

work page arXiv 2024

[68] [71]

T. T. Chung, L. Liu, M. Yu, D.-Y. Yeung, Divlogiceval: A framework for benchmarking logical reasoning evaluation in large language models, arXiv preprint arXiv:2509.15587 (2025)

work page arXiv 2025

[69] [72]

M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, Socialiqa: Commonsense reasoning about social interactions, arXiv preprint arXiv:1904.09728 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[70] [73]

D. N. Manh, T. P. Chau, N. Le Hai, T. T. Doan, N. V. Nguyen, Q. Pham, N. D. Bui, Codemmlu: A multi-task benchmark for assessing code understanding capabilities of codellms, CoRR (2024)

work page 2024