pith. sign in

arxiv: 2605.20189 · v1 · pith:E2MF3PIZnew · submitted 2026-03-23 · 💻 cs.AI · cs.LG

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Pith reviewed 2026-05-21 11:16 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords autonomous agentslifelong learningcontinual adaptationreinforcement learningmeta-learninglarge language modelstest-time adaptation
0
0 comments X

The pith

SOLAR lets an autonomous agent discover its own adaptation strategies by treating model weights as an environment for multi-level reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SOLAR as a method for large language models to handle streaming data and concept drift without relying on traditional fine-tuning or manual curation. It starts with a consolidated prior on common-sense knowledge and then applies multi-level reinforcement learning so the agent can explore and select modification strategies on its own parameters. The system keeps an evolving knowledge base that serves as episodic memory to retain what has already been learned while allowing new adaptations. A sympathetic reader would care because this setup aims to produce agents that improve over time in changing real-world conditions rather than requiring repeated human-guided retraining.

Core claim

SOLAR initiates with a strong prior over common-sense knowledge and then uses a multi-level reinforcement learning approach to autonomously discover adaptation strategies. It maintains an evolving knowledge base of valid modification strategies that implicitly acts as an episodic memory buffer, balancing plasticity for new tasks with stability for retained meta-knowledge. This enables efficient test-time adaptation to unseen domains while avoiding catastrophic forgetting.

What carries the argument

Multi-level reinforcement learning applied to model weights treated as an explorable environment, together with an evolving knowledge base of valid modification strategies.

If this is right

  • Enables efficient test-time adaptation to unseen domains without gradient-based retraining.
  • Outperforms strong baselines on common-sense, mathematical, medical, coding, social, and logical reasoning tasks.
  • Maintains balance between plasticity for new tasks and stability for prior meta-knowledge.
  • Supports open-ended autonomous agents capable of lifelong adaptation in evolving environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower the human effort needed to keep deployed models current as data streams change.
  • Similar self-optimization loops might be tested on non-language models to check if the same weight-exploration approach transfers.
  • Longer task sequences would show whether the knowledge base continues to grow without becoming unwieldy.

Load-bearing premise

Treating model weights as an environment that multi-level reinforcement learning can reliably explore will produce modification strategies that generalize across domains without causing instability or collapse.

What would settle it

Running SOLAR through a sequence of new domains and checking whether performance on the original tasks remains stable or degrades after each adaptation cycle.

Figures

Figures reproduced from arXiv: 2605.20189 by Dianbo Liu, Nitin Vetcha.

Figure 1
Figure 1. Figure 1: SOLAR’s methodology of weight-level meta-knowledge discovery and modification summarized (adapted from [34]) 5. Implementation 5.1. Architecture Primary architectural detail in SOLAR’s framework is the design of the weight-space exploration initializer. As mentioned in Section 4, we use a convolution based decoder model for this purpose. We assume that we have access to either the unseen task’s description… view at source ↗
Figure 2
Figure 2. Figure 2: Details of the Parameter Tokenization Process These convolutions are divided into three categories: i) width convolution that operates on (𝐶, 𝐿) dimension, ii) height convolution that operates on (𝐿, 𝑁) dimension) iii) layer-wise convolution that on (𝑁, 𝐿) dimension) , with notations Conv𝑊 , Conv𝐻, and Conv𝐿. Each layer consists of two Conv𝑊 , two Conv𝐻 and one Conv𝐿. Given this, the forward operation of t… view at source ↗
Figure 3
Figure 3. Figure 3: Details of the Hyper-Convolutional Decoder Architecture used Subsequently, prompt-checkpoint pairing is done as follows. Given a dataset 𝑃, it is first divided it into non-overlapping prompt batches [𝑝1, · · · , 𝑝𝑖 , · · · , 𝑝𝐼 ]. Denote the trained LLM checkpoints of this dataset as 𝑀 = [𝑚1, · · · , 𝑚𝑗 , · · · , 𝑚𝐽 ]. Then randomly a batch of prompts and a corresponding checkpoint is picked to create a pa… view at source ↗
Figure 4
Figure 4. Figure 4: Router Approach for TTS which can take one of five values - avg_sim_score, avg_prompt_embed, max_confidence, majority_vote or (summing log probabilities) i.e., sum_logprobs (former two belong to router approach and the latter three constitute the ensemble approach). • For LS, we use [4] and the corresponding JSON object has fields times and learning_rate. 6. Experiments 6.1. Setup As described in Section 5… view at source ↗
Figure 5
Figure 5. Figure 5: Details of the Prompt Selection Strategy used in Ablation Study Finally, greedy graph search is done to select the final prompt subset 𝑆. For this, start with 𝑆 = ∅ and at each round pick 𝑣 * = arg max 𝑣 /∈𝑆 𝑓𝒢(𝑣), 𝑣 * is then added to 𝑆 and diversity penalties only for neighbors of 𝑣 * are updated14. This process continues until |𝑆| reaches the target size which in our case is 128. Fortunately, the influe… view at source ↗
read the original abstract

Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SOLAR, a Self-Optimizing Lifelong Autonomous Reasoner, which is an open-ended autonomous agent that leverages parameter-level meta-learning by treating model weights as an environment for exploration. It uses multi-level reinforcement learning to autonomously discover adaptation strategies and maintains an evolving knowledge base to balance plasticity and stability. The paper claims that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social, and logical reasoning tasks.

Significance. If the experimental results and the underlying mechanisms are rigorously demonstrated with full implementation details, this work could have high significance for the field of continual learning and autonomous agents, as it addresses key challenges like concept drift and catastrophic forgetting in dynamic environments without relying on gradient-based adaptation or extensive manual curation. The approach of treating weights as an RL environment and using an evolving knowledge base as episodic memory is novel if shown to be stable and generalizable.

major comments (2)
  1. [Abstract] Abstract: The claim that SOLAR 'outperforms strong baselines' on six reasoning domains is stated without any accompanying methods, data details, error bars, ablation results, or statistical tests, which is load-bearing for the central claim of autonomous strategy discovery via multi-level RL.
  2. [Methods] RL framework description: No equations or pseudocode are provided for the multi-level RL policy, action space over model weights, reward function (e.g., validation accuracy plus stability term), or knowledge-base update rule, leaving open whether the method reliably constrains modifications to valid states and avoids instability or catastrophic interference.
minor comments (1)
  1. [Abstract] The abstract uses several acronyms (LLM, FT, SOLAR) without initial expansion on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity, reproducibility, and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that SOLAR 'outperforms strong baselines' on six reasoning domains is stated without any accompanying methods, data details, error bars, ablation results, or statistical tests, which is load-bearing for the central claim of autonomous strategy discovery via multi-level RL.

    Authors: We agree that the abstract's performance claim would benefit from additional context. In the revised manuscript, we have updated the abstract to briefly reference the evaluation across the six reasoning domains using standard benchmarks, along with a note that results include error bars, ablations, and statistical tests. Full experimental details, data descriptions, and analyses remain in the Experiments section, where we have added the requested elements to strengthen the presentation of the central claim. revision: yes

  2. Referee: [Methods] RL framework description: No equations or pseudocode are provided for the multi-level RL policy, action space over model weights, reward function (e.g., validation accuracy plus stability term), or knowledge-base update rule, leaving open whether the method reliably constrains modifications to valid states and avoids instability or catastrophic interference.

    Authors: We acknowledge that the original submission lacked formal descriptions of the RL components. The revised manuscript now includes equations for the multi-level RL policy, the action space defined over model weight modifications, the reward function (task accuracy combined with a stability term), and the knowledge-base update rule. We have also added pseudocode for the overall SOLAR procedure in the Methods section. These additions clarify the constraints on state transitions and the mechanisms for maintaining stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SOLAR as a novel agent architecture that treats model weights as an RL environment and uses multi-level reinforcement learning plus an evolving knowledge base for lifelong adaptation. No mathematical derivations, equations, or self-referential definitions appear in the abstract or method description. Performance claims rest on experimental comparisons across reasoning domains rather than any reduction of outputs to fitted inputs or self-citations by construction. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5758 in / 1195 out tokens · 25766 ms · 2026-05-21T11:16:47.592294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 22 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020)

  2. [2]

    W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, Advances in neural information processing systems 29 (2016)

  3. [3]

    J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y. Li, M. Tan, Test-time learning for large language models, 2025. URL: https://arxiv.org/abs/2505.20633.arXiv:2505.20633

  4. [4]

    Y. Hu, X. Zhang, X. Fang, Z. Chen, X. Wang, H. Zhang, G. Qi, Slot: Sample-specific language model optimization at test-time, 2025. URL: https://arxiv.org/abs/2505.12392.arXiv:2505.12392

  5. [5]

    Y. Zuo, K. Zhang, L. Sheng, S. Qu, G. Cui, X. Zhu, H. Li, Y. Zhang, X. Long, E. Hua, B. Qi, Y. Sun, Z. Ma, L. Yuan, N. Ding, B. Zhou, Ttrl: Test-time reinforcement learning, 2025. URL: https://arxiv.org/abs/2504.16084.arXiv:2504.16084

  6. [6]

    M. M. Moradi, H. Amer, S. Mudur, W. Zhang, Y. Liu, W. Ahmed, Continuous self-improvement of large language models by test-time training with verifier-driven sample selection, 2025. URL: https://arxiv.org/abs/2505.19475.arXiv:2505.19475

  7. [7]

    H. Lee, S. Oh, J. Kim, J. Shin, J. Tack, Revise: Learning to refine at test-time via intrinsic self- verification, 2025. URL: https://arxiv.org/abs/2502.14565.arXiv:2502.14565

  8. [8]

    Hübotter, L

    J. Hübotter, L. Diaz-Bone, I. Hakimi, A. Krause, M. Hardt, Learning on the job: Test-time curricula for targeted reinforcement learning, 2025. URL: https://arxiv.org/abs/2510.04786. arXiv:2510.04786

  9. [9]

    Bertolissi, J

    R. Bertolissi, J. Hübotter, I. Hakimi, A. Krause, Local mixtures of experts: Essentially free test-time training via model merging, 2025. URL: https://arxiv.org/abs/2505.14136.arXiv:2505.14136

  10. [10]

    Z. Yang, N. Band, S. Li, E. Candès, T. Hashimoto, Synthetic continued pretraining, 2024. URL: https://arxiv.org/abs/2409.07431.arXiv:2409.07431

  11. [11]

    Y. Wang, X. Liu, X. Chen, S. O’Brien, J. Wu, J. McAuley, Self-updatable large language mod- els by integrating context into model parameters, 2025. URL: https://arxiv.org/abs/2410.00487. arXiv:2410.00487

  12. [12]

    R. Wang, P. Ping, Z. Guo, X. Zhang, Q. Shi, L. Zhou, T. Ji, Loki: Low-damage knowledge implanting of large language models, 2025. URL: https://arxiv.org/abs/2505.22120.arXiv:2505.22120

  13. [13]

    C. F. Park, Z. Zhang, H. Tanaka, New News: System-2 fine-tuning for robust integration of new knowledge, 2025. URL: https://arxiv.org/abs/2505.01812.arXiv:2505.01812

  14. [16]

    E. C. Acikgoz, C. Qian, H. Ji, D. Hakkani-Tür, G. Tur, Self-improving llm agents at test-time, 2025. URL: https://arxiv.org/abs/2510.07841.arXiv:2510.07841

  15. [17]

    J.-C. Pang, P. Wang, K. Li, X.-H. Chen, J. Xu, Z. Zhang, Y. Yu, Language model self- improvement by reinforcement learning contemplation, 2023. URL: https://arxiv.org/abs/2305. 14483.arXiv:2305.14483

  16. [19]

    Zweiger, J

    A. Zweiger, J. Pari, H. Guo, E. Akyürek, Y. Kim, P. Agrawal, Self-adapting language models, 2025. URL: https://arxiv.org/abs/2506.10943.arXiv:2506.10943

  17. [20]

    M. Li, J. Lin, X. Zhao, W. Lu, P. Zhao, S. Wermter, D. Wang, Curriculum-rlaif: Curriculum align- ment with reinforcement learning from ai feedback, 2025. URL: https://arxiv.org/abs/2505.20075. arXiv:2505.20075

  18. [21]

    W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, J. Weston, Self-rewarding language models,

  19. [22]

    URL: https://arxiv.org/abs/2401.10020.arXiv:2401.10020

  20. [23]

    H. Zhou, Y. Chen, S. Guo, X. Yan, K. H. Lee, Z. Wang, K. Y. Lee, G. Zhang, K. Shao, L. Yang, J. Wang, Memento: Fine-tuning llm agents without fine-tuning llms, 2025. URL: https://arxiv.org/abs/2508. 16153.arXiv:2508.16153

  21. [24]

    Meta-Reinforcement Learning of Structured Exploration Strategies

    A. Gupta, R. Mendonca, Y. Liu, P. Abbeel, S. Levine, Meta-reinforcement learning of structured exploration strategies, 2018. URL: https://arxiv.org/abs/1802.07245.arXiv:1802.07245

  22. [25]

    K. Irie, I. Schlag, R. Csordás, J. Schmidhuber, A modern self-referential weight matrix that learns to modify itself, 2022. URL: https://arxiv.org/abs/2202.05780.arXiv:2202.05780

  23. [26]

    A survey on self-evolution of large language models

    Z. Tao, T.-E. Lin, X. Chen, H. Li, Y. Wu, Y. Li, Z. Jin, F. Huang, D. Tao, J. Zhou, A sur- vey on self-evolution of large language models, 2024. URL: https://arxiv.org/abs/2404.14387. arXiv:2404.14387

  24. [27]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    H. ang Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Y. Wu, H. Wang, H. Xiao, Y. Zhou, S. Zhang, J. Zhang, J. Xiang, Y. Fang, Q. Zhao, D. Liu, Q. Ren, C. Qian, Z. Wang, M. Hu, H. Wang, Q. Wu, H. Ji, M. Wang, A survey of self-evolving agents: On path to artificial super intelligence, 2025. URL: https://arxiv.org/abs/2507.21046.arXiv:...

  25. [28]

    K. Wang, D. Tang, W. Zhao, K. Schürholt, Z. Wang, Y. You, Recurrent diffusion for large-scale parameter generation, arXiv preprint arXiv:2501.11587 (2025)

  26. [29]

    Drag- and-drop llms: Zero-shot prompt-to-weights

    Z. Liang, D. Tang, Y. Zhou, X. Zhao, M. Shi, W. Zhao, Z. Li, P. Wang, K. Schürholt, D. Borth, et al., Drag-and-drop llms: Zero-shot prompt-to-weights, arXiv preprint arXiv:2506.16406 (2025)

  27. [30]

    Charakorn, E

    R. Charakorn, E. Cetin, Y. Tang, R. T. Lange, Text-to-lora: Instant transformer adaption, 2025. URL: https://arxiv.org/abs/2506.06105.arXiv:2506.06105

  28. [31]

    R. M. S. Khan, D. Tang, P. Li, K. Wang, T. Chen, Oral: Prompting your large-scale loras via conditional recurrent diffusion, 2025. URL: https://arxiv.org/abs/2503.24354. arXiv:2503.24354

  29. [32]

    X. Jin, K. Wang, D. Tang, W. Zhao, Y. Zhou, J. Tang, Y. You, Conditional lora parameter generation,

  30. [33]

    URL: https://arxiv.org/abs/2408.01415.arXiv:2408.01415

  31. [34]

    Y. Shao, X. Lin, X. Long, S. Chen, M. Yan, Y. Liu, Z. Yan, A. Ma, H. Tang, J. Guo, Icm-fusion: In-context meta-optimized lora fusion for multi-task adaptation, 2025. URL: https://arxiv.org/abs/ 2508.04153.arXiv:2508.04153

  32. [35]

    Y. Shao, M. Yan, Y. Liu, S. Chen, W. Chen, X. Long, Z. Yan, L. Li, C. Zhang, N. Sebe, H. Tang, Y. Wang, H. Zhao, M. Wang, J. Guo, In-context meta lora generation, 2025. URL: https://arxiv.org/ abs/2501.17635.arXiv:2501.17635

  33. [36]

    Zhang, Toward weight-level self-improving agents with meta-knowledge discovery, 10.36227/techrxiv.175744083.37752625/v1 (2025)

    T. Zhang, Toward weight-level self-improving agents with meta-knowledge discovery, 10.36227/techrxiv.175744083.37752625/v1 (2025)

  34. [37]

    E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022, p. 3

  35. [38]

    LeCun, A path towards autonomous machine intelligence version 0.9

    Y. LeCun, A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27, Open Review 62 (2022) 1–62

  36. [39]

    Y. Liu, Y. Nan, W. Xu, X. Hu, L. Ye, Z. Qin, P. Liu, Alphago moment for model architecture discovery,

  37. [40]

    URL: https://arxiv.org/abs/2507.18074.arXiv:2507.18074

  38. [41]

    C. Lu, S. Holt, C. Fanconi, A. J. Chan, J. Foerster, M. van der Schaar, R. T. Lange, Discovering preference optimization algorithms with and for large language models, 2024. URL: https://arxiv. org/abs/2406.08414.arXiv:2406.08414

  39. [42]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    C. Huang, W. Yu, X. Wang, H. Zhang, Z. Li, R. Li, J. Huang, H. Mi, D. Yu, R-zero: Self-evolving reasoning llm from zero data, arXiv preprint arXiv:2508.05004 (2025)

  40. [43]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019)

  41. [44]

    Kunin, J

    D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. Yamins, H. Tanaka, Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics, arXiv preprint arXiv:2012.04728 (2020)

  42. [45]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020)

  43. [46]

    Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qi...

  44. [47]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your sentence?, 2019. URL: https://arxiv.org/abs/1905.07830.arXiv:1905.07830

  45. [48]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, K. Toutanova, Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019. URL: https://arxiv.org/abs/1905.10044. arXiv:1905.10044

  46. [49]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, O. Tafjord, Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL: https://arxiv.org/abs/ 1803.05457.arXiv:1803.05457

  47. [50]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    T. Mihaylov, P. Clark, T. Khot, A. Sabharwal, Can a suit of armor conduct electricity? a new dataset for open book question answering, arXiv preprint arXiv:1809.02789 (2018)

  48. [51]

    Y. Bisk, R. Zellers, R. L. Bras, J. Gao, Y. Choi, Piqa: Reasoning about physical commonsense in natural language, 2019. URL: https://arxiv.org/abs/1911.11641.arXiv:1911.11641

  49. [52]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An adversarial winograd schema challenge at scale, 2019. URL: https://arxiv.org/abs/1907.10641.arXiv:1907.10641

  50. [53]

    L. A. Agrawal, S. Tan, D. Soylu, N. Ziems, R. Khare, K. Opsahl-Ong, A. Singhvi, H. Shandilya, M. J. Ryan, M. Jiang, C. Potts, K. Sen, A. G. Dimakis, I. Stoica, D. Klein, M. Zaharia, O. Khattab, Gepa: Reflective prompt evolution can outperform reinforcement learning, 2025. URL: https: //arxiv.org/abs/2507.19457.arXiv:2507.19457

  51. [54]

    J. Li, X. Dong, Y. Liu, Z. Yang, Q. Wang, X. Wang, S. Zhu, Z. Jia, Z. Zheng, Reflectevo: Improving meta introspection of small llms by learning self-reflection, 2025. URL: https://arxiv.org/abs/2505.16475. arXiv:2505.16475

  52. [55]

    L. Liu, C. Zhang, L. Wu, C. Zhao, Z. Hu, M. He, J. Fan, Instruct-of-reflection: Enhancing large language models iterative reflection capabilities via dynamic-meta instruction, 2025. URL: https: //arxiv.org/abs/2503.00902.arXiv:2503.00902

  53. [56]

    TextGrad: Automatic "Differentiation" via Text

    M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, J. Zou, Textgrad: Automatic "differentiation" via text, 2024. URL: https://arxiv.org/abs/2406.07496.arXiv:2406.07496

  54. [57]

    X. Tang, Z. Lv, X. Cheng, J. Li, W. X. Zhao, Z. Wen, Z. Zhang, J. Zhou, Enhancing cross-task transfer of large language models via activation steering, 2025. URL: https://arxiv.org/abs/2507.13236. arXiv:2507.13236

  55. [58]

    T. Wu, J. Wang, Z. Zhao, N. Wong, Mixture-of-subspaces in low-rank adaptation, 2025. URL: https://arxiv.org/abs/2406.11909.arXiv:2406.11909

  56. [59]

    R. Wang, K. Dvijotham, I. R. Manchester, Norm-bounded low-rank adaptation, 2025. URL: https: //arxiv.org/abs/2501.19050.arXiv:2501.19050

  57. [60]

    Z. Zhao, T. Shen, D. Zhu, Z. Li, J. Su, X. Wang, K. Kuang, F. Wu, Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering, 2024. URL: https: //arxiv.org/abs/2409.16167.arXiv:2409.16167

  58. [61]

    L. Chen, M. Prabhudesai, K. Fragkiadaki, H. Liu, D. Pathak, Self-questioning language models,

  59. [62]

    URL: https://arxiv.org/abs/2508.03682.arXiv:2508.03682

  60. [63]

    Zhang, F

    G. Zhang, F. Meng, G. Wan, Z. Li, K. Wang, Z. Yin, L. Bai, S. Yan, Latentevolve: Self-evolving test-time scaling in latent space, 2025. URL: https://arxiv.org/abs/2509.24771.arXiv:2509.24771

  61. [64]

    Reasoning with Sampling: Your Base Model is Smarter Than You Think

    A. Karan, Y. Du, Reasoning with sampling: Your base model is smarter than you think, 2025. URL: https://arxiv.org/abs/2510.14901.arXiv:2510.14901

  62. [65]

    Z. Wang, D. Ma, X. Huang, D. Cai, T. Lan, J. Xu, H. Mi, X. Tang, Y. Wang, The end of manual decoding: Towards truly end-to-end language models, 2025. URL: https://arxiv.org/abs/2510.26697. arXiv:2510.26697

  63. [66]

    Bileschi, Noah Constant, Roman Novak, Rosanne Liu, Tris Warkentin, Yundi Qian, Yamini Bansal, Ethan Dyer, Behnam Neyshabur, Jascha Sohl-Dickstein, and Noah Fiedel

    A. Singh, J. D. Co-Reyes, R. Agarwal, A. Anand, P. Patil, X. Garcia, P. J. Liu, J. Harrison, J. Lee, K. Xu, A. Parisi, A. Kumar, A. Alemi, A. Rizkowsky, A. Nova, B. Adlam, B. Bohnet, G. Elsayed, H. Sedghi, I. Mordatch, I. Simpson, I. Gur, J. Snoek, J. Pennington, J. Hron, K. Kenealy, K. Swersky, K. Mahajan, L. Culp, L. Xiao, M. L. Bileschi, N. Constant, R...

  64. [67]

    Zheng, H

    S. Zheng, H. Wang, C. Huang, X. Wang, T. Chen, J. Fan, S. Hu, P. Ye, Decouple and orthog- onalize: A data-free framework for lora merging, 2025. URL: https://arxiv.org/abs/2505.15875. arXiv:2505.15875

  65. [68]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al., Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021)

  66. [69]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, J. Steinhardt, Measuring mathematical problem solving with the math dataset, arXiv preprint arXiv:2103.03874 (2021)

  67. [70]

    Zhang, Z

    Z. Zhang, Z. Jiang, L. Xu, H. Hao, R. Wang, Multiple-choice questions are efficient and robust llm evaluators, arXiv preprint arXiv:2405.11966 (2024)

  68. [71]

    T. T. Chung, L. Liu, M. Yu, D.-Y. Yeung, Divlogiceval: A framework for benchmarking logical reasoning evaluation in large language models, arXiv preprint arXiv:2509.15587 (2025)

  69. [72]

    M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, Socialiqa: Commonsense reasoning about social interactions, arXiv preprint arXiv:1904.09728 (2019)

  70. [73]

    D. N. Manh, T. P. Chau, N. Le Hai, T. T. Doan, N. V. Nguyen, Q. Pham, N. D. Bui, Codemmlu: A multi-task benchmark for assessing code understanding capabilities of codellms, CoRR (2024)