pith. sign in

arxiv: 2605.21318 · v1 · pith:DETCQ2VLnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.LG

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Pith reviewed 2026-05-21 04:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords prompt optimizationlarge language modelsdistributional overfittingregularizationout-of-distribution generalizationtext-space optimizationrepresentational inefficiency
0
0 comments X

The pith

TextReg mitigates prompt distributional overfitting by regularizing text-space optimization to control capacity cost and scope narrowness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that iterative prompt optimization for large language models often leads to prompts that grow longer while accumulating narrow, sample-specific rules, resulting in poor generalization to new data distributions. It frames this as prompt distributional overfitting arising from representational inefficiency, measured as the coupled rise of capacity cost and scope narrowness during optimization. TextReg counters this with a soft-penalty approach that applies regularized textual gradients through three components: Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. A sympathetic reader would care because more stable prompts could make LLM reasoning tasks reliable across varied inputs without repeated retraining. If the method works, it shows that explicit control over representation growth in discrete text space can preserve broad applicability while retaining task performance.

Core claim

The authors argue that prompt distributional overfitting reflects a lack of representation control in discrete text-space optimization and formalize this through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness. They attribute the failure mode to the coupled growth of these factors during iterative rewriting and propose TextReg as a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across reasoning benchmarks this yields substantial gains in out

What carries the argument

Representational inefficiency, the dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, which the regularization framework targets to prevent coupled growth.

Load-bearing premise

Prompt distributional overfitting is caused by the coupled growth of capacity cost and scope narrowness, and the proposed regularization components can control this growth without introducing new biases or harming in-distribution performance.

What would settle it

Running the optimization on the same reasoning benchmarks and checking whether TextReg prompts remain shorter and achieve the reported OOD accuracy gains without drops in in-distribution accuracy compared to TextGrad and REVOLVE.

Figures

Figures reproduced from arXiv: 2605.21318 by B. Aditya Prakash, Haibo Jin, Haohan Wang, Lucheng Fu, Ye Yu, Yiqiao Jin, Yiyang Wang.

Figure 1
Figure 1. Figure 1: Problem Illustration. We illustrate prompt distributional overfitting in prompt optimization: I) conven￾tional methods often produce long prompts saturated with narrow rules (left), which degrade on OOD inputs. II) Our goal is to instead yield compact prompts composed of broadly applicable rules (right), achieving stronger OOD generalization. In classical machine learning, overfitting is com￾monly mitigate… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TextReg, which proceeds in three stages. (a) Left: Dual-Evidence Gradient Purifica￾tion filters the raw task gradient via local batch and RuleBank recurrence evidence, yielding g˜task. (b) Middle: Semantic Edit Regularization diagnoses capacity and scope degradation and synthesizes the regularization gradient greg. (c) Right: Regularization-Guided Prompt Update rewrites pt into pt+1 by selectin… view at source ↗
Figure 4
Figure 4. Figure 4: Resilience analysis of TextReg under role-wise engine degradation, where each of the three LLM-driven roles in the optimization pipeline is replaced one at a time with a weaker Qwen2.5-7B-Instruct model. For an in-depth analysis, please refer to Section 5.4. 5.2 Main Results To answer Q1, [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study of the three core compo￾nents of TextReg: Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Up￾date. Each bar reports mean out-of-distribution accu￾racy (%) across four OOD tasks (Tracking Shuffled Objects 5/7 obj, Logical Deduction 5/7 obj) on each test engine. See Section 5.3 for analysis. To address Q2, we disable each of TextReg’s three components in turn ( … view at source ↗
read the original abstract

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TextReg, a regularization framework for prompt optimization in LLMs to mitigate prompt distributional overfitting. It defines representational inefficiency as the coupled growth of capacity cost and scope narrowness during iterative text-space optimization, introduces three components (Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update) to realize a soft-penalty objective, and reports OOD accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE on reasoning benchmarks.

Significance. If the empirical results hold under proper controls and the mechanism is directly validated, the work could meaningfully advance prompt optimization methods by offering a principled regularization approach in discrete text space. The dual-factor decomposition of inefficiency provides a potentially useful analytical lens for prompt evolution, though its practical impact depends on whether the gains are shown to stem from the claimed control rather than incidental effects.

major comments (2)
  1. [Experiments] Experiments section: The paper reports only final OOD accuracies without intermediate measurements of capacity cost and scope narrowness on the evolving prompts, without ablations isolating each regularizer's effect on these factors, and without confirming that in-distribution performance remains stable. This leaves the central causal claim—that the three components control distributional overfitting via representational inefficiency—unsupported by direct evidence.
  2. [Abstract and §3 (Method)] Method and abstract: No information is supplied on experimental controls, statistical tests, dataset details, or whether regularization hyperparameters were tuned on the same data used for final OOD reporting. This undermines the reliability of the claimed gains of +11.8% and +16.5%.
minor comments (1)
  1. [§3] The notation for the soft-penalty objective and the three regularization terms could be clarified with explicit equations showing how they combine, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the empirical support and methodological transparency in our work on TextReg. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The paper reports only final OOD accuracies without intermediate measurements of capacity cost and scope narrowness on the evolving prompts, without ablations isolating each regularizer's effect on these factors, and without confirming that in-distribution performance remains stable. This leaves the central causal claim—that the three components control distributional overfitting via representational inefficiency—unsupported by direct evidence.

    Authors: We agree that the current presentation focuses on final OOD accuracies and does not include the requested intermediate analyses. In the revised manuscript we will add plots tracking capacity cost and scope narrowness over optimization iterations for both TextReg and baselines. We will also include component-wise ablations measuring the isolated impact of Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update on these two factors. In-distribution accuracy will be reported alongside OOD results to confirm that gains do not come at the expense of in-distribution performance. These additions will provide direct evidence linking the proposed regularizers to the control of representational inefficiency. revision: yes

  2. Referee: [Abstract and §3 (Method)] Method and abstract: No information is supplied on experimental controls, statistical tests, dataset details, or whether regularization hyperparameters were tuned on the same data used for final OOD reporting. This undermines the reliability of the claimed gains of +11.8% and +16.5%.

    Authors: We acknowledge the absence of these details in the submitted version. The revised manuscript will expand the experimental protocol section to specify: (i) use of a held-out validation split for hyperparameter selection that is disjoint from both in-distribution training and OOD test sets; (ii) statistical reporting with means and standard deviations across at least five random seeds together with appropriate significance tests; (iii) full dataset descriptions including sizes, sources, and OOD construction procedures; and (iv) explicit confirmation that regularization hyperparameters were never tuned on the final OOD evaluation data. These clarifications will substantiate the reliability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via new definitions and empirical claims

full rationale

The paper introduces representational inefficiency as a dual-factor decomposition (capacity cost and scope narrowness) to formalize prompt distributional overfitting, then proposes three regularization components (Dual-Evidence Gradient Purification, Semantic Edit Regularization, Regularization-Guided Prompt Update) to realize a soft-penalty objective. These steps are presented as additive innovations rather than reductions of prior fitted quantities or self-citations. The central claims rest on reported OOD accuracy gains across benchmarks, without any equations or mechanisms shown to be equivalent to inputs by construction. No load-bearing self-citation chains, fitted-input predictions, or ansatz smuggling are evident in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract relies on the domain assumption that iterative LLM feedback optimization naturally produces distributional overfitting and introduces the new concept of representational inefficiency without external validation.

axioms (1)
  • domain assumption LLMs are highly sensitive to the prompts used to specify task objectives and behavioral constraints
    Opening statement of the abstract that underpins the entire optimization problem.
invented entities (1)
  • representational inefficiency no independent evidence
    purpose: Dual-factor measure decomposing prompt inefficiency into capacity cost and scope narrowness
    Newly defined quantity used to attribute distributional overfitting to coupled growth during optimization.

pith-pipeline@v0.9.0 · 5734 in / 1254 out tokens · 32405 ms · 2026-05-21T04:57:48.036024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 13 internal anchors

  1. [1]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  4. [5]

    Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

    Yiyang Wang, Yiqiao Jin, Alex Cabral, and Josiah Hester. Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

  5. [6]

    Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

    Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, and Josiah Hester. Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

  6. [7]

    gradient descent

    Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InEMNLP, pages 7957–7968, 2023

  7. [8]

    Teach better or show smarter? on instructions and exemplars in automatic prompt optimization.NeurIPS, 37:58174–58244, 2024

    Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö Arık. Teach better or show smarter? on instructions and exemplars in automatic prompt optimization.NeurIPS, 37:58174–58244, 2024

  8. [9]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

  9. [10]

    Prosa: Assessing and understanding the prompt sensitivity of llms

    Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 1950–1976, 2024. 11

  10. [11]

    Self-regulating prompts: Foundational model adaptation without forgetting

    Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InICCV, pages 15190–15200, 2023

  11. [12]

    Same task, more tokens: the impact of input length on the reasoning performance of large language models

    Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339–15353, 2024

  12. [13]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  13. [14]

    Sara: Selective and adaptive retrieval-augmented generation with context compression

    Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, and Srijan Kumar. Sara: Selective and adaptive retrieval-augmented generation with context compression. InACL, 2026

  14. [15]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  15. [16]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  16. [17]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  17. [18]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Dis- entangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

  18. [19]

    Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

  19. [20]

    Autoprompt: Eliciting knowledge from language models with automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

  20. [21]

    Rlprompt: Optimizing discrete text prompts with reinforcement learning

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022

  21. [22]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

  22. [23]

    EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023

  23. [24]

    Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023. 12

  24. [25]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

  25. [26]

    Zhang, H

    Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, and Haohan Wang. Revolve: Optimizing ai systems by tracking response evolution in textual optimization.arXiv preprint arXiv:2412.03092, 2024

  26. [27]

    Sipdo: Closed-loop prompt optimization via synthetic data feedback.arXiv preprint arXiv:2505.19514, 2025

    Yaoning Yu, Ye Yu, Peiyan Zhang, Kai Wei, Haojing Luo, and Haohan Wang. Sipdo: Closed-loop prompt optimization via synthetic data feedback.arXiv preprint arXiv:2505.19514, 2025

  27. [28]

    Robust prompt optimization for large language models against distribution shifts

    Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1539–1554, 2023

  28. [29]

    Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare.arXiv preprint arXiv:2509.24130, 2025

    Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, et al. Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare.arXiv preprint arXiv:2509.24130, 2025

  29. [30]

    Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective.arXiv preprint arXiv:2503.13413, 2025

    Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin, and Wanxiang Che. Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective.arXiv preprint arXiv:2503.13413, 2025

  30. [31]

    Reflection-enhanced meta-optimization integrating textgrad-style prompt optimization with memory-driven self-evolution.arXiv preprint arXiv:2508.18749, 2025

    Chunlong Wu and Zhibo Qu. Reflection-enhanced meta-optimization integrating textgrad-style prompt optimization with memory-driven self-evolution.arXiv preprint arXiv:2508.18749, 2025

  31. [32]

    Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

    Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

  32. [33]

    A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

    Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

  33. [34]

    Regularization of neural networks using dropconnect

    Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. InInternational conference on machine learning, pages 1058–1066. PMLR, 2013

  34. [35]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  35. [36]

    Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

    Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

  36. [37]

    Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

    Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

  37. [38]

    Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.Advances in neural information processing systems, 13, 2000

    Rich Caruana, Steve Lawrence, and C Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.Advances in neural information processing systems, 13, 2000

  38. [39]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

  39. [40]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021. 13

  40. [41]

    P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

    Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

  41. [42]

    Challenging big-bench tasks and whether chain-of-thought can solve them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

  42. [43]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research, 2023

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research, 2023

  43. [44]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  44. [45]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

  45. [46]

    Solving general arithmetic word problems

    Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

  46. [47]

    Mawps: A math word problem repository

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. InProceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152–1157, 2016

  47. [48]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.eprint arXiv: 2407.10671, 2024

  48. [49]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.arXiv:2404.14219, 2:6, 2024

  49. [50]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv:2407.21783, 2024

  50. [51]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv e-prints, pages arXiv–2412, 2024

  51. [52]

    Gpt-4o, 2025

    OpenAI. Gpt-4o, 2025

  52. [53]

    Think step-by-step

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199– 22213, 2022. 14 A Ethical Considerations and Broader Impact This work proposes TextReg, a regularization framework for prompt optimization, and is methodological rathe...

  53. [54]

    Remove references to specific entities, exact numbers, or particular examples

    Extract mid-level canonical {rule_scope} rules from the raw gradient. Remove references to specific entities, exact numbers, or particular examples. Preserve structural {rule_patterns}. Keep rules at mid-level abstraction

  54. [55]

    operations

    For each extracted rule, compare it with the existing RuleBank. If semantically equivalent to an existing rule (same structural pattern, not just similar wording), output an INCREMENT operation with that rule’s ID. Otherwise, output anINSERToperation with the canonical description. Input[CURRENT RULEBANK] {rulebank_summary};[RAW GRADIENT] {raw_gradient}. ...