TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

B. Aditya Prakash; Haibo Jin; Haohan Wang; Lucheng Fu; Ye Yu; Yiqiao Jin; Yiyang Wang

arxiv: 2605.21318 · v1 · pith:DETCQ2VLnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.LG

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Lucheng Fu , Ye Yu , Yiyang Wang , Yiqiao Jin , Haibo Jin , B. Aditya Prakash , Haohan Wang This is my paper

Pith reviewed 2026-05-21 04:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords prompt optimizationlarge language modelsdistributional overfittingregularizationout-of-distribution generalizationtext-space optimizationrepresentational inefficiency

0 comments

The pith

TextReg mitigates prompt distributional overfitting by regularizing text-space optimization to control capacity cost and scope narrowness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that iterative prompt optimization for large language models often leads to prompts that grow longer while accumulating narrow, sample-specific rules, resulting in poor generalization to new data distributions. It frames this as prompt distributional overfitting arising from representational inefficiency, measured as the coupled rise of capacity cost and scope narrowness during optimization. TextReg counters this with a soft-penalty approach that applies regularized textual gradients through three components: Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. A sympathetic reader would care because more stable prompts could make LLM reasoning tasks reliable across varied inputs without repeated retraining. If the method works, it shows that explicit control over representation growth in discrete text space can preserve broad applicability while retaining task performance.

Core claim

The authors argue that prompt distributional overfitting reflects a lack of representation control in discrete text-space optimization and formalize this through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness. They attribute the failure mode to the coupled growth of these factors during iterative rewriting and propose TextReg as a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across reasoning benchmarks this yields substantial gains in out

What carries the argument

Representational inefficiency, the dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, which the regularization framework targets to prevent coupled growth.

Load-bearing premise

Prompt distributional overfitting is caused by the coupled growth of capacity cost and scope narrowness, and the proposed regularization components can control this growth without introducing new biases or harming in-distribution performance.

What would settle it

Running the optimization on the same reasoning benchmarks and checking whether TextReg prompts remain shorter and achieve the reported OOD accuracy gains without drops in in-distribution accuracy compared to TextGrad and REVOLVE.

Figures

Figures reproduced from arXiv: 2605.21318 by B. Aditya Prakash, Haibo Jin, Haohan Wang, Lucheng Fu, Ye Yu, Yiqiao Jin, Yiyang Wang.

**Figure 1.** Figure 1: Problem Illustration. We illustrate prompt distributional overfitting in prompt optimization: I) conventional methods often produce long prompts saturated with narrow rules (left), which degrade on OOD inputs. II) Our goal is to instead yield compact prompts composed of broadly applicable rules (right), achieving stronger OOD generalization. In classical machine learning, overfitting is commonly mitigate… view at source ↗

**Figure 2.** Figure 2: Overview of TextReg, which proceeds in three stages. (a) Left: Dual-Evidence Gradient Purification filters the raw task gradient via local batch and RuleBank recurrence evidence, yielding g˜task. (b) Middle: Semantic Edit Regularization diagnoses capacity and scope degradation and synthesizes the regularization gradient greg. (c) Right: Regularization-Guided Prompt Update rewrites pt into pt+1 by selectin… view at source ↗

**Figure 4.** Figure 4: Resilience analysis of TextReg under role-wise engine degradation, where each of the three LLM-driven roles in the optimization pipeline is replaced one at a time with a weaker Qwen2.5-7B-Instruct model. For an in-depth analysis, please refer to Section 5.4. 5.2 Main Results To answer Q1, [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 3.** Figure 3: Ablation study of the three core components of TextReg: Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Update. Each bar reports mean out-of-distribution accuracy (%) across four OOD tasks (Tracking Shuffled Objects 5/7 obj, Logical Deduction 5/7 obj) on each test engine. See Section 5.3 for analysis. To address Q2, we disable each of TextReg’s three components in turn ( … view at source ↗

read the original abstract

Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TextReg adds a three-part regularization scheme to text prompt optimization and claims solid OOD gains, but the experiments skip direct checks on the claimed mechanism.

read the letter

The paper's main contribution is a regularization framework called TextReg that targets prompt distributional overfitting in iterative text-space optimization. It defines representational inefficiency as the sum of capacity cost and scope narrowness, then applies three components—Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update—to enforce a soft penalty during prompt rewriting. The reported results show accuracy lifts of up to 11.8% over TextGrad and 16.5% over REVOLVE on reasoning benchmarks, which is the clearest practical signal so far.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TextReg, a regularization framework for prompt optimization in LLMs to mitigate prompt distributional overfitting. It defines representational inefficiency as the coupled growth of capacity cost and scope narrowness during iterative text-space optimization, introduces three components (Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update) to realize a soft-penalty objective, and reports OOD accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE on reasoning benchmarks.

Significance. If the empirical results hold under proper controls and the mechanism is directly validated, the work could meaningfully advance prompt optimization methods by offering a principled regularization approach in discrete text space. The dual-factor decomposition of inefficiency provides a potentially useful analytical lens for prompt evolution, though its practical impact depends on whether the gains are shown to stem from the claimed control rather than incidental effects.

major comments (2)

[Experiments] Experiments section: The paper reports only final OOD accuracies without intermediate measurements of capacity cost and scope narrowness on the evolving prompts, without ablations isolating each regularizer's effect on these factors, and without confirming that in-distribution performance remains stable. This leaves the central causal claim—that the three components control distributional overfitting via representational inefficiency—unsupported by direct evidence.
[Abstract and §3 (Method)] Method and abstract: No information is supplied on experimental controls, statistical tests, dataset details, or whether regularization hyperparameters were tuned on the same data used for final OOD reporting. This undermines the reliability of the claimed gains of +11.8% and +16.5%.

minor comments (1)

[§3] The notation for the soft-penalty objective and the three regularization terms could be clarified with explicit equations showing how they combine, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the empirical support and methodological transparency in our work on TextReg. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Experiments] Experiments section: The paper reports only final OOD accuracies without intermediate measurements of capacity cost and scope narrowness on the evolving prompts, without ablations isolating each regularizer's effect on these factors, and without confirming that in-distribution performance remains stable. This leaves the central causal claim—that the three components control distributional overfitting via representational inefficiency—unsupported by direct evidence.

Authors: We agree that the current presentation focuses on final OOD accuracies and does not include the requested intermediate analyses. In the revised manuscript we will add plots tracking capacity cost and scope narrowness over optimization iterations for both TextReg and baselines. We will also include component-wise ablations measuring the isolated impact of Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update on these two factors. In-distribution accuracy will be reported alongside OOD results to confirm that gains do not come at the expense of in-distribution performance. These additions will provide direct evidence linking the proposed regularizers to the control of representational inefficiency. revision: yes
Referee: [Abstract and §3 (Method)] Method and abstract: No information is supplied on experimental controls, statistical tests, dataset details, or whether regularization hyperparameters were tuned on the same data used for final OOD reporting. This undermines the reliability of the claimed gains of +11.8% and +16.5%.

Authors: We acknowledge the absence of these details in the submitted version. The revised manuscript will expand the experimental protocol section to specify: (i) use of a held-out validation split for hyperparameter selection that is disjoint from both in-distribution training and OOD test sets; (ii) statistical reporting with means and standard deviations across at least five random seeds together with appropriate significance tests; (iii) full dataset descriptions including sizes, sources, and OOD construction procedures; and (iv) explicit confirmation that regularization hyperparameters were never tuned on the final OOD evaluation data. These clarifications will substantiate the reliability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via new definitions and empirical claims

full rationale

The paper introduces representational inefficiency as a dual-factor decomposition (capacity cost and scope narrowness) to formalize prompt distributional overfitting, then proposes three regularization components (Dual-Evidence Gradient Purification, Semantic Edit Regularization, Regularization-Guided Prompt Update) to realize a soft-penalty objective. These steps are presented as additive innovations rather than reductions of prior fitted quantities or self-citations. The central claims rest on reported OOD accuracy gains across benchmarks, without any equations or mechanisms shown to be equivalent to inputs by construction. No load-bearing self-citation chains, fitted-input predictions, or ansatz smuggling are evident in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract relies on the domain assumption that iterative LLM feedback optimization naturally produces distributional overfitting and introduces the new concept of representational inefficiency without external validation.

axioms (1)

domain assumption LLMs are highly sensitive to the prompts used to specify task objectives and behavioral constraints
Opening statement of the abstract that underpins the entire optimization problem.

invented entities (1)

representational inefficiency no independent evidence
purpose: Dual-factor measure decomposing prompt inefficiency into capacity cost and scope narrowness
Newly defined quantity used to attribute distributional overfitting to coupled growth during optimization.

pith-pipeline@v0.9.0 · 5734 in / 1254 out tokens · 32405 ms · 2026-05-21T04:57:48.036024+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the representational inefficiency of a prompt as I(p)=|p|_tok · (1−s̄(p)) … multiplicative form emphasizes a mutually amplifying interaction between the two
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min_p LDtrain(p) + λ I(p)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 13 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

Yiyang Wang, Yiqiao Jin, Alex Cabral, and Josiah Hester. Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

work page arXiv 2026
[6]

Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, and Josiah Hester. Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

work page 2026
[7]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InEMNLP, pages 7957–7968, 2023

work page 2023
[8]

Teach better or show smarter? on instructions and exemplars in automatic prompt optimization.NeurIPS, 37:58174–58244, 2024

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö Arık. Teach better or show smarter? on instructions and exemplars in automatic prompt optimization.NeurIPS, 37:58174–58244, 2024

work page 2024
[9]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Prosa: Assessing and understanding the prompt sensitivity of llms

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 1950–1976, 2024. 11

work page 2024
[11]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InICCV, pages 15190–15200, 2023

work page 2023
[12]

Same task, more tokens: the impact of input length on the reasoning performance of large language models

Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339–15353, 2024

work page 2024
[13]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024
[14]

Sara: Selective and adaptive retrieval-augmented generation with context compression

Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, and Srijan Kumar. Sara: Selective and adaptive retrieval-augmented generation with context compression. InACL, 2026

work page 2026
[15]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Dis- entangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

work page 2023
[20]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

work page 2020
[21]

Rlprompt: Optimizing discrete text prompts with reinforcement learning

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022

work page 2022
[22]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

work page 2022
[23]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Zhang, H

Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, and Haohan Wang. Revolve: Optimizing ai systems by tracking response evolution in textual optimization.arXiv preprint arXiv:2412.03092, 2024

work page arXiv 2024
[27]

Sipdo: Closed-loop prompt optimization via synthetic data feedback.arXiv preprint arXiv:2505.19514, 2025

Yaoning Yu, Ye Yu, Peiyan Zhang, Kai Wei, Haojing Luo, and Haohan Wang. Sipdo: Closed-loop prompt optimization via synthetic data feedback.arXiv preprint arXiv:2505.19514, 2025

work page arXiv 2025
[28]

Robust prompt optimization for large language models against distribution shifts

Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1539–1554, 2023

work page 2023
[29]

Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare.arXiv preprint arXiv:2509.24130, 2025

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, et al. Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare.arXiv preprint arXiv:2509.24130, 2025

work page arXiv 2025
[30]

Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective.arXiv preprint arXiv:2503.13413, 2025

Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin, and Wanxiang Che. Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective.arXiv preprint arXiv:2503.13413, 2025

work page arXiv 2025
[31]

Reflection-enhanced meta-optimization integrating textgrad-style prompt optimization with memory-driven self-evolution.arXiv preprint arXiv:2508.18749, 2025

Chunlong Wu and Zhibo Qu. Reflection-enhanced meta-optimization integrating textgrad-style prompt optimization with memory-driven self-evolution.arXiv preprint arXiv:2508.18749, 2025

work page arXiv 2025
[32]

Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

work page 1970
[33]

A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

work page 1991
[34]

Regularization of neural networks using dropconnect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. InInternational conference on machine learning, pages 1058–1066. PMLR, 2013

work page 2013
[35]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

work page 1929
[36]

Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

work page 1996
[37]

Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

work page 2005
[38]

Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.Advances in neural information processing systems, 13, 2000

Rich Caruana, Steve Lawrence, and C Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.Advances in neural information processing systems, 13, 2000

work page 2000
[39]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

work page 2021
[40]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021. 13

work page 2021
[41]

P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

work page 2022
[42]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

work page 2023
[43]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research, 2023

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research, 2023

work page 2023
[44]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

work page 2021
[46]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

work page 2015
[47]

Mawps: A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. InProceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152–1157, 2016

work page 2016
[48]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.eprint arXiv: 2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.arXiv:2404.14219, 2:6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv e-prints, pages arXiv–2412, 2024

work page 2024
[52]

Gpt-4o, 2025

OpenAI. Gpt-4o, 2025

work page 2025
[53]

Think step-by-step

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199– 22213, 2022. 14 A Ethical Considerations and Broader Impact This work proposes TextReg, a regularization framework for prompt optimization, and is methodological rathe...

work page 2022
[54]

Remove references to specific entities, exact numbers, or particular examples

Extract mid-level canonical {rule_scope} rules from the raw gradient. Remove references to specific entities, exact numbers, or particular examples. Preserve structural {rule_patterns}. Keep rules at mid-level abstraction

work page
[55]

operations

For each extracted rule, compare it with the existing RuleBank. If semantically equivalent to an existing rule (same structural pattern, not just similar wording), output an INCREMENT operation with that rule’s ID. Otherwise, output anINSERToperation with the canonical description. Input[CURRENT RULEBANK] {rulebank_summary};[RAW GRADIENT] {raw_gradient}. ...

work page

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [5]

Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

Yiyang Wang, Yiqiao Jin, Alex Cabral, and Josiah Hester. Mascot: Towards multi-agent socio- collaborative companion systems.arXiv:2601.14230, 2026

work page arXiv 2026

[5] [6]

Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

Yiyang Wang, Chen Chen, Tica Lin, Vishnu Raj, Josh Kimball, Alex Cabral, and Josiah Hester. Companioncast: A multi-agent conversational ai framework with spatial audio for social co-viewing experiences.ACM CHI 2026 Workshop on Human-Agent Collaboration, 2026

work page 2026

[6] [7]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InEMNLP, pages 7957–7968, 2023

work page 2023

[7] [8]

Teach better or show smarter? on instructions and exemplars in automatic prompt optimization.NeurIPS, 37:58174–58244, 2024

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö Arık. Teach better or show smarter? on instructions and exemplars in automatic prompt optimization.NeurIPS, 37:58174–58244, 2024

work page 2024

[8] [9]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [10]

Prosa: Assessing and understanding the prompt sensitivity of llms

Jingming Zhuo, Songyang Zhang, Xinyu Fang, Haodong Duan, Dahua Lin, and Kai Chen. Prosa: Assessing and understanding the prompt sensitivity of llms. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, pages 1950–1976, 2024. 11

work page 2024

[10] [11]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. InICCV, pages 15190–15200, 2023

work page 2023

[11] [12]

Same task, more tokens: the impact of input length on the reasoning performance of large language models

Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339–15353, 2024

work page 2024

[12] [13]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

work page 2024

[13] [14]

Sara: Selective and adaptive retrieval-augmented generation with context compression

Yiqiao Jin, Kartik Sharma, Vineeth Rakesh, Yingtong Dou, Menghai Pan, Mahashweta Das, and Srijan Kumar. Sara: Selective and adaptive retrieval-augmented generation with context compression. InACL, 2026

work page 2026

[14] [15]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[15] [16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [17]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [18]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Dis- entangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [19]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural informa- tion processing systems, 36:11809–11822, 2023

work page 2023

[19] [20]

Autoprompt: Eliciting knowledge from language models with automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 4222–4235, 2020

work page 2020

[20] [21]

Rlprompt: Optimizing discrete text prompts with reinforcement learning

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022

work page 2022

[21] [22]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

work page 2022

[22] [23]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [24]

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution.arXiv preprint arXiv:2309.16797, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [25]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [26]

Zhang, H

Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, and Haohan Wang. Revolve: Optimizing ai systems by tracking response evolution in textual optimization.arXiv preprint arXiv:2412.03092, 2024

work page arXiv 2024

[26] [27]

Sipdo: Closed-loop prompt optimization via synthetic data feedback.arXiv preprint arXiv:2505.19514, 2025

Yaoning Yu, Ye Yu, Peiyan Zhang, Kai Wei, Haojing Luo, and Haohan Wang. Sipdo: Closed-loop prompt optimization via synthetic data feedback.arXiv preprint arXiv:2505.19514, 2025

work page arXiv 2025

[27] [28]

Robust prompt optimization for large language models against distribution shifts

Moxin Li, Wenjie Wang, Fuli Feng, Yixin Cao, Jizhi Zhang, and Tat-Seng Chua. Robust prompt optimization for large language models against distribution shifts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1539–1554, 2023

work page 2023

[28] [29]

Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare.arXiv preprint arXiv:2509.24130, 2025

Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, et al. Beyond magic words: Sharpness-aware prompt evolving for robust large language models with tare.arXiv preprint arXiv:2509.24130, 2025

work page arXiv 2025

[29] [30]

Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective.arXiv preprint arXiv:2503.13413, 2025

Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin, and Wanxiang Che. Dlpo: Towards a robust, efficient, and generalizable prompt optimization framework from a deep-learning perspective.arXiv preprint arXiv:2503.13413, 2025

work page arXiv 2025

[30] [31]

Reflection-enhanced meta-optimization integrating textgrad-style prompt optimization with memory-driven self-evolution.arXiv preprint arXiv:2508.18749, 2025

Chunlong Wu and Zhibo Qu. Reflection-enhanced meta-optimization integrating textgrad-style prompt optimization with memory-driven self-evolution.arXiv preprint arXiv:2508.18749, 2025

work page arXiv 2025

[31] [32]

Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

work page 1970

[32] [33]

A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

work page 1991

[33] [34]

Regularization of neural networks using dropconnect

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. InInternational conference on machine learning, pages 1058–1066. PMLR, 2013

work page 2013

[34] [35]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

work page 1929

[35] [36]

Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

Robert Tibshirani. Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996

work page 1996

[36] [37]

Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005

work page 2005

[37] [38]

Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.Advances in neural information processing systems, 13, 2000

Rich Caruana, Steve Lawrence, and C Giles. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.Advances in neural information processing systems, 13, 2000

work page 2000

[38] [39]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

work page 2021

[39] [40]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021. 13

work page 2021

[40] [41]

P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022

work page 2022

[41] [42]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

work page 2023

[42] [43]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research, 2023

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on machine learning research, 2023

work page 2023

[43] [44]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [45]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2080–2094, 2021

work page 2021

[45] [46]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pages 1743–1752, 2015

work page 2015

[46] [47]

Mawps: A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. Mawps: A math word problem repository. InProceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152–1157, 2016

work page 2016

[47] [48]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.eprint arXiv: 2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.arXiv:2404.14219, 2:6, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [50]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [51]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv e-prints, pages arXiv–2412, 2024

work page 2024

[51] [52]

Gpt-4o, 2025

OpenAI. Gpt-4o, 2025

work page 2025

[52] [53]

Think step-by-step

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199– 22213, 2022. 14 A Ethical Considerations and Broader Impact This work proposes TextReg, a regularization framework for prompt optimization, and is methodological rathe...

work page 2022

[53] [54]

Remove references to specific entities, exact numbers, or particular examples

Extract mid-level canonical {rule_scope} rules from the raw gradient. Remove references to specific entities, exact numbers, or particular examples. Preserve structural {rule_patterns}. Keep rules at mid-level abstraction

work page

[54] [55]

operations

For each extracted rule, compare it with the existing RuleBank. If semantically equivalent to an existing rule (same structural pattern, not just similar wording), output an INCREMENT operation with that rule’s ID. Otherwise, output anINSERToperation with the canonical description. Input[CURRENT RULEBANK] {rulebank_summary};[RAW GRADIENT] {raw_gradient}. ...

work page