DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
hub Mixed citations
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Mixed citation behavior. Most common role is background (43%).
abstract
Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Presents a query-complexity framework for genetic algorithms with guided operators and shows necessity of multiple operators and tight bounds for diversity in solution pools.
LATTEArena introduces a 6-dimensional taxonomy and modular evaluation framework for LLM-powered tabular feature engineering, benchmarking 24 configurations to produce 17 findings on cost-effectiveness trade-offs with public artifacts.
TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
DEJA uses evolutionary optimization guided by an LLM-based Answer Utility Score to induce soft-failure responses in RAG systems, achieving over 79% soft attack success rate with under 15% hard failures and high stealth across models and datasets.
Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA and fact-checking datasets.
PromptEvolver recovers high-fidelity natural language prompts for given images by evolving them via genetic algorithm guided by a vision-language model, outperforming prior methods on benchmarks.
RAG-HAR combines retrieval-augmented generation with LLMs to deliver state-of-the-art human activity recognition across six benchmarks without any model training or fine-tuning.
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.
iPOE generates and optimizes annotation guidelines from explanations to produce interpretable prompts, reporting up to 39% gains over baselines on four datasets with LLM explanations substituting for human ones.
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
OpenDeepThink uses Bradley-Terry aggregation of LLM pairwise judgments to rank and evolve parallel reasoning traces, improving Gemini 3.1 Pro Codeforces Elo by 405 points over eight rounds.
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and outperforms static baselines on GAIA, HLE, and DeepResearcher.
FitText embeds evolutionary retrieval of tool descriptions into the agent loop, yielding 2.7-10.6 point NDCG@5 gains on ToolRet and 26.7-point pass-rate gains on StableToolBench.
AgentGA optimizes agent seeds with genetic algorithms and parent-archive inheritance to improve autonomous code generation, beating a baseline on 15 of 16 Kaggle competitions.
MemJack achieves 71.48% attack success rate on unmodified COCO val2017 images against Qwen3-VL-Plus by coordinating agents to map visual entities to malicious intents, apply multi-angle camouflage, and filter refusals via iterative nullspace projection while transferring strategies through a shared
Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
ToxSearch-S applies unsupervised speciation to evolutionary prompt search, maintaining capacity-limited species with exemplar leaders and species-aware selection to achieve higher peak toxicity and broader semantic coverage than standard methods.
FELA deploys specialized LLM agents in an evolutionary framework to generate, validate, and refine explainable features from heterogeneous industrial event logs, improving downstream model performance.
LLM-FE is a framework that treats feature engineering as LLM-driven program search with data feedback, reporting consistent gains over baselines on classification and regression tabular tasks.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.