pith. sign in

arxiv: 2310.03714 · v1 · submitted 2023-10-05 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Pith reviewed 2026-05-11 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords DSPylanguage model pipelinesdeclarative modulesprompt optimizationself-bootstrappingcompilerfew-shot promptingperformance improvement
0
0 comments X

The pith

DSPy turns a few lines of declarative code into language model pipelines that self-optimize and outperform few-shot and expert prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DSPy represents LM pipelines as graphs of declarative modules that invoke language models and can learn parameters by collecting their own demonstrations. A compiler then searches over possible module configurations to maximize a user-specified metric. This structure lets short programs build and improve sophisticated pipelines for math word problems, multi-hop retrieval, complex question answering, and agent loops. A sympathetic reader would care because the method replaces manual trial-and-error prompt writing with systematic, automatic optimization. Experiments show that compiled pipelines using GPT-3.5 or Llama2-13b-chat exceed standard few-shot baselines by large margins and often beat expert-written demonstrations.

Core claim

DSPy abstracts LM pipelines as text transformation graphs in which LMs are called through declarative, parameterized modules. The compiler optimizes any such pipeline for a given metric by automatically generating demonstrations and searching over module configurations and compositions of prompting, reasoning, and augmentation techniques. Succinct DSPy programs thereby produce pipelines that, after compilation, outperform standard few-shot prompting and expert-created demonstrations on tasks including math reasoning and multi-hop QA.

What carries the argument

Parameterized DSPy modules inside computational graphs, together with a compiler that collects demonstrations and searches configurations to maximize a target metric.

If this is right

  • Succinct DSPy programs can express and optimize complex pipelines for reasoning, retrieval, and control tasks.
  • Open models as small as 770M-parameter T5 become competitive with expert prompt chains written for proprietary GPT-3.5.
  • The same declarative program can be recompiled for different metrics or models without rewriting prompts.
  • Models can self-bootstrap training data and improve their own performance on the target task within minutes.
  • Pipeline development shifts from hand-crafted strings to declarative code plus automatic optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower the expertise barrier for building reliable LM applications by automating much of the prompt engineering.
  • Compiled pipelines might adapt more readily to new domains if the compiler is given additional unlabeled data or metrics.
  • Extending the same declarative graph structure to multimodal or tool-using agents would be a natural next step.
  • Combining the compiler with lightweight fine-tuning on the collected demonstrations could further improve small-model performance.

Load-bearing premise

Automatic search over module configurations driven by collected demonstrations will reliably locate high-performing pipelines without overfitting to the validation metric or demanding prohibitive compute.

What would settle it

On a new task the DSPy compiler produces a pipeline whose accuracy is no higher than, or lower than, a standard few-shot prompt baseline using the same underlying language model.

read the original abstract

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at https://github.com/stanfordnlp/dspy

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DSPy, a programming model that represents LM pipelines as imperative computational graphs of declarative, parameterized modules. These modules learn by collecting demonstrations to compose prompting, reasoning, and other techniques. A compiler optimizes any DSPy program for a given metric via bootstrap search over module configurations and auto-generated demonstrations. Two case studies demonstrate that short DSPy programs enable GPT-3.5 and Llama-2-13B-chat to self-improve pipelines for math word problems, multi-hop QA, and agent control, outperforming standard few-shot prompting (by >25% and >65%) and expert demonstrations (by up to 5-46% and 16-40%). Compiled DSPy programs on smaller open models are competitive with expert GPT-3.5 chains.

Significance. If the reported gains are robust to validation-set selection bias, the work offers a valuable systematic alternative to manual prompt engineering by turning pipeline design into a programmable, optimizable artifact. The public GitHub release of the DSPy library supports reproducibility and further experimentation.

major comments (1)
  1. [§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.
minor comments (2)
  1. [Abstract and §5] The abstract and experimental sections provide no details on the compiler's search algorithm (e.g., beam size, number of rounds), hyperparameter choices, or statistical significance testing of the reported deltas.
  2. [Figures/Tables in §5] Figure and table captions could more explicitly state the exact validation metric used for each task and whether the same split was used for both optimization and final reporting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of DSPy's significance and for the detailed feedback on the bootstrap optimizer. We address the major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.

    Authors: We agree that the bootstrap optimizer, as currently described in §4, uses the validation set both to generate demonstrations and to select the best pipeline configuration, without a separate held-out selection set or post-selection evaluation on untouched data. This design is intentional for practical settings with limited labeled data, but we acknowledge the referee's point that it can introduce selection bias, particularly with weaker base models. The reported gains are measured on fully held-out test sets, yet the optimization step itself may overfit to the validation metric. We will revise the manuscript to (1) explicitly discuss this limitation in §4, (2) add experiments that reserve a portion of the validation data solely for post-selection evaluation, and (3) report results with Bonferroni-style corrections where multiple configurations are compared. These changes will provide stronger evidence that the observed improvements reflect genuine pipeline optimization rather than overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external test sets against fixed baselines

full rationale

The paper introduces DSPy as a declarative programming model and compiler for LM pipelines, with optimizers (including bootstrap) that collect demonstrations and search configurations to maximize a user-specified metric. The central claims consist of empirical results: compiled pipelines outperform standard few-shot prompting and expert demonstrations on held-out test sets for tasks like math word problems and multi-hop QA. These comparisons use fixed external baselines rather than quantities defined inside the DSPy system. No equations, uniqueness theorems, or first-principles derivations appear that reduce a reported prediction to a fitted parameter or self-citation by construction. The bootstrap process is described as an optimization procedure whose outputs are evaluated externally, rendering the reported performance self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that language-model behavior can be usefully abstracted as learnable declarative modules whose configurations can be searched by a compiler; no numerical constants are fitted in the reported results.

axioms (1)
  • domain assumption Language models respond usefully to compositions of prompting, finetuning, and reasoning techniques when those techniques are expressed through parameterized declarative modules.
    This is the foundational modeling choice stated in the abstract.
invented entities (2)
  • DSPy module no independent evidence
    purpose: Parameterized unit that invokes an LM and can learn from collected demonstrations
    New abstraction introduced by the paper; no independent evidence outside the framework itself.
  • DSPy compiler no independent evidence
    purpose: Optimizer that searches module configurations to maximize a metric
    New component introduced by the paper; no independent evidence outside the framework itself.

pith-pipeline@v0.9.0 · 5656 in / 1322 out tokens · 38086 ms · 2026-05-11T18:52:19.364355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowCompile: An Optimizing Compiler for Structured LLM Workflows

    cs.CL 2026-05 unverdicted novelty 8.0

    FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

  2. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 conditional novelty 8.0

    LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.

  3. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  4. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    MOSS performs source-level self-rewriting in agent systems and raised OpenClaw four-task mean score from 0.25 to 0.61 in one cycle.

  5. TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

    cs.CL 2026-05 unverdicted novelty 7.0

    TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.

  6. Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.

  7. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

  8. S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

    cs.LG 2026-05 unverdicted novelty 7.0 partial

    S-Bus uses a DeliveryLog to reconstruct read sets from HTTP traffic and enforce Observable-Read Isolation, preventing structural race conditions in multi-agent LLM coordination.

  9. S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

    cs.LG 2026-05 unverdicted novelty 7.0 partial

    S-Bus reconstructs read sets from HTTP traffic for multi-agent LLM state coordination, delivering Observable-Read Isolation with formal proofs and empirical safety matching traditional databases.

  10. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  11. Efficient Ensemble Selection from Binary and Pairwise Feedback

    cs.GT 2026-05 unverdicted novelty 7.0

    The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...

  12. TRACE: Tourism Recommendation with Accountable Citation Evidence

    cs.IR 2026-05 unverdicted novelty 7.0

    TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...

  13. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.

  14. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.

  15. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.

  16. More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

    cs.AI 2026-05 conditional novelty 7.0

    Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...

  17. TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

    cs.SE 2026-05 unverdicted novelty 7.0

    TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

  18. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    cs.CL 2026-04 unverdicted novelty 7.0

    AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.

  19. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    cs.CL 2026-04 unverdicted novelty 7.0

    AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.

  20. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  21. RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

    cs.LG 2026-04 unverdicted novelty 7.0

    RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...

  22. Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

  23. Meta-Harness: End-to-End Optimization of Model Harnesses

    cs.AI 2026-03 unverdicted novelty 7.0

    Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...

  24. GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration

    cs.AI 2026-03 unverdicted novelty 7.0

    GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.

  25. Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

    cs.HC 2024-09 unverdicted novelty 7.0

    Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity suppor...

  26. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

    cs.CL 2024-06 accept novelty 7.0

    This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.

  27. Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

    cs.AI 2026-05 conditional novelty 6.0

    Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.

  28. optimize_anything: A Universal API for Optimizing any Text Parameter

    cs.CL 2026-05 unverdicted novelty 6.0

    A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.

  29. Harnesses for Inference-Time Alignment over Execution Trajectories

    cs.LG 2026-05 unverdicted novelty 6.0

    Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal be...

  30. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  31. Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.

  32. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...

  33. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.

  34. Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

    stat.ML 2026-05 unverdicted novelty 6.0

    SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.

  35. Trace-Level Analysis of Information Contamination in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.

  36. The Two Boundaries: Why Behavioral AI Governance Fails Structurally

    cs.AI 2026-04 conditional novelty 6.0 partial

    Behavioral governance of AI effects is undecidable for Turing-complete architectures, making coterminous boundaries via computation-effect separation the only structural solution rather than post-hoc layers.

  37. Probabilistic Programs of Thought

    cs.CL 2026-04 unverdicted novelty 6.0

    Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.

  38. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

  39. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

  40. Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

    cs.AI 2026-04 unverdicted novelty 6.0

    POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

  41. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  42. Behavior Latticing: Inferring User Motivations from Unstructured Interactions

    cs.HC 2026-04 unverdicted novelty 6.0

    Behavior latticing synthesizes connections across unstructured user interactions to generate insights into underlying motivations, yielding deeper and more accurate user understanding than task-only models.

  43. Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

    cs.CL 2026-04 unverdicted novelty 6.0

    A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.

  44. Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion

    cs.CV 2026-03 unverdicted novelty 6.0

    HRFD aligns multi-dimensional preferences in text-to-image diffusion via hierarchical relevance feedback and statistical distribution divergence measurement between liked and disliked image sets, remaining training-fr...

  45. Benchmarking Compound AI Applications for Hardware-Software Co-Design

    cs.DC 2026-03 unverdicted novelty 6.0

    Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.

  46. EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows

    cs.SE 2026-02 unverdicted novelty 6.0

    EditFlow reconstructs temporal developer editing flows from code changes to benchmark and optimize AI code edit recommenders so they align with natural incremental reasoning rather than static snapshots.

  47. Learning to Configure Agentic AI Systems

    cs.AI 2026-02 unverdicted novelty 6.0

    ARC learns per-query configurations for LLM agent systems via a lightweight hierarchical policy in an SMDP formulation, delivering 31% higher reasoning accuracy and doubled success on an agent benchmark over budget-ma...

  48. Learning to Configure Agentic AI Systems

    cs.AI 2026-02 unverdicted novelty 6.0

    ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to ...

  49. Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues

    cs.CL 2025-11 unverdicted novelty 6.0

    A conjecture-then-validate method lets LLMs convert opaque lexical cues from deceptive-review classifiers into interpretable language phenomena that are empirically grounded and more predictive than direct LLM outputs.

  50. In-depth Analysis of Graph-based RAG in a Unified Framework

    cs.IR 2025-03 unverdicted novelty 6.0

    A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.

  51. Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search

    cs.CL 2025-02 unverdicted novelty 6.0

    DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.

  52. Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

    cs.CL 2024-07 conditional novelty 6.0

    LLMs achieve 64% accuracy detecting Wikipedia bias and remove 79% of words removed by editors when correcting, but produce high-recall low-precision edits rated more neutral by crowds than human versions.

  53. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  54. SkillOpt: Executive Strategy for Self-Evolving Agent Skills

    cs.AI 2026-05 unverdicted novelty 5.0

    SkillOpt introduces a validation-gated text-space optimizer for agent skills that outperforms human, one-shot, and prior optimization baselines across 52 model-benchmark-harness combinations.

  55. Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education

    cs.HC 2026-05 unverdicted novelty 5.0

    Compares LLMs against semantic similarity for binary classification of student self-explanations in programming education.

  56. Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

    cs.CL 2026-05 unverdicted novelty 5.0

    DiSP stratifies queries by difficulty using random trial estimates, trains a router and level-specific judges, then applies budgeted stop-on-acceptance selection to improve ICL accuracy and speed on classification tasks.

  57. Runtime-Structured Task Decomposition for Agentic Coding Systems

    cs.SE 2026-05 unverdicted novelty 5.0

    Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.

  58. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  59. A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis

    cs.LG 2026-05 unverdicted novelty 5.0

    The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples,...

  60. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 72 Pith papers · 26 internal anchors

  1. [1]

    Optuna: A next-generation hyperparameter optimization framework

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.\ 2623--2631, 2019

  2. [2]

    Theano: A Python framework for fast computation of mathematical expressions

    Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Fr \'e d \'e ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, pp.\ arXiv--1605, 2016

  3. [3]

    Theano: A CPU and GPU math compiler in Python

    James Bergstra, Olivier Breuleux, Fr \'e d \'e ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python . In Proc. 9th python in science conf, volume 1, pp.\ 3--10, 2010

  4. [4]

    Theano: Deep learning on gpus with Python

    James Bergstra, Fr \'e d \'e ric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. Theano: Deep learning on gpus with Python . In NIPS 2011, BigLearning Workshop, Granada, Spain, volume 3. Citeseer, 2011

  5. [5]

    Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

    James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp.\ 115--123. PMLR, 2013

  6. [6]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  8. [8]

    Hwchase17/langchain

    Harrison Chase. Hwchase17/langchain. 2022. URL https://github.com/hwchase17/langchain

  9. [9]

    Reading Wikipedia to answer open-domain questions

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading W ikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1870--1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1171. URL https://acl...

  10. [10]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

  11. [11]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

  12. [12]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

  13. [13]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  14. [14]

    Torch: a modular machine learning software library

    Ronan Collobert, Samy Bengio, and Johnny Mari \'e thoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002

  15. [15]

    Dohan, W

    David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022

  16. [16]

    Rarr: Researching and revising what language models say, using language models

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...

  17. [17]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023 b

  18. [18]

    EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

    Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023

  19. [19]

    REALM: Retrieval-Augmented Language Model Pre-Training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. URL https://arxiv.org/abs/2002.08909

  20. [20]

    Training classifiers with natural language explanations

    Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1884--1895. Association for Computational Linguistics, 2018. URL http://aclweb...

  21. [21]

    (2023, August 31)

    Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. Enabling intelligent interactions between an agent and an LLM : A reinforcement learning approach. arXiv preprint arXiv:2306.03604, 2023. URL https://arxiv.org/abs/2306.03604

  22. [22]

    Large Language Models Can Self-Improve

    Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

  23. [23]

    Atlas: Few-shot Learning with Retrieval Augmented Language Models

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022

  24. [24]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445, 2022

  25. [25]

    B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval

    Omar Khattab, Christopher Potts, and Matei Zaharia. B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

  26. [26]

    Relevance-guided supervision for openqa with ColBERT

    Omar Khattab, Christopher Potts, and Matei Zaharia. Relevance-guided supervision for openqa with ColBERT . Transactions of the Association for Computational Linguistics, 9: 0 929--944, 2021 b

  27. [27]

    Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp,

    Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022

  28. [28]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

  29. [29]

    Large Language Models are Zero-Shot Reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022

  30. [30]

    Lazaridou, E

    Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022

  31. [31]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...

  32. [32]

    LlamaIndex , 11 2022

    Jerry Liu. LlamaIndex , 11 2022. URL https://github.com/jerryjliu/llama_index

  33. [33]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

  34. [34]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730

  35. [35]

    Semantic kernel

    Microsoft. Semantic kernel. 2023. URL https://learn.microsoft.com/semantic-kernel/

  36. [36]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback, 2021. URL https://...

  37. [37]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  38. [38]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

  39. [39]

    PyTorch : An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch : An imperative style, high-perf...

  40. [40]

    Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, and Sercan O Arik

    Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015, 2023

  41. [41]

    Measuring and Narrowing the Compositionality Gap in Language Models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

  42. [42]

    Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

    Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023

  43. [43]

    Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2590--2602, Hong Kong, ...

  44. [44]

    Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text

    Peng Qi, Haejun Lee, Oghenetegiri Sido, Christopher D Manning, et al. Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527, 2020. URL https://arxiv.org/abs/2010.12527

  45. [45]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Ms, OpenAI, 2018. URL https://openai.com/blog/language-unsupervised/

  46. [46]

    Data programming: Creating large training sets, quickly

    Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\' e . Data programming: Creating large training sets, quickly. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp.\ 3567--3575. Curran Associates, Inc., 2016. URL https://papers.nips.cc/paper/65...

  47. [47]

    Colbertv2: Effective and efficient retrieval via lightweight late interaction

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. C ol BERT v2: E ffective and E fficient R etrieval via L ightweight L ate I nteraction. arXiv preprint arXiv:2112.01488, 2021

  48. [48]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  49. [49]

    arXiv preprint arXiv:2302.00618 , year=

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023

  50. [50]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

  51. [51]

    Prompting gpt-3 to be reliable

    Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022

  52. [52]

    Recitation-augmented language models,

    Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. arXiv preprint arXiv:2210.01296, 2022

  53. [53]

    Chainer: a next-generation open source framework for deep learning

    Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pp.\ 1--6, 2015

  54. [54]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  55. [55]

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

  56. [56]

    Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming

    Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. U...

  57. [57]

    Rationale- augmented ensembles in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022 a

  58. [58]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022 b

  59. [59]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  60. [60]

    Transformers: State-of-the-Art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

  61. [61]

    Large Language Models as Optimizers

    Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023

  62. [62]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  63. [63]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  64. [64]

    Yoran, T

    Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023

  65. [65]

    Wu, andN.D.Goodman

    Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022

  66. [66]

    Automatic Chain of Thought Prompting in Large Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

  67. [67]

    Expel: Llm agents are experiential learners,

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL : LLM agents are experiential learners. arXiv preprint arXiv:2308.10144, 2023 a . URL https://arxiv.org/pdf/2308.10144

  68. [68]

    Automatic model selection with large language models for reasoning

    Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. Automatic model selection with large language models for reasoning. arXiv preprint arXiv:2305.14333, 2023 b