DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Arnav Singhvi; Ashutosh Sharma; Christopher Potts; Hanna Moazam; Heather Miller; Keshav Santhanam; Matei Zaharia; Omar Khattab; Paridhi Maheshwari; Saiful Haq

arxiv: 2310.03714 · v1 · submitted 2023-10-05 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab , Arnav Singhvi , Paridhi Maheshwari , Zhiyuan Zhang , Keshav Santhanam , Sri Vardhamanan , Saiful Haq , Ashutosh Sharma

show 5 more authors

Thomas T. Joshi Hanna Moazam Heather Miller Matei Zaharia Christopher Potts

This is my paper

Pith reviewed 2026-05-11 18:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords DSPylanguage model pipelinesdeclarative modulesprompt optimizationself-bootstrappingcompilerfew-shot promptingperformance improvement

0 comments

The pith

DSPy turns a few lines of declarative code into language model pipelines that self-optimize and outperform few-shot and expert prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DSPy represents LM pipelines as graphs of declarative modules that invoke language models and can learn parameters by collecting their own demonstrations. A compiler then searches over possible module configurations to maximize a user-specified metric. This structure lets short programs build and improve sophisticated pipelines for math word problems, multi-hop retrieval, complex question answering, and agent loops. A sympathetic reader would care because the method replaces manual trial-and-error prompt writing with systematic, automatic optimization. Experiments show that compiled pipelines using GPT-3.5 or Llama2-13b-chat exceed standard few-shot baselines by large margins and often beat expert-written demonstrations.

Core claim

DSPy abstracts LM pipelines as text transformation graphs in which LMs are called through declarative, parameterized modules. The compiler optimizes any such pipeline for a given metric by automatically generating demonstrations and searching over module configurations and compositions of prompting, reasoning, and augmentation techniques. Succinct DSPy programs thereby produce pipelines that, after compilation, outperform standard few-shot prompting and expert-created demonstrations on tasks including math reasoning and multi-hop QA.

What carries the argument

Parameterized DSPy modules inside computational graphs, together with a compiler that collects demonstrations and searches configurations to maximize a target metric.

If this is right

Succinct DSPy programs can express and optimize complex pipelines for reasoning, retrieval, and control tasks.
Open models as small as 770M-parameter T5 become competitive with expert prompt chains written for proprietary GPT-3.5.
The same declarative program can be recompiled for different metrics or models without rewriting prompts.
Models can self-bootstrap training data and improve their own performance on the target task within minutes.
Pipeline development shifts from hand-crafted strings to declarative code plus automatic optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower the expertise barrier for building reliable LM applications by automating much of the prompt engineering.
Compiled pipelines might adapt more readily to new domains if the compiler is given additional unlabeled data or metrics.
Extending the same declarative graph structure to multimodal or tool-using agents would be a natural next step.
Combining the compiler with lightweight fine-tuning on the collected demonstrations could further improve small-model performance.

Load-bearing premise

Automatic search over module configurations driven by collected demonstrations will reliably locate high-performing pipelines without overfitting to the validation metric or demanding prohibitive compute.

What would settle it

On a new task the DSPy compiler produces a pipeline whose accuracy is no higher than, or lower than, a standard few-shot prompt baseline using the same underlying language model.

read the original abstract

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at https://github.com/stanfordnlp/dspy

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DSPy gives a declarative module system and compiler for LM pipelines that can auto-bootstrap better performance than hand prompts, but the optimizer's selection process needs tighter controls to rule out overfitting.

read the letter

The main point is that DSPy turns LM pipeline design into a programming task where you declare modules for prompting, reasoning, or retrieval, then let a compiler optimize the whole thing by generating and picking demonstrations to maximize a metric. A few lines of code end up beating both basic few-shot and expert-written chains on the reported tasks, with larger lifts on the weaker open model like llama2-13b-chat.

Referee Report

1 major / 2 minor

Summary. The paper introduces DSPy, a programming model that represents LM pipelines as imperative computational graphs of declarative, parameterized modules. These modules learn by collecting demonstrations to compose prompting, reasoning, and other techniques. A compiler optimizes any DSPy program for a given metric via bootstrap search over module configurations and auto-generated demonstrations. Two case studies demonstrate that short DSPy programs enable GPT-3.5 and Llama-2-13B-chat to self-improve pipelines for math word problems, multi-hop QA, and agent control, outperforming standard few-shot prompting (by >25% and >65%) and expert demonstrations (by up to 5-46% and 16-40%). Compiled DSPy programs on smaller open models are competitive with expert GPT-3.5 chains.

Significance. If the reported gains are robust to validation-set selection bias, the work offers a valuable systematic alternative to manual prompt engineering by turning pipeline design into a programmable, optimizable artifact. The public GitHub release of the DSPy library supports reproducibility and further experimentation.

major comments (1)

[§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.

minor comments (2)

[Abstract and §5] The abstract and experimental sections provide no details on the compiler's search algorithm (e.g., beam size, number of rounds), hyperparameter choices, or statistical significance testing of the reported deltas.
[Figures/Tables in §5] Figure and table captions could more explicitly state the exact validation metric used for each task and whether the same split was used for both optimization and final reporting.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of DSPy's significance and for the detailed feedback on the bootstrap optimizer. We address the major comment below.

read point-by-point responses

Referee: [§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.

Authors: We agree that the bootstrap optimizer, as currently described in §4, uses the validation set both to generate demonstrations and to select the best pipeline configuration, without a separate held-out selection set or post-selection evaluation on untouched data. This design is intentional for practical settings with limited labeled data, but we acknowledge the referee's point that it can introduce selection bias, particularly with weaker base models. The reported gains are measured on fully held-out test sets, yet the optimization step itself may overfit to the validation metric. We will revise the manuscript to (1) explicitly discuss this limitation in §4, (2) add experiments that reserve a portion of the validation data solely for post-selection evaluation, and (3) report results with Bonferroni-style corrections where multiple configurations are compared. These changes will provide stronger evidence that the observed improvements reflect genuine pipeline optimization rather than overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external test sets against fixed baselines

full rationale

The paper introduces DSPy as a declarative programming model and compiler for LM pipelines, with optimizers (including bootstrap) that collect demonstrations and search configurations to maximize a user-specified metric. The central claims consist of empirical results: compiled pipelines outperform standard few-shot prompting and expert demonstrations on held-out test sets for tasks like math word problems and multi-hop QA. These comparisons use fixed external baselines rather than quantities defined inside the DSPy system. No equations, uniqueness theorems, or first-principles derivations appear that reduce a reported prediction to a fitted parameter or self-citation by construction. The bootstrap process is described as an optimization procedure whose outputs are evaluated externally, rendering the reported performance self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that language-model behavior can be usefully abstracted as learnable declarative modules whose configurations can be searched by a compiler; no numerical constants are fitted in the reported results.

axioms (1)

domain assumption Language models respond usefully to compositions of prompting, finetuning, and reasoning techniques when those techniques are expressed through parameterized declarative modules.
This is the foundational modeling choice stated in the abstract.

invented entities (2)

DSPy module no independent evidence
purpose: Parameterized unit that invokes an LM and can learn from collected demonstrations
New abstraction introduced by the paper; no independent evidence outside the framework itself.
DSPy compiler no independent evidence
purpose: Optimizer that searches module configurations to maximize a metric
New component introduced by the paper; no independent evidence outside the framework itself.

pith-pipeline@v0.9.0 · 5656 in / 1322 out tokens · 38086 ms · 2026-05-11T18:52:19.364355+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

MOSS performs source-level self-rewriting in agent systems and raised OpenClaw four-task mean score from 0.25 to 0.61 in one cycle.
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
cs.CL 2026-05 unverdicted novelty 7.0

TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
cs.AI 2026-05 unverdicted novelty 7.0

DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
cs.LG 2026-05 unverdicted novelty 7.0 partial

S-Bus uses a DeliveryLog to reconstruct read sets from HTTP traffic and enforce Observable-Read Isolation, preventing structural race conditions in multi-agent LLM coordination.
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
cs.LG 2026-05 unverdicted novelty 7.0 partial

S-Bus reconstructs read sets from HTTP traffic for multi-agent LLM state coordination, delivering Observable-Read Isolation with formal proofs and empirical safety matching traditional databases.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Efficient Ensemble Selection from Binary and Pairwise Feedback
cs.GT 2026-05 unverdicted novelty 7.0

The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
TRACE: Tourism Recommendation with Accountable Citation Evidence
cs.IR 2026-05 unverdicted novelty 7.0

TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
cs.AI 2026-05 conditional novelty 7.0

Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
cs.LG 2026-04 unverdicted novelty 7.0

RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
cs.AI 2026-03 unverdicted novelty 7.0

GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation
cs.HC 2024-09 unverdicted novelty 7.0

Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity suppor...
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
cs.CL 2024-06 accept novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
cs.AI 2026-05 conditional novelty 6.0

Ratchet provides a minimal hygiene recipe for self-managing skill libraries in frozen LLM agents, delivering +0.328 rolling-mean pass@1 gain on MBPP+ hard-100 and +0.22 peak lift on SWE-bench Verified.
optimize_anything: A Universal API for Optimizing any Text Parameter
cs.CL 2026-05 unverdicted novelty 6.0

A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
Harnesses for Inference-Time Alignment over Execution Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal be...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
cs.CL 2026-05 unverdicted novelty 6.0

LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 6.0

LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
stat.ML 2026-05 unverdicted novelty 6.0

SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
The Two Boundaries: Why Behavioral AI Governance Fails Structurally
cs.AI 2026-04 conditional novelty 6.0 partial

Behavioral governance of AI effects is undecidable for Turing-complete architectures, making coterminous boundaries via computation-effect separation the only structural solution rather than post-hoc layers.
Probabilistic Programs of Thought
cs.CL 2026-04 unverdicted novelty 6.0

Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
cs.AI 2026-04 unverdicted novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
cs.AI 2026-04 unverdicted novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Behavior Latticing: Inferring User Motivations from Unstructured Interactions
cs.HC 2026-04 unverdicted novelty 6.0

Behavior latticing synthesizes connections across unstructured user interactions to generate insights into underlying motivations, yielding deeper and more accurate user understanding than task-only models.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
cs.CL 2026-04 unverdicted novelty 6.0

A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
Bridging the Intention-Expression Gap: Aligning Multi-Dimensional Preferences via Hierarchical Relevance Feedback in Text-to-Image Diffusion
cs.CV 2026-03 unverdicted novelty 6.0

HRFD aligns multi-dimensional preferences in text-to-image diffusion via hierarchical relevance feedback and statistical distribution divergence measurement between liked and disliked image sets, remaining training-fr...
Benchmarking Compound AI Applications for Hardware-Software Co-Design
cs.DC 2026-03 unverdicted novelty 6.0

Introduces a benchmarking suite for compound AI applications to support cross-stack performance, cost, and resource analysis for hardware-software co-design.
EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows
cs.SE 2026-02 unverdicted novelty 6.0

EditFlow reconstructs temporal developer editing flows from code changes to benchmark and optimize AI code edit recommenders so they align with natural incremental reasoning rather than static snapshots.
Learning to Configure Agentic AI Systems
cs.AI 2026-02 unverdicted novelty 6.0

ARC learns per-query configurations for LLM agent systems via a lightweight hierarchical policy in an SMDP formulation, delivering 31% higher reasoning accuracy and doubled success on an agent benchmark over budget-ma...
Learning to Configure Agentic AI Systems
cs.AI 2026-02 unverdicted novelty 6.0

ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to ...
Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues
cs.CL 2025-11 unverdicted novelty 6.0

A conjecture-then-validate method lets LLMs convert opaque lexical cues from deceptive-review classifiers into interpretable language phenomena that are empirically grounded and more predictive than direct LLM outputs.
In-depth Analysis of Graph-based RAG in a Unified Framework
cs.IR 2025-03 unverdicted novelty 6.0

A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
cs.CL 2025-02 unverdicted novelty 6.0

DITS replaces Q-value guidance in MCTS with influence scores for synthetic data synthesis in multi-agent LLM training, claiming better efficiency and performance on eight datasets.
Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
cs.CL 2024-07 conditional novelty 6.0

LLMs achieve 64% accuracy detecting Wikipedia bias and remove 79% of words removed by editors when correcting, but produce high-recall low-precision edits rated more neutral by crowds than human versions.
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
cs.AI 2026-05 unverdicted novelty 5.0

SkillOpt introduces a validation-gated text-space optimizer for agent skills that outperforms human, one-shot, and prior optimization baselines across 52 model-benchmark-harness combinations.
Exploring the Effectiveness of Using LLMs for Automated Assessment of Student Self Explanations in Programming Education
cs.HC 2026-05 unverdicted novelty 5.0

Compares LLMs against semantic similarity for binary classification of student self-explanations in programming education.
Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection
cs.CL 2026-05 unverdicted novelty 5.0

DiSP stratifies queries by difficulty using random trial estimates, trains a router and level-specific judges, then applies budgeted stop-on-acceptance selection to improve ICL accuracy and speed on classification tasks.
Runtime-Structured Task Decomposition for Agentic Coding Systems
cs.SE 2026-05 unverdicted novelty 5.0

Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
cs.LG 2026-05 unverdicted novelty 5.0

The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples,...
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 72 Pith papers · 26 internal anchors

[1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.\ 2623--2631, 2019

work page 2019
[2]

Theano: A Python framework for fast computation of mathematical expressions

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Fr \'e d \'e ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, pp.\ arXiv--1605, 2016

work page 2016
[3]

Theano: A CPU and GPU math compiler in Python

James Bergstra, Olivier Breuleux, Fr \'e d \'e ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python . In Proc. 9th python in science conf, volume 1, pp.\ 3--10, 2010

work page 2010
[4]

Theano: Deep learning on gpus with Python

James Bergstra, Fr \'e d \'e ric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. Theano: Deep learning on gpus with Python . In NIPS 2011, BigLearning Workshop, Granada, Spain, volume 3. Citeseer, 2011

work page 2011
[5]

Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp.\ 115--123. PMLR, 2013

work page 2013
[6]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[8]

Hwchase17/langchain

Harrison Chase. Hwchase17/langchain. 2022. URL https://github.com/hwchase17/langchain

work page 2022
[9]

Reading Wikipedia to answer open-domain questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading W ikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1870--1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1171. URL https://acl...

work page doi:10.18653/v1/p17-1171 2017
[10]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023
[11]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review arXiv 2022
[12]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Torch: a modular machine learning software library

Ronan Collobert, Samy Bengio, and Johnny Mari \'e thoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002

work page 2002
[15]

Dohan, W

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022

work page arXiv 2022
[16]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...

work page 2023
[17]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023 b

work page 2023
[18]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023

work page internal anchor Pith review arXiv 2023
[19]

REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. URL https://arxiv.org/abs/2002.08909

work page internal anchor Pith review arXiv 2002
[20]

Training classifiers with natural language explanations

Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1884--1895. Association for Computational Linguistics, 2018. URL http://aclweb...

work page 2018
[21]

(2023, August 31)

Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. Enabling intelligent interactions between an agent and an LLM : A reinforcement learning approach. arXiv preprint arXiv:2306.03604, 2023. URL https://arxiv.org/abs/2306.03604

work page arXiv 2023
[22]

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

work page internal anchor Pith review arXiv 2022
[23]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022

work page internal anchor Pith review arXiv 2022
[24]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445, 2022

work page internal anchor Pith review arXiv 2022
[25]

B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval

Omar Khattab, Christopher Potts, and Matei Zaharia. B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

work page 2021
[26]

Relevance-guided supervision for openqa with ColBERT

Omar Khattab, Christopher Potts, and Matei Zaharia. Relevance-guided supervision for openqa with ColBERT . Transactions of the Association for Computational Linguistics, 9: 0 929--944, 2021 b

work page 2021
[27]

Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp,

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022

work page arXiv 2022
[28]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

work page internal anchor Pith review arXiv 2022
[29]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022

work page internal anchor Pith review arXiv 2022
[30]

Lazaridou, E

Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022

work page arXiv 2022
[31]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...

work page 2020
[32]

LlamaIndex , 11 2022

Jerry Liu. LlamaIndex , 11 2022. URL https://github.com/jerryjliu/llama_index

work page 2022
[33]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730

work page Pith review arXiv 2018
[35]

Semantic kernel

Microsoft. Semantic kernel. 2023. URL https://learn.microsoft.com/semantic-kernel/

work page 2023
[36]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback, 2021. URL https://...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

PyTorch : An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch : An imperative style, high-perf...

work page 2019
[40]

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, and Sercan O Arik

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015, 2023

work page arXiv 2023
[41]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page internal anchor Pith review arXiv 2022
[42]

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023
[43]

Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2590--2602, Hong Kong, ...

work page doi:10.18653/v1/d19-1261 2019
[44]

Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text

Peng Qi, Haejun Lee, Oghenetegiri Sido, Christopher D Manning, et al. Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527, 2020. URL https://arxiv.org/abs/2010.12527

work page arXiv 2010
[45]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Ms, OpenAI, 2018. URL https://openai.com/blog/language-unsupervised/

work page 2018
[46]

Data programming: Creating large training sets, quickly

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\' e . Data programming: Creating large training sets, quickly. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp.\ 3567--3575. Curran Associates, Inc., 2016. URL https://papers.nips.cc/paper/65...

work page 2016
[47]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. C ol BERT v2: E ffective and E fficient R etrieval via L ightweight L ate I nteraction. arXiv preprint arXiv:2112.01488, 2021

work page arXiv 2021
[48]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

arXiv preprint arXiv:2302.00618 , year=

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023

work page arXiv 2023
[50]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Prompting gpt-3 to be reliable

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022

work page arXiv 2022
[52]

Recitation-augmented language models,

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. arXiv preprint arXiv:2210.01296, 2022

work page arXiv 2022
[53]

Chainer: a next-generation open source framework for deep learning

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pp.\ 1--6, 2015

work page 2015
[54]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

work page internal anchor Pith review arXiv 2022
[56]

Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming

Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. U...

work page 2018
[57]

Rationale- augmented ensembles in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022 a

work page arXiv 2022
[58]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[61]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023

work page internal anchor Pith review arXiv 2023
[62]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review arXiv 2018
[63]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Yoran, T

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023

work page arXiv 2023
[65]

Wu, andN.D.Goodman

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022

work page arXiv 2022
[66]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

work page internal anchor Pith review arXiv 2022
[67]

Expel: Llm agents are experiential learners,

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL : LLM agents are experiential learners. arXiv preprint arXiv:2308.10144, 2023 a . URL https://arxiv.org/pdf/2308.10144

work page arXiv 2023
[68]

Automatic model selection with large language models for reasoning

Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. Automatic model selection with large language models for reasoning. arXiv preprint arXiv:2305.14333, 2023 b

work page arXiv 2023

[1] [1]

Optuna: A next-generation hyperparameter optimization framework

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.\ 2623--2631, 2019

work page 2019

[2] [2]

Theano: A Python framework for fast computation of mathematical expressions

Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Fr \'e d \'e ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, pp.\ arXiv--1605, 2016

work page 2016

[3] [3]

Theano: A CPU and GPU math compiler in Python

James Bergstra, Olivier Breuleux, Fr \'e d \'e ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python . In Proc. 9th python in science conf, volume 1, pp.\ 3--10, 2010

work page 2010

[4] [4]

Theano: Deep learning on gpus with Python

James Bergstra, Fr \'e d \'e ric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. Theano: Deep learning on gpus with Python . In NIPS 2011, BigLearning Workshop, Granada, Spain, volume 3. Citeseer, 2011

work page 2011

[5] [5]

Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp.\ 115--123. PMLR, 2013

work page 2013

[6] [6]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901

[8] [8]

Hwchase17/langchain

Harrison Chase. Hwchase17/langchain. 2022. URL https://github.com/hwchase17/langchain

work page 2022

[9] [9]

Reading Wikipedia to answer open-domain questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading W ikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1870--1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1171. URL https://acl...

work page doi:10.18653/v1/p17-1171 2017

[10] [10]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023

[11] [11]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review arXiv 2022

[12] [12]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Torch: a modular machine learning software library

Ronan Collobert, Samy Bengio, and Johnny Mari \'e thoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002

work page 2002

[15] [15]

Dohan, W

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022

work page arXiv 2022

[16] [16]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...

work page 2023

[17] [17]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023 b

work page 2023

[18] [18]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023

work page internal anchor Pith review arXiv 2023

[19] [19]

REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. URL https://arxiv.org/abs/2002.08909

work page internal anchor Pith review arXiv 2002

[20] [20]

Training classifiers with natural language explanations

Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1884--1895. Association for Computational Linguistics, 2018. URL http://aclweb...

work page 2018

[21] [21]

(2023, August 31)

Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. Enabling intelligent interactions between an agent and an LLM : A reinforcement learning approach. arXiv preprint arXiv:2306.03604, 2023. URL https://arxiv.org/abs/2306.03604

work page arXiv 2023

[22] [22]

Large Language Models Can Self-Improve

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022

work page internal anchor Pith review arXiv 2022

[23] [23]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022

work page internal anchor Pith review arXiv 2022

[24] [24]

MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445, 2022

work page internal anchor Pith review arXiv 2022

[25] [25]

B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval

Omar Khattab, Christopher Potts, and Matei Zaharia. B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

work page 2021

[26] [26]

Relevance-guided supervision for openqa with ColBERT

Omar Khattab, Christopher Potts, and Matei Zaharia. Relevance-guided supervision for openqa with ColBERT . Transactions of the Association for Computational Linguistics, 9: 0 929--944, 2021 b

work page 2021

[27] [27]

Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp,

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022

work page arXiv 2022

[28] [28]

Decomposed Prompting: A Modular Approach for Solving Complex Tasks

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022

work page internal anchor Pith review arXiv 2022

[29] [29]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022

work page internal anchor Pith review arXiv 2022

[30] [30]

Lazaridou, E

Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022

work page arXiv 2022

[31] [31]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...

work page 2020

[32] [32]

LlamaIndex , 11 2022

Jerry Liu. LlamaIndex , 11 2022. URL https://github.com/jerryjliu/llama_index

work page 2022

[33] [33]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730

work page Pith review arXiv 2018

[35] [35]

Semantic kernel

Microsoft. Semantic kernel. 2023. URL https://learn.microsoft.com/semantic-kernel/

work page 2023

[36] [36]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback, 2021. URL https://...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023

[38] [38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

PyTorch : An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch : An imperative style, high-perf...

work page 2019

[40] [40]

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, and Sercan O Arik

Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015, 2023

work page arXiv 2023

[41] [41]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page internal anchor Pith review arXiv 2022

[42] [42]

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023

work page arXiv 2023

[43] [43]

Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2590--2602, Hong Kong, ...

work page doi:10.18653/v1/d19-1261 2019

[44] [44]

Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text

Peng Qi, Haejun Lee, Oghenetegiri Sido, Christopher D Manning, et al. Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527, 2020. URL https://arxiv.org/abs/2010.12527

work page arXiv 2010

[45] [45]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Ms, OpenAI, 2018. URL https://openai.com/blog/language-unsupervised/

work page 2018

[46] [46]

Data programming: Creating large training sets, quickly

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\' e . Data programming: Creating large training sets, quickly. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp.\ 3567--3575. Curran Associates, Inc., 2016. URL https://papers.nips.cc/paper/65...

work page 2016

[47] [47]

Colbertv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. C ol BERT v2: E ffective and E fficient R etrieval via L ightweight L ate I nteraction. arXiv preprint arXiv:2112.01488, 2021

work page arXiv 2021

[48] [48]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

arXiv preprint arXiv:2302.00618 , year=

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023

work page arXiv 2023

[50] [50]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Prompting gpt-3 to be reliable

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022

work page arXiv 2022

[52] [52]

Recitation-augmented language models,

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. arXiv preprint arXiv:2210.01296, 2022

work page arXiv 2022

[53] [53]

Chainer: a next-generation open source framework for deep learning

Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pp.\ 1--6, 2015

work page 2015

[54] [54]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

work page internal anchor Pith review arXiv 2022

[56] [56]

Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming

Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. U...

work page 2018

[57] [57]

Rationale- augmented ensembles in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022 a

work page arXiv 2022

[58] [58]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022 b

work page internal anchor Pith review Pith/arXiv arXiv 2022

[59] [59]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[60] [60]

Transformers: State-of-the-Art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[61] [61]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023

work page internal anchor Pith review arXiv 2023

[62] [62]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review arXiv 2018

[63] [63]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Yoran, T

Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023

work page arXiv 2023

[65] [65]

Wu, andN.D.Goodman

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022

work page arXiv 2022

[66] [66]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

work page internal anchor Pith review arXiv 2022

[67] [67]

Expel: Llm agents are experiential learners,

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL : LLM agents are experiential learners. arXiv preprint arXiv:2308.10144, 2023 a . URL https://arxiv.org/pdf/2308.10144

work page arXiv 2023

[68] [68]

Automatic model selection with large language models for reasoning

Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. Automatic model selection with large language models for reasoning. arXiv preprint arXiv:2305.14333, 2023 b

work page arXiv 2023