Recognition: 2 theorem links
· Lean TheoremDSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Pith reviewed 2026-05-11 18:52 UTC · model grok-4.3
The pith
DSPy turns a few lines of declarative code into language model pipelines that self-optimize and outperform few-shot and expert prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DSPy abstracts LM pipelines as text transformation graphs in which LMs are called through declarative, parameterized modules. The compiler optimizes any such pipeline for a given metric by automatically generating demonstrations and searching over module configurations and compositions of prompting, reasoning, and augmentation techniques. Succinct DSPy programs thereby produce pipelines that, after compilation, outperform standard few-shot prompting and expert-created demonstrations on tasks including math reasoning and multi-hop QA.
What carries the argument
Parameterized DSPy modules inside computational graphs, together with a compiler that collects demonstrations and searches configurations to maximize a target metric.
If this is right
- Succinct DSPy programs can express and optimize complex pipelines for reasoning, retrieval, and control tasks.
- Open models as small as 770M-parameter T5 become competitive with expert prompt chains written for proprietary GPT-3.5.
- The same declarative program can be recompiled for different metrics or models without rewriting prompts.
- Models can self-bootstrap training data and improve their own performance on the target task within minutes.
- Pipeline development shifts from hand-crafted strings to declarative code plus automatic optimization.
Where Pith is reading between the lines
- The approach could lower the expertise barrier for building reliable LM applications by automating much of the prompt engineering.
- Compiled pipelines might adapt more readily to new domains if the compiler is given additional unlabeled data or metrics.
- Extending the same declarative graph structure to multimodal or tool-using agents would be a natural next step.
- Combining the compiler with lightweight fine-tuning on the collected demonstrations could further improve small-model performance.
Load-bearing premise
Automatic search over module configurations driven by collected demonstrations will reliably locate high-performing pipelines without overfitting to the validation metric or demanding prohibitive compute.
What would settle it
On a new task the DSPy compiler produces a pipeline whose accuracy is no higher than, or lower than, a standard few-shot prompt baseline using the same underlying language model.
read the original abstract
The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5-46% and 16-40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5. DSPy is available at https://github.com/stanfordnlp/dspy
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DSPy, a programming model that represents LM pipelines as imperative computational graphs of declarative, parameterized modules. These modules learn by collecting demonstrations to compose prompting, reasoning, and other techniques. A compiler optimizes any DSPy program for a given metric via bootstrap search over module configurations and auto-generated demonstrations. Two case studies demonstrate that short DSPy programs enable GPT-3.5 and Llama-2-13B-chat to self-improve pipelines for math word problems, multi-hop QA, and agent control, outperforming standard few-shot prompting (by >25% and >65%) and expert demonstrations (by up to 5-46% and 16-40%). Compiled DSPy programs on smaller open models are competitive with expert GPT-3.5 chains.
Significance. If the reported gains are robust to validation-set selection bias, the work offers a valuable systematic alternative to manual prompt engineering by turning pipeline design into a programmable, optimizable artifact. The public GitHub release of the DSPy library supports reproducibility and further experimentation.
major comments (1)
- [§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.
minor comments (2)
- [Abstract and §5] The abstract and experimental sections provide no details on the compiler's search algorithm (e.g., beam size, number of rounds), hyperparameter choices, or statistical significance testing of the reported deltas.
- [Figures/Tables in §5] Figure and table captions could more explicitly state the exact validation metric used for each task and whether the same split was used for both optimization and final reporting.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of DSPy's significance and for the detailed feedback on the bootstrap optimizer. We address the major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Bootstrap Optimizer): The optimizer repeatedly samples LM-generated demonstrations, scores candidate pipelines on a validation metric, and selects the best configuration. No separate held-out selection set, Bonferroni-style correction, or post-selection evaluation on untouched data is described. When the base LM is weak (e.g., Llama-2-13B-chat), noisy or metric-correlated demonstrations can amplify selection bias. This directly affects the central claim that the compiler reliably discovers high-performing pipelines, because the 25-65% gains over few-shot baselines and the 5-46% gains over expert prompts could partly reflect overfitting rather than genuine improvement.
Authors: We agree that the bootstrap optimizer, as currently described in §4, uses the validation set both to generate demonstrations and to select the best pipeline configuration, without a separate held-out selection set or post-selection evaluation on untouched data. This design is intentional for practical settings with limited labeled data, but we acknowledge the referee's point that it can introduce selection bias, particularly with weaker base models. The reported gains are measured on fully held-out test sets, yet the optimization step itself may overfit to the validation metric. We will revise the manuscript to (1) explicitly discuss this limitation in §4, (2) add experiments that reserve a portion of the validation data solely for post-selection evaluation, and (3) report results with Bonferroni-style corrections where multiple configurations are compared. These changes will provide stronger evidence that the observed improvements reflect genuine pipeline optimization rather than overfitting. revision: yes
Circularity Check
No circularity: empirical gains measured on external test sets against fixed baselines
full rationale
The paper introduces DSPy as a declarative programming model and compiler for LM pipelines, with optimizers (including bootstrap) that collect demonstrations and search configurations to maximize a user-specified metric. The central claims consist of empirical results: compiled pipelines outperform standard few-shot prompting and expert demonstrations on held-out test sets for tasks like math word problems and multi-hop QA. These comparisons use fixed external baselines rather than quantities defined inside the DSPy system. No equations, uniqueness theorems, or first-principles derivations appear that reduce a reported prediction to a fitted parameter or self-citation by construction. The bootstrap process is described as an optimization procedure whose outputs are evaluated externally, rendering the reported performance self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language models respond usefully to compositions of prompting, finetuning, and reasoning techniques when those techniques are expressed through parameterized declarative modules.
invented entities (2)
-
DSPy module
no independent evidence
-
DSPy compiler
no independent evidence
Forward citations
Cited by 49 Pith papers
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Efficient Ensemble Selection from Binary and Pairwise Feedback
The paper develops efficient algorithms for ensemble selection from binary and pairwise feedback, achieving (1-1/e) guarantees with query savings for coverage and PTAS-style results via submodular relaxation for theta...
-
TRACE: Tourism Recommendation with Accountable Citation Evidence
TRACE is a new benchmark dataset and evaluation suite for conversational tourism recommenders that requires systems to suggest POIs, cite verifiable review spans, and recover from rejections, revealing a Three-Compete...
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
-
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
Full factorial testing of five LLM agent components reveals that the complete 'All-In' combination is consistently outperformed by smaller subsets due to cross-component interference, with optimal subsets being task- ...
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
-
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
-
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design
RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than...
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
-
The Prompt Report: A Systematic Survey of Prompt Engineering Techniques
This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking
SIREN corrects winner's curse bias in adaptive LLM benchmarking via selection-aware repeated splits and bootstrap for valid procedure-level confidence intervals.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
The Two Boundaries: Why Behavioral AI Governance Fails Structurally
Behavioral governance of AI effects is undecidable for Turing-complete architectures, making coterminous boundaries via computation-effect separation the only structural solution rather than post-hoc layers.
-
Probabilistic Programs of Thought
Probabilistic programs of thought let LLMs produce many program variants from one generation by building a compact probabilistic representation of the token distribution.
-
Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.
-
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
-
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Behavior Latticing: Inferring User Motivations from Unstructured Interactions
Behavior latticing synthesizes connections across unstructured user interactions to generate insights into underlying motivations, yielding deeper and more accurate user understanding than task-only models.
-
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis
The paper introduces a reproducible optimization protocol for prompt-based LLM workflows in evidence synthesis that separates task definitions from prompt harnesses, optimizes the harness against metrics and examples,...
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
AgenticPosesRanker: An Agentic AI Framework for Physically Grounded Ranking of Protein-Ligand Docking Poses
AgenticPosesRanker ranks docking poses using six deterministic physical tools and LLM reasoning, achieving 50% best-pose accuracy that matches the Smina baseline on a balanced 10-system, 162-pose benchmark.
-
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation
Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.
-
Auditing and Controlling AI Agent Actions in Spreadsheets
Pista decomposes AI agent actions in spreadsheets into auditable steps, enabling real-time user intervention that improves task outcomes, user comprehension, agent perception, and sense of co-ownership over baseline agents.
-
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
-
When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
LLM features optimized for high information coefficient with returns do not reliably improve PPO trading policies under distribution shifts, where price-only or macro baselines remain more robust.
-
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
-
LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans
Zero-shot LLM agents with human personas predict individual social media reactions better than chance (MCC 0.29) but worse than conventional text classifiers (MCC 0.36).
-
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Lightweight proxy models deliver over 100x cost and latency savings for semantic AI queries in databases with accuracy preserved or improved on benchmarks up to 10M rows.
-
TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature
TCMIIES is a zero-install browser platform with schema-guided LLM prompting that achieves over 94% structured output compliance for academic information extraction, including support for Chinese databases.
-
Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
Execution feedback in refinement loops improves 1-3B code generation performance far more than complex pipeline topologies discovered via evolutionary search on HumanEval and sanitized MBPP.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
-
Statistical Software Engineering with Tuned Variables
AI system maintenance requires treating configuration choices as versioned governed tuned variables promoted via statistical evidence from sampled evaluations.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO
Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.
-
Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study
A deployed modular inference architecture for compound AI systems cut tail latency over 50%, boosted throughput up to 3.9x, and reduced costs 30-40% while handling multi-model agent workloads.
Reference graph
Works this paper leans on
-
[1]
Optuna: A next-generation hyperparameter optimization framework
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.\ 2623--2631, 2019
work page 2019
-
[2]
Theano: A Python framework for fast computation of mathematical expressions
Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Fr \'e d \'e ric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, pp.\ arXiv--1605, 2016
work page 2016
-
[3]
Theano: A CPU and GPU math compiler in Python
James Bergstra, Olivier Breuleux, Fr \'e d \'e ric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: A CPU and GPU math compiler in Python . In Proc. 9th python in science conf, volume 1, pp.\ 3--10, 2010
work page 2010
-
[4]
Theano: Deep learning on gpus with Python
James Bergstra, Fr \'e d \'e ric Bastien, Olivier Breuleux, Pascal Lamblin, Razvan Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde-Farley, Ian Goodfellow, Arnaud Bergeron, et al. Theano: Deep learning on gpus with Python . In NIPS 2011, BigLearning Workshop, Granada, Spain, volume 3. Citeseer, 2011
work page 2011
-
[5]
James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pp.\ 115--123. PMLR, 2013
work page 2013
-
[6]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[8]
Harrison Chase. Hwchase17/langchain. 2022. URL https://github.com/hwchase17/langchain
work page 2022
-
[9]
Reading W ikipedia to answer open-domain questions
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading W ikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1870--1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-1171. URL https://acl...
-
[10]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review arXiv 2023
-
[11]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review arXiv 2022
-
[12]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Torch: a modular machine learning software library
Ronan Collobert, Samy Bengio, and Johnny Mari \'e thoz. Torch: a modular machine learning software library. Technical report, Idiap, 2002
work page 2002
-
[15]
arXiv preprint arXiv:2207.10342 , year=
David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022
-
[16]
Rarr: Researching and revising what language models say, using language models
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 16477--16...
work page 2023
-
[17]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023 b
work page 2023
-
[18]
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023
-
[19]
REALM: Retrieval-Augmented Language Model Pre-Training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909, 2020. URL https://arxiv.org/abs/2002.08909
work page internal anchor Pith review arXiv 2002
-
[20]
Training classifiers with natural language explanations
Braden Hancock, Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, and Christopher R \'e . Training classifiers with natural language explanations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1884--1895. Association for Computational Linguistics, 2018. URL http://aclweb...
work page 2018
-
[21]
Enabling intelligent interactions between an agent and an llm: A reinforcement learning approach
Bin Hu, Chenyang Zhao, Pu Zhang, Zihao Zhou, Yuanhang Yang, Zenglin Xu, and Bin Liu. Enabling intelligent interactions between an agent and an LLM : A reinforcement learning approach. arXiv preprint arXiv:2306.03604, 2023. URL https://arxiv.org/abs/2306.03604
- [22]
-
[23]
Few-shot learning with retrieval augmented language models,
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022
-
[24]
Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, et al. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445, 2022
work page internal anchor Pith review arXiv 2022
-
[25]
B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval
Omar Khattab, Christopher Potts, and Matei Zaharia. B aleen: R obust M ulti- H op R easoning at S cale via C ondensed R etrieval. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a
work page 2021
-
[26]
Relevance-guided supervision for openqa with ColBERT
Omar Khattab, Christopher Potts, and Matei Zaharia. Relevance-guided supervision for openqa with ColBERT . Transactions of the Association for Computational Linguistics, 9: 0 929--944, 2021 b
work page 2021
-
[27]
arXiv preprint arXiv:2212.14024 (2022)
Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022
-
[28]
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022
-
[29]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022
work page internal anchor Pith review arXiv 2022
-
[30]
Internet-augmented language models through few-shot prompting for open-domain question answering,
Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022
-
[31]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural ...
work page 2020
-
[32]
Jerry Liu. LlamaIndex , 11 2022. URL https://github.com/jerryjliu/llama_index
work page 2022
-
[33]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv:1806.08730, 2018. URL https://arxiv.org/abs/1806.08730
work page Pith review arXiv 2018
-
[35]
Microsoft. Semantic kernel. 2023. URL https://learn.microsoft.com/semantic-kernel/
work page 2023
-
[36]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. WebGPT : Browser-assisted question-answering with human feedback, 2021. URL https://...
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [37]
-
[38]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
PyTorch : An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch : An imperative style, high-perf...
work page 2019
-
[40]
Din-sql: Decomposed in-context learning of text-to-sql with self-correction
Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. arXiv preprint arXiv:2304.11015, 2023
-
[41]
arXiv preprint arXiv:2210.03350
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022
-
[42]
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023
-
[43]
Peng Qi, Xiaowen Lin, Leo Mehr, Zijian Wang, and Christopher D. Manning. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2590--2602, Hong Kong, ...
-
[44]
Peng Qi, Haejun Lee, Oghenetegiri Sido, Christopher D Manning, et al. Retrieve, rerank, read, then iterate: Answering open-domain questions of arbitrary complexity from text. arXiv preprint arXiv:2010.12527, 2020. URL https://arxiv.org/abs/2010.12527
-
[45]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. Ms, OpenAI, 2018. URL https://openai.com/blog/language-unsupervised/
work page 2018
-
[46]
Data programming: Creating large training sets, quickly
Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher R\' e . Data programming: Creating large training sets, quickly. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp.\ 3567--3575. Curran Associates, Inc., 2016. URL https://papers.nips.cc/paper/65...
work page 2016
-
[47]
Colbertv2: Effective and efficient retrieval via light weight late interaction
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. C ol BERT v2: E ffective and E fficient R etrieval via L ightweight L ate I nteraction. arXiv preprint arXiv:2112.01488, 2021
-
[48]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Synthetic prompting: Generating chain-of-thought demonstrations for large language models
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023
-
[50]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Prompting gpt-3 to be reliable
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022
-
[52]
Recitation-augmented language models
Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. arXiv preprint arXiv:2210.01296, 2022
-
[53]
Chainer: a next-generation open source framework for deep learning
Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), volume 5, pp.\ 1--6, 2015
work page 2015
-
[54]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022
-
[56]
Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming
Fei Wang, James Decker, Xilun Wu, Gregory Essertel, and Tiark Rompf. Backpropagation with callbacks: Foundations for efficient and expressive differentiable programming. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. U...
work page 2018
-
[57]
Rationale-Augmented Ensembles in Language Models https://arxiv.org/abs/2207.00747
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022 a
-
[58]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Transformers: State-of-the-Art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...
-
[61]
Large Language Models as Optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023
work page internal anchor Pith review arXiv 2023
-
[62]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review arXiv 2018
-
[63]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Answering questions by meta-reasoning over multiple chains of thought
Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023
-
[65]
Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465, 2022
-
[66]
Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022
-
[67]
Expel: Llm agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL : LLM agents are experiential learners. arXiv preprint arXiv:2308.10144, 2023 a . URL https://arxiv.org/pdf/2308.10144
-
[68]
Automatic model selection with large language models for reasoning
Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. Automatic model selection with large language models for reasoning. arXiv preprint arXiv:2305.14333, 2023 b
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.