Making Prompts First-Class Citizens for Adaptive LLM Pipelines

Alexander W. Lee; Andrew Crotty; Deepti Raghavan; Duo Lu; Shu Chen; Ugur Cetintemel

arxiv: 2508.05012 · v2 · submitted 2025-08-07 · 💻 cs.DB · cs.AI· cs.CL

Making Prompts First-Class Citizens for Adaptive LLM Pipelines

Ugur Cetintemel , Shu Chen , Alexander W. Lee , Deepti Raghavan , Duo Lu , Andrew Crotty This is my paper

Pith reviewed 2026-05-19 00:54 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.CL

keywords prompt managementLLM pipelinesadaptive refinementversioned viewspolicy-driven controlruntime feedbackprovenancefirst-class citizens

0 comments

The pith

SPEAR elevates prompts to first-class status in LLM pipelines with versioned views, runtime refinement, and when-then policy rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that prompts currently function as brittle, opaque strings disconnected from the logic of complex LLM pipelines that retrieve data, correct errors, and coordinate agents. SPEAR addresses this by organizing prompts into versioned views that enable introspection and provenance tracking. It further allows prompts to refine themselves dynamically during execution in response to runtime feedback. Refinement logic is expressed declaratively through when-then policy rules. This design complements existing prompt optimization tools by focusing on runtime management and automatic evolution within data-centric LLM applications.

Core claim

SPEAR treats prompts as first-class citizens in the execution model. It provides structured prompt management through versioned views for introspection and provenance reasoning, adaptive prompt refinement that evolves prompts based on runtime feedback during execution, and policy-driven control that specifies automatic refinement as when-then rules.

What carries the argument

SPEAR (Structured Prompt Execution and Adaptive Refinement), which maintains versioned prompt views and applies when-then rules to drive adaptive refinement from runtime feedback.

If this is right

Prompts gain introspection capabilities and traceable provenance across pipeline runs.
Prompts evolve automatically during execution without separate manual tuning steps.
When-then rules allow declarative specification of refinement behavior under varying conditions.
The model unlocks reuse and optimization opportunities in error correction and agent coordination.
It plays a complementary role to existing prompt optimization frameworks and semantic query engines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Versioned prompt views could support querying and analysis of prompt changes over time in a manner similar to database views.
Combining this approach with semantic query processing engines could enable joint optimization of data access and prompt logic.
Longer-term pipelines might maintain quality with reduced human intervention if refinement policies prove stable.

Load-bearing premise

Runtime feedback signals can be mapped reliably to prompt edits that improve downstream performance without causing instability, high latency, or needing extensive human oversight.

What would settle it

An experiment on a multi-step LLM pipeline in which feedback-triggered prompt changes either reduce task accuracy, increase latency beyond acceptable bounds, or require repeated manual corrections to stabilize.

Figures

Figures reproduced from arXiv: 2508.05012 by Alexander W. Lee, Andrew Crotty, Deepti Raghavan, Duo Lu, Shu Chen, Ugur Cetintemel.

read the original abstract

Modern LLM pipelines increasingly resemble complex data-centric applications: they retrieve data, correct errors, call external tools, and coordinate interactions between agents. Yet, the central element controlling this entire process -- the prompt -- remains a brittle, opaque string that is entirely disconnected from the surrounding program logic. This disconnect fundamentally limits opportunities for reuse, optimization, and runtime adaptivity. In this paper, we describe our vision and an initial design of SPEAR (Structured Prompt Execution and Adaptive Refinement), a new approach to prompt management that treats prompts as first-class citizens in the execution model. Specifically, SPEAR enables: (1) structured prompt management, with prompts organized into versioned views to support introspection and reasoning about provenance; (2) adaptive prompt refinement, whereby prompts can evolve dynamically during execution based on runtime feedback; and (3) policy-driven control, a mechanism for the specification of automatic prompt refinement logic as when-then rules. By tackling the problem of runtime prompt refinement, SPEAR plays a complementary role in the vast ecosystem of existing prompt optimization frameworks and semantic query processing engines. We describe a number of related optimization opportunities unlocked by the SPEAR model, and our preliminary results demonstrate the strong potential of this approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a vision and initial design for SPEAR (Structured Prompt Execution and Adaptive Refinement), which treats prompts as first-class citizens in LLM pipelines. It enables (1) structured prompt management with versioned views supporting introspection and provenance, (2) adaptive prompt refinement that evolves prompts dynamically based on runtime feedback, and (3) policy-driven control via when-then rules for automatic prompt evolution. The work positions SPEAR as complementary to existing prompt optimization frameworks and semantic query processing engines, noting preliminary results that demonstrate strong potential.

Significance. If the adaptive mechanisms can be realized with appropriate stability properties, SPEAR would address a genuine gap by integrating prompt management into the execution model of data-centric LLM applications, enabling better reuse, provenance tracking, and runtime optimization in complex pipelines that combine retrieval, tool use, and agent coordination.

major comments (2)

Abstract (adaptive prompt refinement paragraph): The central claim that runtime feedback can be mapped to prompt edits via when-then rules to support automatic evolution lacks any formal semantics for the feedback language, boundedness guarantees on the edit operators, or analysis of convergence/oscillation under noisy signals such as error rates or token usage. This directly bears on the asserted support for stable, automatic prompt evolution without excessive latency or human oversight.
Abstract (SPEAR execution model description): The proposal introduces an execution model with versioned views and policy-driven control but provides no concrete definitions, examples, or invariants showing how these mechanisms integrate with surrounding program logic, leaving the feasibility of introspection and provenance claims ungrounded.

minor comments (1)

Abstract: The reference to 'preliminary results' demonstrating strong potential is stated without any quantitative details, error analysis, or baseline comparisons; adding a pointer to the relevant section or table would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of SPEAR to address a gap in prompt management for data-centric LLM applications. The major comments correctly highlight the need for stronger grounding of the vision in concrete mechanisms and stability considerations. We address each point below and will revise the manuscript to incorporate additional detail and discussion while preserving the paper's focus as a vision and initial design.

read point-by-point responses

Referee: Abstract (adaptive prompt refinement paragraph): The central claim that runtime feedback can be mapped to prompt edits via when-then rules to support automatic evolution lacks any formal semantics for the feedback language, boundedness guarantees on the edit operators, or analysis of convergence/oscillation under noisy signals such as error rates or token usage. This directly bears on the asserted support for stable, automatic prompt evolution without excessive latency or human oversight.

Authors: We agree that formal semantics, boundedness guarantees, and convergence analysis are essential for credible claims about stable automatic evolution. The current manuscript presents a high-level vision and initial design without these elements, as the work is positioned as complementary to existing optimization frameworks rather than a complete formal system. In revision we will add a new subsection discussing preliminary approaches to bounded edit operators (e.g., restricting changes to template slots or token budgets) and referencing techniques from adaptive query processing for handling noisy feedback signals. We will also explicitly state that full formal semantics and convergence proofs remain future work. This provides the requested grounding without overstating current results. revision: partial
Referee: Abstract (SPEAR execution model description): The proposal introduces an execution model with versioned views and policy-driven control but provides no concrete definitions, examples, or invariants showing how these mechanisms integrate with surrounding program logic, leaving the feasibility of introspection and provenance claims ungrounded.

Authors: We acknowledge that the absence of concrete definitions and examples leaves the integration claims insufficiently grounded. In the revised version we will insert a dedicated subsection containing (1) pseudocode definitions for versioned prompt views and when-then policy rules, (2) a worked example showing integration with a retrieval-augmented generation pipeline, and (3) a short list of invariants (e.g., view immutability after commit and provenance chain preservation) that ensure introspection remains feasible. These additions will directly support the feasibility of the provenance and introspection features. revision: yes

Circularity Check

0 steps flagged

No circularity: vision paper proposes system architecture without derivations or self-referential reductions

full rationale

The paper is a forward-looking system design proposal for SPEAR that outlines structured prompt management, adaptive refinement via runtime feedback, and when-then policy rules. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. Central claims describe enabling features in an execution model and do not reduce to quantities defined by the authors' own prior results or self-citations. The design sketch is self-contained as an architectural vision whose correctness depends on future implementation and evaluation rather than tautological mappings or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The design rests on domain assumptions about the feasibility of mapping runtime signals to useful prompt edits and the value of treating prompts like database objects; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption Runtime feedback can be translated into prompt edits that measurably improve task outcomes without excessive instability or human intervention.
Invoked in the description of adaptive prompt refinement and policy-driven control.
domain assumption Prompts can be organized into versioned views while preserving their original expressiveness and utility for LLM calls.
Central to the structured prompt management feature.

invented entities (1)

SPEAR execution model no independent evidence
purpose: To elevate prompts to first-class, versioned, and policy-governed objects inside LLM pipelines.
New proposed abstraction whose benefits are asserted but not yet independently verified beyond preliminary results.

pith-pipeline@v0.9.0 · 5760 in / 1479 out tokens · 33649 ms · 2026-05-19T00:54:54.401359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SPEAR defines a prompt algebra that governs how prompts are constructed and adapted... core operators RET, GEN, REF, CHECK, MERGE, DELEGATE
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

policy-driven control... when-then rules... automatic prompt refinement logic

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

work page 2022
[2]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems 6 (2024), 325–338

work page 2024
[3]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Sentiment140 dataset. https: //www.kaggle.com/datasets/kazanova/sentiment140

work page 2009
[4]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2024. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. The Twelfth International Conference on Learning Representations

work page 2024
[5]

LangGraph. 2023. LangGraph. https://www.langchain.com/langgraph

work page 2023
[6]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. In CIDR

work page 2025
[7]

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Ni- colo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. 2023. Can Generalist Foun- dation Models Outcompete Special-Purpose Tuning? Case Study in ...

work page arXiv 2023
[8]

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic Operators: A Declarative Model for Rich, AI-based Data Processing. arXiv:2407.11418 [cs.DB]

work page arXiv 2025
[9]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Ef- ficiently scaling transformer inference. Proceedings of machine learning and systems (2023)

work page 2023
[10]

Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2025. Abacus: A Cost-Based Optimizer for Semantic Operator Systems. arXiv:2505.14661 [cs.DB]

work page arXiv 2025
[11]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2024. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing. arXiv:2410.12189 [cs.DB]

work page arXiv 2024
[12]

Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, and Eugene Wu. 2024. spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines. Proc. VLDB Endow. 17, 12 (Aug. 2024), 4173–4186. doi:10.14778/3685800.3685835

work page doi:10.14778/3685800.3685835 2024
[13]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- ing. Advances in Neural Information Processing Systems 36 (2023), 8634–8652

work page 2023
[14]

Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab. 2024. DSPy Assertions: Computational Con- straints for Self-Refining Language Model Pipelines. arXiv:2312.13382 [cs.CL]

work page arXiv 2024
[15]

Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, and Yanshan Wang. 2024. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Medical Informatics 12 (2024), e55318. doi:10.2196/55318

work page doi:10.2196/55318 2024
[16]

vLLM. 2024. Automatic Prefix Caching. https://docs.vllm.ai/en/v0.9.2/design/ automatic_prefix_caching.html

work page 2024
[17]

Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. 2024. Prompt Engineering in Consistency and Reliability with the Evidence-Based Guideline for LLMs. NPJ Digital Medicine 7, 1 (2024), 41. doi:10.1038/s41746-024-01046-y

work page doi:10.1038/s41746-024-01046-y 2024

[1] [1]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems 35 (2022), 16344–16359

work page 2022

[2] [2]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems 6 (2024), 325–338

work page 2024

[3] [3]

Alec Go, Richa Bhayani, and Lei Huang. 2009. Sentiment140 dataset. https: //www.kaggle.com/datasets/kazanova/sentiment140

work page 2009

[4] [4]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav San- thanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2024. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. The Twelfth International Conference on Learning Representations

work page 2024

[5] [5]

LangGraph. 2023. LangGraph. https://www.langchain.com/langgraph

work page 2023

[6] [6]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, and Gerardo Vitagliano. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. In CIDR

work page 2025

[7] [7]

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Ni- colo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. 2023. Can Generalist Foun- dation Models Outcompete Special-Purpose Tuning? Case Study in ...

work page arXiv 2023

[8] [8]

Liana Patel, Siddharth Jha, Melissa Pan, Harshit Gupta, Parth Asawa, Carlos Guestrin, and Matei Zaharia. 2025. Semantic Operators: A Declarative Model for Rich, AI-based Data Processing. arXiv:2407.11418 [cs.DB]

work page arXiv 2025

[9] [9]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2023. Ef- ficiently scaling transformer inference. Proceedings of machine learning and systems (2023)

work page 2023

[10] [10]

Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2025. Abacus: A Cost-Based Optimizer for Semantic Operator Systems. arXiv:2505.14661 [cs.DB]

work page arXiv 2025

[11] [11]

Parameswaran, and Eugene Wu

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G. Parameswaran, and Eugene Wu. 2024. DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing. arXiv:2410.12189 [cs.DB]

work page arXiv 2024

[12] [12]

Shreya Shankar, Haotian Li, Parth Asawa, Madelon Hulsebos, Yiming Lin, J. D. Zamfirescu-Pereira, Harrison Chase, Will Fu-Hinthorn, Aditya G. Parameswaran, and Eugene Wu. 2024. spade: Synthesizing Data Quality Assertions for Large Language Model Pipelines. Proc. VLDB Endow. 17, 12 (Aug. 2024), 4173–4186. doi:10.14778/3685800.3685835

work page doi:10.14778/3685800.3685835 2024

[13] [13]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- ing. Advances in Neural Information Processing Systems 36 (2023), 8634–8652

work page 2023

[14] [14]

Arnav Singhvi, Manish Shetty, Shangyin Tan, Christopher Potts, Koushik Sen, Matei Zaharia, and Omar Khattab. 2024. DSPy Assertions: Computational Con- straints for Self-Refining Language Model Pipelines. arXiv:2312.13382 [cs.CL]

work page arXiv 2024

[15] [15]

Sonish Sivarajkumar, Mark Kelley, Alyssa Samolyk-Mazzanti, Shyam Visweswaran, and Yanshan Wang. 2024. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Medical Informatics 12 (2024), e55318. doi:10.2196/55318

work page doi:10.2196/55318 2024

[16] [16]

vLLM. 2024. Automatic Prefix Caching. https://docs.vllm.ai/en/v0.9.2/design/ automatic_prefix_caching.html

work page 2024

[17] [17]

Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, and Jian Li. 2024. Prompt Engineering in Consistency and Reliability with the Evidence-Based Guideline for LLMs. NPJ Digital Medicine 7, 1 (2024), 41. doi:10.1038/s41746-024-01046-y

work page doi:10.1038/s41746-024-01046-y 2024