Reflective Prompt Tuning through Language Model Function-Calling

Estevam Hruschka; Farima Fatahi Bayat; Moin Aminnaseri; Pouya Pezeshkpour

arxiv: 2605.21781 · v1 · pith:SHD2RO5Vnew · submitted 2026-05-20 · 💻 cs.CL

Reflective Prompt Tuning through Language Model Function-Calling

Farima Fatahi Bayat , Moin Aminnaseri , Pouya Pezeshkpour , Estevam Hruschka This is my paper

Pith reviewed 2026-05-22 08:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt optimizationlarge language modelsfunction callingreflective tuningreasoning tasksfailure diagnosisconfidence calibrationiterative refinement

0 comments

The pith

Reflective Prompt Tuning uses an LLM optimizer with function calls to diagnose recurring failure modes across a full dataset and iteratively revise prompts for better reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reflective Prompt Tuning (RPT) as a way to automate prompt design for large language models on reasoning tasks. Instead of manual tweaks or small-batch critiques, an LLM optimizer calls a diagnostic function that evaluates the target model on the entire optimization set, summarizes systematic errors into a structured report, and draws on memory of prior reports to generate targeted prompt revisions. The method also incorporates calibration signals to improve both performance and confidence estimates. A reader would care because prompt engineering is time-consuming and sensitive, and this approach aims to reduce that effort while keeping inference flexible and preserving gains on complex tasks like multi-hop and math reasoning.

Core claim

Reflective Prompt Tuning enables an LLM optimizer to simulate human prompt engineering through function calling: it invokes a diagnostic function that runs the target model over the full optimization set, produces a report of recurring failure modes, and then uses that report plus accumulated prior reports to make iterative prompt revisions, yielding up to 12.9-point gains over initial prompts on three reasoning tasks while remaining competitive with prior methods and improving calibration.

What carries the argument

The diagnostic function that evaluates the target model over the entire optimization set and returns a structured report of recurring failure modes, which the optimizer combines with memory of earlier reports to drive targeted prompt revisions.

If this is right

RPT produces prompt revisions that align with the diagnosed failure patterns from the reports.
Performance improves by up to 12.9 points over initial prompts on reasoning benchmarks while staying competitive with state-of-the-art optimizers.
Calibration improves when calibration signals are included in the diagnostic feedback and final selection.
The approach shows particular strength on multi-hop and mathematical reasoning tasks.
Iterative use of accumulated memory of prior reports enables more systematic edits than single-example or small-batch methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The full-set evaluation could make RPT more robust than batch-based methods when datasets contain noisy or outlier examples.
Similar reflective diagnostic loops could be applied to other LLM adaptation settings such as tool-use or agent design.
The structured report format might allow human engineers to inspect and override the optimizer's suggested changes.
Combining RPT with parameter-efficient fine-tuning could compound gains if prompt-level and weight-level updates are compatible.

Load-bearing premise

The diagnostic function can reliably summarize recurring failure modes from full-set evaluations in a way that gives the optimizer actionable guidance for producing effective prompt revisions.

What would settle it

Running RPT on a held-out reasoning task and finding that the revised prompts show no performance gain over the initial prompt or fail to address the specific failure modes listed in the diagnostic reports.

Figures

Figures reproduced from arXiv: 2605.21781 by Estevam Hruschka, Farima Fatahi Bayat, Moin Aminnaseri, Pouya Pezeshkpour.

**Figure 1.** Figure 1: Overview of Reflective Prompt Tuning (RPT). At each iteration, the optimizer calls a diagnostic function that evaluates the current prompt on Dtrain, critiques failures, clusters recurring failure modes, and returns a structured report. The optimizer uses this report and prior reports to generate the next prompt. not necessarily yield better development performance, motivating development-set selection. I… view at source ↗

**Figure 2.** Figure 2: Failure-to-patch alignment across datasets. Each heatmap reports [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Patch topics and next-iteration metric changes. Each cell reports the average [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Average persistence of failure topics across optimization iterations for HotPotQA, LiveBench-Math, and [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Average next-iteration metric changes associated with each diagnosed failure topic for HotPotQA, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt length and development-set task score across [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RPT uses LLM function calls for full-dataset diagnostic reports plus memory of past ones to drive prompt revisions, and it reports up to 12.9-point gains on reasoning tasks with better calibration.

read the letter

RPT is a prompt-tuning loop where an LLM optimizer calls a diagnostic function that scores the target model on the entire optimization set, summarizes recurring failure modes into a structured report, and then revises the prompt using that report plus accumulated memory from earlier rounds. It also folds in calibration signals for both the optimization steps and final selection. On three reasoning tasks the method lifts performance over the starting prompt by as much as 12.9 points and stays competitive with current approaches while improving calibration, with the biggest lifts on multi-hop and mathematical reasoning.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Reflective Prompt Tuning (RPT), a framework that leverages LLM function-calling to automate prompt optimization. An optimizer LLM iteratively invokes a diagnostic function that evaluates the target model across the full optimization set, generates structured summaries of recurring failure modes, and incorporates accumulated memory of prior reports to produce revised prompts. The method also integrates confidence calibration signals for both optimization and final prompt selection. Experiments across three reasoning tasks report performance gains of up to 12.9 points over initial prompts, competitiveness with state-of-the-art prompt optimization approaches, and improved calibration, with particular effectiveness noted on multi-hop and mathematical reasoning tasks.

Significance. If the reflective mechanism proves reliable, RPT could meaningfully advance automated prompt engineering by moving beyond per-example or fixed-pipeline critique methods toward history-aware, full-set diagnostics that target systematic error patterns. The reported gains on complex reasoning tasks and the added calibration benefits would be of practical interest for applications requiring both accuracy and reliable uncertainty estimates.

major comments (2)

[§3] §3 (RPT framework description): The central claim that the diagnostic function reliably extracts recurring failure modes from full-set evaluations and supplies actionable guidance for targeted prompt revisions is load-bearing, yet the manuscript provides no independent verification such as human fidelity ratings of the summaries or an ablation that isolates the diagnostic step from plain iterative prompting.
[§4] Experimental analyses (near end of §4): The statement that RPT produces 'targeted prompt revisions that align with diagnosed failure patterns' is asserted without supporting evidence, such as paired examples of specific diagnosed errors and the corresponding prompt edits or any quantitative alignment metric between reports and revisions.

minor comments (2)

[Abstract] The abstract reports a maximum gain of 12.9 points but does not identify the specific task, baseline, or iteration at which this occurs; adding this detail would improve precision.
Reproducibility would benefit from explicit documentation of the exact diagnostic function prompt template, the memory accumulation format, the number of optimization iterations, and the full optimization set sizes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our work. We provide point-by-point responses to the major comments and describe the revisions we will implement to address the concerns raised.

read point-by-point responses

Referee: [§3] §3 (RPT framework description): The central claim that the diagnostic function reliably extracts recurring failure modes from full-set evaluations and supplies actionable guidance for targeted prompt revisions is load-bearing, yet the manuscript provides no independent verification such as human fidelity ratings of the summaries or an ablation that isolates the diagnostic step from plain iterative prompting.

Authors: We recognize the importance of verifying the diagnostic function's ability to extract recurring failure modes. Although our experiments show substantial performance improvements attributable to the full RPT framework, we agree that an explicit ablation and human validation would be valuable. In the revised manuscript, we will add an ablation removing the diagnostic reports (replacing them with generic iterative feedback) and report the results. We will also perform and report a human study assessing the accuracy of the generated diagnostic summaries against the model's error patterns on a subset of cases. revision: yes
Referee: [§4] Experimental analyses (near end of §4): The statement that RPT produces 'targeted prompt revisions that align with diagnosed failure patterns' is asserted without supporting evidence, such as paired examples of specific diagnosed errors and the corresponding prompt edits or any quantitative alignment metric between reports and revisions.

Authors: The claim regarding targeted revisions is based on our examination of the optimization process. To make this evidence explicit, we will append to the paper several concrete examples pairing diagnostic reports with the prompt changes they inspired. We will further introduce a simple quantitative metric measuring how many of the key failure modes mentioned in reports are addressed in the subsequent prompt version, based on our annotations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or self-referential reductions

full rationale

The paper describes an empirical prompt optimization method using LLM function-calling for diagnostics and revisions, with performance evaluated on three reasoning tasks. No equations, derivations, or mathematical claims appear in the provided text. Claims of improvement (up to 12.9 points) and alignment with failure patterns rest on experimental results rather than any reduction to fitted parameters, self-definitions, or self-citation chains. The central mechanism is presented as a novel workflow without load-bearing reliance on prior author work that would create circularity. This is a standard empirical proposal and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical framework proposal with no mathematical derivations, free parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5788 in / 1040 out tokens · 23206 ms · 2026-05-22T08:39:11.378880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

A systematic survey of prompt engineering in large language models: Techniques and applications. Preprint, arXiv:2402.07927. Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Large Language Models Are Human-Level Prompt Engineers

Agentic context engineering: Evolving con- texts for self-improving language models. Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. 2024. Agent-pro: Learning to evolve via policy-level reflection and op- timization. InProceedings of the 62nd Annual Meet- ing of the Association for...

work page internal anchor Pith review arXiv 2024
[3]

The question asked for the city’s location relative to Rome, but the model returned the city name instead

Produce 1-3 failure_modes with: •label: 2-6 words, consistent across similar errors •definition: Comprehensive explanation of the failure mode •why: brief, self-contained explanation for THIS example, e.g. “The question asked for the city’s location relative to Rome, but the model returned the city name instead.” •basis: cite what in trace/reasoning shows this

work page
[4]

Focus on actionable failure modes

work page
[6]

HotpotQA Critic Prompt You are a strict evaluation critic for QA failures

Output ONLY valid JSON matching the schema. HotpotQA Critic Prompt You are a strict evaluation critic for QA failures. You are given ONE QA trace: •question •context (titles + snippets) •gold answer •predicted answer •model confidence •model reasoning Your goal is to diagnose WHY the target model produced the wrong answer. Instructions:

work page
[7]

Produce 1-3 failure_modes with: •label: 2-6 words, consistent across similar errors •definition: Comprehensive explanation of the failure mode •why: brief explanation for THIS example •basis: cite what in the trace/reasoning shows this

work page
[8]

wrong bridge entity

Make labels concrete and clusterable: •Prefer labels like “wrong bridge entity” over long sentences. •Do not include entity names, dates, or example-specific details in labels

work page
[10]

LiveBench-Math Critic Prompt You are a strict evaluation critic for math failures

Output ONLY valid JSON matching the schema (no extra text). LiveBench-Math Critic Prompt You are a strict evaluation critic for math failures. You are given one failed model attempt with: •task metadata •question •gold answer •predicted answer •model confidence •model reasoning Your goal is to diagnose why the model failed. Instructions:

work page
[11]

Produce 1-3 failure_modes with: •label: 2-6 words, consistent across similar errors •definition: Comprehensive explanation of the failure mode •why: brief, self-contained explanation for THIS example •basis: cite what in the trace/reasoning shows this

work page
[12]

Make labels concrete and clusterable

work page
[13]

If you cannot identify a clear failure mode, return an empty list

work page
[14]

7.3 Shared Optimizer Prompt Optimizer Prompt You are the Reflective Prompt Tuning (RPT) controller

Output only valid JSON matching the schema. 7.3 Shared Optimizer Prompt Optimizer Prompt You are the Reflective Prompt Tuning (RPT) controller. Your goal is to iteratively improve a PromptProgram for the target task. At each iteration you must:

work page
[15]

Call evaluate_prompt exactly once on the CURRENT PromptProgram

work page
[16]

Read the returned evaluation report with insights

work page
[17]

Optimization target: •Primary: improve task performance on the training split

Output either a PATCH or STOP. Optimization target: •Primary: improve task performance on the training split. •Secondary: improve calibration (lower Brier / reduce overconfidence) without hurting task performance. Decision guidance: •When current_summary is provided, use it as the primary decision signal, especially current_summary.metrics and any deltas ...

work page 2018
[18]

about/circa/ap- proximately 100

into train, development, and test sets. Formula.Formula (Wang et al., 2025) is a finan- cial reasoning benchmark built around the eXten- sible Business Reporting Language (XBRL). Fol- lowing ACE (Zhang et al., 2026), we use Formula as a domain-specific numerical reasoning task. It requires models to apply financial concepts and per- form computations over...

work page 2025

[1] [1]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

A systematic survey of prompt engineering in large language models: Techniques and applications. Preprint, arXiv:2402.07927. Timo Schick, Jane Dwivedi-Yu, Roberto Dessí, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: language models can teach themselves to use tools. InProceedings of...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Large Language Models Are Human-Level Prompt Engineers

Agentic context engineering: Evolving con- texts for self-improving language models. Wenqi Zhang, Ke Tang, Hai Wu, Mengna Wang, Yongliang Shen, Guiyang Hou, Zeqi Tan, Peng Li, Yueting Zhuang, and Weiming Lu. 2024. Agent-pro: Learning to evolve via policy-level reflection and op- timization. InProceedings of the 62nd Annual Meet- ing of the Association for...

work page internal anchor Pith review arXiv 2024

[3] [3]

The question asked for the city’s location relative to Rome, but the model returned the city name instead

Produce 1-3 failure_modes with: •label: 2-6 words, consistent across similar errors •definition: Comprehensive explanation of the failure mode •why: brief, self-contained explanation for THIS example, e.g. “The question asked for the city’s location relative to Rome, but the model returned the city name instead.” •basis: cite what in trace/reasoning shows this

work page

[4] [4]

Focus on actionable failure modes

work page

[5] [6]

HotpotQA Critic Prompt You are a strict evaluation critic for QA failures

Output ONLY valid JSON matching the schema. HotpotQA Critic Prompt You are a strict evaluation critic for QA failures. You are given ONE QA trace: •question •context (titles + snippets) •gold answer •predicted answer •model confidence •model reasoning Your goal is to diagnose WHY the target model produced the wrong answer. Instructions:

work page

[6] [7]

Produce 1-3 failure_modes with: •label: 2-6 words, consistent across similar errors •definition: Comprehensive explanation of the failure mode •why: brief explanation for THIS example •basis: cite what in the trace/reasoning shows this

work page

[7] [8]

wrong bridge entity

Make labels concrete and clusterable: •Prefer labels like “wrong bridge entity” over long sentences. •Do not include entity names, dates, or example-specific details in labels

work page

[8] [10]

LiveBench-Math Critic Prompt You are a strict evaluation critic for math failures

Output ONLY valid JSON matching the schema (no extra text). LiveBench-Math Critic Prompt You are a strict evaluation critic for math failures. You are given one failed model attempt with: •task metadata •question •gold answer •predicted answer •model confidence •model reasoning Your goal is to diagnose why the model failed. Instructions:

work page

[9] [11]

Produce 1-3 failure_modes with: •label: 2-6 words, consistent across similar errors •definition: Comprehensive explanation of the failure mode •why: brief, self-contained explanation for THIS example •basis: cite what in the trace/reasoning shows this

work page

[10] [12]

Make labels concrete and clusterable

work page

[11] [13]

If you cannot identify a clear failure mode, return an empty list

work page

[12] [14]

7.3 Shared Optimizer Prompt Optimizer Prompt You are the Reflective Prompt Tuning (RPT) controller

Output only valid JSON matching the schema. 7.3 Shared Optimizer Prompt Optimizer Prompt You are the Reflective Prompt Tuning (RPT) controller. Your goal is to iteratively improve a PromptProgram for the target task. At each iteration you must:

work page

[13] [15]

Call evaluate_prompt exactly once on the CURRENT PromptProgram

work page

[14] [16]

Read the returned evaluation report with insights

work page

[15] [17]

Optimization target: •Primary: improve task performance on the training split

Output either a PATCH or STOP. Optimization target: •Primary: improve task performance on the training split. •Secondary: improve calibration (lower Brier / reduce overconfidence) without hurting task performance. Decision guidance: •When current_summary is provided, use it as the primary decision signal, especially current_summary.metrics and any deltas ...

work page 2018

[16] [18]

about/circa/ap- proximately 100

into train, development, and test sets. Formula.Formula (Wang et al., 2025) is a finan- cial reasoning benchmark built around the eXten- sible Business Reporting Language (XBRL). Fol- lowing ACE (Zhang et al., 2026), we use Formula as a domain-specific numerical reasoning task. It requires models to apply financial concepts and per- form computations over...

work page 2025