Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Aayush Sheth; Alejandro Breen Herrera; Charles Pollnow V; Charles Wright; Danny Nightingale; Hongtai Wei; Marcus Yearwood; Meg Watson; Steven G. Xu; Sudeep Das

arxiv: 2603.03565 · v2 · submitted 2026-03-03 · 💻 cs.AI · cs.CL· cs.LG

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera , Aayush Sheth , Steven G. Xu , Zhucheng Zhan , Charles Wright , Marcus Yearwood , Hongtai Wei , Sudeep Das

show 3 more authors

Danny Nightingale Meg Watson Charles Pollnow V

This is my paper

Pith reviewed 2026-05-15 16:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords multi-agent systemsprompt optimizationconversational AILLM evaluationshopping assistantstrajectory scoringevaluation rubric

0 comments

The pith

A rubric, calibrated LLM judge, and joint multi-agent prompt optimization improve conversational shopping assistants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a blueprint for evaluating and optimizing multi-agent conversational shopping assistants by breaking down end-to-end quality into structured dimensions with a new rubric. It pairs this with a calibrated LLM-as-judge that matches human ratings and introduces MAMuT GEPA, a system-level method that jointly tunes prompts across all agents through multi-turn simulations scored at the full trajectory level. Demonstrated on a production grocery assistant, the approach targets the challenges of underspecified requests, preference sensitivity, and tight agent coupling. If the claims hold, teams can iterate on complex agent systems at scale with less reliance on constant human review.

Core claim

The authors claim that a multi-faceted evaluation rubric decomposes shopping quality into measurable dimensions, a calibrated LLM-as-judge pipeline aligns with human annotations, and MAMuT GEPA jointly optimizes prompts for all agents by running multi-turn simulations and applying trajectory-level scoring, outperforming independent sub-agent optimization in a production-scale grocery assistant.

What carries the argument

MAMuT (Multi-Agent Multi-Turn) GEPA, a system-level prompt optimizer that jointly tunes prompts across agents using multi-turn simulation and trajectory-level scoring.

If this is right

Decomposing quality into dimensions enables targeted diagnosis of failures in multi-turn flows.
Trajectory-level scoring aligns individual agent behavior with overall conversation outcomes.
Calibrated LLM judges scale evaluation beyond what human raters can handle in production.
Joint optimization reduces coordination failures that arise when agents are tuned in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The released rubric templates could become a starting point for standardized evaluation in other multi-turn agent domains such as travel or support.
Teams might combine the method with online A/B testing to close the simulation-to-production gap.
The approach suggests that prompt optimization should treat the entire agent graph as a single tunable object rather than separate modules.

Load-bearing premise

The LLM-as-judge pipeline remains reliably calibrated to human judgments across diverse real-world interactions and multi-turn simulations accurately predict production performance without overfitting.

What would settle it

A controlled user study in which real shoppers rate full shopping sessions with MAMuT-optimized agents no higher than sessions with independently optimized sub-agents.

read the original abstract

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical blueprint for multi-agent grocery assistants with a new joint optimization method, but no results to show it works.

read the letter

This paper lays out a concrete engineering blueprint for evaluating and optimizing multi-agent conversational shopping assistants, with the main new piece being MAMuT GEPA, a system-level prompt optimizer that jointly tunes multiple agents over multi-turn trajectories using simulation and trajectory scoring. It builds directly on the earlier GEPA work by moving from per-agent tweaks to coordinated optimization across the whole setup. The multi-faceted rubric that splits end-to-end shopping quality into structured dimensions is a clear step forward, and releasing the templates plus design guidance is genuinely useful for anyone trying to ship something similar in production. The two variants—localized Sub-agent GEPA versus the joint MAMuT version—are presented as complementary, which tracks with how these systems actually behave in practice. The soft spot is the complete absence of empirical results or validation. The claims rest on the LLM judge staying aligned with humans outside the calibration set and on simulations predicting live performance, yet there are no cross-validation numbers, inter-annotator stats, or ablation checks on simulation fidelity. Without that evidence, it's impossible to tell whether the optimization loop is improving real shopping quality or just chasing a noisy proxy. This is aimed at practitioners building production consumer assistants rather than academic theorists. A reader working on similar systems would get usable ideas from the rubric and release, even if they have to add their own experiments. I would send it to peer review so the method can get feedback on how to close the validation gap.

Referee Report

2 major / 1 minor

Summary. The paper presents a blueprint for evaluating and optimizing multi-agent conversational shopping assistants (CSAs) for grocery shopping. It introduces a multi-faceted rubric decomposing end-to-end shopping quality into structured dimensions, a calibrated LLM-as-judge pipeline aligned with human annotations, and two GEPA-based prompt optimization strategies: localized Sub-agent GEPA and the novel MAMuT (Multi-Agent Multi-Turn) GEPA for joint system-level optimization via multi-turn simulation and trajectory scoring. Rubric templates are released for practitioners.

Significance. If the LLM-judge calibration remains robust outside the annotation set and multi-turn simulations accurately predict live performance without overfitting to the user model, the framework could provide a practical, reusable methodology for production CSAs, with the released templates supporting adoption and reproducibility in agentic AI applications.

major comments (2)

[Abstract] Abstract and evaluation pipeline description: The central claim that the LLM-as-judge is 'calibrated' and 'aligned with human annotations' is load-bearing for all downstream optimization results, yet no quantitative metrics (e.g., correlation coefficients, inter-annotator agreement, or out-of-distribution validation on budget/inventory-constrained trajectories) are reported.
[MAMuT GEPA] MAMuT GEPA description: The novel joint optimization approach relies on multi-turn simulation and trajectory-level scoring to optimize across agents, but provides no ablations or fidelity checks on how preference distributions, inventory dynamics, or underspecified requests are modeled in the simulator, risking optimization for an unreliable proxy.

minor comments (1)

[Abstract] The acronym MAMuT is introduced in the abstract before its expansion; spelling out 'Multi-Agent Multi-Turn' on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation pipeline description: The central claim that the LLM-as-judge is 'calibrated' and 'aligned with human annotations' is load-bearing for all downstream optimization results, yet no quantitative metrics (e.g., correlation coefficients, inter-annotator agreement, or out-of-distribution validation on budget/inventory-constrained trajectories) are reported.

Authors: We agree that quantitative metrics are essential to substantiate the calibration claims. The initial manuscript describes the alignment process but does not report specific numerical results. In the revised version, we will add Pearson and Spearman correlation coefficients, inter-annotator agreement (Cohen's kappa), and out-of-distribution validation on budget- and inventory-constrained trajectories, presented in a dedicated subsection of the evaluation pipeline. revision: yes
Referee: [MAMuT GEPA] MAMuT GEPA description: The novel joint optimization approach relies on multi-turn simulation and trajectory-level scoring to optimize across agents, but provides no ablations or fidelity checks on how preference distributions, inventory dynamics, or underspecified requests are modeled in the simulator, risking optimization for an unreliable proxy.

Authors: We acknowledge that the manuscript does not include ablations or fidelity checks for the simulator. In the revision, we will add a new subsection detailing the modeling of preference distributions, inventory dynamics, and underspecified requests, along with ablations that vary these components and report their impact on optimization performance. We will also include fidelity checks comparing simulated trajectories to held-out real-user data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an evaluation rubric decomposed into structured dimensions and a calibrated LLM-as-judge pipeline explicitly aligned with external human annotations. It then describes two prompt-optimization strategies (Sub-agent GEPA and MAMuT GEPA) built on a cited SOTA optimizer. No equations, first-principles derivations, or predictions are present that reduce by construction to fitted parameters or self-defined inputs. The calibration step depends on independent human data rather than internal fits, and the self-citation to Herrera et al. 2026 for the novel MAMuT component is not load-bearing for any circular reduction—the current work presents the blueprint with external validation elements. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central contributions rest on assumptions about LLM judge calibration and the effectiveness of the new MAMuT optimization strategy; no free parameters are mentioned in the abstract.

axioms (2)

domain assumption LLM-as-a-judge pipelines can be calibrated to align with human annotations for multi-turn shopping interactions
Central to the described evaluation foundation.
ad hoc to paper Prompt optimization methods like GEPA can be extended to joint system-level optimization across agents
Basis for introducing the MAMuT method.

invented entities (1)

MAMuT GEPA no independent evidence
purpose: Jointly optimize prompts across multiple agents using multi-turn simulation and trajectory-level scoring
New method introduced as a novel system-level approach.

pith-pipeline@v0.9.0 · 5566 in / 1300 out tokens · 64650 ms · 2026-05-15T16:20:42.961870+00:00 · methodology

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)