Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Pith reviewed 2026-05-15 16:20 UTC · model grok-4.3
The pith
A rubric, calibrated LLM judge, and joint multi-agent prompt optimization improve conversational shopping assistants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a multi-faceted evaluation rubric decomposes shopping quality into measurable dimensions, a calibrated LLM-as-judge pipeline aligns with human annotations, and MAMuT GEPA jointly optimizes prompts for all agents by running multi-turn simulations and applying trajectory-level scoring, outperforming independent sub-agent optimization in a production-scale grocery assistant.
What carries the argument
MAMuT (Multi-Agent Multi-Turn) GEPA, a system-level prompt optimizer that jointly tunes prompts across agents using multi-turn simulation and trajectory-level scoring.
If this is right
- Decomposing quality into dimensions enables targeted diagnosis of failures in multi-turn flows.
- Trajectory-level scoring aligns individual agent behavior with overall conversation outcomes.
- Calibrated LLM judges scale evaluation beyond what human raters can handle in production.
- Joint optimization reduces coordination failures that arise when agents are tuned in isolation.
Where Pith is reading between the lines
- The released rubric templates could become a starting point for standardized evaluation in other multi-turn agent domains such as travel or support.
- Teams might combine the method with online A/B testing to close the simulation-to-production gap.
- The approach suggests that prompt optimization should treat the entire agent graph as a single tunable object rather than separate modules.
Load-bearing premise
The LLM-as-judge pipeline remains reliably calibrated to human judgments across diverse real-world interactions and multi-turn simulations accurately predict production performance without overfitting.
What would settle it
A controlled user study in which real shoppers rate full shopping sessions with MAMuT-optimized agents no higher than sessions with independently optimized sub-agents.
read the original abstract
Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a blueprint for evaluating and optimizing multi-agent conversational shopping assistants (CSAs) for grocery shopping. It introduces a multi-faceted rubric decomposing end-to-end shopping quality into structured dimensions, a calibrated LLM-as-judge pipeline aligned with human annotations, and two GEPA-based prompt optimization strategies: localized Sub-agent GEPA and the novel MAMuT (Multi-Agent Multi-Turn) GEPA for joint system-level optimization via multi-turn simulation and trajectory scoring. Rubric templates are released for practitioners.
Significance. If the LLM-judge calibration remains robust outside the annotation set and multi-turn simulations accurately predict live performance without overfitting to the user model, the framework could provide a practical, reusable methodology for production CSAs, with the released templates supporting adoption and reproducibility in agentic AI applications.
major comments (2)
- [Abstract] Abstract and evaluation pipeline description: The central claim that the LLM-as-judge is 'calibrated' and 'aligned with human annotations' is load-bearing for all downstream optimization results, yet no quantitative metrics (e.g., correlation coefficients, inter-annotator agreement, or out-of-distribution validation on budget/inventory-constrained trajectories) are reported.
- [MAMuT GEPA] MAMuT GEPA description: The novel joint optimization approach relies on multi-turn simulation and trajectory-level scoring to optimize across agents, but provides no ablations or fidelity checks on how preference distributions, inventory dynamics, or underspecified requests are modeled in the simulator, risking optimization for an unreliable proxy.
minor comments (1)
- [Abstract] The acronym MAMuT is introduced in the abstract before its expansion; spelling out 'Multi-Agent Multi-Turn' on first use would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation pipeline description: The central claim that the LLM-as-judge is 'calibrated' and 'aligned with human annotations' is load-bearing for all downstream optimization results, yet no quantitative metrics (e.g., correlation coefficients, inter-annotator agreement, or out-of-distribution validation on budget/inventory-constrained trajectories) are reported.
Authors: We agree that quantitative metrics are essential to substantiate the calibration claims. The initial manuscript describes the alignment process but does not report specific numerical results. In the revised version, we will add Pearson and Spearman correlation coefficients, inter-annotator agreement (Cohen's kappa), and out-of-distribution validation on budget- and inventory-constrained trajectories, presented in a dedicated subsection of the evaluation pipeline. revision: yes
-
Referee: [MAMuT GEPA] MAMuT GEPA description: The novel joint optimization approach relies on multi-turn simulation and trajectory-level scoring to optimize across agents, but provides no ablations or fidelity checks on how preference distributions, inventory dynamics, or underspecified requests are modeled in the simulator, risking optimization for an unreliable proxy.
Authors: We acknowledge that the manuscript does not include ablations or fidelity checks for the simulator. In the revision, we will add a new subsection detailing the modeling of preference distributions, inventory dynamics, and underspecified requests, along with ablations that vary these components and report their impact on optimization performance. We will also include fidelity checks comparing simulated trajectories to held-out real-user data. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces an evaluation rubric decomposed into structured dimensions and a calibrated LLM-as-judge pipeline explicitly aligned with external human annotations. It then describes two prompt-optimization strategies (Sub-agent GEPA and MAMuT GEPA) built on a cited SOTA optimizer. No equations, first-principles derivations, or predictions are present that reduce by construction to fitted parameters or self-defined inputs. The calibration step depends on independent human data rather than internal fits, and the self-citation to Herrera et al. 2026 for the novel MAMuT component is not load-bearing for any circular reduction—the current work presents the blueprint with external validation elements. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-as-a-judge pipelines can be calibrated to align with human annotations for multi-turn shopping interactions
- ad hoc to paper Prompt optimization methods like GEPA can be extended to joint system-level optimization across agents
invented entities (1)
-
MAMuT GEPA
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.