pith. sign in

arxiv: 2603.03565 · v2 · submitted 2026-03-03 · 💻 cs.AI · cs.CL· cs.LG

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Pith reviewed 2026-05-15 16:20 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords multi-agent systemsprompt optimizationconversational AILLM evaluationshopping assistantstrajectory scoringevaluation rubric
0
0 comments X

The pith

A rubric, calibrated LLM judge, and joint multi-agent prompt optimization improve conversational shopping assistants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a blueprint for evaluating and optimizing multi-agent conversational shopping assistants by breaking down end-to-end quality into structured dimensions with a new rubric. It pairs this with a calibrated LLM-as-judge that matches human ratings and introduces MAMuT GEPA, a system-level method that jointly tunes prompts across all agents through multi-turn simulations scored at the full trajectory level. Demonstrated on a production grocery assistant, the approach targets the challenges of underspecified requests, preference sensitivity, and tight agent coupling. If the claims hold, teams can iterate on complex agent systems at scale with less reliance on constant human review.

Core claim

The authors claim that a multi-faceted evaluation rubric decomposes shopping quality into measurable dimensions, a calibrated LLM-as-judge pipeline aligns with human annotations, and MAMuT GEPA jointly optimizes prompts for all agents by running multi-turn simulations and applying trajectory-level scoring, outperforming independent sub-agent optimization in a production-scale grocery assistant.

What carries the argument

MAMuT (Multi-Agent Multi-Turn) GEPA, a system-level prompt optimizer that jointly tunes prompts across agents using multi-turn simulation and trajectory-level scoring.

If this is right

  • Decomposing quality into dimensions enables targeted diagnosis of failures in multi-turn flows.
  • Trajectory-level scoring aligns individual agent behavior with overall conversation outcomes.
  • Calibrated LLM judges scale evaluation beyond what human raters can handle in production.
  • Joint optimization reduces coordination failures that arise when agents are tuned in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The released rubric templates could become a starting point for standardized evaluation in other multi-turn agent domains such as travel or support.
  • Teams might combine the method with online A/B testing to close the simulation-to-production gap.
  • The approach suggests that prompt optimization should treat the entire agent graph as a single tunable object rather than separate modules.

Load-bearing premise

The LLM-as-judge pipeline remains reliably calibrated to human judgments across diverse real-world interactions and multi-turn simulations accurately predict production performance without overfitting.

What would settle it

A controlled user study in which real shoppers rate full shopping sessions with MAMuT-optimized agents no higher than sessions with independently optimized sub-agents.

read the original abstract

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary prompt-optimization strategies based on a SOTA prompt-optimizer called GEPA (Shao et al., 2025): (1) Sub-agent GEPA, which optimizes individual agent nodes against localized rubrics, and (2) MAMuT (Multi-Agent Multi-Turn) GEPA (Herrera et al., 2026), a novel system-level approach that jointly optimizes prompts across agents using multi-turn simulation and trajectory-level scoring. We release rubric templates and evaluation design guidance to support practitioners building production CSAs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a blueprint for evaluating and optimizing multi-agent conversational shopping assistants (CSAs) for grocery shopping. It introduces a multi-faceted rubric decomposing end-to-end shopping quality into structured dimensions, a calibrated LLM-as-judge pipeline aligned with human annotations, and two GEPA-based prompt optimization strategies: localized Sub-agent GEPA and the novel MAMuT (Multi-Agent Multi-Turn) GEPA for joint system-level optimization via multi-turn simulation and trajectory scoring. Rubric templates are released for practitioners.

Significance. If the LLM-judge calibration remains robust outside the annotation set and multi-turn simulations accurately predict live performance without overfitting to the user model, the framework could provide a practical, reusable methodology for production CSAs, with the released templates supporting adoption and reproducibility in agentic AI applications.

major comments (2)
  1. [Abstract] Abstract and evaluation pipeline description: The central claim that the LLM-as-judge is 'calibrated' and 'aligned with human annotations' is load-bearing for all downstream optimization results, yet no quantitative metrics (e.g., correlation coefficients, inter-annotator agreement, or out-of-distribution validation on budget/inventory-constrained trajectories) are reported.
  2. [MAMuT GEPA] MAMuT GEPA description: The novel joint optimization approach relies on multi-turn simulation and trajectory-level scoring to optimize across agents, but provides no ablations or fidelity checks on how preference distributions, inventory dynamics, or underspecified requests are modeled in the simulator, risking optimization for an unreliable proxy.
minor comments (1)
  1. [Abstract] The acronym MAMuT is introduced in the abstract before its expansion; spelling out 'Multi-Agent Multi-Turn' on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation pipeline description: The central claim that the LLM-as-judge is 'calibrated' and 'aligned with human annotations' is load-bearing for all downstream optimization results, yet no quantitative metrics (e.g., correlation coefficients, inter-annotator agreement, or out-of-distribution validation on budget/inventory-constrained trajectories) are reported.

    Authors: We agree that quantitative metrics are essential to substantiate the calibration claims. The initial manuscript describes the alignment process but does not report specific numerical results. In the revised version, we will add Pearson and Spearman correlation coefficients, inter-annotator agreement (Cohen's kappa), and out-of-distribution validation on budget- and inventory-constrained trajectories, presented in a dedicated subsection of the evaluation pipeline. revision: yes

  2. Referee: [MAMuT GEPA] MAMuT GEPA description: The novel joint optimization approach relies on multi-turn simulation and trajectory-level scoring to optimize across agents, but provides no ablations or fidelity checks on how preference distributions, inventory dynamics, or underspecified requests are modeled in the simulator, risking optimization for an unreliable proxy.

    Authors: We acknowledge that the manuscript does not include ablations or fidelity checks for the simulator. In the revision, we will add a new subsection detailing the modeling of preference distributions, inventory dynamics, and underspecified requests, along with ablations that vary these components and report their impact on optimization performance. We will also include fidelity checks comparing simulated trajectories to held-out real-user data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an evaluation rubric decomposed into structured dimensions and a calibrated LLM-as-judge pipeline explicitly aligned with external human annotations. It then describes two prompt-optimization strategies (Sub-agent GEPA and MAMuT GEPA) built on a cited SOTA optimizer. No equations, first-principles derivations, or predictions are present that reduce by construction to fitted parameters or self-defined inputs. The calibration step depends on independent human data rather than internal fits, and the self-citation to Herrera et al. 2026 for the novel MAMuT component is not load-bearing for any circular reduction—the current work presents the blueprint with external validation elements. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central contributions rest on assumptions about LLM judge calibration and the effectiveness of the new MAMuT optimization strategy; no free parameters are mentioned in the abstract.

axioms (2)
  • domain assumption LLM-as-a-judge pipelines can be calibrated to align with human annotations for multi-turn shopping interactions
    Central to the described evaluation foundation.
  • ad hoc to paper Prompt optimization methods like GEPA can be extended to joint system-level optimization across agents
    Basis for introducing the MAMuT method.
invented entities (1)
  • MAMuT GEPA no independent evidence
    purpose: Jointly optimize prompts across multiple agents using multi-turn simulation and trajectory-level scoring
    New method introduced as a novel system-level approach.

pith-pipeline@v0.9.0 · 5566 in / 1300 out tokens · 64650 ms · 2026-05-15T16:20:42.961870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.