AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Gang Wu; Kun Wan; Peilin Wu; Wentian Zhao; Xinlu Zhang; Xinya Du; Zhiyu Chen

arxiv: 2605.18592 · v2 · pith:24VFHJXWnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Peilin Wu , Xinlu Zhang , Kun Wan , Wentian Zhao , Gang Wu , Xinya Du , Zhiyu Chen This is my paper

Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords rubric-based reward shapingreinforcement learningmemory augmentationLLM fine-tuningadaptive rubricspersistent memory retrievalevaluation history

0 comments

The pith

AMARIS improves LLM reinforcement learning by storing and retrieving long-term evaluation history to update rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that rubric-based reward shaping works better when it draws on accumulated training history instead of resetting each step. Current approaches discard rollout diagnostics after one use, so they keep rediscovering the same problems; AMARIS keeps those diagnostics in persistent memory and retrieves them both by recency and by semantic match before revising the rubric. A reader would care because this change turns reward signals from short-term guesses into an accumulating record that can spot and correct repeated mistakes. If the claim holds, training runs would need fewer total steps to reach stronger policies while adding almost no extra wall-clock time.

Core claim

AMARIS analyzes each rollout, condenses the findings into a step-level summary, pulls relevant past summaries from a persistent memory store through both static recent-step lookup and dynamic semantic search, then revises the rubric with that combined evidence; the whole process runs asynchronously and adds roughly five percent overhead while producing higher final performance than stateless baselines in both closed and open-ended tasks.

What carries the argument

A persistent evaluation memory that stores step-level summaries and supplies them through static recent-step retrieval plus dynamic semantic retrieval to guide each rubric update.

If this is right

Rubric revisions become driven by evidence accumulated across many steps rather than by local signals alone.
Combining recent-step and semantically matched history yields stronger gains than either retrieval method by itself.
Moderate retrieval budgets capture most of the benefit, so memory size need not grow without limit.
The added work can stay under five percent overhead when memory operations run outside the main RL loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory structure could be attached to other adaptive reward or feedback systems that currently operate step-by-step.
After many steps the memory might begin to encode training-phase patterns that allow the system to adjust rubric strictness proactively.
One could test whether the stored summaries allow transfer of diagnostic knowledge to entirely new tasks or domains.

Load-bearing premise

Aggregated rollout summaries plus static and semantic retrieval will reliably surface recurring problems without injecting noise or stale data into the rubric updates.

What would settle it

An experiment that disables memory retrieval entirely and still records the same performance gains as the full AMARIS system would falsify the claim that long-term history is required.

Figures

Figures reproduced from arXiv: 2605.18592 by Gang Wu, Kun Wan, Peilin Wu, Wentian Zhao, Xinlu Zhang, Xinya Du, Zhiyu Chen.

**Figure 2.** Figure 2: Prompt template for the reward scoring (1/2). The LLM scoring evaluates each [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt template for the reward scoring (2/2). The LLM scoring evaluates each [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template for individual rollout analysis (1/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt template for individual rollout analysis (2/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template for individual rollout analysis (3/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt template for individual rollout analysis (4/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt template for batch-level summarization (1/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt template for batch-level summarization (2/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for batch-level summarization (3/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt template for batch-level summarization (4/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt template for meta-batch-level summarization (1/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p041_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt template for meta-batch-level summarization (2/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt template for meta-batch-level summarization (3/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt template for meta-batch-level summarization (4/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt template for memory query generation (1/3) described in Section [PITH_FULL_IMAGE:figures/full_fig_p045_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt template for memory query generation (2/3) described in Section [PITH_FULL_IMAGE:figures/full_fig_p046_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt template for memory query generation (3/3) described in Section [PITH_FULL_IMAGE:figures/full_fig_p047_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt template for rubric update (1/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗

**Figure 20.** Figure 20: Prompt template for rubric update (2/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt template for rubric update (3/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p050_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt template for rubric update (4/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p051_22.png] view at source ↗

read the original abstract

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMARIS adds persistent memory and dual retrieval to rubric RL but needs stronger evidence to back its performance claims.

read the letter

The main thing to know about this paper is that AMARIS introduces a persistent memory system to accumulate evaluation knowledge across training steps for rubric-based RL on LLMs. What the work does well is identify the limitation in prior adaptive rubric approaches that throw away diagnostics after each use. It then proposes a practical fix: aggregate rollout findings into summaries, retrieve via static recent history and dynamic semantic search from memory, and update rubrics asynchronously. The low overhead and the ablation results showing benefit from combining retrieval methods are positive points. This turns the process into more of an ongoing evidence loop rather than repeated restarts. The soft spots center on the lack of detailed quantitative support. The abstract claims consistent outperformance and positive ablations but gives no specific metrics, dataset info, or significance tests. This makes it hard to assess the real impact. The stress-test concern about dynamic retrieval possibly introducing noise from non-causal matches has merit here, since the paper does not appear to include direct tests of retrieval relevance or precision. If the full experiments address this, it would strengthen the case. This kind of paper is for people already working on LLM alignment and RL reward shaping. Readers looking for incremental improvements to existing rubric methods will find the architecture straightforward to understand. It deserves a serious referee because the core idea is well-motivated and the method is described in enough detail to review properly. I recommend sending it to peer review but with a note to expand on the experimental results and retrieval quality checks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AMARIS, a system that augments rubric-based reinforcement learning for LLMs with a persistent evaluation memory. At each training step, rollouts are analyzed to produce step-level summaries; relevant historical context is retrieved via static (recent-step) and dynamic (semantic) mechanisms from the memory; rubrics are then updated on the basis of this accumulated evidence. The pipeline executes asynchronously alongside the RL loop and adds only ~5% time overhead. Experiments across closed and open-ended domains are reported to show consistent outperformance over baselines, with ablations indicating that the combination of static and dynamic retrieval contributes to the gains and that moderate retrieval budgets suffice.

Significance. If the performance claims are substantiated with detailed metrics and controls, AMARIS would represent a practical advance in adaptive reward shaping by converting per-step rubric heuristics into a long-term, evidence-driven process. The asynchronous design and low overhead are attractive for real training pipelines, and the explicit use of persistent memory to detect recurring suboptimal behaviors could support more stable curriculum-like progression in LLM fine-tuning.

major comments (2)

[Abstract and §4 (Experiments)] The central empirical claim—that AMARIS consistently outperforms baselines—rests on high-level assertions without accompanying quantitative metrics, dataset sizes, statistical significance tests, or effect sizes. This is load-bearing for the paper’s contribution because the abstract and results summary assert reliable gains from the memory-augmented loop; absent concrete numbers it is impossible to judge whether the improvements are meaningful or reproducible.
[§3.2] §3.2 (Dynamic Retrieval): Semantic matching over persistent memory is assumed to surface causally relevant historical summaries of suboptimal behaviors. However, embedding similarity may retrieve superficially similar but non-causal or stale entries, injecting noise into rubric updates. The ablations demonstrate benefit from combining retrieval types yet provide no retrieval-precision metric, no relevance-filtered baseline, and no analysis of failure cases where mismatched context harms performance; this directly threatens the “evidence-driven loop” assumption.

minor comments (2)

[§3.1] Clarify the exact definition and update frequency of the “persistent evaluation memory” and whether it is reset between runs or shared across experiments.
[§4.3] The ~5% overhead figure should be accompanied by a breakdown (e.g., retrieval latency vs. summary generation) and measured on the same hardware used for the RL loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the manuscript to improve the clarity and substantiation of our empirical claims and retrieval analysis.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] The central empirical claim—that AMARIS consistently outperforms baselines—rests on high-level assertions without accompanying quantitative metrics, dataset sizes, statistical significance tests, or effect sizes. This is load-bearing for the paper’s contribution because the abstract and results summary assert reliable gains from the memory-augmented loop; absent concrete numbers it is impossible to judge whether the improvements are meaningful or reproducible.

Authors: We appreciate the referee drawing attention to the presentation of results. The experiments in Section 4 report specific performance metrics across domains in tables comparing AMARIS to baselines, along with ablation results and the ~5% overhead measurement. Dataset sizes and task descriptions are provided in the experimental setup. However, we agree that the abstract and high-level summary would benefit from more explicit quantitative highlights to allow readers to assess the gains immediately. In the revised manuscript, we have updated the abstract to include key performance deltas and have added a consolidated results summary table in Section 4 that reports effect sizes and notes statistical significance from repeated runs. revision: yes
Referee: [§3.2] §3.2 (Dynamic Retrieval): Semantic matching over persistent memory is assumed to surface causally relevant historical summaries of suboptimal behaviors. However, embedding similarity may retrieve superficially similar but non-causal or stale entries, injecting noise into rubric updates. The ablations demonstrate benefit from combining retrieval types yet provide no retrieval-precision metric, no relevance-filtered baseline, and no analysis of failure cases where mismatched context harms performance; this directly threatens the “evidence-driven loop” assumption.

Authors: We acknowledge that the current ablations focus primarily on end-to-end performance impact rather than directly measuring retrieval quality. To strengthen the justification for the evidence-driven loop, we have added a new analysis subsection that evaluates retrieval precision via sampled manual annotations of retrieved summaries, introduces a relevance-filtered retrieval baseline, and discusses observed failure modes (including cases of stale or superficial matches) along with their measured effect on rubric updates. These additions confirm that the combined static-dynamic approach limits noise while preserving the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AMARIS derivation chain

full rationale

The paper introduces AMARIS as an additive memory-augmented pipeline on top of existing rubric-based RL: it aggregates rollout summaries, performs static/dynamic retrieval from persistent memory, and updates rubrics asynchronously. These mechanisms are described procedurally without reducing the claimed performance gains to quantities defined by fitted parameters or self-referential definitions from the same experimental data. Ablations attribute gains to the memory components as independent additions, and the central claim rests on empirical outperformance rather than any tautological reduction. No load-bearing step equates a prediction to its own inputs by construction, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that historical evaluation context improves rubric quality over stateless per-step updates, plus a small number of implementation choices such as retrieval budget size.

free parameters (1)

retrieval budget
Abstract states that moderate retrieval budgets suffice for most performance gains.

axioms (1)

domain assumption Aggregating rollout diagnostics into step-level summaries and retrieving from persistent memory will surface recurring suboptimal behaviors more effectively than local signals alone.
This premise underpins the entire memory-augmented update loop described in the abstract.

invented entities (1)

persistent evaluation memory no independent evidence
purpose: Store and retrieve long-term training history to inform rubric modifications.
New component introduced by AMARIS; no independent evidence of its effectiveness outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5834 in / 1201 out tokens · 38168 ms · 2026-05-20T12:35:19.167790+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, and uses both static and dynamic retrieval to ground rubric changes in training history.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ablation studies show that static and dynamic memory retrieval contributes to the performance gain

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.