pith. sign in

arxiv: 2605.18592 · v1 · pith:24VFHJXWnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords rubric-based reward shapingreinforcement learningmemory augmentationLLM fine-tuningadaptive rubricspersistent memory retrievalevaluation history
0
0 comments X

The pith

AMARIS improves LLM reinforcement learning by storing and retrieving long-term evaluation history to update rubrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that rubric-based reward shaping works better when it draws on accumulated training history instead of resetting each step. Current approaches discard rollout diagnostics after one use, so they keep rediscovering the same problems; AMARIS keeps those diagnostics in persistent memory and retrieves them both by recency and by semantic match before revising the rubric. A reader would care because this change turns reward signals from short-term guesses into an accumulating record that can spot and correct repeated mistakes. If the claim holds, training runs would need fewer total steps to reach stronger policies while adding almost no extra wall-clock time.

Core claim

AMARIS analyzes each rollout, condenses the findings into a step-level summary, pulls relevant past summaries from a persistent memory store through both static recent-step lookup and dynamic semantic search, then revises the rubric with that combined evidence; the whole process runs asynchronously and adds roughly five percent overhead while producing higher final performance than stateless baselines in both closed and open-ended tasks.

What carries the argument

A persistent evaluation memory that stores step-level summaries and supplies them through static recent-step retrieval plus dynamic semantic retrieval to guide each rubric update.

If this is right

  • Rubric revisions become driven by evidence accumulated across many steps rather than by local signals alone.
  • Combining recent-step and semantically matched history yields stronger gains than either retrieval method by itself.
  • Moderate retrieval budgets capture most of the benefit, so memory size need not grow without limit.
  • The added work can stay under five percent overhead when memory operations run outside the main RL loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory structure could be attached to other adaptive reward or feedback systems that currently operate step-by-step.
  • After many steps the memory might begin to encode training-phase patterns that allow the system to adjust rubric strictness proactively.
  • One could test whether the stored summaries allow transfer of diagnostic knowledge to entirely new tasks or domains.

Load-bearing premise

Aggregated rollout summaries plus static and semantic retrieval will reliably surface recurring problems without injecting noise or stale data into the rubric updates.

What would settle it

An experiment that disables memory retrieval entirely and still records the same performance gains as the full AMARIS system would falsify the claim that long-term history is required.

Figures

Figures reproduced from arXiv: 2605.18592 by Gang Wu, Kun Wan, Peilin Wu, Wentian Zhao, Xinlu Zhang, Xinya Du, Zhiyu Chen.

Figure 1
Figure 1. Figure 1: The AMARIS system operates asynchronously in parallel with the normal RL [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template for the reward scoring (1/2). The LLM scoring evaluates each [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template for the reward scoring (2/2). The LLM scoring evaluates each [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for individual rollout analysis (1/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt template for individual rollout analysis (2/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p034_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template for individual rollout analysis (3/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for individual rollout analysis (4/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p036_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for batch-level summarization (1/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for batch-level summarization (2/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template for batch-level summarization (3/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template for batch-level summarization (4/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt template for meta-batch-level summarization (1/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p041_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template for meta-batch-level summarization (2/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p042_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template for meta-batch-level summarization (3/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt template for meta-batch-level summarization (4/4) described in Sec [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Prompt template for memory query generation (1/3) described in Section [PITH_FULL_IMAGE:figures/full_fig_p045_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt template for memory query generation (2/3) described in Section [PITH_FULL_IMAGE:figures/full_fig_p046_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt template for memory query generation (3/3) described in Section [PITH_FULL_IMAGE:figures/full_fig_p047_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt template for rubric update (1/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Prompt template for rubric update (2/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p049_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Prompt template for rubric update (3/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p050_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Prompt template for rubric update (4/4) described in Section [PITH_FULL_IMAGE:figures/full_fig_p051_22.png] view at source ↗
read the original abstract

Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollouts from the current step or pairwise comparisons. However, these methods discard the diagnostics produced during evaluation after immediate use and prevent the long-term accumulation and strategic reuse of evaluation knowledge. This forces the system to re-derive evaluation principles from scratch, limits its ability to detect recurring suboptimal behaviors, and forfeits the curriculum-like progression that a persistent training history would naturally support. To address these limitations, we introduce AMARIS, which grounds rubric modifications in long-term training history. At each training step, AMARIS analyzes individual rollouts, aggregates findings into step-level summaries, retrieves relevant historical context from a persistent evaluation memory through both static (recent steps) and dynamic (semantically matched) retrieval, and updates rubrics based on these accumulated analyses. This procedure runs asynchronously alongside the normal RL loop with minimal overhead. Experiments across both closed and open-ended domains show that AMARIS consistently outperforms the baselines. Ablation studies show that static and dynamic memory retrieval contributes to the performance gain and their combination provides the strongest results with moderate retrieval budgets sufficient to provide most of the gain, and that the entire pipeline adds only ~5\% time overhead through asynchronous execution. These results show that persistent evaluation memory can transform rubric-based reward shaping from a stateless, per-step heuristic into an evidence-driven loop for RL training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AMARIS, a system that augments rubric-based reinforcement learning for LLMs with a persistent evaluation memory. At each training step, rollouts are analyzed to produce step-level summaries; relevant historical context is retrieved via static (recent-step) and dynamic (semantic) mechanisms from the memory; rubrics are then updated on the basis of this accumulated evidence. The pipeline executes asynchronously alongside the RL loop and adds only ~5% time overhead. Experiments across closed and open-ended domains are reported to show consistent outperformance over baselines, with ablations indicating that the combination of static and dynamic retrieval contributes to the gains and that moderate retrieval budgets suffice.

Significance. If the performance claims are substantiated with detailed metrics and controls, AMARIS would represent a practical advance in adaptive reward shaping by converting per-step rubric heuristics into a long-term, evidence-driven process. The asynchronous design and low overhead are attractive for real training pipelines, and the explicit use of persistent memory to detect recurring suboptimal behaviors could support more stable curriculum-like progression in LLM fine-tuning.

major comments (2)
  1. [Abstract and §4 (Experiments)] The central empirical claim—that AMARIS consistently outperforms baselines—rests on high-level assertions without accompanying quantitative metrics, dataset sizes, statistical significance tests, or effect sizes. This is load-bearing for the paper’s contribution because the abstract and results summary assert reliable gains from the memory-augmented loop; absent concrete numbers it is impossible to judge whether the improvements are meaningful or reproducible.
  2. [§3.2] §3.2 (Dynamic Retrieval): Semantic matching over persistent memory is assumed to surface causally relevant historical summaries of suboptimal behaviors. However, embedding similarity may retrieve superficially similar but non-causal or stale entries, injecting noise into rubric updates. The ablations demonstrate benefit from combining retrieval types yet provide no retrieval-precision metric, no relevance-filtered baseline, and no analysis of failure cases where mismatched context harms performance; this directly threatens the “evidence-driven loop” assumption.
minor comments (2)
  1. [§3.1] Clarify the exact definition and update frequency of the “persistent evaluation memory” and whether it is reset between runs or shared across experiments.
  2. [§4.3] The ~5% overhead figure should be accompanied by a breakdown (e.g., retrieval latency vs. summary generation) and measured on the same hardware used for the RL loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the manuscript to improve the clarity and substantiation of our empirical claims and retrieval analysis.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] The central empirical claim—that AMARIS consistently outperforms baselines—rests on high-level assertions without accompanying quantitative metrics, dataset sizes, statistical significance tests, or effect sizes. This is load-bearing for the paper’s contribution because the abstract and results summary assert reliable gains from the memory-augmented loop; absent concrete numbers it is impossible to judge whether the improvements are meaningful or reproducible.

    Authors: We appreciate the referee drawing attention to the presentation of results. The experiments in Section 4 report specific performance metrics across domains in tables comparing AMARIS to baselines, along with ablation results and the ~5% overhead measurement. Dataset sizes and task descriptions are provided in the experimental setup. However, we agree that the abstract and high-level summary would benefit from more explicit quantitative highlights to allow readers to assess the gains immediately. In the revised manuscript, we have updated the abstract to include key performance deltas and have added a consolidated results summary table in Section 4 that reports effect sizes and notes statistical significance from repeated runs. revision: yes

  2. Referee: [§3.2] §3.2 (Dynamic Retrieval): Semantic matching over persistent memory is assumed to surface causally relevant historical summaries of suboptimal behaviors. However, embedding similarity may retrieve superficially similar but non-causal or stale entries, injecting noise into rubric updates. The ablations demonstrate benefit from combining retrieval types yet provide no retrieval-precision metric, no relevance-filtered baseline, and no analysis of failure cases where mismatched context harms performance; this directly threatens the “evidence-driven loop” assumption.

    Authors: We acknowledge that the current ablations focus primarily on end-to-end performance impact rather than directly measuring retrieval quality. To strengthen the justification for the evidence-driven loop, we have added a new analysis subsection that evaluates retrieval precision via sampled manual annotations of retrieved summaries, introduces a relevance-filtered retrieval baseline, and discusses observed failure modes (including cases of stale or superficial matches) along with their measured effect on rubric updates. These additions confirm that the combined static-dynamic approach limits noise while preserving the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AMARIS derivation chain

full rationale

The paper introduces AMARIS as an additive memory-augmented pipeline on top of existing rubric-based RL: it aggregates rollout summaries, performs static/dynamic retrieval from persistent memory, and updates rubrics asynchronously. These mechanisms are described procedurally without reducing the claimed performance gains to quantities defined by fitted parameters or self-referential definitions from the same experimental data. Ablations attribute gains to the memory components as independent additions, and the central claim rests on empirical outperformance rather than any tautological reduction. No load-bearing step equates a prediction to its own inputs by construction, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that historical evaluation context improves rubric quality over stateless per-step updates, plus a small number of implementation choices such as retrieval budget size.

free parameters (1)
  • retrieval budget
    Abstract states that moderate retrieval budgets suffice for most performance gains.
axioms (1)
  • domain assumption Aggregating rollout diagnostics into step-level summaries and retrieving from persistent memory will surface recurring suboptimal behaviors more effectively than local signals alone.
    This premise underpins the entire memory-augmented update loop described in the abstract.
invented entities (1)
  • persistent evaluation memory no independent evidence
    purpose: Store and retrieve long-term training history to inform rubric modifications.
    New component introduced by AMARIS; no independent evidence of its effectiveness outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5834 in / 1201 out tokens · 38168 ms · 2026-05-20T12:35:19.167790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Improving Reward Models with Synthetic Critiques

    Ye, Zihuiwen and Greenlee, Fraser David and Bartolo, Max and Blunsom, Phil and Campos, Jon Ander and Gall \'e , Matthias. Improving Reward Models with Synthetic Critiques. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.254

  2. [2]

    CARMO : Dynamic Criteria Generation for Context Aware Reward Modelling

    Gupta, Taneesh and Shandilya, Shivam and Zhang, Xuchao and Madhavan, Rahul and Ghosh, Supriyo and Bansal, Chetan and Yao, Huaxiu and Rajmohan, Saravan. CARMO : Dynamic Criteria Generation for Context Aware Reward Modelling. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.114

  3. [3]

    2025 , eprint=

    Online Rubrics Elicitation from Pairwise Comparisons , author=. 2025 , eprint=

  4. [4]

    2025 , eprint=

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. 2025 , eprint=

  5. [5]

    2026 , eprint=

    OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment , author=. 2026 , eprint=

  6. [6]

    2026 , eprint=

    RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation , author=. 2026 , eprint=

  7. [7]

    2026 , eprint=

    Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning , author=. 2026 , eprint=

  8. [8]

    2025 , eprint=

    Checklists Are Better Than Reward Models For Aligning Language Models , author=. 2025 , eprint=

  9. [9]

    2026 , eprint=

    Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training , author=. 2026 , eprint=

  10. [10]

    2026 , eprint=

    Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics , author=. 2026 , eprint=

  11. [11]

    2025 , eprint=

    Reinforcement Learning with Rubric Anchors , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research , author=. 2025 , eprint=

  13. [13]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  14. [14]

    2025 , eprint=

    HealthBench: Evaluating Large Language Models Towards Improved Human Health , author=. 2025 , eprint=

  15. [15]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  16. [16]

    I n F o B ench: Evaluating Instruction Following Ability in Large Language Models

    Qin, Yiwei and Song, Kaiqiang and Hu, Yebowen and Yao, Wenlin and Cho, Sangwoo and Wang, Xiaoyang and Wu, Xuansheng and Liu, Fei and Liu, Pengfei and Yu, Dong. I n F o B ench: Evaluating Instruction Following Ability in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.772

  17. [17]

    2025 , eprint=

    Generalizing Verifiable Instruction Following , author=. 2025 , eprint=

  18. [18]

    2025 , eprint=

    WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=

  19. [19]

    GitHub repository , howpublished =

    Samuel J Paech , title =. GitHub repository , howpublished =. 2025 , publisher =

  20. [20]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  22. [22]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  23. [23]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  24. [24]

    2025 , eprint=

    OpenAI GPT-5 System Card , author=. 2025 , eprint=

  25. [25]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  26. [26]

    Chroma: Open-source data infrastructure for AI , year =

  27. [27]

    2025 , eprint=

    A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

  28. [28]

    2026 , eprint=

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. 2026 , eprint=

  29. [29]

    2025 , eprint=

    Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent , author=. 2025 , eprint=

  30. [30]

    2026 , eprint=

    Memory in the Age of AI Agents , author=. 2026 , eprint=

  31. [31]

    Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...

  32. [32]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  33. [33]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  34. [34]

    Skalse, Joar and Howe, Nikolaus H. R. and Krasheninnikov, Dmitrii and Krueger, David , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  35. [35]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Gao, Leo and Schulman, John and Hilton, Jacob , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  36. [36]

    2024 , eprint=

    Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author=. 2024 , eprint=

  37. [37]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  38. [38]

    2025 , eprint=

    Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback , author=. 2025 , eprint=

  39. [39]

    2025 , eprint=

    Curriculum Reinforcement Learning for Complex Reward Functions , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    Robust Reward Modeling via Causal Rubrics , author=. 2025 , eprint=

  41. [41]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

    Miao, Yuchun and Zhang, Sen and Ding, Liang and Bao, Rong and Zhang, Lefei and Tao, Dacheng , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =