AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3
The pith
AMARIS improves LLM reinforcement learning by storing and retrieving long-term evaluation history to update rubrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AMARIS analyzes each rollout, condenses the findings into a step-level summary, pulls relevant past summaries from a persistent memory store through both static recent-step lookup and dynamic semantic search, then revises the rubric with that combined evidence; the whole process runs asynchronously and adds roughly five percent overhead while producing higher final performance than stateless baselines in both closed and open-ended tasks.
What carries the argument
A persistent evaluation memory that stores step-level summaries and supplies them through static recent-step retrieval plus dynamic semantic retrieval to guide each rubric update.
If this is right
- Rubric revisions become driven by evidence accumulated across many steps rather than by local signals alone.
- Combining recent-step and semantically matched history yields stronger gains than either retrieval method by itself.
- Moderate retrieval budgets capture most of the benefit, so memory size need not grow without limit.
- The added work can stay under five percent overhead when memory operations run outside the main RL loop.
Where Pith is reading between the lines
- The same memory structure could be attached to other adaptive reward or feedback systems that currently operate step-by-step.
- After many steps the memory might begin to encode training-phase patterns that allow the system to adjust rubric strictness proactively.
- One could test whether the stored summaries allow transfer of diagnostic knowledge to entirely new tasks or domains.
Load-bearing premise
Aggregated rollout summaries plus static and semantic retrieval will reliably surface recurring problems without injecting noise or stale data into the rubric updates.
What would settle it
An experiment that disables memory retrieval entirely and still records the same performance gains as the full AMARIS system would falsify the claim that long-term history is required.
Figures
read the original abstract
Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollouts from the current step or pairwise comparisons. However, these methods discard the diagnostics produced during evaluation after immediate use and prevent the long-term accumulation and strategic reuse of evaluation knowledge. This forces the system to re-derive evaluation principles from scratch, limits its ability to detect recurring suboptimal behaviors, and forfeits the curriculum-like progression that a persistent training history would naturally support. To address these limitations, we introduce AMARIS, which grounds rubric modifications in long-term training history. At each training step, AMARIS analyzes individual rollouts, aggregates findings into step-level summaries, retrieves relevant historical context from a persistent evaluation memory through both static (recent steps) and dynamic (semantically matched) retrieval, and updates rubrics based on these accumulated analyses. This procedure runs asynchronously alongside the normal RL loop with minimal overhead. Experiments across both closed and open-ended domains show that AMARIS consistently outperforms the baselines. Ablation studies show that static and dynamic memory retrieval contributes to the performance gain and their combination provides the strongest results with moderate retrieval budgets sufficient to provide most of the gain, and that the entire pipeline adds only ~5\% time overhead through asynchronous execution. These results show that persistent evaluation memory can transform rubric-based reward shaping from a stateless, per-step heuristic into an evidence-driven loop for RL training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AMARIS, a system that augments rubric-based reinforcement learning for LLMs with a persistent evaluation memory. At each training step, rollouts are analyzed to produce step-level summaries; relevant historical context is retrieved via static (recent-step) and dynamic (semantic) mechanisms from the memory; rubrics are then updated on the basis of this accumulated evidence. The pipeline executes asynchronously alongside the RL loop and adds only ~5% time overhead. Experiments across closed and open-ended domains are reported to show consistent outperformance over baselines, with ablations indicating that the combination of static and dynamic retrieval contributes to the gains and that moderate retrieval budgets suffice.
Significance. If the performance claims are substantiated with detailed metrics and controls, AMARIS would represent a practical advance in adaptive reward shaping by converting per-step rubric heuristics into a long-term, evidence-driven process. The asynchronous design and low overhead are attractive for real training pipelines, and the explicit use of persistent memory to detect recurring suboptimal behaviors could support more stable curriculum-like progression in LLM fine-tuning.
major comments (2)
- [Abstract and §4 (Experiments)] The central empirical claim—that AMARIS consistently outperforms baselines—rests on high-level assertions without accompanying quantitative metrics, dataset sizes, statistical significance tests, or effect sizes. This is load-bearing for the paper’s contribution because the abstract and results summary assert reliable gains from the memory-augmented loop; absent concrete numbers it is impossible to judge whether the improvements are meaningful or reproducible.
- [§3.2] §3.2 (Dynamic Retrieval): Semantic matching over persistent memory is assumed to surface causally relevant historical summaries of suboptimal behaviors. However, embedding similarity may retrieve superficially similar but non-causal or stale entries, injecting noise into rubric updates. The ablations demonstrate benefit from combining retrieval types yet provide no retrieval-precision metric, no relevance-filtered baseline, and no analysis of failure cases where mismatched context harms performance; this directly threatens the “evidence-driven loop” assumption.
minor comments (2)
- [§3.1] Clarify the exact definition and update frequency of the “persistent evaluation memory” and whether it is reset between runs or shared across experiments.
- [§4.3] The ~5% overhead figure should be accompanied by a breakdown (e.g., retrieval latency vs. summary generation) and measured on the same hardware used for the RL loop.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and have revised the manuscript to improve the clarity and substantiation of our empirical claims and retrieval analysis.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] The central empirical claim—that AMARIS consistently outperforms baselines—rests on high-level assertions without accompanying quantitative metrics, dataset sizes, statistical significance tests, or effect sizes. This is load-bearing for the paper’s contribution because the abstract and results summary assert reliable gains from the memory-augmented loop; absent concrete numbers it is impossible to judge whether the improvements are meaningful or reproducible.
Authors: We appreciate the referee drawing attention to the presentation of results. The experiments in Section 4 report specific performance metrics across domains in tables comparing AMARIS to baselines, along with ablation results and the ~5% overhead measurement. Dataset sizes and task descriptions are provided in the experimental setup. However, we agree that the abstract and high-level summary would benefit from more explicit quantitative highlights to allow readers to assess the gains immediately. In the revised manuscript, we have updated the abstract to include key performance deltas and have added a consolidated results summary table in Section 4 that reports effect sizes and notes statistical significance from repeated runs. revision: yes
-
Referee: [§3.2] §3.2 (Dynamic Retrieval): Semantic matching over persistent memory is assumed to surface causally relevant historical summaries of suboptimal behaviors. However, embedding similarity may retrieve superficially similar but non-causal or stale entries, injecting noise into rubric updates. The ablations demonstrate benefit from combining retrieval types yet provide no retrieval-precision metric, no relevance-filtered baseline, and no analysis of failure cases where mismatched context harms performance; this directly threatens the “evidence-driven loop” assumption.
Authors: We acknowledge that the current ablations focus primarily on end-to-end performance impact rather than directly measuring retrieval quality. To strengthen the justification for the evidence-driven loop, we have added a new analysis subsection that evaluates retrieval precision via sampled manual annotations of retrieved summaries, introduces a relevance-filtered retrieval baseline, and discusses observed failure modes (including cases of stale or superficial matches) along with their measured effect on rubric updates. These additions confirm that the combined static-dynamic approach limits noise while preserving the observed gains. revision: yes
Circularity Check
No significant circularity in AMARIS derivation chain
full rationale
The paper introduces AMARIS as an additive memory-augmented pipeline on top of existing rubric-based RL: it aggregates rollout summaries, performs static/dynamic retrieval from persistent memory, and updates rubrics asynchronously. These mechanisms are described procedurally without reducing the claimed performance gains to quantities defined by fitted parameters or self-referential definitions from the same experimental data. Ablations attribute gains to the memory components as independent additions, and the central claim rests on empirical outperformance rather than any tautological reduction. No load-bearing step equates a prediction to its own inputs by construction, and the derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- retrieval budget
axioms (1)
- domain assumption Aggregating rollout diagnostics into step-level summaries and retrieving from persistent memory will surface recurring suboptimal behaviors more effectively than local signals alone.
invented entities (1)
-
persistent evaluation memory
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, and uses both static and dynamic retrieval to ground rubric changes in training history.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ablation studies show that static and dynamic memory retrieval contributes to the performance gain
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Improving Reward Models with Synthetic Critiques
Ye, Zihuiwen and Greenlee, Fraser David and Bartolo, Max and Blunsom, Phil and Campos, Jon Ander and Gall \'e , Matthias. Improving Reward Models with Synthetic Critiques. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.254
-
[2]
CARMO : Dynamic Criteria Generation for Context Aware Reward Modelling
Gupta, Taneesh and Shandilya, Shivam and Zhang, Xuchao and Madhavan, Rahul and Ghosh, Supriyo and Bansal, Chetan and Yao, Huaxiu and Rajmohan, Saravan. CARMO : Dynamic Criteria Generation for Context Aware Reward Modelling. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.114
-
[3]
Online Rubrics Elicitation from Pairwise Comparisons , author=. 2025 , eprint=
work page 2025
-
[4]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. 2025 , eprint=
work page 2025
-
[5]
OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment , author=. 2026 , eprint=
work page 2026
-
[6]
RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation , author=. 2026 , eprint=
work page 2026
-
[7]
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning , author=. 2026 , eprint=
work page 2026
-
[8]
Checklists Are Better Than Reward Models For Aligning Language Models , author=. 2025 , eprint=
work page 2025
-
[9]
Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training , author=. 2026 , eprint=
work page 2026
-
[10]
Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics , author=. 2026 , eprint=
work page 2026
- [11]
-
[12]
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research , author=. 2025 , eprint=
work page 2025
-
[13]
David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=
work page 2024
-
[14]
HealthBench: Evaluating Large Language Models Towards Improved Human Health , author=. 2025 , eprint=
work page 2025
-
[15]
Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=
work page 2023
-
[16]
I n F o B ench: Evaluating Instruction Following Ability in Large Language Models
Qin, Yiwei and Song, Kaiqiang and Hu, Yebowen and Yao, Wenlin and Cho, Sangwoo and Wang, Xiaoyang and Wu, Xuansheng and Liu, Fei and Liu, Pengfei and Yu, Dong. I n F o B ench: Evaluating Instruction Following Ability in Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.772
-
[17]
Generalizing Verifiable Instruction Following , author=. 2025 , eprint=
work page 2025
-
[18]
WritingBench: A Comprehensive Benchmark for Generative Writing , author=. 2025 , eprint=
work page 2025
-
[19]
GitHub repository , howpublished =
Samuel J Paech , title =. GitHub repository , howpublished =. 2025 , publisher =
work page 2025
- [20]
- [21]
- [22]
- [23]
- [24]
-
[25]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
work page 2024
-
[26]
Chroma: Open-source data infrastructure for AI , year =
- [27]
-
[28]
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning , author=. 2026 , eprint=
work page 2026
-
[29]
Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent , author=. 2025 , eprint=
work page 2025
- [30]
-
[31]
Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul and Leike, Jan and Lowe, R...
work page 2022
-
[32]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=
work page 2022
-
[33]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[34]
Skalse, Joar and Howe, Nikolaus H. R. and Krasheninnikov, Dmitrii and Krueger, David , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =
work page 2022
-
[35]
Proceedings of the 40th International Conference on Machine Learning , articleno =
Gao, Leo and Schulman, John and Hilton, Jacob , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
work page 2023
-
[36]
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models , author=. 2024 , eprint=
work page 2024
-
[37]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =
work page 2023
-
[38]
Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback , author=. 2025 , eprint=
work page 2025
-
[39]
Curriculum Reinforcement Learning for Complex Reward Functions , author=. 2025 , eprint=
work page 2025
- [40]
-
[41]
Miao, Yuchun and Zhang, Sen and Ding, Liang and Bao, Rong and Zhang, Lefei and Tao, Dacheng , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.