TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

Huichi Zhou; Jiuan Zhou; Kun Shao; Mingang Chen; Yihang Chen; Yongkang Hu; Yuan Xie; Yu Cheng; Yushuo Zhang; Zhaoxia Yin

TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.03224 v2 pith:JQKA46UQ submitted 2026-02-03 cs.AI cs.LG

TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

Yu Cheng , Yongkang Hu , Jiuan Zhou , Yushuo Zhang , Yihang Chen , Huichi Zhou , Mingang Chen , Zhizhong Zhang

show 3 more authors

Kun Shao Yuan Xie Zhaoxia Yin

This is my paper

classification cs.AI cs.LG

keywords memoryevolutionagenttametaskbenchmarkbenignduring

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Test-time evolution of agent memory represents a pivotal paradigm for advancing AGI, as it strengthens complex reasoning through experience accumulation without requiring parameter updates. However, even during benign task evolution, agent safety alignment remains vulnerable, a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark and find that agents exhibit an overall decline in trustworthiness across multiple tasks during benign task evolution. To address this issue, we propose TAME, a trust-aware memory evolution framework in which a shared memory bank is jointly governed by an Executor and an Evaluator. The Executor retrieves and applies transferable experiences to support task solving, while the Evaluator assesses the contribution of each utilized experience to the outcome and produces trust-aware feedback to guide subsequent memory use. This executor-evaluator loop enables memory to be selectively reinforced, cautiously reused, and continuously expanded over time. Experiments show that TAME mitigates memory misevolution while achieving strong task performance. In particular, on the GPT-5.2 AIME benchmark, TAME improves accuracy by 14.6 percentage points over the strongest existing method and maintains competitive trustworthiness.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness
cs.AI 2026-06 unverdicted novelty 6.0

Xcientist externalizes research synthesis and validation in AI scientists via contract-governed artifacts to maintain traceable trajectories and avoid claim drift across three domains.
Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness
cs.AI 2026-06 conditional novelty 6.0

Xcientist is a research harness that externalizes an AI scientist's literature grounding, idea evolution, experiments, and repairs into auditable artifacts, demonstrated on memory, traffic forecasting, and PDE-solving tasks.
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 conditional novelty 5.5

Compact control-oriented strategy genes outperform documentation-heavy skill packages for test-time guidance and iterative experience evolution on scientific coding tasks.
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
cs.CR 2026-05 unverdicted novelty 5.0

Memory poisoning via lost-provenance documents in agent memory stores creates agent misconduct that safety systems misattribute to model failure; the paper defines Semantic Norm Drift, releases a benchmark, and propos...
From Procedural Skills to Strategy Genes: Towards Experience-Driven Test-Time Evolution
cs.SE 2026-04 unverdicted novelty 5.0

Compact Gene representations of experience outperform documentation-oriented Skill packages for test-time control and iterative evolution in code-solving tasks, with measured gains on CritPt from 9.1% to 18.57% and 17...