Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Haomin Zhuang; Kehan Guo; Nitesh V. Chawla; Nuno Moniz; Pin-Yu Chen; Tian Gao; Xiangliang Zhang; Xiangqi Wang; Yue Huang; Yujun Zhou

arxiv: 2606.13174 · v1 · pith:VBXCPJ5Jnew · submitted 2026-06-11 · 💻 cs.LG · cs.CL

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

Yujun Zhou , Kehan Guo , Haomin Zhuang , Xiangqi Wang , Yue Huang , Zhenwen Liang , Pin-Yu Chen , Tian Gao

show 3 more authors

Nuno Moniz Nitesh V. Chawla Xiangliang Zhang

This is my paper

Pith reviewed 2026-06-27 07:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM agentsruntime enforcementuser correctionspreference compliancecoding agentsTRACEmemory baselines

0 comments

The pith

TRACE mines user corrections and compiles them into runtime checks that coding agents must satisfy before finishing tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interactive LLM coding agents often violate the same user preferences across sessions even after explicit corrections. The paper introduces TRACE, a pipeline that extracts corrections from chat, rewrites them as atomic rules, and inserts them as mandatory runtime checks. On ClawArena tasks this drops held-out preference violations from 100% to 37.6% in-distribution and to 2% out-of-distribution. On MemoryArena-derived tasks it reduces in-distribution violations to 60.5% while matching or beating memory baselines on task success. The approach therefore supplies a mechanism for agents to accumulate and enforce user-specific preferences without repeated restatement.

Core claim

TRACE acquires rules from user chat corrections at test time, rewrites them as atomic enforceable statements, and compiles them into runtime checks that must pass before an agent completes future coding tasks, producing large measured reductions in preference violations on both in- and out-of-distribution benchmarks.

What carries the argument

The TRACE pipeline that mines corrections, rewrites them as atomic rules, and compiles them into pre-completion runtime enforcement checks.

If this is right

Agents can enforce user preferences via compiled checks rather than memory retrieval alone.
Out-of-distribution violation rates can fall from 100% to 2% on ClawArena-style tasks.
In-distribution violation rates can fall from 100% to 60.5% on memory-intensive tasks while preserving task-pass performance.
Users can avoid restating the same correction in every new session.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mining-and-compilation pattern could extend to non-coding agents if corrections can be expressed as checkable conditions.
A growing per-user rule set might eventually require conflict detection when new corrections contradict earlier ones.
If rule extraction accuracy is high, the method could reduce the total number of user interventions needed over long interaction histories.

Load-bearing premise

User corrections can be automatically mined and rewritten as atomic rules that stay faithful to original intent and generalize to future tasks without false positives or missed violations.

What would settle it

A held-out test set in which mined rules are compiled and applied yet preference-violation rates remain near 100% on both in- and out-of-distribution tasks.

read the original abstract

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE cuts simulated violations by compiling mined corrections into runtime checks, but the gains rest on unvalidated rule extraction from simulated users.

read the letter

The main point for you is that this paper shows a practical way to turn chat corrections into enforceable runtime rules for coding agents, with reported drops in preference violations on held-out tasks that beat memory baselines. The TRACE pipeline mines corrections, rewrites them as atomic rules, and compiles them into checks that run before task completion.

What is new is the end-to-end skill layer that goes beyond storing preferences in memory. Mem0 still left 57.5% of checks violated in their tests, while TRACE brings in-distribution violations down to 37.6% and out-of-distribution to 2% on ClawArena, and to 60.5% on the memory-derived tasks, while holding or improving task pass rates. Public code links and the deployable skill are a plus for anyone who wants to try it.

The soft spots are around the rule mining step. The results come from simulated user-in-the-loop runs, with no numbers on extraction accuracy, how often the rewritten rules match the original intent, or false-positive rates on held-out checks. The central assumption that automatic rewriting produces faithful, generalizing rules without blocking good actions is not directly tested, so the violation reductions could partly reflect simulation choices rather than robust enforcement.

This paper is for people building or evaluating LLM coding agents who deal with repeated user friction. A reader focused on agent reliability or interactive tools will find usable ideas and clear baselines. It deserves a serious referee because the problem is real, the experiments are on held-out data, and the code is available, even though real-user validation and rule-quality metrics would strengthen it.

I would send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces TRACE, a pipeline that mines user corrections from coding-agent interactions, rewrites them as atomic rules, and compiles them into runtime enforcement checks. It claims this addresses repeated preference violations that memory systems like Mem0 fail to prevent, with simulated user-in-the-loop experiments on ClawArena showing violation reductions from 100% to 37.6% (in-distribution) and 100% to 2.0% (out-of-distribution), and similar gains on MemoryArena-derived tasks while matching baselines on task success.

Significance. If the automatic mining and compilation process proves reliable, the approach could meaningfully reduce repeated user friction in interactive LLM agents by converting one-off corrections into persistent, enforceable checks. The public experiment code and deployable skill links are a clear strength for reproducibility.

major comments (2)

[Evaluation] Evaluation (simulated experiments on ClawArena and MemoryArena): the reported violation reductions (e.g., 100.0% o 2.0% OOD) are presented without any metrics on rule-extraction accuracy, fidelity of the mined rules to the original user corrections, or false-positive rates on held-out checks. This is load-bearing for the central claim that automatic compilation, rather than simulation choices or oracle rules, drives the gains.
[§4] §4 and abstract: no human validation or comparison to manually authored rules is reported, leaving open whether the automatic rewriting step preserves intent or introduces blocking false positives that would undermine real deployment.

minor comments (1)

[Methods] Clarify in the methods how 'preference checks' and 'violations' are operationalized on held-out data, including any inter-annotator or automated judgment protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation and validation approach. We address each point below and will revise the manuscript accordingly where feasible.

read point-by-point responses

Referee: [Evaluation] Evaluation (simulated experiments on ClawArena and MemoryArena): the reported violation reductions (e.g., 100.0% to 2.0% OOD) are presented without any metrics on rule-extraction accuracy, fidelity of the mined rules to the original user corrections, or false-positive rates on held-out checks. This is load-bearing for the central claim that automatic compilation, rather than simulation choices or oracle rules, drives the gains.

Authors: We agree that explicit metrics on rule-extraction accuracy and fidelity would strengthen the central claim. The current evaluation measures end-to-end violation reduction on held-out tasks after mining and compilation, with the simulation designed to isolate the effect of the compiled rules. In revision we will add a quantitative fidelity analysis (e.g., semantic similarity scores between mined rules and original corrections on a sampled subset) and report false-positive rates by measuring unnecessary blocks on tasks where no applicable correction exists. These additions will clarify that gains derive from the automatic pipeline rather than simulation artifacts. revision: yes
Referee: [§4] §4 and abstract: no human validation or comparison to manually authored rules is reported, leaving open whether the automatic rewriting step preserves intent or introduces blocking false positives that would undermine real deployment.

Authors: The experiments use simulated user-in-the-loop interactions on tasks derived from anonymized real-user friction cases, as stated in the paper. We acknowledge that direct human validation and side-by-side comparison to manually authored rules would better support deployment claims. In the revision we will add a discussion of potential false positives in the rewriting step together with a small-scale manual inspection of mined-rule fidelity on a subset of examples. Full human-in-the-loop studies remain future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on held-out tasks with no equations or self-referential reductions

full rationale

The paper describes an empirical pipeline (TRACE) for mining corrections into rules and evaluates it via simulated user-in-the-loop experiments on ClawArena and MemoryArena-derived tasks. Reported metrics (e.g., violation reductions from 100% to 37.6%/2.0%) are direct measurements on held-out preference checks, not quantities derived from the same data by construction. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim rests on external benchmarks rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that corrections yield reliable atomic rules; no explicit free parameters or new physical entities are described in the abstract.

axioms (1)

domain assumption User corrections contain extractable atomic preferences that can be formalized as runtime checks without loss of intent
This premise enables the mining and compilation steps of TRACE.

invented entities (1)

TRACE skills no independent evidence
purpose: Compiled runtime enforcement mechanisms derived from user corrections
New construct introduced to provide the enforcement layer on top of existing agent runtimes.

pith-pipeline@v0.9.1-grok · 5873 in / 1278 out tokens · 32086 ms · 2026-06-27T07:03:17.832934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 1 canonical work pages

[1]

2022 IEEE International Conference on Big Data (Big Data) , pages=

Advcat: Domain-agnostic robustness assessment for cybersecurity-critical applications with categorical inputs , author=. 2022 IEEE International Conference on Big Data (Big Data) , pages=. 2022 , organization=

2022
[2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Scemqa: A scientific college entrance level multimodal question answering benchmark , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
[3]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards efficient and domain-agnostic evasion attack with high-dimensional categorical inputs , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[4]

Forty-first International Conference on Machine Learning , url=

Attack-free Evaluating and Enhancing Adversarial Robustness on Categorical Data , author=. Forty-first International Conference on Machine Learning , url=
[5]

International conference on learning representations , year=

Towards understanding the robustness against evasion attack on categorical data , author=. International conference on learning representations , year=
[6]

EMNLP 2024 , year=

Defending jailbreak prompts via in-context adversarial game , author=. EMNLP 2024 , year=

2024
[7]

Advances in Neural Information Processing Systems , volume=

Can llms solve molecule puzzles? a multimodal benchmark for molecular structure elucidation , author=. Advances in Neural Information Processing Systems , volume=
[9]

arXiv preprint arXiv:2502.09897 , year=

Artificial intelligence in spectroscopy: advancing chemistry from prediction to generation and beyond , author=. arXiv preprint arXiv:2502.09897 , year=

arXiv
[10]

arXiv preprint arXiv:2502.14296 , year=

On the trustworthiness of generative foundation models: Guideline, assessment, and perspective , author=. arXiv preprint arXiv:2502.14296 , year=

Pith/arXiv arXiv
[11]

2025 , organization=

Beyond single-value metrics: Evaluating and enhancing llm unlearning with cognitive diagnosis , author=. 2025 , organization=

2025
[12]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Adareasoner: Adaptive reasoning enables more flexible thinking , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[13]

arXiv preprint arXiv:2505.23713 , year=

Socialmaze: A benchmark for evaluating social reasoning in large language models , author=. arXiv preprint arXiv:2505.23713 , year=

arXiv
[14]

EMNLP 2025 Findings , year=

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study , author=. EMNLP 2025 Findings , year=

2025
[15]

Exposing and Patching the Flaws of Large Language Models in Social Character Simulation , author=
[16]

arXiv preprint arXiv:2509.15194 , year=

Evolving language models without labels: Majority drives selection, novelty promotes variation , author=. arXiv preprint arXiv:2509.15194 , year=

arXiv
[17]

arXiv preprint arXiv:2509.23095 , year=

Causally-enhanced reinforcement policy optimization , author=. arXiv preprint arXiv:2509.23095 , year=

arXiv
[18]

arXiv preprint arXiv:2510.01591 , year=

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering , author=. arXiv preprint arXiv:2510.01591 , year=

arXiv
[19]

arXiv preprint arXiv:2510.08892 , year=

Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR , author=. arXiv preprint arXiv:2510.08892 , year=

arXiv
[20]

arXiv preprint arXiv:2510.09781 , year=

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data , author=. arXiv preprint arXiv:2510.09781 , year=

arXiv
[21]

ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions , author=
[22]

arXiv preprint arXiv:2512.15687 , year=

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2512.15687 , year=

arXiv
[23]

arXiv preprint arXiv:2512.18215 , year=

Stable and Efficient Single-Rollout RL for Multimodal Reasoning , author=. arXiv preprint arXiv:2512.18215 , year=

arXiv
[24]

Nature Machine Intelligence , pages=

Benchmarking large language models on safety risks in scientific laboratories , author=. Nature Machine Intelligence , pages=. 2026 , publisher=

2026
[25]

On the Evolution of Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , author=
[26]

arXiv preprint arXiv:2601.18984 , year=

Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning , author=. arXiv preprint arXiv:2601.18984 , year=

arXiv
[27]

arXiv preprint arXiv:2602.12124 , year=

Capability-oriented training induced alignment risk , author=. arXiv preprint arXiv:2602.12124 , year=

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2502.06059 , year=

Prioritization First, Principles Second: An Adaptive Interpretation of Helpful, Honest, and Harmless Principles , author=. arXiv preprint arXiv:2502.06059 , year=

arXiv
[29]

Causally-Enhanced Reinforcement Policy Optimization of Large Language Models , author=
[30]

arXiv preprint arXiv:2602.12966 , year=

Probellm: Automating principled diagnosis of llm failures , author=. arXiv preprint arXiv:2602.12966 , year=

Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2602.20042 , year=

Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously , author=. arXiv preprint arXiv:2602.20042 , year=

Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2604.12995 , year=

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models , author=. arXiv preprint arXiv:2604.12995 , year=

Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2604.18493 , year=

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data , author=. arXiv preprint arXiv:2604.18493 , year=

Pith/arXiv arXiv
[34]

2025 , author=

Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond. 2025 , author=

2025
[35]

Toolformer: Language Models Can Teach Themselves to Use Tools , url =

Schick, Timo and Dwivedi-Yu, Jane and Dessi, Roberto and Raileanu, Roberta and Lomeli, Maria and Hambro, Eric and Zettlemoyer, Luke and Cancedda, Nicola and Scialom, Thomas , booktitle =. Toolformer: Language Models Can Teach Themselves to Use Tools , url =
[36]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[37]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents , url =

Wang, Xingyao and Li, Boxuan and Song, Yufan and Xu, Frank F and Tang, Xiangru and Zhuge, Mingchen and Pan, Jiayi and Song, Yueqi and Li, Bowen and Singh, Jaskirat and Tran, Hoang and Li, Fuqiang and Ma, Ren and Zheng, Mingzhang and Qian, Bill and Shao, Daniel and Muennighoff, Niklas and Zhang, Yizhe and Hui, Binyuan and Lin, Junyang and Brennan, Robert a...
[38]

2024 , url =

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle =. 2024 , url =

2024
[39]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =

Yang, John and Jimenez, Carlos and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , url =. doi:10.52202/079017-1601 , editor =

work page doi:10.52202/079017-1601
[40]

NeurIPS 2024 , year =

Gao, Ge and Taymanov, Alexey and Salinas, Eduardo and Mineiro, Paul and Misra, Dipendra , title =. NeurIPS 2024 , year =

2024
[41]

2026 , eprint=

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files , author=. 2026 , eprint=

2026
[42]

2026 , eprint=

Learning Personalized Agents from Human Feedback , author=. 2026 , eprint=

2026
[43]

2025 , eprint=

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents , author=. 2025 , eprint=

2025
[44]

arXiv preprint arXiv:2504.19413 , year=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. arXiv preprint arXiv:2504.19413 , year=

Pith/arXiv arXiv
[45]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

2025
[46]

2026 , eprint=

MultiSessionCollab: Learning User Preferences with Memory to Improve Long-Term Collaboration , author=. 2026 , eprint=

2026
[47]

2026 , eprint=

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions , author=. 2026 , eprint=

2026
[48]

2025 , eprint=

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs , author=. 2025 , eprint=

2025
[49]

2026 , eprint=

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents , author=. 2026 , eprint=

2026
[50]

2025 , eprint=

MemOS: A Memory OS for AI System , author=. 2025 , eprint=

2025
[51]

2025 , eprint=

Preference-Aware Memory Update for Long-Term LLM Agents , author=. 2025 , eprint=

2025
[52]

2026 , eprint=

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents , author=. 2026 , eprint=

2026
[53]

2026 , eprint=

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution , author=. 2026 , eprint=

2026
[54]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024
[55]

2025 , eprint=

Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects , author=. 2025 , eprint=

2025
[56]

2025 , eprint=

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning , author=. 2025 , eprint=

2025
[57]

2026 , eprint=

ProbGuard: Probabilistic Runtime Monitoring for LLM Agent Safety , author=. 2026 , eprint=

2026
[58]

2025 , eprint=

Towards Enforcing Company Policy Adherence in Agentic Workflows , author=. 2025 , eprint=

2025
[59]

2025 , eprint=

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents , author=. 2025 , eprint=

2025
[60]

2025 , eprint=

Agent READMEs: An Empirical Study of Context Files for Agentic Coding , author=. 2025 , eprint=

2025
[61]

2026 , eprint=

Configuring Agentic AI Coding Tools: An Exploratory Study , author=. 2026 , eprint=

2026
[62]

2026 , eprint=

Synthetic Interaction Data for Scalable Personalization in Large Language Models , author=. 2026 , eprint=

2026
[63]

2026 , eprint=

ClawArena: Benchmarking AI Agents in Evolving Information Environments , author=. 2026 , eprint=

2026
[64]

2026 , eprint=

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , author=. 2026 , eprint=

2026

[1] [1]

2022 IEEE International Conference on Big Data (Big Data) , pages=

Advcat: Domain-agnostic robustness assessment for cybersecurity-critical applications with categorical inputs , author=. 2022 IEEE International Conference on Big Data (Big Data) , pages=. 2022 , organization=

2022

[2] [2]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

Scemqa: A scientific college entrance level multimodal question answering benchmark , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

[3] [3]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Towards efficient and domain-agnostic evasion attack with high-dimensional categorical inputs , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[4] [4]

Forty-first International Conference on Machine Learning , url=

Attack-free Evaluating and Enhancing Adversarial Robustness on Categorical Data , author=. Forty-first International Conference on Machine Learning , url=

[5] [5]

International conference on learning representations , year=

Towards understanding the robustness against evasion attack on categorical data , author=. International conference on learning representations , year=

[6] [6]

EMNLP 2024 , year=

Defending jailbreak prompts via in-context adversarial game , author=. EMNLP 2024 , year=

2024

[7] [7]

Advances in Neural Information Processing Systems , volume=

Can llms solve molecule puzzles? a multimodal benchmark for molecular structure elucidation , author=. Advances in Neural Information Processing Systems , volume=

[8] [9]

arXiv preprint arXiv:2502.09897 , year=

Artificial intelligence in spectroscopy: advancing chemistry from prediction to generation and beyond , author=. arXiv preprint arXiv:2502.09897 , year=

arXiv

[9] [10]

arXiv preprint arXiv:2502.14296 , year=

On the trustworthiness of generative foundation models: Guideline, assessment, and perspective , author=. arXiv preprint arXiv:2502.14296 , year=

Pith/arXiv arXiv

[10] [11]

2025 , organization=

Beyond single-value metrics: Evaluating and enhancing llm unlearning with cognitive diagnosis , author=. 2025 , organization=

2025

[11] [12]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Adareasoner: Adaptive reasoning enables more flexible thinking , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[12] [13]

arXiv preprint arXiv:2505.23713 , year=

Socialmaze: A benchmark for evaluating social reasoning in large language models , author=. arXiv preprint arXiv:2505.23713 , year=

arXiv

[13] [14]

EMNLP 2025 Findings , year=

Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study , author=. EMNLP 2025 Findings , year=

2025

[14] [15]

Exposing and Patching the Flaws of Large Language Models in Social Character Simulation , author=

[15] [16]

arXiv preprint arXiv:2509.15194 , year=

Evolving language models without labels: Majority drives selection, novelty promotes variation , author=. arXiv preprint arXiv:2509.15194 , year=

arXiv

[16] [17]

arXiv preprint arXiv:2509.23095 , year=

Causally-enhanced reinforcement policy optimization , author=. arXiv preprint arXiv:2509.23095 , year=

arXiv

[17] [18]

arXiv preprint arXiv:2510.01591 , year=

CLUE: Non-parametric Verification from Experience via Hidden-State Clustering , author=. arXiv preprint arXiv:2510.01591 , year=

arXiv

[18] [19]

arXiv preprint arXiv:2510.08892 , year=

Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR , author=. arXiv preprint arXiv:2510.08892 , year=

arXiv

[19] [20]

arXiv preprint arXiv:2510.09781 , year=

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data , author=. arXiv preprint arXiv:2510.09781 , year=

arXiv

[20] [21]

ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions , author=

[21] [22]

arXiv preprint arXiv:2512.15687 , year=

Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2512.15687 , year=

arXiv

[22] [23]

arXiv preprint arXiv:2512.18215 , year=

Stable and Efficient Single-Rollout RL for Multimodal Reasoning , author=. arXiv preprint arXiv:2512.18215 , year=

arXiv

[23] [24]

Nature Machine Intelligence , pages=

Benchmarking large language models on safety risks in scientific laboratories , author=. Nature Machine Intelligence , pages=. 2026 , publisher=

2026

[24] [25]

On the Evolution of Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation , author=

[25] [26]

arXiv preprint arXiv:2601.18984 , year=

Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning , author=. arXiv preprint arXiv:2601.18984 , year=

arXiv

[26] [27]

arXiv preprint arXiv:2602.12124 , year=

Capability-oriented training induced alignment risk , author=. arXiv preprint arXiv:2602.12124 , year=

Pith/arXiv arXiv

[27] [28]

arXiv preprint arXiv:2502.06059 , year=

Prioritization First, Principles Second: An Adaptive Interpretation of Helpful, Honest, and Harmless Principles , author=. arXiv preprint arXiv:2502.06059 , year=

arXiv

[28] [29]

Causally-Enhanced Reinforcement Policy Optimization of Large Language Models , author=

[29] [30]

arXiv preprint arXiv:2602.12966 , year=

Probellm: Automating principled diagnosis of llm failures , author=. arXiv preprint arXiv:2602.12966 , year=

Pith/arXiv arXiv

[30] [31]

arXiv preprint arXiv:2602.20042 , year=

Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously , author=. arXiv preprint arXiv:2602.20042 , year=

Pith/arXiv arXiv

[31] [32]

arXiv preprint arXiv:2604.12995 , year=

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models , author=. arXiv preprint arXiv:2604.12995 , year=

Pith/arXiv arXiv

[32] [33]

arXiv preprint arXiv:2604.18493 , year=

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data , author=. arXiv preprint arXiv:2604.18493 , year=

Pith/arXiv arXiv

[33] [34]

2025 , author=

Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond. 2025 , author=

2025

[34] [35]

Toolformer: Language Models Can Teach Themselves to Use Tools , url =

Schick, Timo and Dwivedi-Yu, Jane and Dessi, Roberto and Raileanu, Roberta and Lomeli, Maria and Hambro, Eric and Zettlemoyer, Luke and Cancedda, Nicola and Scialom, Thomas , booktitle =. Toolformer: Language Models Can Teach Themselves to Use Tools , url =

[35] [36]

Narasimhan and Yuan Cao , title =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[36] [37]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents , url =

Wang, Xingyao and Li, Boxuan and Song, Yufan and Xu, Frank F and Tang, Xiangru and Zhuge, Mingchen and Pan, Jiayi and Song, Yueqi and Li, Bowen and Singh, Jaskirat and Tran, Hoang and Li, Fuqiang and Ma, Ren and Zheng, Mingzhang and Qian, Bill and Shao, Daniel and Muennighoff, Niklas and Zhang, Yizhe and Hui, Binyuan and Lin, Junyang and Brennan, Robert a...

[37] [38]

2024 , url =

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle =. 2024 , url =

2024

[38] [39]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =

Yang, John and Jimenez, Carlos and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , booktitle =. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering , url =. doi:10.52202/079017-1601 , editor =

work page doi:10.52202/079017-1601

[39] [40]

NeurIPS 2024 , year =

Gao, Ge and Taymanov, Alexey and Salinas, Eduardo and Mineiro, Paul and Misra, Dipendra , title =. NeurIPS 2024 , year =

2024

[40] [41]

2026 , eprint=

ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files , author=. 2026 , eprint=

2026

[41] [42]

2026 , eprint=

Learning Personalized Agents from Human Feedback , author=. 2026 , eprint=

2026

[42] [43]

2025 , eprint=

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents , author=. 2025 , eprint=

2025

[43] [44]

arXiv preprint arXiv:2504.19413 , year=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. arXiv preprint arXiv:2504.19413 , year=

Pith/arXiv arXiv

[44] [45]

2025 , eprint=

A-MEM: Agentic Memory for LLM Agents , author=. 2025 , eprint=

2025

[45] [46]

2026 , eprint=

MultiSessionCollab: Learning User Preferences with Memory to Improve Long-Term Collaboration , author=. 2026 , eprint=

2026

[46] [47]

2026 , eprint=

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions , author=. 2026 , eprint=

2026

[47] [48]

2025 , eprint=

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs , author=. 2025 , eprint=

2025

[48] [49]

2026 , eprint=

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents , author=. 2026 , eprint=

2026

[49] [50]

2025 , eprint=

MemOS: A Memory OS for AI System , author=. 2025 , eprint=

2025

[50] [51]

2025 , eprint=

Preference-Aware Memory Update for Long-Term LLM Agents , author=. 2025 , eprint=

2025

[51] [52]

2026 , eprint=

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents , author=. 2026 , eprint=

2026

[52] [53]

2026 , eprint=

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution , author=. 2026 , eprint=

2026

[53] [54]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

2024

[54] [55]

2025 , eprint=

Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects , author=. 2025 , eprint=

2025

[55] [56]

2025 , eprint=

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning , author=. 2025 , eprint=

2025

[56] [57]

2026 , eprint=

ProbGuard: Probabilistic Runtime Monitoring for LLM Agent Safety , author=. 2026 , eprint=

2026

[57] [58]

2025 , eprint=

Towards Enforcing Company Policy Adherence in Agentic Workflows , author=. 2025 , eprint=

2025

[58] [59]

2025 , eprint=

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents , author=. 2025 , eprint=

2025

[59] [60]

2025 , eprint=

Agent READMEs: An Empirical Study of Context Files for Agentic Coding , author=. 2025 , eprint=

2025

[60] [61]

2026 , eprint=

Configuring Agentic AI Coding Tools: An Exploratory Study , author=. 2026 , eprint=

2026

[61] [62]

2026 , eprint=

Synthetic Interaction Data for Scalable Personalization in Large Language Models , author=. 2026 , eprint=

2026

[62] [63]

2026 , eprint=

ClawArena: Benchmarking AI Agents in Evolving Information Environments , author=. 2026 , eprint=

2026

[63] [64]

2026 , eprint=

MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks , author=. 2026 , eprint=

2026