Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Bowen Jin; Chengwei Qin; Haodong Yue; Haozhen Zhang; Jianzhu Bao; Jiaxuan You; Quanyu Long; Tao Feng; Weizhi Zhang; Wenya Wang

arxiv: 2602.06025 · v2 · pith:AGHHOCHFnew · submitted 2026-02-05 · 💻 cs.CL · cs.AI· cs.LG

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Haozhen Zhang , Haodong Yue , Tao Feng , Quanyu Long , Jianzhu Bao , Bowen Jin , Weizhi Zhang , Xiao Li

show 3 more authors

Jiaxuan You Chengwei Qin Wenya Wang

This is my paper

Pith reviewed 2026-05-21 13:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM agentsruntime memorybudget-tier routingreinforcement learningperformance-cost trade-offquery-awarememory modules

0 comments

The pith

BudgetMem uses a query-aware RL router to assign low, mid, or high budget tiers to memory modules for explicit performance-cost control in LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BudgetMem to manage memory for LLM agents that need information beyond a single context window. It organizes memory processing into modules each available in low, mid, or high budget tiers and trains a compact neural policy with reinforcement learning to route each query to the right tier. This gives direct control over the accuracy versus memory-construction-cost trade-off instead of relying on fixed offline memory that wastes resources or drops query-critical details. Experiments across LoCoMo, LongMemEval, and HotpotQA show the routed system beats strong baselines when high performance is allowed and traces better accuracy-cost frontiers when budgets are tight. The analysis also separates the strengths of creating tiers through method complexity, inference behavior, or model capacity.

Core claim

By structuring memory as a set of modules each offered in Low/Mid/High budget tiers and using a lightweight router implemented as a compact neural policy trained with reinforcement learning to perform query-aware budget-tier routing, BudgetMem achieves explicit control over the performance-memory cost trade-off and surpasses strong baselines in high-budget settings while delivering better accuracy-cost frontiers under tighter budgets.

What carries the argument

The lightweight router, a compact neural policy trained with reinforcement learning, that assigns budget tiers to memory modules based on the input query.

If this is right

Surpasses strong baselines when performance is prioritized in high-budget settings.
Delivers better accuracy-cost frontiers under tighter budgets.
Disentangles the strengths and weaknesses of implementation, reasoning, and capacity strategies for realizing budget tiers under varying budget regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The routing mechanism could extend to dynamic allocation of other resources such as compute or tool calls in agent systems.
Hybrid tiering that combines multiple realization axes might produce further improved trade-offs.
Deployment on agent tasks with highly variable query complexity would test whether the learned policy generalizes beyond the evaluated benchmarks.

Load-bearing premise

The compact neural policy trained with reinforcement learning can learn budget-tier assignments that improve the performance-cost trade-off without adding substantial overhead.

What would settle it

If the routed BudgetMem system produces accuracy-cost curves that fail to dominate fixed-tier or non-routed baselines on LoCoMo, LongMemEval, or HotpotQA, the benefit of query-aware routing would be falsified.

Figures

Figures reproduced from arXiv: 2602.06025 by Bowen Jin, Chengwei Qin, Haodong Yue, Haozhen Zhang, Jianzhu Bao, Jiaxuan You, Quanyu Long, Tao Feng, Weizhi Zhang, Wenya Wang, Xiao Li.

**Figure 1.** Figure 1: BudgetMem overview. Given a user query q, we retrieve raw chunks Cq from a chunked history (without offline memory preprocessing) and process them with a modular pipeline (filter → entity/temporal/topic → summary). Each module exposes LOW/MID/HIGH budget tiers instantiated by one of three strategies (implementation, reasoning, capacity). A shared lightweight router selects tiers module-wise based on the q… view at source ↗

**Figure 2.** Figure 2: Performance–cost trade-offs across tiering strategies on LoCoMo. By varying the cost weight λ, BudgetMem traces smooth, controllable frontiers that shift toward higher performance as budget increases, and envelop baselines in both low- and high-cost regimes. three variants: BudgetMem-IMP, BudgetMem-REA, and BudgetMem-CAP. Sec. 6.2 varies λ to trace performance– cost curves, Sec. 6.3 ablates cost modeling a… view at source ↗

**Figure 4.** Figure 4: Budget-tier selection ratios. Module-wise LOW/MID/HIGH routing ratios on LongMemEval under varying cost weights λ using the capacity tiering strategy. chunks on LoCoMo, evaluated under all three tiering strategies. Increasing the retrieval size predictably raises cost due to longer inputs and additional processing, and it often improves Judge score by providing more supporting evidence, reflecting the st… view at source ↗

**Figure 3.** Figure 3: Ablation of reward-scale alignment under capacity tiering strategy on LoCoMo. 6.4. Discussion Budget tier selection ratio [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Retrieval-size sensitivity on LoCoMo. Cost and Judge versus the number of retrieved raw chunks, evaluated under all three tiering strategies. 7. Conclusion We present BudgetMem, a runtime agent memory framework for explicit performance–cost control in on-demand memory extraction. BudgetMem equips each module in a modular memory pipeline with LOW/MID/HIGH budget tiers and learns a lightweight router to se… view at source ↗

read the original abstract

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces BudgetMem, a runtime agent memory framework that decomposes memory processing into modules each available in Low/Mid/High budget tiers. A compact neural policy router, trained via reinforcement learning, performs query-aware tier assignment to explicitly trade off task performance against memory-construction cost. The framework is used as a testbed to compare three complementary tier-realization strategies (implementation complexity, inference-time reasoning behavior, and module model capacity). Experiments on LoCoMo, LongMemEval, and HotpotQA report that BudgetMem outperforms strong baselines in high-budget regimes and yields improved accuracy-cost frontiers under tighter budgets.

Significance. If the reported frontiers remain superior after router overhead is properly internalized, the work supplies a practical, controllable mechanism for runtime memory budgeting in long-context LLM agents and a reusable testbed for dissecting tiering axes. The disentanglement of implementation, reasoning, and capacity strategies under varying budget regimes could inform future system design.

major comments (1)

[Experimental Setup and Results] The central accuracy-cost frontier claims rest on the router producing net-positive tier assignments, yet the evaluation does not isolate or subtract the per-query inference cost of the lightweight router (nor the amortized RL training cost) from the reported memory-construction budgets. Under tight budgets this overhead may constitute a non-negligible fraction of the allowed cost; if the metrics treat router cost as external, the attributed gains cannot be unambiguously credited to query-aware routing rather than to the underlying tier implementations themselves.

minor comments (2)

[Abstract] The abstract states improved frontiers and benchmark results but supplies no quantitative numbers, error bars, or ablation details; a short table of headline deltas would improve readability.
[Method] Notation for the three tiering strategies (implementation, reasoning, capacity) is introduced without an explicit mapping to the concrete module variants used in each experiment; a small summary table would clarify the correspondence.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below, providing clarifications and indicating revisions where appropriate to strengthen the evaluation of the accuracy-cost frontiers.

read point-by-point responses

Referee: [Experimental Setup and Results] The central accuracy-cost frontier claims rest on the router producing net-positive tier assignments, yet the evaluation does not isolate or subtract the per-query inference cost of the lightweight router (nor the amortized RL training cost) from the reported memory-construction budgets. Under tight budgets this overhead may constitute a non-negligible fraction of the allowed cost; if the metrics treat router cost as external, the attributed gains cannot be unambiguously credited to query-aware routing rather than to the underlying tier implementations themselves.

Authors: We agree that the router's per-query inference cost should be explicitly internalized for a rigorous assessment of net gains from query-aware routing. In the original experiments, the reported budgets centered on memory module construction costs, with the router (a compact neural policy) treated as a fixed lightweight overhead incurred uniformly per query. However, we acknowledge the referee's point that this could affect tight-budget regimes. To address it, we have added new measurements of the router's inference cost (approximately 0.5-2% of total budget depending on the setting, based on FLOPs and latency benchmarks) and recomputed the accuracy-cost frontiers with this overhead subtracted from the allowed budgets. The revised results confirm that BudgetMem retains superior frontiers over baselines. We have also clarified in the text that amortized RL training costs are offline and not part of runtime per-query budgets. These updates appear in a new paragraph in Section 4.3 and updated Figures 3-5. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical testbed with RL router validated on external benchmarks

full rationale

The paper presents BudgetMem as an empirical framework for runtime memory with a compact neural policy router trained via reinforcement learning to assign budget tiers. Claims of superior accuracy-cost frontiers rest on experimental comparisons against baselines across LoCoMo, LongMemEval, and HotpotQA rather than any derivation chain. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The router training and tier strategies are positioned as independent design choices whose effectiveness is measured externally, keeping the contribution self-contained without reducing outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the RL policy and three tiering strategies are described at a high level without implementation specifics or independent evidence.

pith-pipeline@v0.9.0 · 5813 in / 1172 out tokens · 58011 ms · 2026-05-21T13:41:17.616570+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper

[1]

Correctness: - Is the model answer factually consistent with ANY of the correct answers? - Does it avoid contradictions or introducing false information?

work page
[2]

Relevance: - Does the answer address the question directly without unnecessary content?

work page
[3]

explanation

Completeness: - Does the answer include all essential information needed to fully answer the question?,→ - Partial answers are allowed but should receive lower scores. Scoring Rules: - Score = 1.0 if the answer is fully correct. - Score = 0.5 if the answer is partially correct but incomplete or slightly inaccurate.,→ - Score = 0.0 if the answer is incorre...

work page
[5]

N" [date_time=

A`<memories>`section containing individual`<memory>`elements. Each memory is formatted as: ``` <memory index="N" [date_time="..." session_id="..." dia_id="..."]> memory content text </memory> ``` Where: -`index`is the memory's position in the list. -`date_time`,`session_id`,`dia_id`are optional metadata attributes. - The text between the tags is the memor...

work page
[7]

N" [date_time=

A`<memories>`section containing individual`<memory>`elements. Each memory is formatted as: ``` <memory index="N" [date_time="..." session_id="..." dia_id="..."]> memory content text </memory> ``` Where: -`index`is the memory's position in the list. -`date_time`,`session_id`,`dia_id`are optional metadata attributes. - The text between the tags is the memor...

work page
[8]

A`<query>`section containing the user's question

work page
[9]

N" [date_time=

A`<memories>`section containing individual`<memory>`elements. Each memory is formatted as: ``` <memory index="N" [date_time="..." session_id="..." dia_id="..."]> memory content text </memory> ``` Where: -`index`is the memory's position in the list. -`date_time`,`session_id`,`dia_id`are optional metadata attributes. - The text between the tags is the memor...

work page 2023
[13]

A`<Topic Relations>`section containing one`<topic>`tag per topic relationship.,→ **Synthesis Guidelines:** - Do **not** answer the query directly. - Explain what information is available and how it should be used to formulate an answer.,→ **Output Format:** Your entire response must end with the following line: `<answer>your summary text here</answer>` Th...

work page
[16]

A`<Temporal Relations>`section containing one`<temporal>`tag per temporal fact.,→

work page
[17]

- Do **not** answer the query directly

A`<Topic Relations>`section containing one`<topic>`tag per topic relationship.,→ **Synthesis Guidelines:** - **Integrate** relevant entity, temporal, and topic facts into a coherent structure.,→ - **Extract** key information that directly supports or constrains the answer.,→ - **Reorganize** content for clarity and logical flow. - Do **not** answer the qu...

work page
[18]

A`<query>`defining the subject and scope

work page
[19]

An`<Entity Relations>`section containing one`<entity>`tag per relationship string.,→

work page
[20]

An`<Temporal Relations>`section containing one`<temporal>`tag per temporal fact.,→

work page
[21]

A`<Topic Relations>`section containing one`<topic>`tag per topic relationship.,→ **Synthesis Guidelines:** - Do **not** answer the query directly. - Explain what information is available and how it should be used to formulate an answer.,→ To complete the task systematically, please follow the steps reasoning framework outlined below:,→ **Reasoning Steps:*...

work page

[1] [1]

Correctness: - Is the model answer factually consistent with ANY of the correct answers? - Does it avoid contradictions or introducing false information?

work page

[2] [2]

Relevance: - Does the answer address the question directly without unnecessary content?

work page

[3] [3]

explanation

Completeness: - Does the answer include all essential information needed to fully answer the question?,→ - Partial answers are allowed but should receive lower scores. Scoring Rules: - Score = 1.0 if the answer is fully correct. - Score = 0.5 if the answer is partially correct but incomplete or slightly inaccurate.,→ - Score = 0.0 if the answer is incorre...

work page

[4] [5]

N" [date_time=

A`<memories>`section containing individual`<memory>`elements. Each memory is formatted as: ``` <memory index="N" [date_time="..." session_id="..." dia_id="..."]> memory content text </memory> ``` Where: -`index`is the memory's position in the list. -`date_time`,`session_id`,`dia_id`are optional metadata attributes. - The text between the tags is the memor...

work page

[5] [7]

N" [date_time=

A`<memories>`section containing individual`<memory>`elements. Each memory is formatted as: ``` <memory index="N" [date_time="..." session_id="..." dia_id="..."]> memory content text </memory> ``` Where: -`index`is the memory's position in the list. -`date_time`,`session_id`,`dia_id`are optional metadata attributes. - The text between the tags is the memor...

work page

[6] [8]

A`<query>`section containing the user's question

work page

[7] [9]

N" [date_time=

A`<memories>`section containing individual`<memory>`elements. Each memory is formatted as: ``` <memory index="N" [date_time="..." session_id="..." dia_id="..."]> memory content text </memory> ``` Where: -`index`is the memory's position in the list. -`date_time`,`session_id`,`dia_id`are optional metadata attributes. - The text between the tags is the memor...

work page 2023

[8] [13]

A`<Topic Relations>`section containing one`<topic>`tag per topic relationship.,→ **Synthesis Guidelines:** - Do **not** answer the query directly. - Explain what information is available and how it should be used to formulate an answer.,→ **Output Format:** Your entire response must end with the following line: `<answer>your summary text here</answer>` Th...

work page

[9] [16]

A`<Temporal Relations>`section containing one`<temporal>`tag per temporal fact.,→

work page

[10] [17]

- Do **not** answer the query directly

A`<Topic Relations>`section containing one`<topic>`tag per topic relationship.,→ **Synthesis Guidelines:** - **Integrate** relevant entity, temporal, and topic facts into a coherent structure.,→ - **Extract** key information that directly supports or constrains the answer.,→ - **Reorganize** content for clarity and logical flow. - Do **not** answer the qu...

work page

[11] [18]

A`<query>`defining the subject and scope

work page

[12] [19]

An`<Entity Relations>`section containing one`<entity>`tag per relationship string.,→

work page

[13] [20]

An`<Temporal Relations>`section containing one`<temporal>`tag per temporal fact.,→

work page

[14] [21]

A`<Topic Relations>`section containing one`<topic>`tag per topic relationship.,→ **Synthesis Guidelines:** - Do **not** answer the query directly. - Explain what information is available and how it should be used to formulate an answer.,→ To complete the task systematically, please follow the steps reasoning framework outlined below:,→ **Reasoning Steps:*...

work page