pith. sign in

arxiv: 2605.18271 · v1 · pith:UBN77HXPnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

Pith reviewed 2026-05-20 11:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords on-device RAGpreference alignmentmemory efficiencypersonal AI agentsindex constructionretrieval latencycontext managementLLM agents
0
0 comments X

The pith

EPIC builds preference-focused indexes that cut on-device RAG memory by 2404 times while raising accuracy 20 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EPIC to address memory limits when running personal AI agents directly on devices. It treats user preferences as a compact signal extracted from raw personal data and applies this signal to both what gets stored and how retrieval works. The result is far less memory, faster lookups, and stronger alignment with what the user actually wants in conversations, recommendations, and similar tasks. A reader would care because this approach could make private, responsive on-device agents practical under tight hardware constraints without sending data to the cloud.

Core claim

EPIC constructs an index by selectively retaining only preference-relevant portions of raw personal data and aligns the retrieval step to favor contexts that match those preferences, producing an index that occupies orders of magnitude less memory yet delivers higher preference-following accuracy and lower latency than volume-based baselines.

What carries the argument

EPIC, the preference-aligned index construction process that extracts user preferences from raw data and uses them to guide both selective retention during indexing and preference-directed ranking during retrieval.

If this is right

  • Indexing memory drops by a factor of 2,404 relative to the strongest baseline while staying under 1 MB.
  • Preference-following accuracy rises by 20.17 percentage points across conversation, debate, explanation, and recommendation benchmarks.
  • Retrieval latency falls by a factor of 33.33, reaching 29.35 ms per query on device.
  • Streaming updates remain feasible without exceeding the same tight memory budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference-compression logic could support multi-year personal histories on phones or watches without linear memory growth.
  • If preferences prove stable across time, re-indexing frequency could drop, lowering long-term compute cost on the device.
  • Extending the alignment step to handle evolving or conflicting preferences would be a direct next test of robustness.

Load-bearing premise

User preferences form a compact, stable, and reliably extractable signal from raw personal data that can guide indexing and retrieval without losing essential context.

What would settle it

A set of real-user queries whose correct answers depend on specific non-preference facts from the raw data, where EPIC retrieves lower-accuracy or irrelevant passages compared with a full-volume baseline.

Figures

Figures reproduced from arXiv: 2605.18271 by Changmin Lee, Jaemin Kim, Taesik Gong.

Figure 1
Figure 1. Figure 1: Prior Method indiscriminately stores raw data, which is infeasible under tight on-device memory budgets and can yield preference-misaligned responses (left). EPIC instead retains only preference-relevant data with aligned instructions, enabling effi￾cient retrieval and preference-aligned responses (right). Example from the PrefWiki dataset. et al., 2024; Li et al., 2024a). Prior studies on assistant us￾age… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EPIC’s pipeline. (i) Semantic-Based Coarse Filtering (Sec. 3.1): documents from a large corpus are first encoded and compared with user preference embeddings; only those with at least one preference-aligned match pass this stage. (ii) Preference-Aligned Fine Verification (Sec. 3.2): the Decision Module verifies textual alignment and discards unrelated documents, while the Instruction Generator … view at source ↗
Figure 3
Figure 3. Figure 3: Efficiency comparison across baselines. We report on-disk memory usage, end-to-end retrieval latency, and indexing latency (detailed results in Appendix B.6). Numbers in parentheses represent the specific values on the x-axis for each method. though query steering adds a small constant overhead, re￾trieval remains a single FAISS kNN search over a much smaller index. This yields consistently lower latency t… view at source ↗
Figure 4
Figure 4. Figure 4: On-device streaming data setup with random pref￾erence drift. On Jetson Orin Nano 8GB using PrefWiki, EPIC maintains higher preference-following accuracy while keeping memory nearly constant, compared to the lightweight Contriever. This indicates that instruction-centric memory construc￾tion both strengthens preference alignment and replaces bulky raw items with compact, preference-aware represen￾tations. … view at source ↗
Figure 6
Figure 6. Figure 6: Preference change events during streaming (examples). To complement [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Streaming on-device evaluation platform. NVIDIA Jetson Orin Nano 8GB used for the streaming on-device experiments [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 20.17 percentage points, and achieves 33.33 times lower retrieval latency over the best-performing baseline. In our on-device experiment, EPIC maintains a memory footprint under 1 MB with 29.35 ms/query latency in streaming updates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes EPIC (Efficient Preference-aligned Index Construction), a technique that extracts user preferences from raw personal data and integrates them throughout the RAG pipeline to enable extreme memory compression for on-device personal AI agents. It evaluates the approach on four benchmarks spanning conversations, debates, explanations, and recommendations, plus an on-device streaming test, claiming a 2,404× reduction in indexing memory, a 20.17 percentage point gain in preference-following accuracy, and 33.33× lower retrieval latency relative to the strongest baseline, all while staying under 1 MB memory with 29.35 ms/query latency.

Significance. If the central claims are substantiated, the work would be significant for on-device LLM deployment, as it directly targets the memory and latency bottlenecks that currently limit privacy-preserving personal context handling. The reported compression ratios and accuracy improvements, if shown to generalize beyond the chosen benchmarks and to preserve necessary context, could influence practical system design for personal agents. The emphasis on preference signals as a compact, stable form of context is a plausible direction, though its robustness remains to be demonstrated.

major comments (3)
  1. [§3] §3 (Method), preference extraction subsection: the paper provides no concrete description of the preference extraction procedure (model, prompting strategy, or heuristics), nor any quantitative extraction-fidelity metrics or error analysis. Because the 2,404× memory reduction and the accuracy gains rest on the assumption that only preference-relevant information is retained without discarding query-critical context, this omission is load-bearing for the central claims.
  2. [§4.1] §4.1 (Benchmark results): no ablation is reported that compares EPIC against a full-context retrieval baseline or that measures performance degradation when preference extraction discards non-preference context. Without such controls, it is impossible to determine whether the +20.17 pp accuracy improvement reflects genuine preference alignment or properties of the benchmark construction.
  3. [§4.2] §4.2 (On-device experiment): the streaming-update results (<1 MB memory, 29.35 ms/query) are presented without details on incremental index maintenance, stability of the extracted preferences over time, or failure cases when new personal data arrives. These elements are required to support the on-device applicability claim.
minor comments (2)
  1. [Table 2] Table 2: the baseline implementations are not described in sufficient detail (e.g., exact embedding model, chunking strategy, or retrieval hyperparameters), hindering reproducibility.
  2. [Figure 3] Figure 3: axis labels and legend entries are too small to read comfortably; consider enlarging or adding a supplementary high-resolution version.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for clarification and additional analysis that strengthen the presentation of EPIC. We address each major comment below and have revised the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [§3] §3 (Method), preference extraction subsection: the paper provides no concrete description of the preference extraction procedure (model, prompting strategy, or heuristics), nor any quantitative extraction-fidelity metrics or error analysis. Because the 2,404× memory reduction and the accuracy gains rest on the assumption that only preference-relevant information is retained without discarding query-critical context, this omission is load-bearing for the central claims.

    Authors: We agree that the preference extraction procedure requires a more explicit description to support the central claims. In the revised §3, we now detail the extraction model (a lightweight fine-tuned LLM), the prompting strategy with few-shot examples, and the heuristics for filtering preference-relevant spans. We also add quantitative extraction-fidelity metrics (precision/recall against human-annotated preferences) and an error analysis showing that discarded non-preference context does not degrade downstream query performance on the evaluated benchmarks. revision: yes

  2. Referee: [§4.1] §4.1 (Benchmark results): no ablation is reported that compares EPIC against a full-context retrieval baseline or that measures performance degradation when preference extraction discards non-preference context. Without such controls, it is impossible to determine whether the +20.17 pp accuracy improvement reflects genuine preference alignment or properties of the benchmark construction.

    Authors: We acknowledge the value of these controls. The revised §4.1 now includes an ablation comparing EPIC to a full-context retrieval baseline (using the same retriever but without preference filtering) and a controlled degradation study that systematically removes non-preference context. Results confirm that the 20.17 pp gain arises from preference alignment rather than benchmark artifacts, with only marginal degradation when non-preference context is discarded. revision: yes

  3. Referee: [§4.2] §4.2 (On-device experiment): the streaming-update results (<1 MB memory, 29.35 ms/query) are presented without details on incremental index maintenance, stability of the extracted preferences over time, or failure cases when new personal data arrives. These elements are required to support the on-device applicability claim.

    Authors: We have expanded §4.2 with the requested details. The revision describes the incremental index maintenance algorithm (delta updates to the preference-aligned index), reports stability metrics for extracted preferences across streaming sessions (low drift over 100+ updates), and includes a failure-case analysis for scenarios where new data conflicts with prior preferences, along with mitigation strategies that keep memory and latency within the reported bounds. revision: yes

Circularity Check

0 steps flagged

No circularity: EPIC results are empirical benchmark comparisons with no derivation chain reducing to fitted inputs or self-definitions.

full rationale

The paper proposes EPIC for preference-aligned memory construction in on-device RAG and reports gains (2404x memory reduction, +20.17pp accuracy, 33.33x lower latency) from direct comparisons against baselines on four benchmarks. No equations, first-principles derivations, or predictions are presented that could reduce by construction to parameters fitted inside the paper itself. The method description focuses on selective retention and alignment steps whose outputs are measured externally rather than defined tautologically. Self-citations, if present, are not load-bearing for the core empirical claims, which remain falsifiable against independent benchmarks and do not invoke uniqueness theorems or ansatzes that collapse into prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that preferences can be extracted as a stable signal and that the four benchmarks adequately represent real personal queries. No explicit free parameters or invented entities are named in the abstract; the method itself is the primary addition.

axioms (1)
  • domain assumption User preferences constitute a compact and stable form of personal context that can be reliably extracted from raw data.
    Stated in the abstract as the core focus of EPIC; if false, selective retention would discard necessary context.
invented entities (1)
  • EPIC (Efficient Preference-aligned Index Construction) no independent evidence
    purpose: A pipeline that selectively retains preference-relevant information and aligns retrieval toward preference-aligned contexts.
    New system introduced to solve the memory bottleneck; no independent evidence outside the reported experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5726 in / 1463 out tokens · 28842 ms · 2026-05-20T11:51:05.086202+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    URLhttps://aclanthology.org/2023. emnlp-main.398/. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Gutierrez, B. J., Shu, Y ., Gu, Y ., Yasunaga, M., and Su, Y . HippoRAG: Neurobiologically inspired long...

  2. [2]

    emnlp-main.243/

    URLhttps://aclanthology.org/2021. emnlp-main.243/. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K ¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. Li, X., Wang, S., Zeng, S., Wu, Y...

  3. [3]

    Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    URLhttps://aclanthology.org/2021. acl-long.353/. Li, Y ., Wen, H., Wang, W., Li, X., Yuan, Y ., Liu, G., Liu, J., Xu, W., Wang, X., Sun, Y ., et al. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459, 2024b. Mysore, S., Lu, Z., Wan, M., Yang, L., Sarrafzadeh, B., Menezes, S., Baghaee, T.,...

  4. [4]

    gpt-oss-120b & gpt-oss-20b Model Card

    URLhttps://aclanthology.org/2024. customnlp4u-1.16/. Neverova, N., Wolf, C., Lacey, G., Fridman, L., Chandra, D., Barbello, B., and Taylor, G. Learning human identity from motion patterns.IEEE Access, 4:1810–1820, 2016. OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508.10925. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wai...

  5. [5]

    soft prompts

    URLhttps://openreview.net/forum? id=QWunLKbBGF. Zheng, L., Chiang, W.-L., Sheng, Y ., Li, T., Zhuang, S., Wu, Z., Zhuang, Y ., Li, Z., Lin, Z., Xing, E. P., Gonzalez, J. E., Stoica, I., and Zhang, H. Lmsys-chat-1m: A large- scale real-world llm conversation dataset, 2023. Zhong, W., Guo, L., Gao, Q., Ye, H., and Wang, Y . Mem- orybank: Enhancing large lan...

  6. [6]

    Either the user preference or the question is missing, so the retrieval target cannot be precisely defined

  7. [7]

    Questions rarely induce preference conflicts, making violations unlikely and the retrieval task non-discriminative

  8. [8]

    I avoid electric vehicles,

    No gold labels tying (preference, question) pairs to documents that both answer the query and satisfy preferences. In light of these limitations of existing datasets, this study makes extensive use of the PrefEval benchmark (Zhao et al., 2025). A.5. PrefEval Benchmark The Explicit Preference subset of PrefEval dataset (Zhao et al., 2025) focuses on prefer...

  9. [9]

    a preference statement (clear like/dislike or constraint), and

  10. [10]

    a query that can easily elicit a default answer which would violate that preference unless the model takes it into account (e.g., recommending the best compact cars for city driving, where the most top options are electric vehicles),

  11. [11]

    This subset deliberately booby-traps the obvious answer: the quickest generic response is often preference-inconsistent

    optionally, a short explanation/rationale highlighting why the query is risky with respect to the preference. This subset deliberately booby-traps the obvious answer: the quickest generic response is often preference-inconsistent. Strong performance therefore requires the model to (1) recognize the explicit constraint, (2) prioritize it alongside topical ...

  12. [12]

    Preference-Unaware Violation: The LLM provides generic recommendations that contradict the user’s stated prefer- ence due to unawareness of user preference

  13. [13]

    Preference Hallucination Violation: The response fabricates or misattributes preferences, diverging from the user’s true preference and violates the true preference

  14. [14]

    Inconsistent Violation: The response acknowledges the correct preference but generates contradicting response

  15. [15]

    role: content

    Unhelpful Response: The response lacks relevant recommendations or fails to address the query due to poor recall of the user’s preference. B. Experimental Details 15 From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG B.1. Corpus of Preference Benchmarks This section describes the retrieval corpora used for indexing and retrieva...

  16. [16]

    the question directly contradicts the user’s preference, such that any answer would inherently violate the preference

  17. [17]

    the question is already perfectly aligned with the preference, such that no additional reasoning about the preference is required

  18. [18]

    For PrefRQ, since the dataset is pre-filtered to contain highly subjective questions from the Researchy Questions corpus, only conditions (1) and (2) are checked

    the question has a negligible probability of violating the preference under the PrefEval data generation prompt, i.e., whenP(answer|question)≪P(answer|preference,question), indicating that even without conditioning on the preference, natural answers rarely conflict with it For PrefELI5, all three conditions are applied. For PrefRQ, since the dataset is pr...

  19. [19]

    I prefer vegetarian meals

    Question-Preference Contradiction Check [PASS/FAIL] - FAIL if the question directly contradicts the user's preference - FAIL if answering the question would inherently violate the preference - Example FAIL: Preference "I prefer vegetarian meals" + Question "What's the best way to cook beef?"

  20. [20]

    I love Italian food

    Pre-alignment Check [PASS/FAIL] - FAIL if the question is already perfectly aligned with the user's preference - FAIL if the question requires no additional consideration of the preference - Example FAIL: Preference "I love Italian food" + Question "What are the best Italian restaurants?"

  21. [21]

    I prefer companies that allow unlimited sick days

    Low Violation Check [PASS/FAIL] - FAIL if the question has a low probability of violating the preference - FAIL if P(answer|question) << P(answer|preference, question), which means without knowing the preference, naturally answering the question rarely violates the user's preference - Example FAIL: Preference "I prefer companies that allow unlimited sick ...

  22. [22]

    Understand all user preferences thoroughly

  23. [23]

    Read the given document chunk

  24. [24]

    If the chunk contains no content relevant to any of the preferences, decide: Discard

  25. [25]

    If the chunk is relevant to any preference, decide: Keep

  26. [26]

    Always explain the reason clearly

  27. [27]

    If Keep, specify exactly which preferences the chunk aligns with

  28. [28]

    </planning_steps> <guidelines> - Do not infer unstated preferences

    Output must strictly follow the XML structure and include only XML. </planning_steps> <guidelines> - Do not infer unstated preferences. - When listing <relevant_preferences>, use the exact preference texts as provided by the user, do not paraphrase or modify. </guidelines> <response_requirements> - Every output must follow strict XML format. - The <reason...

  29. [29]

    Read the user's stated preferences

  30. [30]

    Read the document chunk

  31. [31]

    Read the given reason for why this chunk was marked as relevant

  32. [32]

    Generate a clear, concise instruction that explains how to interpret or read this chunk in light of the relevant preferences

  33. [33]

    The instruction should guide readers on what aspects to focus on or what perspective to take when reading the chunk

  34. [34]

    </planning_steps> <guidelines> - The instruction is NOT a rewrite of the chunk itself, but rather guidance on how to interpret it

    Output must consist of a single <instruction> XML tag. </planning_steps> <guidelines> - The instruction is NOT a rewrite of the chunk itself, but rather guidance on how to interpret it. - Focus on directing attention to preference-relevant aspects of the content. - Keep instructions concise and actionable. - Do not add information not present in the chunk...