arxiv: 2601.12078 · v2 · submitted 2026-01-17 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization

Linfeng Du , Ye Yuan , Zichen Zhao , Fuyuan Lyu , Emiliano Penaloza , Xiuying Chen , Zipeng Sun , Jikun Kang

show 3 more authors

Laurent Charlin Xue Liu Haolun Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:58 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords contextual banditsLLM personalizationretrieval augmentationuser profilesPlackett-Luce rankingpersonalization tasksreference likelihood reward

0 comments

The pith

PURPLE uses contextual bandits and Plackett-Luce ranking to select user records that directly raise LLM generation quality rather than semantic relevance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that relevance-based retrieval often picks records that add redundancy or conflict and therefore fail to help downstream generation. It introduces PURPLE to model profile construction as an order-sensitive ranking problem solved by a Plackett-Luce bandit whose reward is the likelihood of producing a reference response. This training signal is intended to make selection decisions optimize actual output quality instead of proxy similarity. Experiments across nine personalization tasks report consistent gains in both effectiveness and efficiency over heuristic and retrieval baselines.

Core claim

PURPLE frames user-profile construction as a contextual bandit problem solved with a Plackett-Luce ranking model; the model is trained on the likelihood of generating reference responses so that retrieval selection is aligned directly with downstream generation quality rather than semantic similarity.

What carries the argument

PURPLE contextual bandit with Plackett-Luce ranking model that treats profile construction as order-sensitive selection and uses reference-response likelihood as the reward signal.

If this is right

Profile construction shifts from greedy top-k relevance to order-aware selection that accounts for inter-record dependencies.
Retrieval decisions become directly optimized for generation quality through the reference-likelihood reward.
The same bandit formulation scales across multiple personalization tasks without requiring LLM fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on tasks without reference responses by substituting other cheap feedback signals such as user click-through or self-consistency scores.
Similar utility-versus-similarity gaps may appear in non-personalization retrieval settings such as document summarization or multi-hop QA.
If reference responses are narrow, the learned policy may overfit to particular answer styles and require periodic retraining on fresh data.

Load-bearing premise

The likelihood of generating the reference response gives an unbiased and sufficiently rich reward that aligns record selection with actual generation quality.

What would settle it

A controlled experiment on held-out queries in which profiles chosen by PURPLE produce lower reference-response likelihood than profiles chosen by the strongest semantic-relevance baseline.

read the original abstract

Large language models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for LLM pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as an order-sensitive generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with semantically rich feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PURPLE uses a contextual bandit with Plackett-Luce to order user records for RAG, trained on reference likelihood, but the reward proxy needs tighter validation before the gains look solid.

read the letter

The paper's main contribution is framing user profile construction as an order-sensitive ranking task solved via contextual bandits and a Plackett-Luce model, with the policy trained directly on the log-likelihood of generating a reference response given the selected records. This moves past simple semantic similarity and tries to optimize for downstream generation utility instead. The abstract makes a clear case that relevance alone can introduce redundancy or conflicts that hurt the LLM output, and the bandit approach is a reasonable way to capture inter-record dependencies during selection and ordering. That part lands as a clean, practical idea for anyone scaling personalized retrieval without fine-tuning. The experiments claim consistent wins over heuristics and other retrieval baselines across nine tasks, which suggests the method is at least competitive in their setup. The soft spot is the reward signal. Using log P(reference | query, profile) as the training objective assumes this scalar is an unbiased stand-in for real generation quality. In practice it can reward profiles that simply increase overlap with the fixed reference tokens or mimic its style, without improving robustness on paraphrased queries or new user contexts. The abstract does not report ablations isolating the Plackett-Luce component, statistical significance, or correlation checks against independent metrics like human ratings or zero-shot performance on held-out references. Without those, the reported gains could partly reflect proxy overfitting rather than genuine alignment. This work is aimed at researchers and engineers building production RAG systems that adapt to user history. It is worth sending to peer review because the problem framing is honest and the bandit formulation is well-motivated, even though the current evidence leaves room for the reward concern to be tested more rigorously.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PURPLE, a contextual bandit framework for optimizing user profiles in retrieval-augmented LLM personalization. It models profile construction as an order-sensitive ranking process via a Plackett-Luce model and trains the policy using the log-likelihood of generating a fixed reference response as the reward signal. Experiments on nine personalization tasks are reported to show consistent outperformance over heuristic and retrieval-augmented baselines in both effectiveness and efficiency.

Significance. If the reported gains prove robust to the concerns below, the work supplies a direct optimization route from retrieval selection to generation quality that goes beyond semantic similarity. The Plackett-Luce treatment of inter-record dependencies and the emphasis on efficiency constitute concrete technical strengths that could influence future retrieval-augmented personalization systems.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The abstract claims consistent gains across nine tasks, yet the manuscript supplies no exact baseline specifications, no statistical significance tests, and no ablation isolating the Plackett-Luce component from simpler ranking or greedy selection. Without these elements the central empirical claim cannot be evaluated at the required level of rigor.
[§3.2 (Reward definition)] §3.2 (Reward definition): The training objective uses log P(reference response | query, selected profile) as the scalar reward. This proxy can be gamed by profiles that increase probability mass on the exact reference tokens (lexical overlap or stylistic mimicry) without improving robustness on paraphrases or novel queries; no correlation analysis with independent metrics (human preference, zero-shot new references, or adversarial paraphrases) is provided to support the alignment assumption.

minor comments (2)

[§3.1] The Plackett-Luce parameterization is introduced without a worked numerical example or pseudocode, making the order-sensitive update rule harder to follow on first reading.
[§2] A small number of recent bandit-for-retrieval papers are omitted from the related-work discussion; adding them would sharpen the novelty statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to a revised version that incorporates the suggested improvements for greater rigor and validation.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The abstract claims consistent gains across nine tasks, yet the manuscript supplies no exact baseline specifications, no statistical significance tests, and no ablation isolating the Plackett-Luce component from simpler ranking or greedy selection. Without these elements the central empirical claim cannot be evaluated at the required level of rigor.

Authors: We agree that the experimental section requires additional detail for full reproducibility and rigor. In the revision we will: (1) provide exact hyperparameter settings, dataset splits, and implementation details for all baselines (including the specific retrieval-augmented and heuristic methods); (2) report statistical significance via paired t-tests or Wilcoxon signed-rank tests with p-values across the nine tasks; and (3) add an ablation study that isolates the Plackett-Luce ranking model against simpler greedy selection and non-order-sensitive ranking variants. These additions will directly support the central empirical claims. revision: yes
Referee: [§3.2 (Reward definition)] §3.2 (Reward definition): The training objective uses log P(reference response | query, selected profile) as the scalar reward. This proxy can be gamed by profiles that increase probability mass on the exact reference tokens (lexical overlap or stylistic mimicry) without improving robustness on paraphrases or novel queries; no correlation analysis with independent metrics (human preference, zero-shot new references, or adversarial paraphrases) is provided to support the alignment assumption.

Authors: We acknowledge that the reference-likelihood reward could in principle be influenced by surface-level overlap. However, because the reward is computed from the full generative likelihood under the target LLM (rather than token-level matching), it already encodes semantic and contextual utility. To strengthen the alignment claim we will add, in the revision, a correlation analysis between the learned reward and (a) human preference ratings on a held-out subset and (b) performance on paraphrased and zero-shot novel queries. This will provide empirical support for the proxy while preserving the direct optimization objective. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation uses external reward signal

full rationale

The paper trains a Plackett-Luce contextual bandit policy by optimizing against an external scalar reward defined as the log-likelihood of a fixed reference response under the LLM given the query and selected profile. This reward is computed independently from the bandit parameters and is not a function of the policy itself, nor is any fitted parameter renamed as a prediction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided derivation; the central claim rests on the alignment between retrieval ordering and generation quality via this external feedback rather than reducing to a tautology or self-definition by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that Plackett-Luce ranking adequately models inter-record dependencies and that reference likelihood provides a faithful utility signal; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Plackett-Luce model captures complex inter-record dependencies in profile construction
Invoked to justify the ranking component of the bandit policy instead of independent selection.

pith-pipeline@v0.9.0 · 5532 in / 1151 out tokens · 46807 ms · 2026-05-16T12:58:36.553252+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PURPLE treats profile construction as an order-sensitive generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with ... the likelihood of the reference response
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate retrieval-augmented LLM personalization as a contextual bandit problem ... reward R(LLM(P ∥ x), y) = log p_ϕ(y | P, x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.