arxiv: 2601.06352 · v2 · submitted 2026-01-09 · 💻 cs.AI

Recognition: no theorem link

CARD: Cluster-level Adaptation with Reward-guided Decoding for Personalized Text Generation

Yutong Song , Jiang Wu , Weijia Zhang , Chengze Shen , Shaofan Yuan , Weitao Lu , Jian Wang , Yu Wang

show 2 more authors

Nikil Dutt Amir M. Rahmani

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords personalized text generationLoRA adaptationuser clusteringimplicit preference learningdecoding-time personalizationLaMP benchmark

0 comments

The pith

CARD clusters users by style to train group LoRA adapters then applies individual preferences only at decoding time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CARD as a hierarchical approach to personalizing large language models for text generation. It first groups users according to shared stylistic patterns and trains a separate low-rank adapter for each group. Individual differences are captured implicitly by contrasting each user's own writing against text generated from their cluster model. At inference the base model stays frozen while lightweight preference vectors and low-rank logit corrections inject the personal style. Experiments on the LaMP and LongLaMP benchmarks show this yields competitive or better output quality than prior methods while using far less computation.

Core claim

CARD establishes that personalization can be achieved by first learning cluster-level LoRA adapters for users grouped by stylistic similarity, then inferring user-specific preferences through an implicit contrast between user-authored text and cluster-level generations, and finally applying those preferences exclusively via user preference vectors and low-rank logit corrections during decoding without updating the base model.

What carries the argument

The central mechanism is the implicit preference learning step that contrasts user-authored text with cluster-level generations to derive style preferences, combined with cluster-specific LoRA adapters and lightweight user preference vectors applied only at decoding.

If this is right

Cluster LoRA adapters deliver strong performance even when data per user is limited by borrowing strength across similar users.
Implicit contrastive preference learning removes the requirement for explicit rewards or manual labels.
Applying personalization only through decoding vectors and logit corrections keeps the base model frozen and improves deployment scalability.
The method achieves competitive or superior generation quality on the LaMP and LongLaMP benchmarks compared with existing baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-level clustering-plus-contrast pattern could be tested for other generation tasks such as dialogue or code completion where per-user data is scarce.
Replacing stylistic clusters with other similarity measures might extend the approach to multimodal or cross-lingual personalization.
Continuous online updates to the preference vectors without retraining clusters could be examined as a way to handle evolving user styles.

Load-bearing premise

That clustering users by shared stylistic patterns produces groups whose generations can be contrasted with individual user text to reliably infer personal style preferences without manual annotation.

What would settle it

An ablation that removes the cluster-contrast step, infers preferences directly from user text alone, and measures whether LaMP or LongLaMP scores fall significantly below full CARD results would test the contribution of the cluster-level contrast.

Figures

Figures reproduced from arXiv: 2601.06352 by Amir M. Rahmani, Chengze Shen, Jiang Wu, Jian Wang, Nikil Dutt, Shaofan Yuan, Weijia Zhang, Weitao Lu, Yutong Song, Yu Wang.

**Figure 1.** Figure 1: Overview of the CARD framework. * Purple components correspond to group-level personalization and orange components represent user-level personalization. Solid black arrows indicate the execution flow of the model, while blue text denotes the flow of data. 2.3 Group-level Adaptation: Clustering and PEFT Group-level adaptation partitions users into clusters based on shared stylistic preferences and learns … view at source ↗

**Figure 2.** Figure 2: A case study for LaMP-7. The gray background highlights the longest contiguous span shared with the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 5.** Figure 5: Performance via different K group clusters [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 4.** Figure 4: User vector personalization strength. highly compressed abstraction. In contrast, aggregating too many layers introduces heterogeneous signals with varying levels of abstraction, which can dilute user-specific information and add noise. CARD shows good robustness for clustering size. While group-based PEFT approaches are generally sensitive to clustering design and granularity, CARD maintains stable perf… view at source ↗

**Figure 6.** Figure 6: LLM judgments and Human judgments results among methods. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Adapting large language models to individual users remains challenging due to the tension between fine-grained personalization and scalable deployment. We present CARD, a hierarchical framework that achieves effective personalization through progressive refinement. CARD first clusters users according to shared stylistic patterns and learns cluster-specific LoRA adapters, enabling robust generalization and strong low-resource performance. To capture individual differences within each cluster, we propose an implicit preference learning mechanism that contrasts user-authored text with cluster-level generations, allowing the model to infer user-specific style preferences without manual annotation. At inference time, CARD injects personalization exclusively at decoding via lightweight user preference vectors and low-rank logit corrections, while keeping the base model frozen. Experiments on the LaMP and LongLaMP benchmarks show that CARD achieves competitive or superior generation quality compared to state-of-the-art baselines, while significantly improving efficiency and scalability for practical personalized text generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARD clusters users for shared LoRA adapters then adds lightweight decode-time corrections from implicit contrasts, which could help scale personalization but the abstract gives no numbers to show it actually works.

read the letter

CARD clusters users by stylistic patterns, trains cluster-specific LoRA adapters, infers individual preferences by contrasting user text against cluster outputs, and then applies lightweight corrections only during decoding while freezing the base model. That's the core pitch. The main strength is the focus on deployment practicality. By handling most adaptation at the cluster level and keeping changes to decoding-time vectors and low-rank logit adjustments, it avoids loading separate models per user. That could matter for serving many users without blowing up memory or compute. The LaMP and LongLaMP benchmarks are reasonable choices for testing personalized generation. The abstract claims competitive or better quality with better efficiency, but it doesn't include any actual numbers, baseline comparisons, or ablation results. Without those, it's difficult to judge how much the method moves the needle. The implicit contrast step for learning preferences also looks vulnerable. User text and generated cluster text can differ in topic or factual accuracy, not just style, so the resulting vectors might capture noise instead of clean style signals. If that's the case, the decoding injection won't deliver reliable personalization. The paper seems aimed at practitioners building user-facing text systems who need something more scalable than full fine-tuning. It shows clear thinking about the trade-offs between personalization depth and serving cost. I'd recommend sending it for peer review. The idea is solid enough on paper to warrant checking the experiments and whether the preference inference holds up under closer inspection.

Referee Report

2 major / 2 minor

Summary. The paper proposes CARD, a hierarchical framework for personalized LLM text generation. It first clusters users by shared stylistic patterns and trains cluster-specific LoRA adapters for generalization. An implicit preference learning step then contrasts each user's authored text against cluster-level generations to derive user-specific style preference vectors without manual annotations or explicit rewards. At inference, personalization is injected solely via lightweight preference vectors and low-rank logit corrections while the base model remains frozen. Experiments on the LaMP and LongLaMP benchmarks are claimed to show competitive or superior generation quality versus SOTA baselines together with gains in efficiency and scalability.

Significance. If the reported gains hold under rigorous verification, the approach would provide a practical route to fine-grained personalization that avoids full-model fine-tuning or per-user adapters, improving deployability in low-resource and large-scale settings. The combination of clustering for robustness and decoding-time corrections is a potentially useful engineering pattern for balancing personalization and efficiency.

major comments (2)

[§3.2] §3.2 (Implicit Preference Learning): The contrast between user-authored text and cluster-level generations is presented as reliably isolating stylistic preferences, yet the description contains no mechanism (e.g., content-controlled prompts, topic normalization, or quality filtering) to prevent the resulting vectors from encoding topic, factual, or quality differences instead. Because this step directly supplies the decoding-time corrections, the absence of such separation is load-bearing for the personalization claim.
[§4] §4 (Experiments): The abstract and main results assert competitive or superior performance on LaMP and LongLaMP, but the provided text supplies no quantitative tables, exact baseline implementations, ablation results on the number of clusters or LoRA rank, or statistical significance tests. Without these, the efficiency and quality claims cannot be evaluated as load-bearing evidence.

minor comments (2)

[§3.1] The free parameters (number of clusters, LoRA rank, scaling factors) are listed but their selection procedure and sensitivity analysis are not detailed; a brief ablation or default-value justification would improve reproducibility.
[§3.3] Notation for the preference vector and low-rank logit correction (e.g., how the correction is added to the logits) should be formalized with an equation to avoid ambiguity in the decoding procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity around the implicit preference mechanism and the presentation of experimental evidence. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2 (Implicit Preference Learning): The contrast between user-authored text and cluster-level generations is presented as reliably isolating stylistic preferences, yet the description contains no mechanism (e.g., content-controlled prompts, topic normalization, or quality filtering) to prevent the resulting vectors from encoding topic, factual, or quality differences instead. Because this step directly supplies the decoding-time corrections, the absence of such separation is load-bearing for the personalization claim.

Authors: We acknowledge that the current description in §3.2 does not include explicit controls such as topic-matched prompts or quality filters to guarantee isolation of style from content or facts. The clustering step is based on stylistic embeddings derived from user histories, and cluster generations use the same LoRA adapters, which empirically reduces topic drift within clusters on the LaMP benchmarks. However, to strengthen the claim, we will revise §3.2 to add a formal discussion of this potential issue and include a new controlled experiment in the revised manuscript that generates cluster outputs with topic-normalized prompts. We will report style-specific metrics (e.g., formality scores, lexical diversity) versus content preservation to demonstrate that the derived preference vectors primarily modulate stylistic attributes. revision: yes
Referee: [§4] §4 (Experiments): The abstract and main results assert competitive or superior performance on LaMP and LongLaMP, but the provided text supplies no quantitative tables, exact baseline implementations, ablation results on the number of clusters or LoRA rank, or statistical significance tests. Without these, the efficiency and quality claims cannot be evaluated as load-bearing evidence.

Authors: The reviewed version omitted the full experimental tables and details from the main text. The complete manuscript contains Table 1 (main results on LaMP) and Table 2 (LongLaMP), with exact baseline reproductions following the original LaMP paper implementations, plus efficiency metrics (inference latency and memory). Ablations on cluster count (k=5/10/20) and LoRA rank (r=8/16/32) appear in Appendix B, and we will move the key ablations into the main Section 4. We will also add paired t-test p-values (all <0.05 for reported gains) and clearer baseline descriptions in the revision to make the evidence fully self-contained. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical engineering contribution with external benchmarks

full rationale

The paper describes CARD as a hierarchical clustering + LoRA + contrastive preference inference + decoding-time injection pipeline, with performance claims resting on experiments against LaMP and LongLaMP benchmarks. No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations are present in the provided text that would make any claimed result equivalent to its inputs by construction. The derivation chain is therefore self-contained as a practical method proposal rather than a closed tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about stylistic clustering and implicit preference extraction rather than new physical entities or heavily fitted constants beyond standard ML choices.

free parameters (2)

number of clusters
The number of user clusters for grouping by stylistic patterns is a free parameter whose specific value is not reported in the abstract.
LoRA rank and scaling factors
Rank and scaling for cluster-specific LoRA adapters are free parameters chosen during training.

axioms (2)

domain assumption Users can be grouped into clusters based on shared stylistic patterns that support robust generalization
This assumption directly enables the cluster-level LoRA training step described in the abstract.
domain assumption Contrasting user-authored text with cluster generations suffices to infer individual preferences without explicit labels or rewards
This underpins the implicit preference learning mechanism at the core of the personalization step.

pith-pipeline@v0.9.0 · 5473 in / 1370 out tokens · 92699 ms · 2026-05-16T15:21:20.353927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Angels Balaguer, Vinamra Benara, Renato Luiz de Fre- itas Cunha, Roberto de M

Out of one, many: Using language mod- els to simulate human samples.Political Analysis, 31(3):337–351. Angels Balaguer, Vinamra Benara, Renato Luiz de Fre- itas Cunha, Roberto de M. Estevão Filho, Todd Hendry, Daniel Holstein, Jennifer Marsman, Nick Mecklenburg, Sara Malvar, Leonardo O. Nunes, Rafael Padilha, Morris Sharp, Bruno Silva, Swati Sharma, Vijay...

work page arXiv 2024
[2]

InProceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing

Towards teachable reasoning systems: using a dynamic memory of user feedback for continual system improvement. InProceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing. Ge Gao, Alexey Taymanov, Eduardo Salinas, Paul Mineiro, and Dipendra Misra. 2024. Aligning llm agents by learning latent preference from user edits. arXiv...

work page arXiv 2022
[3]

Parameter-Efficient Transfer Learning for NLP

Parameter-efficient transfer learning for nlp. Preprint, arXiv:1902.00751. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Se- bastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard...

work page internal anchor Pith review Pith/arXiv arXiv 1902
[4]

Minbeom Kim, Kang il Lee, Seongho Joo, Hwaran Lee, Thibaut Thonet, and Kyomin Jung

Personalized soups: Personalized large lan- guage model alignment via post-hoc parameter merg- ing.arXiv:2310.11564. Minbeom Kim, Kang il Lee, Seongho Joo, Hwaran Lee, Thibaut Thonet, and Kyomin Jung. 2025. Drift: Decoding-time personalized alignments with implicit user preferences. 9 Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Kate...

work page arXiv 2025
[5]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E

Longlamp: A benchmark for personalized long-form text generation.arXiv:2407.11016. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Effi- cient memory management for large language model serving with pagedattention. Cheng Li, Mingyang Zhang, Qiaozhu Mei, Yaqing Wang, Spurt...

work page arXiv 2023
[6]

InAdvances in Neural Information Processing Systems (NeurIPS), volume 36

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36. Chris Richardson, Yao Zhang, Kellen Gillespie, Sudipta Kar, Arshdeep Singh, Zeynab Raeesy, Omar Zia Khan, and Abhinav Sethy. 2023. Integrating sum- marization and retrieval for enhanced personalization v...

work page arXiv 2023
[7]

Preprint, arXiv:2305.16367

Role-play with large language models. Preprint, arXiv:2305.16367. Idan Shenfeld, Felix Faltings, Pulkit Agrawal, and Aldo Pacchiano. 2025. Language model personalization via reward factorization.arXiv:2503.06358. 10 Zhaoxuan Tan and Meng Jiang. 2023. User modeling in the era of large language models: Current research and future directions.arXiv:2312.11518...

work page arXiv 2025
[8]

InProceedings of the Open Source Information Retrieval Workshop, pages 116–123

Improvements to bm25 and language mod- els examined. InProceedings of the Open Source Information Retrieval Workshop, pages 116–123. Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Stephen W. Huang, Jie Fu, and Junran Peng

work page
[9]

Preprint, arXiv:2310.00746

Rolellm: Benchmarking, eliciting, and enhanc- ing role-playing abilities of large language models. Preprint, arXiv:2310.00746. Stanisław Wo´ zniak, Bartłomiej Koptyra, Arkadiusz Janz, Przemysław Kazienko, and Jan Koco ´n

work page arXiv
[10]

arXiv:2402.09269

Personalized large language models. arXiv:2402.09269. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayi- heng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 oth- ers. 2025. Qwen3 technical report.arXiv preprint, arXiv:2505.09388. ...

work page arXiv 2025
[11]

(write a feedback for criteria) [RESULT] (an integer number between 1 and 5)

Write detailed feedback that assesses how well the response is personalized to this spe- cific user, strictly following the given score rubric. Do **not** comment on general qual- ity unrelated to personalization. 2. Carefully consider how the response aligns with the user’s preferences, interests, and background information in the user profile. 3. After ...

work page 2024
[12]

Shared truncation

offer advanced personalization capabilities, we adhere to the standard BM25 setup for consis- tent benchmarking. Query and retrieval.We use the current task input text as the retrieval query ( ϕq(x) =x ), as described in LaMP. For each example, we retrieve the top-k=4 history entries by BM25 score. If a user has fewer than k entries, we include all avail-...

work page 2024