Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
hub
L a MP : When Large Language Models Meet Personalization
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13verdicts
UNVERDICTED 13representative citing papers
HorizonBench generates 6-month conversation histories from structured mental state graphs to test AI models on tracking evolving user preferences, finding that frontier models mostly fail at belief updates and perform near or below chance.
PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.
Psy-CoT decomposes reasoning into Interaction Perception, Psychological Empathy, and Logical Construction while RAPO asymmetrically weights role-specific tokens during policy optimization, outperforming prior CoT and GRPO baselines on role-playing benchmarks.
ClusterRAG applies density-based clustering to user profiles for collaborative retrieval in personalized RAG and reports best performance on LaMP tasks by combining target and similar-user profiles.
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
PeReGrINE is a graph-based benchmark that restructures Amazon Reviews 2023 with temporal cutoffs and introduces dissonance analysis to measure how well retrieval-conditioned models match user style and product consensus.
A hybrid fine-tuning objective using KL divergence for token calibration and Kahneman-Tversky optimization for semantic binding enables LLMs to produce outputs that match desired attribute distributions across repeated prompts.
Survey mapping persistent state in LLM agents along six axes and proposing the AOEP-v0 protocol to evaluate governance and recovery obligations.
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
Socio-Contrastive Learning jointly learns socio-demographic representations and textual features via contrastive objectives to predict annotator perspectives more accurately than concatenation baselines.
citing papers explorer
-
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
-
HorizonBench: Long-Horizon Personalization with Evolving Preferences
HorizonBench generates 6-month conversation histories from structured mental state graphs to test AI models on tracking evolving user preferences, finding that frontier models mostly fail at belief updates and perform near or below chance.
-
PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media
PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.
-
Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization
Psy-CoT decomposes reasoning into Interaction Perception, Psychological Empathy, and Logical Construction while RAPO asymmetrically weights role-specific tokens during policy optimization, outperforming prior CoT and GRPO baselines on role-playing benchmarks.
-
ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation
ClusterRAG applies density-based clustering to user profiles for collaborative retrieval in personalized RAG and reports best performance on LaMP tasks by combining target and similar-user profiles.
-
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
-
PeReGrINE: Evaluating Personalized Review Fidelity with User Item Graph Context
PeReGrINE is a graph-based benchmark that restructures Amazon Reviews 2023 with temporal cutoffs and introduces dissonance analysis to measure how well retrieval-conditioned models match user style and product consensus.
-
Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning
A hybrid fine-tuning objective using KL divergence for token calibration and Kahneman-Tversky optimization for semantic binding enables LLMs to produce outputs that match desired attribute distributions across repeated prompts.
-
Always-OnAgents:A Survey of Persistent Memory, State, and Governance in LLMAgents
Survey mapping persistent state in LLM agents along six axes and proposing the AOEP-v0 protocol to evaluate governance and recovery obligations.
-
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
An agentic harness letting the LLM self-manage flat text-file storage via tool calls outperforms eight prior memory systems on cross-scenario generality across QA, chat, trajectory, stress-test, and long-horizon tasks.
-
Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering
IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.
-
Quantifying and Predicting Disagreement in Graded Human Ratings
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
-
Modeling Human Perspectives with Socio-Demographic Representations
Socio-Contrastive Learning jointly learns socio-demographic representations and textual features via contrastive objectives to predict annotator perspectives more accurately than concatenation baselines.