AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
hub Canonical reference
A Survey of Personalized Large Language Models: Progress and Future Directions
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5representative citing papers
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
RUMS selects LLM user memory via mutual information with model outputs to reduce response uncertainty, outperforming similarity-based methods in human alignment and response quality with up to 95% lower cost.
Retrieval from out-of-domain foundation models enables personalization of a lightweight transformer for stress detection, yielding +3.92% accuracy and +4.76% F1 gains on WESAD without user labels.
RBI-Eval shows LLMs integrate sensitive memory under benign prompts at rates 8.9-82.9% higher than no-memory baselines, with retrieval systems reducing but not eliminating the effect.
MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.
The work develops a reflective LLM-based storytelling agent for older adults that integrates argumentation schemes and argument mining with knowledge graphs and user modeling to generate and inspect personalized health narratives, evaluated through expert design and user studies showing recognition,
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
P-MLLM augments a frozen LLM with selective fusion modules to incorporate visual information in a profile-conditioned manner for competitive zero-shot PIAA performance.
TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.
TAP-PER encodes user preferences as lightweight learnable prefix embeddings that outperform prompt-based and adapter-based baselines on LaMP tasks with 130x fewer per-user parameters.
Personalization in an agentic RAG advising system boosts reasoning quality and grounding while reducing semantic metric scores due to the inability of current metrics to accommodate user-specific responses.
The authors propose a conceptual framework integrating stakeholder-LLM alignment methods, social choice-based aggregation for collective decisions, and stakeholder-centric evaluations to achieve fair multi-agent personalization.
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.
citing papers explorer
-
AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
-
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
-
Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures
Test-time scaling for personalized LLMs follows a logarithmic utility curve under oracle selection but standard reward models suffer user-level collapse and query-level hacking; a probabilistic reward model with learned variance enables consistent scaling.
-
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
-
Response-Aware User Memory Selection for LLM Personalization
RUMS selects LLM user memory via mutual information with model outputs to reduce response uncertainty, outperforming similarity-based methods in human alignment and response quality with up to 95% lower cost.
-
Retrieval-Augmented Personalization with Foundation Models for Wearable Stress Detection
Retrieval from out-of-domain foundation models enables personalization of a lightweight transformer for stress detection, yielding +3.92% accuracy and +4.76% F1 gains on WESAD without user labels.
-
When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents
RBI-Eval shows LLMs integrate sensitive memory under benign prompts at rates 8.9-82.9% higher than no-memory baselines, with retrieval systems reducing but not eliminating the effect.
-
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents
MemGate is a 9M-parameter neural gate inserted between vector memory and LLM that converts similarity search into task-conditioned admission, reducing memory-induced threats across agent frameworks while preserving utility.
-
A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives
The work develops a reflective LLM-based storytelling agent for older adults that integrates argumentation schemes and argument mining with knowledge graphs and user modeling to generate and inspect personalized health narratives, evaluated through expert design and user studies showing recognition,
-
A Survey on LLM-based Conversational User Simulation
A survey that introduces a taxonomy for LLM-based conversational user simulation, analyzes core techniques and evaluation methods, and identifies open challenges in the field.
-
Alignment has a Fantasia Problem
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
-
Enhancing Zero-shot Personalized Image Aesthetics Assessment with Profile-aware Multimodal LLM
P-MLLM augments a frozen LLM with selective fusion modules to incorporate visual information in a profile-conditioned manner for competitive zero-shot PIAA performance.
-
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
-
TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
TSUBASA improves long-horizon personalization in LLMs via dynamic memory evolution for writing and context-distillation self-learning for reading, outperforming Mem0 and Memory-R1 on Qwen-3 benchmarks while reducing token use.
-
PersonaVLM: Long-Term Personalized Multimodal LLMs
PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
-
AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations
13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.
-
Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization
TAP-PER encodes user preferences as lightweight learnable prefix embeddings that outperform prompt-based and adapter-based baselines on LaMP tasks with 130x fewer per-user parameters.
-
The Personalization Paradox: Semantic Loss vs. Reasoning Gains in Agentic AI Q&A
Personalization in an agentic RAG advising system boosts reasoning quality and grounding while reducing semantic metric scores due to the inability of current metrics to accommodate user-specific responses.
-
Fair Agents: Balancing Multistakeholder Alignment in Multi-Agent Personalization Systems
The authors propose a conceptual framework integrating stakeholder-LLM alignment methods, social choice-based aggregation for collective decisions, and stakeholder-centric evaluations to achieve fair multi-agent personalization.
-
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.