Personalized Multimodal Large Language Models: A Survey

Dang Nguyen; Franck Dernoncourt; Hanieh Deilamsalehy; Hanjia Lyu; Hongjie Chen; Huanrui Yang; Ishita Kumar; Jiebo Luo; Jiuxiang Gu; Joe Barrow

arxiv: 2412.02142 · v1 · pith:XDFWNYP3new · submitted 2024-12-03 · 💻 cs.CV · cs.AI· cs.CL· cs.IR

Personalized Multimodal Large Language Models: A Survey

Junda Wu , Hanjia Lyu , Yu Xia , Zhehao Zhang , Joe Barrow , Ishita Kumar , Mehrnoosh Mirtaheri , Hongjie Chen

show 19 more authors

Ryan A. Rossi Franck Dernoncourt Tong Yu Ruiyi Zhang Jiuxiang Gu Nesreen K. Ahmed Yu Wang Xiang Chen Hanieh Deilamsalehy Namyong Park Sungchul Kim Huanrui Yang Subrata Mitra Zhengmian Hu Nedim Lipka Dang Nguyen Yue Zhao Jiebo Luo Julian McAuley

This is my paper

classification 💻 cs.CV cs.AIcs.CLcs.IR

keywords languagelargemodelsmultimodalpersonalizedmllmssurveytechniques

0 comments

read the original abstract

Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
cs.CL 2026-03 unverdicted novelty 8.0

AlpsBench supplies 2500 real-dialogue sequences with verified memories to benchmark LLM extraction, updating, retrieval, and utilization of personalized information.
Personal Visual Memory from Explicit and Implicit Evidence
cs.CV 2026-05 unverdicted novelty 7.0

VisualMem augments text memory with a visual module that resolves identity and durable user facts from images, outperforming prior systems on a new benchmark for explicit and implicit personal visual evidence.
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
cs.LG 2026-05 unverdicted novelty 7.0

F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked perfo...
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents
cs.AI 2026-05 unverdicted novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on f...
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
cs.LG 2026-05 unverdicted novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PersonaVLM: Long-Term Personalized Multimodal LLMs
cs.CL 2026-03 unverdicted novelty 6.0

PersonaVLM adds memory extraction, multi-turn retrieval-based reasoning, and personality inference to multimodal LLMs, yielding 22.4% gains on a new long-term personalization benchmark and outperforming GPT-4o.
GRADE: Graph Representation of LLM Agent Dependency and Execution
cs.LG 2026-06 unverdicted novelty 5.0

GRADE models any LLM agent run as a graph with execution and graded dependency edge layers to enable failure prediction and fault localization across tool, coding, and web agent corpora.