Contextualized Visual Personalization in Vision-Language Models

Han Cheol Moon; Jisoo Mok; Junsung Park; Sangwon Yu; Sungroh Yoon; Yeongtak Oh

arxiv: 2602.03454 · v3 · pith:W6SUU52Jnew · submitted 2026-02-03 · 💻 cs.CV

Contextualized Visual Personalization in Vision-Language Models

Yeongtak Oh , Sangwon Yu , Junsung Park , Han Cheol Moon , Jisoo Mok , Sungroh Yoon This is my paper

Pith reviewed 2026-05-21 14:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords contextualized visual personalizationvision-language modelspersonalized image captioningreinforcement learning post-trainingdiagnostic evaluationsvisual context retrieval

0 comments

The pith

Vision-language models can be post-trained to associate new images with a user's accumulated visual experiences instead of relying on text patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines contextualized visual personalization as the requirement for VLMs to recognize and retrieve a user's past visual-textual experiences when processing new images. It introduces CoViP, a framework that centers on personalized image captioning and strengthens this skill via reinforcement-learning post-training combined with caption-augmented generation. Diagnostic tests are added to check whether models actually use visual context rather than textual shortcuts. Experiments indicate that current models fall short on this capability while CoViP delivers improvements in captioning and broader personalization tasks. Readers should care because the work targets the gap between generic model responses and systems that can draw on an individual's own visual history.

Core claim

We formalize contextualized visual personalization and propose CoViP, a unified framework that treats personalized image captioning as the core task. CoViP applies reinforcement-learning-based post-training and caption-augmented generation so that VLMs perform visual recognition and textual retrieval of personalized experiences from new images. Diagnostic evaluations explicitly rule out textual shortcut solutions and verify genuine use of visual context, with results showing that existing models have substantial limitations while CoViP yields gains across downstream personalization tasks.

What carries the argument

CoViP, a unified framework that centers personalized image captioning as the core task and applies reinforcement-learning post-training plus caption-augmented generation to enable visual recognition and retrieval of user-specific experiences.

If this is right

Existing open-source and proprietary VLMs exhibit substantial limitations in contextualized visual personalization.
CoViP improves performance on the core task of personalized image captioning.
CoViP produces holistic gains across multiple downstream personalization tasks.
Diagnostic evaluations can verify whether models truly leverage visual context rather than textual shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method scales, personal AI systems could naturally reference a user's own photo collection during everyday image descriptions or queries.
The same post-training pattern might be tested on other user-specific multimodal tasks such as personalized question answering or story generation from photos.
Long-term use could be examined by measuring whether gains hold as the volume and variety of a single user's visual history increase over time.

Load-bearing premise

The diagnostic evaluations successfully isolate genuine visual-context usage from textual shortcuts or other non-visual solutions.

What would settle it

A controlled test in which CoViP-trained models maintain high performance on personalization tasks even when visual history inputs are removed, replaced with unrelated images, or scrambled, while caption quality remains unchanged.

Figures

Figures reproduced from arXiv: 2602.03454 by Han Cheol Moon, Jisoo Mok, Junsung Park, Sangwon Yu, Sungroh Yoon, Yeongtak Oh.

**Figure 1.** Figure 1: Qualitative example of the use-case for contextual visual personalization in VLMs. Note that our CoViP effectively responds to the question while integrating the mentioned personal details from the given multimodal contexts. quiring additional task-specific processing. Accordingly, we leverage personalized image captioning as a proxy task to effectively model and learn this shared process. Building on this… view at source ↗

**Figure 2.** Figure 2: Illustration of the proposed personalized image captioning benchmark construction. Furthermore, we observe that hθ is inherently aligned with the objective of personalized image captioning. As captioning is a fundamental generation task that avoids extraneous reasoning (e.g., thinking) steps, the resulting caption s directly reflects the model’s user-specific contextual understanding. Accordingly, we ad… view at source ↗

**Figure 3.** Figure 3: Visualization of diagnostic personalization tasks. the user’s contextual history. Given that the context contains multiple interactions involving the same individual, the model must retrieve all relevant entries and perform temporal reasoning to determine the correct answer. This task, therefore, requires grounding visual input in user-specific history rather than relying on partial matches or surfacele… view at source ↗

**Figure 4.** Figure 4: Results of the human preference evaluation. Here, Win denotes the win rates of CoViP compared to the baseline. human preferences. We consider three strong baselines, and as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: analyzes the relationship between recognition and retrieval. As shown in the figure, the average F1 score exhibits a moderate increase across models, whereas MCQA accuracy improves by a substantially larger margin at comparable F1 levels. This indicates that baseline models already achieve reasonable recognition capability, but their low performance under our benchmark probing stems from retrieval as th… view at source ↗

read the original abstract

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names a practical gap in VLMs around user-specific visual memory and gives a concrete RL-based recipe plus diagnostics to address it, but the abstract supplies no numbers or setup details to judge whether the gains are real.

read the letter

The main takeaway is that this work formalizes contextualized visual personalization as the ability for VLMs to retrieve and apply a user's past visual experiences when processing new images. CoViP implements this by treating personalized image captioning as the central task, then applying reinforcement-learning post-training combined with caption-augmented generation. The authors also describe diagnostic evaluations meant to block textual shortcuts and confirm actual visual-context use. Existing models are said to fall short, while CoViP improves captioning and transfers to other personalization tasks. That framing and the training outline are the clearest new elements. The diagnostics stand out as a useful addition because they target a common failure mode in multimodal work. The overall direction fits real needs in consumer and educational applications where models should remember individual photo histories. The soft spots are straightforward. The abstract contains no equations, dataset descriptions, training hyperparameters, or quantitative results, so the magnitude of any improvement and the actual robustness of the diagnostics cannot be checked. The stress-test concern about textual proxies is reasonable on the given text; if the test contexts still let models pull information from associated captions without needing the image pixels, or if the RL reward leans heavily on fluency rather than context retrieval, the claim of true visual leverage would weaken. Nothing in the description suggests circular fitting or invented entities, but the lack of visible evidence keeps the soundness assessment provisional. This paper is aimed at researchers working on VLM fine-tuning and personalization. Anyone building or evaluating context-aware multimodal systems would find the formalization and the evaluation ideas worth reading. It deserves peer review because the problem is well-motivated and the proposed framework is specific enough for referees to examine the experiments and diagnostics in detail.

Referee Report

1 major / 1 minor

Summary. The manuscript formalizes contextualized visual personalization as the task of enabling VLMs to recognize and retrieve user-specific visual experiences when interpreting new images. It proposes CoViP, a unified framework that centers on personalized image captioning as the core task, applies reinforcement-learning-based post-training, and uses caption-augmented generation. Diagnostic evaluations are introduced to rule out textual shortcut solutions and confirm that VLMs leverage visual context. Experiments show limitations in existing open-source and proprietary VLMs, with CoViP delivering gains in captioning and holistic improvements across downstream personalization tasks.

Significance. If the diagnostic evaluations successfully isolate visual-context usage, CoViP could represent a useful post-training stage for building more robust personalized VLMs. The RL-based approach and explicit focus on ruling out shortcuts are potential strengths, provided the evaluations include rigorous controls that ablate visual input independently of text.

major comments (1)

[Abstract / Diagnostic Evaluations] Abstract and diagnostic evaluations section: the claim that the introduced diagnostics 'explicitly rule out textual shortcut solutions' and verify that VLMs 'truly leverage visual context' is load-bearing for the central thesis that CoViP gains arise from contextualized visual personalization rather than fluency or text retrieval. The manuscript should detail concrete controls (e.g., swapping or corrupting visual pixels while preserving associated text descriptions) to demonstrate that non-visual solutions are blocked; without such ablations the diagnostics risk permitting textual proxies, as raised in the stress-test note.

minor comments (1)

[Abstract] The abstract refers to 'extensive experiments' and 'holistic gains across downstream personalization tasks' without naming the specific datasets, baselines, or metrics; adding these details would improve clarity even in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive report. We address the single major comment below and have revised the manuscript to strengthen the description of our diagnostic controls.

read point-by-point responses

Referee: [Abstract / Diagnostic Evaluations] Abstract and diagnostic evaluations section: the claim that the introduced diagnostics 'explicitly rule out textual shortcut solutions' and verify that VLMs 'truly leverage visual context' is load-bearing for the central thesis that CoViP gains arise from contextualized visual personalization rather than fluency or text retrieval. The manuscript should detail concrete controls (e.g., swapping or corrupting visual pixels while preserving associated text descriptions) to demonstrate that non-visual solutions are blocked; without such ablations the diagnostics risk permitting textual proxies, as raised in the stress-test note.

Authors: We thank the referee for highlighting this critical point. While our original manuscript introduced diagnostic evaluations intended to rule out textual shortcuts, we agree that the description of the concrete controls was insufficiently explicit. In the revised manuscript we have added a new subsection (Section 4.3) titled 'Controls for Isolating Visual Context Usage'. This subsection now details three specific ablations performed on both baseline VLMs and CoViP: (1) pixel-level corruption of the visual input while preserving all associated personalized text descriptions, (2) complete removal of visual pixels (text-only context), and (3) cross-user visual swapping where the visual context from one user is paired with the query image of another. In all three cases we report substantial performance degradation when visual information is disrupted or mismatched, while CoViP retains higher robustness than baselines. These results are presented in a new Table 3 and Figure 4, with full experimental details moved to the appendix. We believe these additions directly address the concern that textual proxies could explain the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework and diagnostics are independently defined and validated.

full rationale

The paper formalizes a new task of contextualized visual personalization and introduces CoViP as a post-training framework that applies reinforcement learning and caption-augmented generation to existing VLMs. Diagnostic evaluations are presented as separate verification steps to check for visual context usage versus textual shortcuts. No equations, fitted parameters, or derivations are shown that reduce any claimed prediction or result to the same inputs by construction. Self-citations, if present, are not load-bearing for the central claims, which rest on experimental outcomes rather than tautological redefinitions. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified effectiveness of the proposed RL objective and diagnostic tests.

pith-pipeline@v0.9.0 · 5716 in / 1216 out tokens · 49902 ms · 2026-05-21T14:12:27.857205+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose CoViP, a unified framework that treats personalized image captioning as a core task... through reinforcement-learning-based post-training and caption-augmented generation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Object Hallucination in Image Captioning

Springer, 2024. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021. Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

I still remember the day I saw John at Lake Francesborough on 2025-09-09

Last-Seen Detection (LSD)In the LSD task, each dialogue contains an explicit reference to when and where the user encountered the individual, e.g.,“I still remember the day I saw John at Lake Francesborough on 2025-09-09. ” Given a new query image, the user asks:“Where did I last see the person in this image?” To answer correctly, the model must identify ...

work page 2025
[3]

Oh wait, I need to go to the post office to return this package

Last-Action Recall (LAR)LAR extends LSD by requiring recall of a finer-grained personal action rather than a location. For each context, we append an additional user utterance to thelast-seen dialogueof the query individual, describing a specific action, e.g.,“Oh wait, I need to go to the post office to return this package. ” The action is randomly sample...

work page
[4]

If this person ever shows up again, remind me by saying the keyword SKS

Instruction-Triggered Recall (ITR)ITR evaluates a more proactive form of personalization. In this task, the last-seen dialogue includes an instruction of the form:“If this person ever shows up again, remind me by saying the keyword SKS. ” At inference time, the user asks a generic question such as:“Where did I last see the person in this image?”without ex...

work page 2024
[5]

Conversely, optimizing only rcaps also yields consistently weaker results, suggesting retrieval signals without fine-grained visual supervision are inadequate

Necessity of joint supervision.Training with only rvis (without rcaps) degrades performance, in some cases falling below the Qwen3-VL-8B baseline, indicating that visual supervision alone is insufficient for personalized image captioning. Conversely, optimizing only rcaps also yields consistently weaker results, suggesting retrieval signals without fine-g...

work page
[6]

F1-based vision VR vs. binary consistency VR.Replacing RePIC’s object-consistency VR (OCT), which provides binary correctness feedback, with our set-based F1 VR rvis consistently improves performance on positive accuracy. This indicates that set-level supervision provides a denser and more robust learning signal for multi-concept perception

work page
[7]

Effect of increasing the number of positive concepts.Varying the number of positive concepts included during 23 Contextualized Visual Personalization in Vision-Language Models (a) Category Accuracy (%)(b) Per TaskAccuracy (%) MMIU Figure S.10.Results on MMIU (11K MCQs across 52 tasks), which evaluates multi-image relational understanding spanning temporal...

work page 2025
[8]

Any concept image is occluded or not fully visible in the query image

work page
[9]

Any concept does not appear in the query image at all

work page
[10]

yes" only when every concept appears in the query image. Carefully examine the images and output the final result only as

Any object or person in the query image is significantly different from the corresponding concept image. [Answering rule] Output "yes" only when every concept appears in the query image. Carefully examine the images and output the final result only as"yes"or"no". 28 Contextualized Visual Personalization in Vision-Language Models Table S.11.Showcase of a u...

work page
[11]

This looks like Pino again, perhaps older than in the park photo from Busan Station

Recall and reuse detailsfrom the previous dialogues (object names, appearances, places, times, and relationships). – Treat the previous dialogues as long-term memory. – If an object in the new image appears similar to one mentioned in the past, refer to it using the same name and contextual background. 2.Ground your description in the new image’s visual c...

work page
[12]

Keep your tone natural and human-like, as if describing something familiar to the same user

work page
[13]

Do not restate previous dialogues verbatim; instead, synthesize and extend them with new image- grounded observations

work page
[14]

6.Use only relevant memories

Write in paragraph form, not in a dialogue format. 6.Use only relevant memories. – If an object or scene from the previous dialogues doesnot appear in the new image, ignore it completely. – Include contextual information only for objects that actually appear. – Avoid unrelated names, locations, or events. 29 Contextualized Visual Personalization in Vision...

work page
[15]

• Do not invent or alter the name

The main object’s name ({name}) must be used consistently throughout the dialogue. • Do not invent or alter the name

work page
[16]

last summer at the riverside,

The user should describe a personal experience related to{name}. • The experience must include at least oneobjective contextual element, such as a specificplace, time,event, orsituation(e.g., “last summer at the riverside,” “during my first year in college,” “in my grandmother’s backyard”)

work page
[17]

The model should respond naturally and empathetically — acknowledging, asking gentle questions, or adding brief reflections

work page
[18]

Keep the tone human-like, calm, and realistic — not overly emotional or robotic

work page
[19]

The conversation should have6 turns total(User→Model→User→Model→User→Model)

work page
[20]

Focus on thepersonal connectionandshared observa- tionof the object

Avoid encyclopedic or factual world knowledge. Focus on thepersonal connectionandshared observa- tionof the object. [Output Format] Dialogue: User: ... Model: ... User: ... Model: ... User: ... Model: ... 30 Contextualized Visual Personalization in Vision-Language Models Table S.13.Visualization of a prompt used to generate MCQA pairs from the dialogue. M...

work page
[21]

Each question must target an objective detail present in the conversation (e.g., name, place, time, habit/action)

work page
[22]

Avoid emotions, opinions, or meta-dialogue

work page
[23]

Each question must have exactly 3 options: A, B, C

work page
[24]

Exactly one option is correct among A, B, C

work page
[25]

Make the wrong options (A/B/C except the correct one) plausible but clearly incorrect

work page
[26]

DoNOTrequire external/world knowledge; answers must come from the conversation content

work page
[27]

qa": [ {

Output must be valid JSON only: no additional text and no trailing commas. [JSON Output Schema] { "qa": [ { "id": "Q1", "question": "<string>", "options": { "A": "<string>", "B": "<string>", "C": "<string>" }, "correct_answer": "A" | "B" | "C" }, { "id": "Q2", "question": "<string>", "options": { ... }, "correct_answer": "A" | "B" | "C" }, { "id": "Q3", "...

work page
[28]

Read the description carefully

work page
[29]

• Ifnoneof A/B/C can be confirmed from the description, choose D

For each question, choose thesingle bestoption: • If one of A/B/C is explicitly or clearly supported by the description, choose that option. • Ifnoneof A/B/C can be confirmed from the description, choose D

work page
[30]

You must ignore any information that is not in the description

work page
[31]

• Then, on a separate line, output the final choice in the exact format: [Required output format] Answer:\boxed{X} whereXis one ofA, B, C, or D

For each question: • You may briefly explain your reasoning in natural language. • Then, on a separate line, output the final choice in the exact format: [Required output format] Answer:\boxed{X} whereXis one ofA, B, C, or D. Inside\boxed{}there must beexactly one letter, with no extra text. [Given] •[Description]{Generated caption} •[Question]{Pre-define...

work page
[32]

• Do not invent, alter, or omit the name

The person’s name ({name}) must be used consistently throughout the dialogue. • Do not invent, alter, or omit the name

work page
[33]

• The experience must include at least one concreteevent or situation(e.g., bumping into them, having a short conversation, noticing what they were doing)

The user must describe apersonal experiencerelated to{name}. • The experience must include at least one concreteevent or situation(e.g., bumping into them, having a short conversation, noticing what they were doing). • It should include at least onesensory or situational detailthat makes the memory feel realistic

work page
[34]

I saw them on{seen date}at{seen place}

The user must explicitly mentionboththe date and the place: • Date:{seen date} • Place:{seen place} • Preferably within a single user turn (e.g., “I saw them on{seen date}at{seen place}...”)

work page
[35]

• Do not introduce new factual information beyond what the user provides

The model should respond naturally and empathetically, acknowledging the user’s experience or asking gentle follow-up questions. • Do not introduce new factual information beyond what the user provides

work page
[36]

Avoid encyclopedic or factual descriptions

Keep the tone calm, realistic, and human-like. Avoid encyclopedic or factual descriptions

work page
[37]

[Output Format] Dialogue: User:

The conversation must haveexactly 6 turnsin total: User→Model→User→Model→User→Model. [Output Format] Dialogue: User: ... Model: ... User: ... Model: ... User: ... Model: ... 34 Contextualized Visual Personalization in Vision-Language Models Table S.17.Prompt visualization used for diagnostic downstream tasks. Personalized Image Understanding Prompt: You a...

work page
[38]

This looks like Pino again, now indoors instead of the park near Busan Station

Recall and reuse detailsfrom the previous dialogues (object names, appearances, places, times, and relationships). • Treat the previous dialogues as your long-term memory. • If an object in the new image appears similar to one mentioned in the past, refer to it with the same name and contextual background. 2.Ground your understanding in the new image’s vi...

work page
[39]

Keep your tone natural and human-like — as if you are interpreting something familiar to the same user

work page
[40]

Instead, synthesize memory with the current image content

Do not restate previous dialogues verbatim. Instead, synthesize memory with the current image content

work page
[41]

Oh wait, I think I left my wallet at the Guess store, so I’m going back to check

Write inparagraph form, not in a dialogue format. 6.Use only relevant memories. • If an object or context from past dialogues doesnotappear in the new image, ignore it completely. • Add contextual information only when it helps understanding of what is visible. • Avoid mentioning unrelated names, locations, or experiences. 35 Contextualized Visual Persona...

work page
[42]

A question asked to the model

work page
[43]

A ground-truth reference answer (GT)

work page
[44]

Evaluation Criteria: • The generated response isCorrectif it semantically includes the core information conveyed by the ground-truth reference

A generated response from the model Your task is to decide whether the generated response isCorrectorWrong. Evaluation Criteria: • The generated response isCorrectif it semantically includes the core information conveyed by the ground-truth reference. • The wording does NOT need to match exactly. Paraphrases, rephrasings, or additional details are allowed...

work page

[1] [1]

Object Hallucination in Image Captioning

Springer, 2024. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021. Rohrbach, A., Hendricks, L. A., Burns, K., Darrell, T., and Saenko, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

I still remember the day I saw John at Lake Francesborough on 2025-09-09

Last-Seen Detection (LSD)In the LSD task, each dialogue contains an explicit reference to when and where the user encountered the individual, e.g.,“I still remember the day I saw John at Lake Francesborough on 2025-09-09. ” Given a new query image, the user asks:“Where did I last see the person in this image?” To answer correctly, the model must identify ...

work page 2025

[3] [3]

Oh wait, I need to go to the post office to return this package

Last-Action Recall (LAR)LAR extends LSD by requiring recall of a finer-grained personal action rather than a location. For each context, we append an additional user utterance to thelast-seen dialogueof the query individual, describing a specific action, e.g.,“Oh wait, I need to go to the post office to return this package. ” The action is randomly sample...

work page

[4] [4]

If this person ever shows up again, remind me by saying the keyword SKS

Instruction-Triggered Recall (ITR)ITR evaluates a more proactive form of personalization. In this task, the last-seen dialogue includes an instruction of the form:“If this person ever shows up again, remind me by saying the keyword SKS. ” At inference time, the user asks a generic question such as:“Where did I last see the person in this image?”without ex...

work page 2024

[5] [5]

Conversely, optimizing only rcaps also yields consistently weaker results, suggesting retrieval signals without fine-grained visual supervision are inadequate

Necessity of joint supervision.Training with only rvis (without rcaps) degrades performance, in some cases falling below the Qwen3-VL-8B baseline, indicating that visual supervision alone is insufficient for personalized image captioning. Conversely, optimizing only rcaps also yields consistently weaker results, suggesting retrieval signals without fine-g...

work page

[6] [6]

F1-based vision VR vs. binary consistency VR.Replacing RePIC’s object-consistency VR (OCT), which provides binary correctness feedback, with our set-based F1 VR rvis consistently improves performance on positive accuracy. This indicates that set-level supervision provides a denser and more robust learning signal for multi-concept perception

work page

[7] [7]

Effect of increasing the number of positive concepts.Varying the number of positive concepts included during 23 Contextualized Visual Personalization in Vision-Language Models (a) Category Accuracy (%)(b) Per TaskAccuracy (%) MMIU Figure S.10.Results on MMIU (11K MCQs across 52 tasks), which evaluates multi-image relational understanding spanning temporal...

work page 2025

[8] [8]

Any concept image is occluded or not fully visible in the query image

work page

[9] [9]

Any concept does not appear in the query image at all

work page

[10] [10]

yes" only when every concept appears in the query image. Carefully examine the images and output the final result only as

Any object or person in the query image is significantly different from the corresponding concept image. [Answering rule] Output "yes" only when every concept appears in the query image. Carefully examine the images and output the final result only as"yes"or"no". 28 Contextualized Visual Personalization in Vision-Language Models Table S.11.Showcase of a u...

work page

[11] [11]

This looks like Pino again, perhaps older than in the park photo from Busan Station

Recall and reuse detailsfrom the previous dialogues (object names, appearances, places, times, and relationships). – Treat the previous dialogues as long-term memory. – If an object in the new image appears similar to one mentioned in the past, refer to it using the same name and contextual background. 2.Ground your description in the new image’s visual c...

work page

[12] [12]

Keep your tone natural and human-like, as if describing something familiar to the same user

work page

[13] [13]

Do not restate previous dialogues verbatim; instead, synthesize and extend them with new image- grounded observations

work page

[14] [14]

6.Use only relevant memories

Write in paragraph form, not in a dialogue format. 6.Use only relevant memories. – If an object or scene from the previous dialogues doesnot appear in the new image, ignore it completely. – Include contextual information only for objects that actually appear. – Avoid unrelated names, locations, or events. 29 Contextualized Visual Personalization in Vision...

work page

[15] [15]

• Do not invent or alter the name

The main object’s name ({name}) must be used consistently throughout the dialogue. • Do not invent or alter the name

work page

[16] [16]

last summer at the riverside,

The user should describe a personal experience related to{name}. • The experience must include at least oneobjective contextual element, such as a specificplace, time,event, orsituation(e.g., “last summer at the riverside,” “during my first year in college,” “in my grandmother’s backyard”)

work page

[17] [17]

The model should respond naturally and empathetically — acknowledging, asking gentle questions, or adding brief reflections

work page

[18] [18]

Keep the tone human-like, calm, and realistic — not overly emotional or robotic

work page

[19] [19]

The conversation should have6 turns total(User→Model→User→Model→User→Model)

work page

[20] [20]

Focus on thepersonal connectionandshared observa- tionof the object

Avoid encyclopedic or factual world knowledge. Focus on thepersonal connectionandshared observa- tionof the object. [Output Format] Dialogue: User: ... Model: ... User: ... Model: ... User: ... Model: ... 30 Contextualized Visual Personalization in Vision-Language Models Table S.13.Visualization of a prompt used to generate MCQA pairs from the dialogue. M...

work page

[21] [21]

Each question must target an objective detail present in the conversation (e.g., name, place, time, habit/action)

work page

[22] [22]

Avoid emotions, opinions, or meta-dialogue

work page

[23] [23]

Each question must have exactly 3 options: A, B, C

work page

[24] [24]

Exactly one option is correct among A, B, C

work page

[25] [25]

Make the wrong options (A/B/C except the correct one) plausible but clearly incorrect

work page

[26] [26]

DoNOTrequire external/world knowledge; answers must come from the conversation content

work page

[27] [27]

qa": [ {

Output must be valid JSON only: no additional text and no trailing commas. [JSON Output Schema] { "qa": [ { "id": "Q1", "question": "<string>", "options": { "A": "<string>", "B": "<string>", "C": "<string>" }, "correct_answer": "A" | "B" | "C" }, { "id": "Q2", "question": "<string>", "options": { ... }, "correct_answer": "A" | "B" | "C" }, { "id": "Q3", "...

work page

[28] [28]

Read the description carefully

work page

[29] [29]

• Ifnoneof A/B/C can be confirmed from the description, choose D

For each question, choose thesingle bestoption: • If one of A/B/C is explicitly or clearly supported by the description, choose that option. • Ifnoneof A/B/C can be confirmed from the description, choose D

work page

[30] [30]

You must ignore any information that is not in the description

work page

[31] [31]

• Then, on a separate line, output the final choice in the exact format: [Required output format] Answer:\boxed{X} whereXis one ofA, B, C, or D

For each question: • You may briefly explain your reasoning in natural language. • Then, on a separate line, output the final choice in the exact format: [Required output format] Answer:\boxed{X} whereXis one ofA, B, C, or D. Inside\boxed{}there must beexactly one letter, with no extra text. [Given] •[Description]{Generated caption} •[Question]{Pre-define...

work page

[32] [32]

• Do not invent, alter, or omit the name

The person’s name ({name}) must be used consistently throughout the dialogue. • Do not invent, alter, or omit the name

work page

[33] [33]

• The experience must include at least one concreteevent or situation(e.g., bumping into them, having a short conversation, noticing what they were doing)

The user must describe apersonal experiencerelated to{name}. • The experience must include at least one concreteevent or situation(e.g., bumping into them, having a short conversation, noticing what they were doing). • It should include at least onesensory or situational detailthat makes the memory feel realistic

work page

[34] [34]

I saw them on{seen date}at{seen place}

The user must explicitly mentionboththe date and the place: • Date:{seen date} • Place:{seen place} • Preferably within a single user turn (e.g., “I saw them on{seen date}at{seen place}...”)

work page

[35] [35]

• Do not introduce new factual information beyond what the user provides

The model should respond naturally and empathetically, acknowledging the user’s experience or asking gentle follow-up questions. • Do not introduce new factual information beyond what the user provides

work page

[36] [36]

Avoid encyclopedic or factual descriptions

Keep the tone calm, realistic, and human-like. Avoid encyclopedic or factual descriptions

work page

[37] [37]

[Output Format] Dialogue: User:

The conversation must haveexactly 6 turnsin total: User→Model→User→Model→User→Model. [Output Format] Dialogue: User: ... Model: ... User: ... Model: ... User: ... Model: ... 34 Contextualized Visual Personalization in Vision-Language Models Table S.17.Prompt visualization used for diagnostic downstream tasks. Personalized Image Understanding Prompt: You a...

work page

[38] [38]

This looks like Pino again, now indoors instead of the park near Busan Station

Recall and reuse detailsfrom the previous dialogues (object names, appearances, places, times, and relationships). • Treat the previous dialogues as your long-term memory. • If an object in the new image appears similar to one mentioned in the past, refer to it with the same name and contextual background. 2.Ground your understanding in the new image’s vi...

work page

[39] [39]

Keep your tone natural and human-like — as if you are interpreting something familiar to the same user

work page

[40] [40]

Instead, synthesize memory with the current image content

Do not restate previous dialogues verbatim. Instead, synthesize memory with the current image content

work page

[41] [41]

Oh wait, I think I left my wallet at the Guess store, so I’m going back to check

Write inparagraph form, not in a dialogue format. 6.Use only relevant memories. • If an object or context from past dialogues doesnotappear in the new image, ignore it completely. • Add contextual information only when it helps understanding of what is visible. • Avoid mentioning unrelated names, locations, or experiences. 35 Contextualized Visual Persona...

work page

[42] [42]

A question asked to the model

work page

[43] [43]

A ground-truth reference answer (GT)

work page

[44] [44]

Evaluation Criteria: • The generated response isCorrectif it semantically includes the core information conveyed by the ground-truth reference

A generated response from the model Your task is to decide whether the generated response isCorrectorWrong. Evaluation Criteria: • The generated response isCorrectif it semantically includes the core information conveyed by the ground-truth reference. • The wording does NOT need to match exactly. Paraphrases, rephrasings, or additional details are allowed...

work page