Personalization Toolkit: Training Free Personalization of Large Vision Language Models

Daniel Olmeda Reino; Fabien Despinoy; Matteo Cassinelli; Rahaf Aljundi; Soroush Seifi; Vaggelis Dorovatas

arxiv: 2502.02452 · v4 · submitted 2025-02-04 · 💻 cs.CV

Personalization Toolkit: Training Free Personalization of Large Vision Language Models

Soroush Seifi , Vaggelis Dorovatas , Matteo Cassinelli , Fabien Despinoy , Daniel Olmeda Reino , Rahaf Aljundi This is my paper

Pith reviewed 2026-05-23 03:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords LVLM personalizationtraining-free methodsmulti-concept personalizationvision foundation modelsretrieval-augmented generationvisual promptingimage and video personalizationpersonalization benchmark

0 comments

The pith

Training-free toolkit personalizes large vision-language models for multiple concepts in images and videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to customize large vision-language models to recognize specific objects or users without any training step for each new item. It extracts features using existing vision foundation models, retrieves matching instances from the input via retrieval-augmented generation, and steers outputs with visual prompts. The toolkit handles several concepts at once and processes both still images and video sequences. The authors also release a new benchmark focused on realistic multi-concept cases. If the approach works, personalization moves from a slow per-item training process to an immediate, reusable capability.

Core claim

The paper claims that its model-agnostic vision toolkit enables efficient and flexible multi-concept personalization of LVLMs across images and videos without additional training. It achieves this by using pre-trained vision foundation models to extract distinctive features, retrieval-augmented generation to identify instances within visual inputs, and visual prompting strategies to guide model outputs, while also introducing a comprehensive real-world benchmark that evaluates these aspects beyond single-concept object-centric tests, and reports state-of-the-art results that surpass existing training-based methods.

What carries the argument

The model-agnostic vision toolkit that extracts distinctive features from pre-trained vision foundation models, identifies instances via retrieval-augmented generation, and guides outputs with visual prompting.

If this is right

Multi-concept personalization becomes possible in one forward pass without per-item training.
The same toolkit applies to both image and video inputs.
The method remains compatible with different underlying large vision-language models.
Performance exceeds that of prior approaches that require training.
A new benchmark now exists for testing personalization under realistic multi-concept conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time consumer applications such as personal photo or video assistants could adopt personalization at scale because no retraining is needed.
The retrieval-plus-prompting pattern might transfer to other input types like audio clips for cross-modal personalization.
Testing on inputs with heavy occlusion or rapid motion would reveal whether the current feature extraction step remains stable outside the reported benchmark.

Load-bearing premise

Pre-trained vision foundation models can reliably extract features distinctive enough to identify and retrieve specific instances accurately in complex real-world scenes with multiple concepts.

What would settle it

Running the toolkit on a dataset of crowded scenes containing many visually similar objects and checking whether instance identification accuracy drops below usable levels would test the feature extraction premise.

Figures

Figures reproduced from arXiv: 2502.02452 by Daniel Olmeda Reino, Fabien Despinoy, Matteo Cassinelli, Rahaf Aljundi, Soroush Seifi, Vaggelis Dorovatas.

**Figure 1.** Figure 1: Illustration of the personalization task and our PeKit. A reference image is introduced to the LVLM with information and possible context. The LVLM should later be able to answer questions about the introduced object using only the name of the object in the query. Our approach, PeKit, extracts patch-level features from the reference image and stores them in a memory module, M. During personalized inference… view at source ↗

**Figure 2.** Figure 2: Our proposed evaluation set This-Is-My-Img, built on the This-Is-My dataset [30]: Example reference views and validation samples from the single-concept category Reynard’s Work Chair and the multi-concept category Nikki-Nikki’s Car. Faces are blurred to ensure compliance with GDPR. MyVLM [2] dataset consists of 29 object categories and Yo’LLaVA [20] dataset includes 40 categories of objects, buildings, an… view at source ↗

**Figure 3.** Figure 3: Qualitative Results: PeKit handles a range of personalization tasks, encompassing both single- and multiconcept personalization in images and videos. For video personalization, the VLM model can reliably track the target object across frames using only a few confidently annotated instances. One representative frame is shown per scene. Faces are blurred to ensure compliance with GDPR. model’s performance b… view at source ↗

**Figure 4.** Figure 4: Ablation on N: Average weighted visual recognition accuracy as a function of number of reference images. Increasing the number of reference images improves performance, but PeKit is robust with just one reference image. the first 10 objects from the Yo’LLaVA dataset as the number of personalized objects increases incrementally from 10 to all 40 categories. While there is a slight performance drop at high… view at source ↗

**Figure 5.** Figure 5: illustrates the performance of PeKit on 1 2 3 4 Number of Reference Images 88 90 92 94 96 98 Weighted Accuracy MyVLM dataset Yo'LLaVA Dataset [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: This-Is-My-Img Single-concept Benchmark. Our benchmark includes a wide range of concepts presented in realistic indoor and outdoor environments. Reference views can occasionally be sub-optimal, which increases the difficulty of the task. The positive validation set may contain false positives from within the same semantic category, allowing us to assess a model’s robustness to contextual similarities. The … view at source ↗

**Figure 7.** Figure 7: Prompt Format. Personalized VQA and captioning on Yo’LLaVA (Left) and MyVLM (Right) datasets. The context used for the ‘red chicken’ is imaginary and generated by ChatGPT. Correct Answer: ANSWER Predicted Answer: PREDICTION Provide your evaluation only as a yes/no answer. Please generate the response in the form of a Python dictionary string with key ‘pred’, where value of ‘pred’ is a string of ‘yes’ or ‘n… view at source ↗

**Figure 8.** Figure 8: Noisy Reference Views: Poor segmentation masks may affect the visual prompting stage and degrade PeKit’s performance. Reference Views Blippi's shoes 14 Patches 25 Patches Validation (Fake) False Positive Detection [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Small Reference Objects: The native image resolution (518 × 518) and stride factor (14) of DinoV2 might result in embeddings of small personalized objects, such as Blippi’s shoes, capturing only general attributes, which can increase the likelihood of false positive detections. The incorrect detections are depicted on our proposed Fake validation set. Faces are blurred to ensure compliance with GDPR. PeKit… view at source ↗

**Figure 10.** Figure 10: Comparison to LLaVA: Right: Our method detects personalized objects and integrates provided context (for qualitative comparison) in caption generation. Left: While the original model struggles with specific questions about named objects, our method easily identifies the referred object. Faces are blurred to ensure compliance with GDPR. F. Qualitative Comparison to MyVLM [2] [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 11.** Figure 11: Comparison to MyVLM: MyVLM often misidentifies personalized objects because of its low precision. In the leftmost figure, when prompted to caption an image containing a ‘Cat Statue’—which is actually absent—MyVLM incorrectly labels the ‘Asian doll’ and the headset as the ‘Cat Statue’ instead of rejecting the query. Additionally, MyVLM training interferes with the original captioning capabilities of the LV… view at source ↗

**Figure 12.** Figure 12: Qualitative Comparison to Yo’LLaVA: Yo’LLaVA’s prompt template requires specifying the personalized object’s identifier in the query (first row), limiting generalization since users must already know which objects are in the image. Using image-level embeddings can also cause confusion between similar objects (e.g., Alex vs. Alex’s bag). Adjusting the LLM’s head weights further harms captioning quality. Pe… view at source ↗

read the original abstract

Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming training for each item, making them impractical for real-world deployment, as reflected in current personalization benchmarks limited to object-centric single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization called \ours. We introduce a comprehensive, real-world benchmark designed to rigorously evaluate various aspects of the personalization task. \ours leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The training-free multi-concept claim is the main draw but the abstract gives zero numbers or protocol details to support the SOTA assertion over trained methods.

read the letter

The paper's central idea is a training-free toolkit that pulls features from off-the-shelf vision models, uses RAG to spot specific instances in images or videos, and applies visual prompts to steer an LVLM. It also ships a new benchmark meant to cover real-world multi-concept cases instead of the usual single-object setups. That combination addresses a clear deployment friction: no per-item fine-tuning required. The model-agnostic framing and video extension are practical angles worth noting if the retrieval step actually works at instance level. Introducing the benchmark itself is a concrete step forward, since prior ones were too narrow. The soft spots sit right at the claims. The abstract states SOTA results that beat training-based baselines, yet supplies no metrics, no dataset sizes, no baselines listed, and no evaluation protocol. Without those, the performance assertion cannot be checked. The core assumption—that generic vision encoders already yield features distinctive enough for reliable RAG retrieval amid pose, lighting, and distractor variation—receives no supporting evidence here either. If that assumption fails on complex scenes, both identification and the downstream personalization collapse. This work is aimed at groups building deployable personalized vision-language systems who need training-free options. A reader who wants to test whether the RAG-plus-prompting pipeline delivers on multi-concept cases would find it relevant once the numbers appear. The paper deserves a serious referee to examine the full experiments, benchmark construction, and direct comparisons; the idea is grounded enough in a real problem to warrant that step even if revisions are needed on the evaluation side. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces a training-free personalization toolkit, called Personalization Toolkit or similar, for Large Vision-Language Models (LVLMs). It leverages pre-trained vision foundation models to extract features, applies retrieval-augmented generation (RAG) to identify specific instances in visual inputs, and uses visual prompting strategies to guide the model's outputs. The method is presented as model-agnostic and capable of handling multi-concept personalization across both images and videos without any additional training. A new comprehensive real-world benchmark is introduced to evaluate personalization beyond limited single-concept object-centric settings. The authors claim state-of-the-art results that surpass existing training-based methods.

Significance. If the empirical claims hold, the work would be significant by offering an efficient alternative to training-based personalization, addressing practical deployment barriers for LVLMs in multi-concept and video scenarios. The new benchmark for rigorous multi-concept evaluation is a constructive addition to the field. Credit is due for focusing on training-free operation and extending beyond single-concept limits. However, the significance is tempered by the need for strong evidence supporting the core assumption that off-the-shelf vision models suffice for instance-level tasks.

major comments (2)

[Abstract] Abstract: The assertion of achieving state-of-the-art results surpassing training-based methods is presented without any metrics, baselines, dataset details, or evaluation protocol. This is load-bearing for the central claim, as the soundness of the SOTA assertion cannot be assessed from the provided information.
[Method] Method section: The approach relies entirely on pre-trained vision foundation models for feature extraction and RAG-based instance identification without any instance-specific adaptation or fine-tuning. No ablation or analysis demonstrates that these features remain sufficiently distinctive for accurate retrieval in complex multi-concept scenes under pose, lighting, or occlusion variations, which directly underpins the training-free claim and its superiority over adapted methods.

minor comments (1)

[Method] The paper could clarify the exact visual prompting strategies and how they integrate with RAG outputs for multi-concept cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the presentation can be strengthened to better support the central claims. We will revise the manuscript to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of achieving state-of-the-art results surpassing training-based methods is presented without any metrics, baselines, dataset details, or evaluation protocol. This is load-bearing for the central claim, as the soundness of the SOTA assertion cannot be assessed from the provided information.

Authors: We agree that the abstract would benefit from greater specificity to allow immediate assessment of the SOTA claim. The full paper contains quantitative results on the new benchmark with explicit baselines and protocols. In the revision we will expand the abstract to include key performance metrics and a brief description of the evaluation setting. revision: yes
Referee: [Method] Method section: The approach relies entirely on pre-trained vision foundation models for feature extraction and RAG-based instance identification without any instance-specific adaptation or fine-tuning. No ablation or analysis demonstrates that these features remain sufficiently distinctive for accurate retrieval in complex multi-concept scenes under pose, lighting, or occlusion variations, which directly underpins the training-free claim and its superiority over adapted methods.

Authors: The new benchmark explicitly incorporates real-world multi-concept scenes that include pose, lighting, and occlusion variations, and the reported results demonstrate effective instance retrieval under these conditions. We acknowledge that an explicit ablation isolating feature robustness would strengthen the argument. We will add such an ablation study in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external pre-trained models and RAG without self-referential derivations or fits

full rationale

The paper describes a training-free pipeline that extracts features from off-the-shelf vision foundation models, applies standard RAG for instance retrieval, and uses visual prompting. No equations, parameter fitting, or derivations appear in the provided text. The central claim (SOTA multi-concept personalization) is an empirical assertion about the toolkit's performance on a new benchmark, not a mathematical reduction to its own inputs. Self-citations are not load-bearing for any uniqueness theorem or ansatz. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit parameters, axioms, or new entities; all components are described as leveraging pre-existing models and techniques.

pith-pipeline@v0.9.0 · 5708 in / 1026 out tokens · 35665 ms · 2026-05-23T03:30:32.410827+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Moni- cault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Am´elie H´eliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timoth ´ee Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Mar...

work page 2024
[2]

Myvlm: Per- sonalizing vlms for user-specific queries

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. Myvlm: Per- sonalizing vlms for user-specific queries. arXiv preprint arXiv:2403.14599, 2024. 1, 2, 3, 4, 5, 6, 7, 18

work page arXiv 2024
[3]

Vip-llava: Making large multi- modal models understand arbitrary visual prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip-llava: Making large multi- modal models understand arbitrary visual prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12914–12923, 2024. 3

work page 2024
[4]

Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 6

work page 2024
[5]

Yolo-world: Real- time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real- time open-vocabulary object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recog- nition (CVPR), 2024. 12

work page 2024
[6]

A continual learn- ing survey: Defying forgetting in classification tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ale ˇs Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learn- ing survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021. 2

work page 2021
[7]

A survey on in- context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in- context learning. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024. 1

work page 2024
[8]

A comprehensive survey on vector database: Storage and retrieval technique, challenge

Yikun Han, Chunjiang Liu, and Pengfei Wang. A comprehensive survey on vector database: Storage and retrieval technique, challenge. arXiv preprint arXiv:2310.11703, 2023. 3

work page arXiv 2023
[9]

Remember, retrieve and generate: Understanding infinite visual concepts as your personalized assistant, 2024

Haoran Hao, Jiaming Han, Changsheng Li, Yu- Feng Li, and Xiangyu Yue. Remember, retrieve and generate: Understanding infinite visual concepts as your personalized assistant, 2024. 2

work page 2024
[10]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ro- nan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image caption- ing. arXiv preprint arXiv:2104.08718, 2021. 7

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan- Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 4015–4026, 2023. 1, 3, 12

work page 2023
[12]

Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems , 33:9459– 9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems , 33:9459– 9474, 2020. 1

work page 2020
[13]

Semantic-sam: Segment and recognize anything at any granu- larity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023. 3

work page arXiv 2023
[14]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3041–3050, 2023. 3

work page 2023
[15]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. In International conference on ma- chine learning, pages 19730–19742. PMLR, 2023. 1

work page 2023
[16]

Improved baselines with vi- sual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with vi- sual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2023. 6

work page 2023
[17]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 1, 3, 16

work page 2023
[18]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 4, 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 6, 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Yo’llava: Your personalized language and vision assistant

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant. arXiv preprint arXiv:2406.09400, 2024. 1, 2, 3, 4, 5, 6, 7, 11, 18

work page arXiv 2024
[21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Fran- cisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Personalized large vision-language models, 2024

Chau Pham, Hoang Phan, David Doermann, and Yunjie Tian. Personalized large vision-language models, 2024. 2

work page 2024
[23]

Learning transferable visual mod- els from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual mod- els from natural language supervision. In Inter- national conference on machine learning , pages 8748–8763. PMLR, 2021. 3, 4

work page 2021
[24]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceed- ings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22500–22510,

work page
[26]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision , pages 11987–11997,

work page
[27]

Contrastive region guidance: Im- proving grounding in vision-language models with- out training

David Wan, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. Contrastive region guidance: Im- proving grounding in vision-language models with- out training. In ECCV, 2024. 3

work page 2024
[28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2- vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual ground- ing in gpt-4v. arXiv preprint arXiv:2310.11441 ,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Meta- personalizing vision-language models to find named instances in video

Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni. Meta- personalizing vision-language models to find named instances in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19123–19132, 2023. 5, 13

work page 2023
[31]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with ad- vanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Henry". Without mentioning the box and its color ans wer the following question about

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything every- where all at once. Advances in Neural Information Processing Systems, 36, 2024. 3 A. Ablation In this section we ablate various aspects of our per- sonalization method primarily using the Yo’LLaV A dataset as it se...

work page 2024

[1] [1]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Moni- cault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Am´elie H´eliou, Paul Jacob, Albert Q. Jiang, Kartik Khandelwal, Timoth ´ee Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Mar...

work page 2024

[2] [2]

Myvlm: Per- sonalizing vlms for user-specific queries

Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. Myvlm: Per- sonalizing vlms for user-specific queries. arXiv preprint arXiv:2403.14599, 2024. 1, 2, 3, 4, 5, 6, 7, 18

work page arXiv 2024

[3] [3]

Vip-llava: Making large multi- modal models understand arbitrary visual prompts

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip-llava: Making large multi- modal models understand arbitrary visual prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12914–12923, 2024. 3

work page 2024

[4] [4]

Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1, 6

work page 2024

[5] [5]

Yolo-world: Real- time open-vocabulary object detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo-world: Real- time open-vocabulary object detection. In Proc. IEEE Conf. Computer Vision and Pattern Recog- nition (CVPR), 2024. 12

work page 2024

[6] [6]

A continual learn- ing survey: Defying forgetting in classification tasks

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ale ˇs Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learn- ing survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021. 2

work page 2021

[7] [7]

A survey on in- context learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al. A survey on in- context learning. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 1107–1128, 2024. 1

work page 2024

[8] [8]

A comprehensive survey on vector database: Storage and retrieval technique, challenge

Yikun Han, Chunjiang Liu, and Pengfei Wang. A comprehensive survey on vector database: Storage and retrieval technique, challenge. arXiv preprint arXiv:2310.11703, 2023. 3

work page arXiv 2023

[9] [9]

Remember, retrieve and generate: Understanding infinite visual concepts as your personalized assistant, 2024

Haoran Hao, Jiaming Han, Changsheng Li, Yu- Feng Li, and Xiangyu Yue. Remember, retrieve and generate: Understanding infinite visual concepts as your personalized assistant, 2024. 2

work page 2024

[10] [10]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ro- nan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image caption- ing. arXiv preprint arXiv:2104.08718, 2021. 7

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan- Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Com- puter Vision, pages 4015–4026, 2023. 1, 3, 12

work page 2023

[12] [12]

Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems , 33:9459– 9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-augmented genera- tion for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems , 33:9459– 9474, 2020. 1

work page 2020

[13] [13]

Semantic-sam: Segment and recognize anything at any granu- larity

Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023. 3

work page arXiv 2023

[14] [14]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3041–3050, 2023. 3

work page 2023

[15] [15]

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. In International conference on ma- chine learning, pages 19730–19742. PMLR, 2023. 1

work page 2023

[16] [16]

Improved baselines with vi- sual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with vi- sual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2023. 6

work page 2023

[17] [17]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2023. 1, 3, 16

work page 2023

[18] [18]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1, 4, 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023. 6, 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Yo’llava: Your personalized language and vision assistant

Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant. arXiv preprint arXiv:2406.09400, 2024. 1, 2, 3, 4, 5, 6, 7, 11, 18

work page arXiv 2024

[21] [21]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Fran- cisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervi- sion. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Personalized large vision-language models, 2024

Chau Pham, Hoang Phan, David Doermann, and Yunjie Tian. Personalized large vision-language models, 2024. 2

work page 2024

[23] [23]

Learning transferable visual mod- els from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual mod- els from natural language supervision. In Inter- national conference on machine learning , pages 8748–8763. PMLR, 2021. 3, 4

work page 2021

[24] [24]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceed- ings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 22500–22510,

work page

[26] [26]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision , pages 11987–11997,

work page

[27] [27]

Contrastive region guidance: Im- proving grounding in vision-language models with- out training

David Wan, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. Contrastive region guidance: Im- proving grounding in vision-language models with- out training. In ECCV, 2024. 3

work page 2024

[28] [28]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2- vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual ground- ing in gpt-4v. arXiv preprint arXiv:2310.11441 ,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Meta- personalizing vision-language models to find named instances in video

Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni. Meta- personalizing vision-language models to find named instances in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19123–19132, 2023. 5, 13

work page 2023

[31] [31]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with ad- vanced large language models. arXiv preprint arXiv:2304.10592, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Henry". Without mentioning the box and its color ans wer the following question about

Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything every- where all at once. Advances in Neural Information Processing Systems, 36, 2024. 3 A. Ablation In this section we ablate various aspects of our per- sonalization method primarily using the Yo’LLaV A dataset as it se...

work page 2024