MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

Bartosz Kotrys; Dominik Michels; Lennart Petersen; Patryk Bartkowiak; Soren Pirk; Wojtek Palubicki

arxiv: 2605.22469 · v1 · pith:5MQ2KVDOnew · submitted 2026-05-21 · 💻 cs.CV

MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

Patryk Bartkowiak , Lennart Petersen , Bartosz Kotrys , Dominik Michels , Soren Pirk , Wojtek Palubicki This is my paper

Pith reviewed 2026-05-22 06:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords concept preservationevaluation metrictext-to-image generationmasked similaritypersonalizationdiffusion modelsimage embeddingssubject-background separation

0 comments

The pith

MaSC evaluates concept-driven image generation by masking out the subject to measure identity preservation and prompt following separately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MaSC to fix a core flaw in how we score text-to-image models that personalize a single concept. Global metrics such as CLIP-I or DINO look at the whole picture and therefore mix subject identity with background content, producing scores that match human judgments poorly. MaSC instead requires foreground masks for both the reference and the generated image, then splits the evaluation: concept preservation is scored by matching masked patches from the reference subject to patches in the output, while prompt following is scored by comparing a background-only image embedding against a prompt with the subject removed. On human-rated data this yields higher agreement than prior non-LLM baselines and even GPT-4V, and on a real-photo identity benchmark it separates same-subject from cross-subject pairs almost perfectly.

Core claim

MaSC is a masked similarity metric that decomposes evaluation into subject-specific concept preservation measured by masked max-cosine matching between foreground reference patches and generated-image patches, and prompt following measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding, both computed from frozen SigLIP2 SO400M-NaFlex features.

What carries the argument

Externally supplied foreground concept masks that enable separate masked patch matching for concept preservation and background-only embedding comparison for prompt following.

If this is right

Evaluation pipelines for personalization models can replace whole-image embeddings with spatially decomposed scores.
Prompt-following scores become independent of subject appearance, allowing cleaner ablation of generation components.
Identity benchmarks that provide masks can achieve near-perfect discrimination between same-subject and cross-subject pairs.
Development cycles for concept-driven generators can rely on an automatic metric that tracks human perception more closely than existing baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If masks can be generated reliably by an auxiliary model, MaSC could be applied to new domains without manual annotation.
The same decomposition principle might improve evaluation of multi-subject or video personalization tasks.
Training objectives could be designed to maximize the same masked similarity signals that MaSC uses at test time.

Load-bearing premise

Accurate foreground concept masks must be supplied for every reference and generated image to isolate the subject from the background.

What would settle it

On a held-out set of human ratings for concept preservation, MaSC correlation with humans falls below that of a simple global CLIP-I baseline.

Figures

Figures reproduced from arXiv: 2605.22469 by Bartosz Kotrys, Dominik Michels, Lennart Petersen, Patryk Bartkowiak, Soren Pirk, Wojtek Palubicki.

**Figure 1.** Figure 1: MaSC pipeline overview. Reference and generated images are encoded once each by a frozen SigLIP2 SO400M-NaFlex vision tower, producing patch-token grids R, G ∈ R N×D. MaSC consumes provided foreground concept masks, denoted MF R , MF G ∈ {0, 1} H×W . Two branches share the single forward pass. Branch 1 — Concept Preservation (masked-maxcos): for each patch covered by the reference foreground mask (MF R ), … view at source ↗

**Figure 2.** Figure 2: ORIDa CP score distributions per metric: within-subject pairs vs. cross-subject pairs; short horizontal lines mark the mean. Our proposed metric (MaSC) is highlighted with a shaded background. This figure visualizes the tail-overlap behavior that drives the AUC rankings in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MaSC separates subject and background with masked SigLIP matching and shows stronger human correlation than global baselines, but the gains rest on untested mask accuracy for generated images.

read the letter

The main point is that MaSC decomposes evaluation into foreground concept preservation via masked max-cosine on SigLIP2 patches and background-only pooling for prompt following. This is a direct response to the weakness of CLIP-I, DINO, and CLIP-T, which mix subject and scene in one embedding. The paper reports Krippendorff alpha of 0.471 on DreamBench++ human ratings, beating the non-LLM baselines and GPT-4V while nearing GPT-4o, plus an AUC of 0.992 on ORIDa for same-subject versus cross-subject distinction. Those numbers are the concrete evidence it offers.

Referee Report

2 major / 2 minor

Summary. The paper introduces MaSC, a masked similarity metric for evaluating single-concept personalization in text-to-image diffusion models. MaSC decomposes evaluation using externally provided foreground concept masks: concept preservation is computed via masked max-cosine matching on foreground patches from frozen SigLIP2 SO400M-NaFlex features, while prompt following uses background-only pooled embeddings compared against subject-stripped prompt embeddings. On DreamBench++ it reports Krippendorff alpha = 0.471 for concept preservation (outperforming non-LLM baselines and GPT-4V, approaching GPT-4o) and on ORIDa it reports AUC = 0.992 for distinguishing same-subject from cross-subject pairs, with prompt-following scores also exceeding the CLIP-T baseline.

Significance. If the empirical results hold under scrutiny, MaSC offers a principled improvement over global embedding metrics (CLIP-I, DINO, CLIP-T) by explicitly separating subject and background signals. The reliance on frozen external features is a strength for reproducibility, and the reported correlations with human ratings plus near-perfect AUC on a real-photo benchmark suggest the spatial decomposition principle could become a useful design choice for future evaluation protocols in concept-driven generation.

major comments (2)

[Abstract] Abstract (paragraph on MaSC definition): the central claim that spatially decomposed aggregation yields superior human correlation because global embeddings attend to the whole image is load-bearing on the assumption that externally supplied foreground masks for generated images are both accurate and concept-specific. No quantification of mask error rates or re-evaluation under controlled mask noise is described, so it remains unclear whether the reported Krippendorff alpha = 0.471 and AUC = 0.992 advantages over CLIP-I/DINO are isolated to the masking step or could be artifacts of mask quality.
[DreamBench++ evaluation] DreamBench++ evaluation: the Krippendorff alpha = 0.471 for concept preservation is presented as outperforming all tested non-LLM baselines, yet without an ablation that perturbs the provided masks (e.g., boundary dilation or random dropout of foreground patches) it is impossible to confirm that the masked max-cosine is the operative factor rather than a noisy global comparison.

minor comments (2)

The description of background-only pooling for the prompt-following score would benefit from an explicit equation showing how the subject-stripped prompt embedding is formed and how the background mask is applied to the image embedding.
Consider adding a short reproducibility note on whether the externally provided masks for generated images are released alongside the evaluation code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address the two major comments point by point below, agreeing that additional analysis on mask robustness would strengthen the paper. We will incorporate the suggested ablations in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on MaSC definition): the central claim that spatially decomposed aggregation yields superior human correlation because global embeddings attend to the whole image is load-bearing on the assumption that externally supplied foreground masks for generated images are both accurate and concept-specific. No quantification of mask error rates or re-evaluation under controlled mask noise is described, so it remains unclear whether the reported Krippendorff alpha = 0.471 and AUC = 0.992 advantages over CLIP-I/DINO are isolated to the masking step or could be artifacts of mask quality.

Authors: We agree that MaSC's advantages depend on reasonable mask quality and that the manuscript would benefit from explicit quantification of this dependence. The masks used are those externally supplied with the DreamBench++ and ORIDa benchmarks, which are constructed to isolate the target concept. To directly address the concern, we will add a new ablation subsection that applies controlled perturbations to the provided masks (boundary dilation of 5 and 10 pixels, plus random dropout of 10% and 20% of foreground patches) and re-reports both Krippendorff alpha on DreamBench++ and AUC on ORIDa under these conditions. This will demonstrate that the masked max-cosine matching continues to outperform global baselines under moderate noise while degrading predictably as mask fidelity drops, thereby isolating the contribution of spatial decomposition. revision: yes
Referee: [DreamBench++ evaluation] DreamBench++ evaluation: the Krippendorff alpha = 0.471 for concept preservation is presented as outperforming all tested non-LLM baselines, yet without an ablation that perturbs the provided masks (e.g., boundary dilation or random dropout of foreground patches) it is impossible to confirm that the masked max-cosine is the operative factor rather than a noisy global comparison.

Authors: We concur that an explicit perturbation ablation on DreamBench++ would more convincingly isolate the masked max-cosine operator from potential mask artifacts. As described in the response to the abstract comment, the revised manuscript will include this analysis: we will re-evaluate MaSC after applying the suggested mask perturbations and show that its human correlation advantage over CLIP-I and DINO persists for moderate noise levels. We will also report how performance changes as a function of increasing mask error, providing quantitative support that the spatial decomposition itself drives the reported gains. revision: yes

Circularity Check

0 steps flagged

MaSC metric is a direct design choice using external masks and frozen features; no derivation reduces to inputs by construction

full rationale

The paper defines MaSC explicitly via masked max-cosine on foreground patches from frozen SigLIP2 features and background-only pooling for prompt following, using externally supplied masks. Reported results (Krippendorff α=0.471, AUC=0.992) are empirical measurements against human ratings and ORIDa pairs, not predictions derived from the metric itself. No self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described derivation. The construction is independent of the evaluation outcomes, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The metric depends on accurate external masks and the representational quality of SigLIP2 features; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Externally provided foreground masks accurately isolate the concept subject from background in both reference and generated images.
Invoked to enable the masked decomposition described in the abstract.

pith-pipeline@v0.9.0 · 5822 in / 1179 out tokens · 43529 ms · 2026-05-22T06:41:55.607924+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page 2025
[2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021

work page 2021
[3]

Superpoint: Self-supervised interest point detection and description, 2018

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description, 2018

work page 2018
[4]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. Roma: Robust dense feature matching. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 19790–19800. IEEE, 2024

work page 2024
[5]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, volume 36, pages 50742–50768, 2023

work page 2023
[6]

Bermano, Gal Chechik, and Daniel Cohen-Or

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022

work page 2022
[7]

Orida: Object-centric real-world image composition dataset

Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyeoung Kim, and Seon Joo Kim. Orida: Object-centric real-world image composition dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3051–3060, 2025

work page 2025
[8]

Computing krippendorff’s alpha-reliability

klaus krippendorff. Computing krippendorff’s alpha-reliability. 01 2011

work page 2011
[9]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024

work page 2024
[10]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.arXiv preprint arXiv:2305.14720, 2023

Dongxu Li, Junnan Li, and Steven CH Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.arXiv preprint arXiv:2305.14720, 2023

work page arXiv 2023
[11]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

work page 2022
[12]

Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024. 10

work page arXiv 2024
[13]

Lightglue: Local feature matching at light speed, 2023

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed, 2023

work page 2023
[14]

Image segmentation using text and image prompts

Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, June 2022

work page 2022
[15]

Hpsv3: Towards wide-spectrum human preference score, 2025

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score, 2025

work page 2025
[16]

Scaling open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zirui Shen, et al. Scaling open-vocabulary object detection. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[17]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

work page 2024
[18]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page 2024
[19]

Dreambench++: A human-aligned benchmark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[20]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021
[21]

Am-radio: Agglomera- tive vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomera- tive vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, June 2024

work page 2024
[22]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

work page 2024
[24]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

work page 2022
[25]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022

work page 2022
[26]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page 2025
[27]

LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

work page 2021
[28]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. 2023

work page 2023
[29]

Emer- gent correspondence from image diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 13

work page 2023
[30]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

work page 2024
[32]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

work page 2023
[33]

Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023. 6.1 Sensitivity to Segmentation Source To verify that MaSC’s performance is not artificially tied to the specific capabilities of SAM3, we conducted a sensitivity analysis across four distinct zero-shot segmentatio...

work page 2023
[34]

Therefore, IRB or equivalent approval is not applicable

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Sam 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page 2025

[2] [2]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021

work page 2021

[3] [3]

Superpoint: Self-supervised interest point detection and description, 2018

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description, 2018

work page 2018

[4] [4]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. Roma: Robust dense feature matching. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 19790–19800. IEEE, 2024

work page 2024

[5] [5]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, volume 36, pages 50742–50768, 2023

work page 2023

[6] [6]

Bermano, Gal Chechik, and Daniel Cohen-Or

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022

work page 2022

[7] [7]

Orida: Object-centric real-world image composition dataset

Jinwoo Kim, Sangmin Han, Jinho Jeong, Jiwoo Choi, Dongyeoung Kim, and Seon Joo Kim. Orida: Object-centric real-world image composition dataset. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3051–3060, 2025

work page 2025

[8] [8]

Computing krippendorff’s alpha-reliability

klaus krippendorff. Computing krippendorff’s alpha-reliability. 01 2011

work page 2011

[9] [9]

Grounding image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r, 2024

work page 2024

[10] [10]

Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.arXiv preprint arXiv:2305.14720, 2023

Dongxu Li, Junnan Li, and Steven CH Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing.arXiv preprint arXiv:2305.14720, 2023

work page arXiv 2023

[11] [11]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022

work page 2022

[12] [12]

Evaluating text-to-visual generation with image-to-text models.preprint arXiv:2404.01291, 2024

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024. 10

work page arXiv 2024

[13] [13]

Lightglue: Local feature matching at light speed, 2023

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed, 2023

work page 2023

[14] [14]

Image segmentation using text and image prompts

Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, June 2022

work page 2022

[15] [15]

Hpsv3: Towards wide-spectrum human preference score, 2025

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score, 2025

work page 2025

[16] [16]

Scaling open-vocabulary object detection

Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zirui Shen, et al. Scaling open-vocabulary object detection. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[17] [17]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

work page 2024

[18] [18]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page 2024

[19] [19]

Dreambench++: A human-aligned benchmark for personalized image generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[20] [20]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

work page 2021

[21] [21]

Am-radio: Agglomera- tive vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomera- tive vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12490–12500, June 2024

work page 2024

[22] [22]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

work page 2024

[24] [24]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

work page 2022

[25] [25]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. 2022

work page 2022

[26] [26]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page 2025

[27] [27]

LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers.CVPR, 2021

work page 2021

[28] [28]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal models are in-context learners. 2023

work page 2023

[29] [29]

Emer- gent correspondence from image diffusion

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emer- gent correspondence from image diffusion. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 13

work page 2023

[30] [30]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

work page 2024

[32] [32]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

work page 2023

[33] [33]

Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023. 6.1 Sensitivity to Segmentation Source To verify that MaSC’s performance is not artificially tied to the specific capabilities of SAM3, we conducted a sensitivity analysis across four distinct zero-shot segmentatio...

work page 2023

[34] [34]

Therefore, IRB or equivalent approval is not applicable

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page