{"total":13,"items":[{"citing_arxiv_id":"2606.03715","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Text-to-Image Models Need Less from Text Encoders Than You Think","primary_cat":"cs.CV","submitted_at":"2026-06-02T14:37:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A bag-of-position-tagged-words embedding guides text-to-image diffusion models as effectively as full contextual text embeddings from standard encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01079","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing","primary_cat":"cs.CV","submitted_at":"2026-05-31T07:54:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chameleon proposes the first large-scale cross-domain compositing dataset and a disentangled encoder plus gated diffusion transformer that outperforms prior in-domain and cross-domain methods on plausibility and fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00931","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences","primary_cat":"cs.CV","submitted_at":"2026-05-30T23:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CV-Arena is a new 12K-pair benchmark for instruction-guided real-image editing with 16 task types, CogRetriever curation, and Active Elo mixed human-AI evaluation that finds gaps in 21 models and presents CV-Agent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00351","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization","primary_cat":"cs.CV","submitted_at":"2026-05-29T20:45:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"UniVerse proposes a unified modulation framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers, claiming superior localization and fidelity over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28806","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Personal Visual Memory from Explicit and Implicit Evidence","primary_cat":"cs.CV","submitted_at":"2026-05-27T17:56:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VisualMem augments text memory with a visual module that resolves identity and durable user facts from images, outperforming prior systems on a new benchmark for explicit and implicit personal visual evidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12305","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:54:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Kosmos-g: Generating images in context with multimodal large language models.arXiv preprint arXiv:2310.02992, 2023. [26] Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. [27] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv:2306.14824, 2023. 10 [28] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis."},{"citing_arxiv_id":"2604.05039","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ID-Sim: An Identity-Focused Similarity Metric","primary_cat":"cs.CV","submitted_at":"2026-04-06T18:00:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ID-Sim is a new similarity metric that aims to capture human selective sensitivity to identities by training on curated real and generative synthetic data and validating against human annotations on recognition, retrieval, and generative tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"imal re-id using a large community-curated dataset.arXiv preprint arXiv:2412.05602, 2024. 2 [50] Thomas J Palmeri and Celina Blalock. The role of back- ground knowledge in speeded perceptual categorization. Cognition, 77(2):B45-B57, 2000. 1 [51] Thomas J Palmeri and Isabel Gauthier. Visual object un- derstanding.Nature Reviews Neuroscience, 5(4):291-303, 2004. 1 [52] Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 3 [53] Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Run- pei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu"},{"citing_arxiv_id":"2512.12675","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling","primary_cat":"cs.CV","submitted_at":"2025-12-14T12:58:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10955","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization","primary_cat":"cs.CV","submitted_at":"2025-12-11T18:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.23951","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HunyuanImage 3.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2025-09-28T16:14:10+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04447","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","primary_cat":"cs.CV","submitted_at":"2025-07-06T16:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"URLhttps://openai.com/research/ o3-o4-mini-system-card . 3 [79] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. DreamLLM: Synergistic multimodal comprehension and creation. In Int. Conf. Learn. Represent. (ICLR), 2024. 3, 4 [80] Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned benchmark for personalized image generation. CoRR, abs/2406.16855, 2024. 3 [81] Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine."},{"citing_arxiv_id":"2506.15742","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","primary_cat":"cs.GR","submitted_at":"2025-06-17T20:18:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FLUX.1 Kontext unifies image generation and editing via flow matching and sequence concatenation, delivering improved multi-turn consistency and speed on the new KontextBench benchmark.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"InstructPix2Pix [5] relies on synthetic Stable Diffusion samples and GPT-generated instructions, creating inherent bias. MagicBrush [60], while using authentic MS-COCO images, is constrained by DALLE-2's [41] capabilities during data collection. Other benchmarks like Emu-Edit [ 51] use lower-resolution images with unrealistic distributions and focus solely on editing tasks, while DreamBench [39] lacks broad coverage and GEdit-bench [31] does not represent the full scope of modern multimodal models. IntelligentBench [ 9] remains unavailable with only 300 examples of uncertain task coverage. To address these gaps, we compile KontextBench from crowd-sourced real-world use cases. The benchmark comprises 1026 unique image-prompt pairs derived from 108 base images including"},{"citing_arxiv_id":"2412.04300","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts","primary_cat":"cs.CV","submitted_at":"2024-12-05T16:21:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}