Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Pith reviewed 2026-05-16 22:36 UTC · model grok-4.3
The pith
Scone uses an understanding expert as a semantic bridge to let generation models handle both multi-subject composition and correct distinction without interference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scone is a unified understanding-generation model for subject-driven image generation. The understanding expert functions as a semantic bridge that conveys information to the generation expert, enabling it to preserve subject identity while minimizing interference among multiple subjects. Training proceeds in two stages: the first stage learns composition, and the second stage strengthens distinction via semantic alignment and attention-based masking. The approach is evaluated on the new SconeEval benchmark and on existing benchmarks, where it surpasses prior open-source models in both composition and distinction metrics.
What carries the argument
The understanding expert acting as a semantic bridge inside a unified understanding-generation architecture, trained with a two-stage schedule of composition learning followed by semantic alignment and attention-based masking.
If this is right
- Multi-subject prompts become reliable for realistic scene generation without manual subject isolation.
- Subject identity is preserved across varying contexts while avoiding leakage from other reference images.
- The SconeEval benchmark provides a standardized way to measure both composition accuracy and distinction correctness.
- Open-source models can now be fine-tuned with the same two-stage recipe to close the gap with closed models on complex subject tasks.
Where Pith is reading between the lines
- The semantic-bridge pattern could transfer to video or 3D generation where temporal or spatial consistency across subjects is required.
- Attention masking during the distinction stage may generalize to other conditional generation settings that need selective focus on reference signals.
- Combining the method with larger pretrained understanding models could further reduce the data needed for the distinction stage.
Load-bearing premise
The two-stage training with semantic alignment and attention-based masking can strengthen distinction without lowering composition quality or creating new interference between subjects.
What would settle it
A controlled test set of prompts containing two visually similar subjects where the model after distinction training either swaps identities or produces lower composition fidelity than the composition-only checkpoint.
Figures
read the original abstract
Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Scone, a unified understanding-generation architecture for subject-driven image generation that jointly addresses composition (integrating multiple subjects) and distinction (correctly identifying and rendering specific subjects amid candidates to minimize interference). The core mechanism positions the understanding expert as a semantic bridge that guides the generation expert. Training proceeds in two stages: an initial composition-focused stage followed by distinction enhancement via semantic alignment and attention-based masking. The authors release a new benchmark SconeEval and report that Scone outperforms existing open-source models on composition and distinction tasks across two benchmarks. Model weights, benchmark, and training data are made publicly available.
Significance. If the performance gains are reproducible and attributable to the proposed mechanisms rather than training volume, the work would meaningfully advance multi-subject generation by explicitly modeling distinction, a previously under-addressed capability needed for realistic scenes. The conceptual framing of understanding as a semantic bridge and the public release of code, benchmark, and data constitute clear strengths that support reproducibility and follow-on research.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and abstract: Performance claims state that Scone outperforms open-source models, yet no details are supplied on experimental controls, statistical significance, variance across runs, or precise metric values and baseline implementations. This absence prevents verification of the central empirical result.
- [Training section (§3.2)] Training section (likely §3.2): The claim that stage-2 semantic alignment plus attention-based masking specifically improves distinction while preserving composition rests on the two-stage scheme. No component-wise ablations, before/after interference metrics (e.g., subject-swap error rates), or controls for additional data volume are reported, leaving open the possibility that observed gains arise simply from continued training on multi-subject examples rather than the proposed mechanisms.
minor comments (2)
- [Abstract] Abstract: The statement 'outperforms ... on two benchmarks' should explicitly name the second benchmark in addition to SconeEval for immediate clarity.
- [Throughout] Notation and terminology: Ensure consistent capitalization and definition of 'understanding expert' and 'generation expert' on first use and in all figure captions.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We appreciate the referee's emphasis on reproducibility and the need to isolate the contributions of our proposed mechanisms. We will revise the manuscript to provide the requested details and ablations.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and abstract: Performance claims state that Scone outperforms open-source models, yet no details are supplied on experimental controls, statistical significance, variance across runs, or precise metric values and baseline implementations. This absence prevents verification of the central empirical result.
Authors: We agree that the current presentation lacks sufficient detail for independent verification. In the revised manuscript, we will expand §4 and the abstract to report: exact numerical metric values with standard deviations across three random seeds, baseline implementation details (including code references and hyperparameter settings), statistical significance via paired t-tests, and full experimental controls such as evaluation protocols and data splits. These additions will directly address the verification concern. revision: yes
-
Referee: [Training section (§3.2)] Training section (likely §3.2): The claim that stage-2 semantic alignment plus attention-based masking specifically improves distinction while preserving composition rests on the two-stage scheme. No component-wise ablations, before/after interference metrics (e.g., subject-swap error rates), or controls for additional data volume are reported, leaving open the possibility that observed gains arise simply from continued training on multi-subject examples rather than the proposed mechanisms.
Authors: We acknowledge that the manuscript does not currently include the requested ablations or controls. In the revision, we will add a dedicated ablation subsection in §3.2 and §4 showing: (i) component-wise results with and without semantic alignment and attention-based masking, (ii) before/after subject-swap error rates and other distinction-specific interference metrics, and (iii) a control experiment training for equivalent total steps on the same multi-subject data but omitting the distinction-specific losses. These results will demonstrate that the observed gains are attributable to the proposed mechanisms rather than data volume or continued training alone. revision: yes
Circularity Check
No significant circularity; empirical claims rest on benchmarks
full rationale
The paper proposes a two-stage training scheme (composition learning followed by semantic alignment and attention-based masking) for a unified model and supports its claims via experimental results on SconeEval and other benchmarks. No equations, derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The semantic-bridge role is asserted through the training procedure and performance comparisons rather than reducing to its own inputs by construction. This is a standard empirical ML paper with independent experimental validation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 6 Pith papers
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
UniCustom fuses ViT and VAE features before VLM encoding and uses two-stage training plus slot-wise regularization to improve subject consistency in multi-reference diffusion-based image generation.
-
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
A unified visual conditioning approach fuses semantic and appearance features before VLM processing, with two-stage training and slot-wise regularization, to improve consistency in multi-reference image generation.
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
A reinforcement learning approach adapts general generative models to produce synthetic data that boosts identity recognition accuracy and generalization under privacy constraints.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Reference graph
Works this paper leans on
-
[1]
Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024
Ruichuan An, Sihan Yang, Ming Lu, Renrui Zhang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, et al. Mc-llava: Multi-concept personalized vision-language model.arXiv preprint arXiv:2411.11706, 2024. 1
-
[2]
Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, et al. Unictokens: Boosting personalized understand- ing and generation via unified concept tokens.arXiv preprint arXiv:2505.14671, 2025. 3
- [3]
-
[4]
Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation
Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025. 5
-
[5]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, et al. Opengpt-4o-image: A com- prehensive dataset for advanced image generation and edit- ing.arXiv preprint arXiv:2509.24900, 2025
-
[7]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 1, 3, 6, 7, 8, 12, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Introducing gemini 2.5 flash image, our state-of- the-art image model, 2025
Google. Introducing gemini 2.5 flash image, our state-of- the-art image model, 2025. 1, 6, 7, 12
work page 2025
-
[9]
Musar: Exploring multi-subject cus- tomization from single-subject dataset via attention routing
Zinan Guo, Pengze Zhang, Yanze Wu, Chong Mou, Songtao Zhao, and Qian He. Musar: Exploring multi-subject cus- tomization from single-subject dataset via attention routing. arXiv preprint arXiv:2505.02823, 2025. 6
-
[10]
Chanran Kim, Jeongin Lee, Shichang Joung, Bongmo Kim, and Yeul-Min Baek. Instantfamily: Masked attention for zero-shot multi-id image generation.arXiv preprint arXiv:2404.19427, 2024. 3
-
[11]
Viescore: Towards explainable metrics for conditional image synthesis evaluation
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 12268–12290, 2024. 6
work page 2024
- [12]
-
[13]
Flux.1 kontext: Flow matching for in-context image generation and editing in latent space,
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dock- horn, Jack English, Zion English, Patrick Esser, Sumith Ku- lal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas M¨uller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context i...
-
[14]
Revealing single frame bias for video-and-language learning
Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 487–507,
-
[15]
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, and Li Yuan. Uniworld-v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025. 3, 6, 7
-
[16]
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, 9 Yunyang Ge, et al. Uniworld: High-resolution semantic en- coders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302, 2025. 1
-
[18]
Sota: spike-navigated optimal trans- port saliency region detection in composite-bias videos
Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, and Tiejun Huang. Sota: spike-navigated optimal trans- port saliency region detection in composite-bias videos. In Proceedings of the Thirty-Fourth International Joint Confer- ence on Artificial Intelligence, 2025. 2
work page 2025
-
[19]
Wenxuan Liu, Xian Zhong, Yihan Dai, Xuemei Jia, Zheng Wang, and Shin’Ichi Satoh. Motion-consistent representa- tion learning for uav-based action recognition.IEEE Trans- actions on Intelligent Transportation Systems, 2025. 1
work page 2025
- [20]
- [21]
-
[22]
Introducing 4o image generation, 2025
OpenAI. Introducing 4o image generation, 2025. 6, 7
work page 2025
-
[23]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[25]
Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. Dreambench++: A human-aligned bench- mark for personalized image generation.arXiv preprint arXiv:2406.16855, 2024. 5
-
[26]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 4
work page 2021
-
[27]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500– 22510, 2023. 5
work page 2023
-
[28]
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual under- standing and generation with dual visual vocabularies.arXiv preprint arXiv:2503.14324, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Ominicontrol: Minimal and univer- sal control for diffusion transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and univer- sal control for diffusion transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14940–14950, 2025. 3
work page 2025
-
[30]
Bingda Tang, Boyang Zheng, Sayak Paul, and Saining Xie. Exploring the deep fusion of large language models and dif- fusion transformers for text-to-image synthesis. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 28586–28595, 2025. 3
work page 2025
-
[31]
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large language models: A survey.IEEE Transactions on Circuits and Sys- tems for Video Technology, 2025. 2
work page 2025
-
[32]
Jing Tao, You Li, Banglei Guan, Yang Shang, and Qifeng Yu. Simultaneous enhancement and noise suppression un- der complex illumination conditions.IEEE Transactions on Instrumentation and Measurement, 73:1–11, 2024. 1
work page 2024
- [33]
-
[34]
InstantID: Zero-shot Identity-Preserving Generation in Seconds
Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[35]
Yuran Wang, Zhijing Wan, Yansheng Qiu, and Zheng Wang. Devil is in details: Locality-aware 3d abdominal ct volume generation for self-supervised organ segmentation. InPro- ceedings of the 32nd ACM International Conference on Mul- timedia, pages 10640–10648, 2024. 1
work page 2024
-
[36]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[37]
OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025. 1, 3, 5, 6, 7, 8, 13
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Ji- ahe Tian, Yiming Luo, Fei Ding, and Qian He. Uso: Unified style and subject-driven generation via disentangled and re- ward learning.arXiv preprint arXiv:2508.18966, 2025. 6, 7
-
[39]
Less-to- more generalization: Unlocking more controllability by in-context generation
Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025. 1, 6, 7
-
[40]
Omnigen: Unified image genera- tion
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1, 3, 6
work page 2025
-
[41]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 3 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show- o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip- adapter: Text compatible image prompt adapter for text-to- image diffusion models.arXiv preprint arXiv:2308.06721,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Echo-4o: Har- nessing the power of gpt-4o synthetic images for improved image generation, 2025
Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zheng- hao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, and Weijia Li. Echo-4o: Har- nessing the power of gpt-4o synthetic images for improved image generation, 2025. 1, 3, 6, 7
work page 2025
-
[45]
Ipdreamer: Appearance-controllable 3d object generation with complex image prompts
Bohan Zeng, Shanglin Li, Yutang Feng, Ling Yang, Juan Zhang, Hong Li, Jiaming Liu, Conghui He, Wentao Zhang, Jianzhuang Liu, et al. Ipdreamer: Appearance-controllable 3d object generation with complex image prompts. InThe Thirteenth International Conference on Learning Represen- tations, 2024. 3
work page 2024
-
[46]
Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm
Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo, Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan Wang, and Nong Sang. Holmes-vad: Towards unbiased and explainable video anomaly detection via multi-modal llm. arXiv preprint arXiv:2406.12235, 2024. 2
-
[47]
Cfbench: A comprehensive constraints- following benchmark for llms
Tao Zhang, Chenglin Zhu, Yanjun Shen, Wenjing Luo, Yan Zhang, Hao Liang, Fan Yang, Mingan Lin, Yujing Qiao, Weipeng Chen, et al. Cfbench: A comprehensive constraints- following benchmark for llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 32926–32944,
-
[48]
Ssr-encoder: Encoding selective subject representation for subject-driven generation
Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8069–8078, 2024. 3
work page 2024
-
[49]
COM” denotes composition and “DIS
Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 19781–19791, 2025. 2, 3, 12 11 Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understa...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.