pith. machine review for the scientific record. sign in

arxiv: 2604.11789 · v2 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords Large Multimodal ModelsObject-Centric VisionVisual UnderstandingReferring SegmentationVisual EditingVisual GenerationMultimodal Systems
0
0 comments X

The pith

Object-centric vision supplies a framework that extends LMMs to precise object-level understanding, segmentation, editing, and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review organizes recent work on large multimodal models around the idea that explicit object representations solve core limitations in identifying specific instances, maintaining identity during edits, and localizing changes with high precision. The authors group the literature into four themes and extract common modeling choices, training approaches, and evaluation methods across them. A reader gains a map for moving multimodal systems from coarse scene descriptions to controllable, entity-focused interactions. The paper closes by listing open problems such as instance permanence and consistent multi-step control.

Core claim

The paper claims that object-centric vision supplies a principled framework for addressing LMM limitations in instance identification, identity preservation, and precise localization by promoting explicit representations and operations over visual entities. It organizes the literature into object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation; summarizes the key modeling paradigms, learning strategies, and evaluation protocols supporting these capabilities; and outlines open challenges including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-</

What carries the argument

The four-theme organization of object-centric visual understanding, referring segmentation, visual editing, and visual generation that structures the surveyed advances.

If this is right

  • LMMs gain the ability to identify and track specific object instances across image sequences and edits.
  • Systems can modify only designated regions while preserving the identity and appearance of untouched objects.
  • Evaluation protocols shift from global scene metrics to instance-level precision and consistency measures.
  • Development efforts converge on shared modeling paradigms that support all four tasks under a single architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The review's structure could serve as a template for new benchmarks that test cross-task consistency rather than isolated capabilities.
  • Future work might test whether the same object-centric priors transfer to video or 3D domains without additional supervision.
  • If the four themes prove incomplete, the field would need a fifth category for object-centric reasoning over temporal or causal relations.

Load-bearing premise

That the reviewed papers sufficiently represent the full intersection of LMMs and object-centric vision and that this intersection indeed forms a coherent, extensible framework.

What would settle it

A systematic audit that finds a large fraction of high-impact LMM papers on object-level tasks either omit explicit object representations or achieve comparable gains without them.

read the original abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper presents a comprehensive review of recent advances at the convergence of Large Multimodal Models (LMMs) and object-centric vision. It organizes the existing literature into four major themes—object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation—while summarizing key modeling paradigms, learning strategies, and evaluation protocols. The review concludes by discussing open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift.

Significance. If the curation and summaries are accurate and reasonably complete, the survey would provide a useful structured perspective for researchers working on extending LMMs from global scene understanding to precise object-level capabilities. Its primary contribution is organizational synthesis rather than new derivations or experiments, which is appropriate for a review paper in a fast-moving area; explicit credit is due for framing the four themes as a coherent lens and for highlighting actionable future directions without introducing unverified claims.

minor comments (2)
  1. [Abstract] Abstract and introduction: the phrasing that object-centric vision 'provides a principled framework' is presented as established motivation; a brief paragraph contrasting it with alternative (e.g., pixel- or region-based) approaches would clarify why the four-theme organization follows naturally rather than appearing as one possible taxonomy.
  2. The four-theme structure is clear, but boundary papers that span multiple themes (e.g., a method that performs both referring segmentation and editing) should be explicitly noted so readers understand how overlaps are handled.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our survey, as well as for recommending minor revision. The referee's assessment correctly captures the paper's organizational structure around the four themes, its synthesis of paradigms and challenges, and its focus on future directions without overclaiming novelty. Since no specific major comments were raised in the report, we have no points requiring rebuttal or clarification at this time. We will incorporate any minor suggestions during the revision process to further strengthen the manuscript.

Circularity Check

0 steps flagged

No significant circularity; survey paper with no derivations or fitted quantities

full rationale

This paper is a literature review that organizes external work into four themes (object-centric understanding, referring segmentation, editing, and generation) and summarizes paradigms, strategies, and protocols. No equations, predictions, parameters, or derivation chains appear in the abstract or described structure. The claim that object-centric vision supplies a 'principled framework' is motivational framing rather than a testable proposition whose validity depends on internal reductions. All cited results are external to the present manuscript, so no self-citation load-bearing, self-definitional, or fitted-input patterns exist. The contribution is curation and perspective, which remains valid independently of any single assumption or result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature review, the paper introduces no free parameters, axioms, or invented entities; all content draws from cited prior work.

pith-pipeline@v0.9.0 · 5548 in / 1112 out tokens · 70869 ms · 2026-05-10T15:31:22.509475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

250 extracted references · 103 canonical work pages · 35 internal anchors

  1. [1]

    Gpt-4v(ision) system card. 2023. 24

  2. [2]

    Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

    Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. 2020

  3. [3]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Fan Yang, Wenbin Ge, Han Yu, Fei Huang, Binyuan Hui, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    Real-time 3d-aware portrait editing from a single image

    Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, and Qifeng Chen. Real-time 3d-aware portrait editing from a single image. InEuropean Conference on Computer Vision, pages 344–362. Springer, 2024

  6. [6]

    CoRR , volume =

    Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742, 2025

  7. [7]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  8. [8]

    Zhang, Tianbao Xie, Zong-Ming Cheng, Zhang Hang, Zhibo Yang, Haiyang Xu, and Juntang Lin

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Hua Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zhangping Fu, Yuanzhong Xu, Jiabo Ye, X.-C. Zhang, Tianbao Xie, Zong-Ming Cheng, Zhang Hang, Zhibo Yang, Haiyang Xu, and Juntang Lin. Qwen2.5-vl technica...

  9. [9]

    One token to seg them all: Language instructed reasoning segmentation in videos.Advancesin Neural Information Processing Systems 37, 2024

    Zechen Bai, Joya Chen, Ziteng Gao, Tong He, Lei Liu, Haiyang Mei, Mike Zheng Shou, Pichao Wang, and Zheng Zhang. One token to seg them all: Language instructed reasoning segmentation in videos.Advancesin Neural Information Processing Systems 37, 2024

  10. [10]

    Flux.1-dev, 2024

    Black Forest Labs. Flux.1-dev, 2024

  11. [11]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  12. [12]

    Ledits++: Limitless image editing using text-to-image models

    Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolinário Passos. Ledits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  13. [13]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  14. [14]

    Pixel-level reasoning segmentation via multi-turn conversations

    Dunbo Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Feng Shi, Yifei Zhang, and Soujanya Poria. Pixel-level reasoning segmentation via multi-turn conversations. 2025

  15. [15]

    Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee

    Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip-llava: Making large multimodal models understand arbitrary visual prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  16. [16]

    Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  17. [17]

    Gui, and Yu-Xiong Wang

    Shixiang Cao, L. Gui, and Yu-Xiong Wang. Emergent visual grounding in large multimodal models without grounding supervision. arXiv preprint arXiv:2410.08209, 2024

  18. [18]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

  19. [19]

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  20. [20]

    Chang, and Matthias Nießner

    Dave Zhenyu Chen, Anne Lynn S. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. 2020. 25

  21. [21]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

  22. [22]

    Gary Chan, and Hongyang Zhang

    Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025

  23. [23]

    Chang, Thomas Funkhouser, and Silvio Savarese

    Kai Chen, Christopher Choy, Manolis Savva, Anne Lynn S. Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. 2019

  24. [24]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

  25. [25]

    3d-dres: Detailed 3d referring expression segmentation

    Qi Chen, Changli Wu, Jiayi Ji, Yiwei ma, and Liujuan Cao. 3d-dres: Detailed 3d referring expression segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026

  26. [26]

    Language conditioned spatial relation reasoning for 3d object grounding.arXiv preprint arXiv:2211.09646, 2022

    Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Language conditioned spatial relation reasoning for 3d object grounding.arXiv preprint arXiv:2211.09646, 2022

  27. [27]

    Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing.arXiv preprint arXiv:2509.13399, 2025

    Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, et al. Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing.arXiv preprint arXiv:2509.13399, 2025

  28. [28]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Feng Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  29. [29]

    Gaussianeditor: Swift and controllable 3d editing with gaussian splatting

    Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  30. [30]

    Contextflow: Training-free video object editing via adaptive context enrichment

    Yiyang Chen, Xuanhua He, Xiujun Ma, and Jack Ma. Contextflow: Training-free video object editing via adaptive context enrichment. 2026

  31. [31]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024

  32. [32]

    Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems 37, 2024

    An-Chieh Cheng, Yang Fu, Qiushan Guo, Jan Kautz, Sifei Liu, Xiaolong Wang, Ruihan Yang, and Hongxu Yin. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems 37, 2024

  33. [33]

    3d aware region prompted vision language model

    An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, and Sifei Liu. 3d aware region prompted vision language model. arXiv preprint arXiv:2509.13317, 2025

  34. [34]

    Masked-attention mask transformer for universal image segmentation

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022

  35. [35]

    Instructblip: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advancesin neural information processing systems, 36:49250–49267, 2023

  36. [36]

    RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026

    Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

  37. [37]

    Ruili Dang, Y. F. Yuan, Yunxuan Mao, Kehan Li, J. Liu, Z. Wang, Xin Li, Fan Wang, and Deli Zhao. Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

  38. [38]

    CoRR , volume =

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia 26 Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

  39. [39]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsanit, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  40. [40]

    Mevis: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023

  41. [41]

    Multimodal referring segmentation: A survey

    Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey. 2025

  42. [42]

    Z3d: Zero-shot 3d visual grounding from images.arXiv preprint arXiv:2602.03361, 2026

    Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, and Maksim Kolodi- azhnyi. Z3d: Zero-shot 3d visual grounding from images.arXiv preprint arXiv:2602.03361, 2026

  43. [43]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

  44. [44]

    Videoagent: A memory-augmented multimodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. 2024

  45. [45]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

  46. [46]

    Mani-gs: Gaussian splatting manipulation with triangular mesh

    Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, and Long Quan. Mani-gs: Gaussian splatting manipulation with triangular mesh. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  47. [47]

    Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

    Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2(1):32, 2024

  48. [48]

    Tokenflow: Con- sistent diffusion features for consistent video editing,

    Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023

  49. [49]

    The devil is in temporal token: High quality video reasoning segmentation

    Shaogang Gong, Yunzhi Zhuge, Pengfei Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. 2025

  50. [50]

    Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

    Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

  51. [51]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  52. [52]

    Regiongpt: Towards region understanding vision language model

    Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  53. [53]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  54. [54]

    Lvis: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  55. [55]

    Merlin: Multimodal embedding refinement via llm-based iterative navigation for text-video retrieval-rerank pipeline

    Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, and Nojun Kwak. Merlin: Multimodal embedding refinement via llm-based iterative navigation for text-video retrieval-rerank pipeline. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2024

  56. [56]

    Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

    Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, and Seon Joo Kim. Decomposed attention fusion in mllms for training-free video reasoning segmentation.arXiv preprint arXiv:2510.19592, 2025. 27

  57. [57]

    Multi-modal instruction tuned llms with fine-grained visual perception

    Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  58. [58]

    Refmask3d: Language-guided transformer for 3d referring segmentation

    Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, 2024

  59. [59]

    Omni-rgpt: Unifying image and video region-level understanding via token marks

    Miran Heo, Min-HungChen, De-An Huang, Sifei Liu, SubhashreeRadhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  60. [60]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay M. Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

  61. [61]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  62. [62]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760, 2025

  63. [63]

    Finecaption: Compositional image captioning focusing on wherever you want at any granularity

    Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  64. [64]

    Viewsrd: 3d visual grounding via structured multi-view decomposition

    Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, and Shengfeng He. Viewsrd: 3d visual grounding via structured multi-view decomposition. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  65. [65]

    Dive: Taming dino for subject-driven video editing

    Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, and Shifeng Chen. Dive: Taming dino for subject-driven video editing. 2024

  66. [66]

    Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746,

    Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746, 2026

  67. [67]

    Identity decoupling for multi-subject personaliza- tion of text-to-image models.Advancesin Neural Information Processing Systems 37, 2024

    Sung Ju Hwang, Sangwon Jang, Jaehyeong Jo, and Kimin Lee. Identity decoupling for multi-subject personaliza- tion of text-to-image models.Advancesin Neural Information Processing Systems 37, 2024

  68. [68]

    Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Tao Niu

    L. Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Tao Niu. Pixelman: Consistent object editing with diffusion models via pixel manipulation and generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

  69. [69]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin Neural Information Processing Systems 37, 2024

    Yi Jiang, Bingyue Peng, Keyu Tian, Liwei Wang, and Zehuan Yuan. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin Neural Information Processing Systems 37, 2024

  70. [70]

    Decoupled seg tokens make stronger reasoning video segmenter and grounder.arXiv preprint arXiv:2506.22880, 2025

    Dang Jisheng, Wu Xudong, Wang Bimei, Lv Ning, Chen Jiayu, Jingwen Zhao, Jizhao Liu, Juncheng Li, Teng Wang, et al. Decoupled seg tokens make stronger reasoning video segmenter and grounder.arXiv preprint arXiv:2506.22880, 2025

  71. [71]

    Flux already knows – activating subject-driven image generation without training.arXiv preprint arXiv:2504.11478, 2025

    Hao Kang, Stathi Fotiadis, Liming Jiang, Yan Qing, Yiwei Jia, Zichuan Liu, Min Jin Chong, and Xin Lu. Flux already knows – activating subject-driven image generation without training.arXiv preprint arXiv:2504.11478, 2025

  72. [72]

    PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis

    Mi Jeong Kim, Gunhee Kim, Jin Sun Choi, Wonjae Roh, and Bohyung Han. Physgaia: A physics-aware benchmark with multi-body interactions for dynamic novel view synthesis.arXiv preprint arXiv:2506.02794, 2025

  73. [73]

    Shamma, Michael S

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 2017

  74. [74]

    Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

    Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 28

  75. [75]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  76. [76]

    Fu, Christopher Ré, and David W

    Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Mingyu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher Ré, and David W. Romero. Hmar: Efficient hierarchical masked auto-regressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  77. [77]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  78. [78]

    Text4seg: Reimagining image segmentation as text generation.arXiv preprint arXiv:2410.09855, 2024

    Mengcheng Lan, Zhaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wei Zhang. Text4seg: Reimagining image segmentation as text generation.arXiv preprint arXiv:2410.09855, 2024

  79. [79]

    Text4seg++: Advancing image segmentation via generative language modeling.arXiv preprint arXiv:2509.06321, 2025

    Mengcheng Lan, Zhaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmentation via generative language modeling.arXiv preprint arXiv:2509.06321, 2025

  80. [80]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

Showing first 80 references.