WOW-Seg: A Word-free Open World Segmentation Model

arxiv: 2605.16903 · v1 · pith:XJXDZDJInew · submitted 2026-05-16 · 💻 cs.CV

WOW-Seg: A Word-free Open World Segmentation Model

Danyang Li , Tianhao Wu , Bin Li , Zhenyuan Chen , Yang Zhang , Yuxuan Li , Ming-Ming Cheng , Xiang Li This is my paper

Pith reviewed 2026-05-19 21:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords open world segmentationword-free modelvisual promptmask to token alignmentopen-set recognitionLVIS benchmarksemantic segmentationparameter efficient

0 comments p. Extension

pith:XJXDZDJI Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{XJXDZDJI}

Prints a linked pith:XJXDZDJI badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A word-free model segments and recognizes open-world objects by aligning visual masks directly to vision-language features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to close the gap between strong visual segmentation and weak semantic understanding in foundation models when faced with the unlimited object categories of real scenes. Closed-set methods cannot scale to novel objects, while models like SAM produce masks but struggle to name or relate what they see. WOW-Seg converts each mask into a token that sits in the same feature space as a vision-language model, then uses a cascade attention mechanism to keep instances from interfering with one another. The result is a system that delivers strong semantic similarity and intersection-over-union scores on the LVIS benchmark while using far fewer parameters than prior approaches. A new test set with more than seven thousand categories is also released to measure how well such models generalize.

Core claim

WOW-Seg performs open-world segmentation and semantic recognition without text supervision by introducing the Mask2Token module that turns image masks into visual tokens aligned with VLLM feature space and the Cascade Attention Mask that decouples information across instances, yielding a semantic similarity of 89.7 and semantic IoU of 82.4 on LVIS while using only one-eighth the parameters of the previous state of the art.

What carries the argument

The Mask2Token visual prompt module that transforms segmentation masks into tokens and aligns them with the feature space of a vision-language large model.

If this is right

Segmentation systems can label objects from categories never encountered during training.
Large labeled datasets that enumerate every possible object type become unnecessary.
Foundation segmentation models gain semantic capability at low additional parameter cost.
Instance decoupling improves performance in scenes with many overlapping or adjacent objects.
The RR-7K benchmark provides a standardized way to compare open-world recognition across thousands of classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mask-to-token alignment could be applied to improve zero-shot detection or tracking without language prompts.
Purely visual interfaces may eventually replace language prompts for many semantic tasks in resource-constrained settings.
Extending the approach to video would test whether the instance separation remains stable across frames.
Combining the method with other lightweight vision backbones could further lower the compute needed for open-set performance.

Load-bearing premise

Visual mask tokens can be aligned with the vision-language model feature space in a way that supports accurate semantic recognition of entirely unseen object categories without any text or category labels.

What would settle it

Running the model on a held-out collection of images whose object categories are guaranteed to be absent from all training data and checking whether semantic similarity and IoU scores remain high while mask quality stays accurate.

Figures

Figures reproduced from arXiv: 2605.16903 by Bin Li, Danyang Li, Ming-Ming Cheng, Tianhao Wu, Xiang Li, Yang Zhang, Yuxuan Li, Zhenyuan Chen.

**Figure 2.** Figure 2: The overall framework of WOW-Seg. WOW-Seg will perform corresponding tokenisation [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed illustration of our Mask2Token. Only the processing procedure of a mask is [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Causal Attention Mask and Cascade Atten [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Data annotation pipeline. The RR-7K we proposed mainly goes through three stages: [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Statistical distribution of RR-7K. The relative mask size denotes the square root of the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Comparing forward pass efficiency and GFLOPs on a 3090 GPU. A potential concern regarding the Mask2Token module is that encoding individual mask regions may introduce significant computational overhead. We analyse the training forward runtime and computational complexity of WOWSeg versus PAM, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: The process of mapping mask to tokens in different methods. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of the loss decline trends of Mask2Token, Fore2Token, and Blur2Token. For [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Cascade Attention Mask and its variants. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Visualisation of category distributions on ImageNet-21K and RR-7K using t-SNE. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Hierarchical distribution visualisation. The top panel displays the distribution of primary [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of some rare category objects in RR-7K. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Visualisation results of WOW-Seg on the ADE-20K dataset. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Visualization of the performance of WOW-Seg-1B and PAM-3B. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Visualization of the performance of WOW-Seg-1B and PAM-3B. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

read the original abstract

Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WOW-Seg adds Mask2Token and Cascade Attention Mask to push word-free open-world segmentation, plus a large new benchmark, but the LVIS gains rest on thin experimental detail.

read the letter

The paper's core move is to replace text prompts with a visual prompt module called Mask2Token that maps image masks into tokens aligned to a VLLM feature space, paired with a Cascade Attention Mask that reduces interference between instances. It also releases RR-7K, a region recognition benchmark covering 7,662 classes. These pieces are presented as a way to close the gap between strong segmentation (like SAM) and semantic understanding in open-world settings without relying on category names or text supervision during training or inference.

Referee Report

3 major / 1 minor

Summary. The paper proposes WOW-Seg, a word-free open-world segmentation model. It introduces the Mask2Token module to transform image masks into visual tokens aligned with VLLM feature space and the Cascade Attention Mask to reduce inter-instance interference. The work constructs the RR-7K benchmark with 7,662 classes for open-world region recognition and reports strong empirical results on LVIS (semantic similarity 89.7, semantic IoU 82.4), outperforming prior SOTA at 1/8 the parameter count while claiming improved open-world generalization.

Significance. If the reported gains and generalization hold under detailed scrutiny, the approach could meaningfully advance open-world segmentation by reducing dependence on text supervision and category labels while lowering parameter count. The RR-7K benchmark and public code release are concrete strengths that would aid reproducibility and future comparisons in the field.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: The headline LVIS results (semantic similarity 89.7, semantic IoU 82.4) are presented without error bars, standard deviations, number of runs, or full evaluation protocol details, which directly affects the reliability of the claim that these numbers surpass prior SOTA.
[Method] Method section (Mask2Token description): The alignment of visual mask tokens to VLLM space is central to the word-free open-set claim, yet the loss function, training objective, and safeguards against pre-training leakage or dataset bias are not specified in sufficient detail to verify that category-level semantics are extracted from masks alone.
[Experiments] Experiments on RR-7K: The paper asserts strong open-world generalization on the 7,662-class benchmark, but provides no ablation studies, category-split protocol, or comparison against baselines that would confirm the Mask2Token and Cascade Attention Mask are responsible for the reported gains rather than implicit cues.

minor comments (1)

[Abstract] The claim of using 'only one-eighth the parameter count' requires explicit identification of the compared baseline model and its parameter count for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The headline LVIS results (semantic similarity 89.7, semantic IoU 82.4) are presented without error bars, standard deviations, number of runs, or full evaluation protocol details, which directly affects the reliability of the claim that these numbers surpass prior SOTA.

Authors: We agree that additional statistical details are necessary to support the reliability of the reported LVIS numbers. In the revised manuscript we will include results from at least three independent runs with different random seeds, report mean values with standard deviations and error bars, and provide a complete evaluation protocol subsection that specifies all hyperparameters, data splits, and inference settings used for the semantic similarity and semantic IoU metrics. revision: yes
Referee: [Method] Method section (Mask2Token description): The alignment of visual mask tokens to VLLM space is central to the word-free open-set claim, yet the loss function, training objective, and safeguards against pre-training leakage or dataset bias are not specified in sufficient detail to verify that category-level semantics are extracted from masks alone.

Authors: We appreciate the request for greater technical detail. The Mask2Token module is trained with a contrastive alignment objective that minimizes the distance between mask-derived visual tokens and frozen VLLM region embeddings. Training uses a held-out alignment set that has no overlap with LVIS or RR-7K evaluation images, and the VLLM backbone remains frozen to avoid introducing dataset-specific bias. We will add the exact loss formulation, the alignment training procedure, and these leakage-prevention steps to the Method section in the revision. revision: yes
Referee: [Experiments] Experiments on RR-7K: The paper asserts strong open-world generalization on the 7,662-class benchmark, but provides no ablation studies, category-split protocol, or comparison against baselines that would confirm the Mask2Token and Cascade Attention Mask are responsible for the reported gains rather than implicit cues.

Authors: We acknowledge that the current RR-7K experiments would benefit from explicit ablations and protocol details. In the revised version we will add (i) ablation tables that isolate the contribution of Mask2Token and Cascade Attention Mask on RR-7K, (ii) a clear description of the category-split protocol (including how the 7,662 classes are partitioned), and (iii) comparisons against suitably adapted baselines. These additions will help attribute performance gains to the proposed modules. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on LVIS and RR-7K with no derivations reducing to inputs by construction

full rationale

The paper introduces the Mask2Token module and Cascade Attention Mask as architectural components for aligning visual tokens with VLLM space and decoupling instances, then reports empirical performance metrics (semantic similarity 89.7, semantic IoU 82.4 on LVIS) and introduces the RR-7K benchmark. No equations, predictions, or first-principles claims are present that reduce by construction to fitted parameters, self-citations, or renamed inputs. All load-bearing claims rest on experimental comparisons rather than analytical steps that could be tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced modules whose alignment and decoupling properties are asserted rather than derived from prior independent evidence.

axioms (1)

domain assumption Visual masks can be transformed into tokens that align with VLLM feature space without text supervision
Invoked in the design of the Mask2Token module described in the abstract.

invented entities (2)

Mask2Token no independent evidence
purpose: Transforms image masks into visual tokens aligned with VLLM feature space
New module introduced to bridge segmentation and semantic understanding.
Cascade Attention Mask no independent evidence
purpose: Decouples information across different instances to reduce interference
New mechanism proposed to improve multi-instance handling.

pith-pipeline@v0.9.0 · 5814 in / 1290 out tokens · 25569 ms · 2026-05-19T21:20:00.129681+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mask2Token module transforms image masks into visual tokens and ensures their alignment with the VLLM feature space... Cascade Attention Mask to decouple information across different instances.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4... using only one-eighth the parameter count.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 12 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024a. Zhenyuan Chen, Lingfeng Yang, Shuo Chen, Zhaowei Chen, Jiajun Liang, and Xiang Li...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Imagenet: A large-scale hi- erarchical image database

11 Published as a conference paper at ICLR 2026 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2026
[5]

Tag: Guidance-free open-vocabulary semantic segmenta- tion.arXiv preprint arXiv:2403.11197,

Yasufumi Kawano and Yoshimitsu Aoki. Tag: Guidance-free open-vocabulary semantic segmenta- tion.arXiv preprint arXiv:2403.11197,

work page arXiv
[6]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren´e Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Visual instruction pretraining for domain-specific foundation models.arXiv preprint arXiv:2509.17562,

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, and Jian Yang. Visual instruction pretraining for domain-specific foundation models.arXiv preprint arXiv:2509.17562,

work page arXiv
[8]

arXiv preprint arXiv:2504.16072 , year=

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video caption- ing.arXiv preprint arXiv:2504.16072,

work page arXiv
[9]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shang- hang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

work page arXiv
[10]

Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302,

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wen- tao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302,

work page arXiv
[11]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Ultraled: Learning to see everything in ultra-high dynamic range scenes.arXiv preprint arXiv:2510.07741,

Yuang Meng, Xin Jin, Lina Lei, Chun-Le Guo, and Chongyi Li. Ultraled: Learning to see everything in ultra-high dynamic range scenes.arXiv preprint arXiv:2510.07741,

work page arXiv
[13]

12 Published as a conference paper at ICLR 2026 Khan Muhammad, Tanveer Hussain, Hayat Ullah, Javier Del Ser, Mahdi Rezaei, Neeraj Kumar, Mohammad Hijji, Paolo Bellavista, and Victor Hugo C De Albuquerque. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks.IEEE Transactions on Int...

work page 2026
[14]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 1908
[17]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024a. Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with ...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925,

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925,

work page arXiv
[19]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Auto-vocabulary semantic segmentation

13 Published as a conference paper at ICLR 2026 Osman ¨Ulger, Maksymilian Kulicki, Yuki Asano, and Martin R Oswald. Auto-vocabulary semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 24266–24275,

work page 2026
[21]

Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403,

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403,

work page arXiv
[22]

Lisa++: An improved baseline for reasoning segmentation with large language model, 2024

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240,

work page arXiv
[23]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Unichange: Unifying change detection with multimodal large language model.arXiv preprint arXiv:2511.02607,

Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, and Xiang Li. Unichange: Unifying change detection with multimodal large language model.arXiv preprint arXiv:2511.02607,

work page arXiv
[25]

Crystal: Spontaneous emergence of visual latents in mllms.arXiv preprint arXiv:2602.20980,

Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie, Mingming Cheng, and Xiang Li. Crystal: Spontaneous emergence of visual latents in mllms.arXiv preprint arXiv:2602.20980,

work page arXiv
[26]

Naipv2: Debiased pairwise learning for efficient paper quality estimation.arXiv preprint arXiv:2509.25179, 2025a

Penghai Zhao, Jinyu Tian, Qinghua Xing, Xin Zhang, Zheng Li, Jianjun Qian, Ming-Ming Cheng, and Xiang Li. Naipv2: Debiased pairwise learning for efficient paper quality estimation.arXiv preprint arXiv:2509.25179, 2025a. Penghai Zhao, Qinghua Xing, Kairan Dou, Jinyu Tian, Ying Tai, Jian Yang, Ming-Ming Cheng, and Xiang Li. From words to worth: Newborn arti...

work page arXiv
[27]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

14 Published as a conference paper at ICLR 2026 Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

The authors have conducted further manual screening and desensitization processing on the data

15 Published as a conference paper at ICLR 2026 A ETHICAL STATEMENT The RR-7K dataset contributed in this work is derived from images in the SA-1B dataset, with annotations generated automatically via large model inference. The authors have conducted further manual screening and desensitization processing on the data. Our contribution is intended solely f...

work page 2026
[29]

Similarly, the mask processed by Blur2Token is still encoded into 256 tokens. In Fig. 8(c), we use the mask as a prior to extract visual tokens at specific positions. In this way, the number of mask tokens is al- lowed to change dynamically. Taking the LVIS dataset as an example, each mask is represented by an average of 27 tokens. Compared with the previ...

work page 2026
[30]

As shown in Table 8, WOW-Seg demonstrates exceptional scalability and architectural superiority

D.4 THEIMPACT OFVISUALLANGUAGEMODELFOUNDATIONS To investigate the influence of pre-trained foundation models on WOW-Seg, we evaluated WOW- Seg across different iterations of the InternVL series (InternVL2, InternVL2.5, and InternVL3). As shown in Table 8, WOW-Seg demonstrates exceptional scalability and architectural superiority. Specifically, when equipp...

work page 2024
[31]

Table 9: A preliminary coarse grained division of RR-7K Coarse category Fine-grained category Animals Mammals lion, bull, tiger, buffalo, elephant, cow, kangaroo, sheep, horse, dog, elk, boxer, antelope, cattle, lioness, orangutan, pony, lemur, boar, hippopotamus, yak, polar bear, chimpanzee, gazelle, llama, hedgehog, wildebeest, ... Birds bird, eagle, du...

work page 2026
[32]

Figure 12: Hierarchical distribution visualisation

21 Published as a conference paper at ICLR 2026 Figure 11: Visualisation of category distributions on ImageNet-21K and RR-7K using t-SNE. Figure 12: Hierarchical distribution visualisation. The top panel displays the distribution of primary categories. The remaining six sub-panels each show the distribution of secondary categories. Statistical analysis in...

work page 2026

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024a. Zhenyuan Chen, Lingfeng Yang, Shuo Chen, Zhaowei Chen, Jiajun Liang, and Xiang Li...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Imagenet: A large-scale hi- erarchical image database

11 Published as a conference paper at ICLR 2026 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2026

[5] [5]

Tag: Guidance-free open-vocabulary semantic segmenta- tion.arXiv preprint arXiv:2403.11197,

Yasufumi Kawano and Yoshimitsu Aoki. Tag: Guidance-free open-vocabulary semantic segmenta- tion.arXiv preprint arXiv:2403.11197,

work page arXiv

[6] [6]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren´e Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Visual instruction pretraining for domain-specific foundation models.arXiv preprint arXiv:2509.17562,

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, and Jian Yang. Visual instruction pretraining for domain-specific foundation models.arXiv preprint arXiv:2509.17562,

work page arXiv

[8] [8]

arXiv preprint arXiv:2504.16072 , year=

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video caption- ing.arXiv preprint arXiv:2504.16072,

work page arXiv

[9] [9]

Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shang- hang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to comprehend what you want.arXiv preprint arXiv:2403.20271,

work page arXiv

[10] [10]

Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302,

Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wen- tao Zhang, Lei Zhang, and Hongsheng Li. Perceive anything: Recognize, explain, caption, and segment anything in images and videos.arXiv preprint arXiv:2506.05302,

work page arXiv

[11] [11]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Ultraled: Learning to see everything in ultra-high dynamic range scenes.arXiv preprint arXiv:2510.07741,

Yuang Meng, Xin Jin, Lina Lei, Chun-Le Guo, and Chongyi Li. Ultraled: Learning to see everything in ultra-high dynamic range scenes.arXiv preprint arXiv:2510.07741,

work page arXiv

[13] [13]

12 Published as a conference paper at ICLR 2026 Khan Muhammad, Tanveer Hussain, Hayat Ullah, Javier Del Ser, Mahdi Rezaei, Neeraj Kumar, Mohammad Hijji, Paolo Bellavista, and Victor Hugo C De Albuquerque. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks.IEEE Transactions on Int...

work page 2026

[14] [14]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 1908

[17] [17]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024a. Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with ...

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925,

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925,

work page arXiv

[19] [19]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Auto-vocabulary semantic segmentation

13 Published as a conference paper at ICLR 2026 Osman ¨Ulger, Maksymilian Kulicki, Yuki Asano, and Martin R Oswald. Auto-vocabulary semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 24266–24275,

work page 2026

[21] [21]

Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403,

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, and Chen Change Loy. Clipself: Vision transformer distills itself for open-vocabulary dense prediction.arXiv preprint arXiv:2310.01403,

work page arXiv

[22] [22]

Lisa++: An improved baseline for reasoning segmentation with large language model, 2024

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240,

work page arXiv

[23] [23]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Unichange: Unifying change detection with multimodal large language model.arXiv preprint arXiv:2511.02607,

Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, and Xiang Li. Unichange: Unifying change detection with multimodal large language model.arXiv preprint arXiv:2511.02607,

work page arXiv

[25] [25]

Crystal: Spontaneous emergence of visual latents in mllms.arXiv preprint arXiv:2602.20980,

Yang Zhang, Danyang Li, Yuxuan Li, Xin Zhang, Tianyu Xie, Mingming Cheng, and Xiang Li. Crystal: Spontaneous emergence of visual latents in mllms.arXiv preprint arXiv:2602.20980,

work page arXiv

[26] [26]

Naipv2: Debiased pairwise learning for efficient paper quality estimation.arXiv preprint arXiv:2509.25179, 2025a

Penghai Zhao, Jinyu Tian, Qinghua Xing, Xin Zhang, Zheng Li, Jianjun Qian, Ming-Ming Cheng, and Xiang Li. Naipv2: Debiased pairwise learning for efficient paper quality estimation.arXiv preprint arXiv:2509.25179, 2025a. Penghai Zhao, Qinghua Xing, Kairan Dou, Jinyu Tian, Ying Tai, Jian Yang, Ming-Ming Cheng, and Xiang Li. From words to worth: Newborn arti...

work page arXiv

[27] [27]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

14 Published as a conference paper at ICLR 2026 Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

The authors have conducted further manual screening and desensitization processing on the data

15 Published as a conference paper at ICLR 2026 A ETHICAL STATEMENT The RR-7K dataset contributed in this work is derived from images in the SA-1B dataset, with annotations generated automatically via large model inference. The authors have conducted further manual screening and desensitization processing on the data. Our contribution is intended solely f...

work page 2026

[29] [29]

Similarly, the mask processed by Blur2Token is still encoded into 256 tokens. In Fig. 8(c), we use the mask as a prior to extract visual tokens at specific positions. In this way, the number of mask tokens is al- lowed to change dynamically. Taking the LVIS dataset as an example, each mask is represented by an average of 27 tokens. Compared with the previ...

work page 2026

[30] [30]

As shown in Table 8, WOW-Seg demonstrates exceptional scalability and architectural superiority

D.4 THEIMPACT OFVISUALLANGUAGEMODELFOUNDATIONS To investigate the influence of pre-trained foundation models on WOW-Seg, we evaluated WOW- Seg across different iterations of the InternVL series (InternVL2, InternVL2.5, and InternVL3). As shown in Table 8, WOW-Seg demonstrates exceptional scalability and architectural superiority. Specifically, when equipp...

work page 2024

[31] [31]

Table 9: A preliminary coarse grained division of RR-7K Coarse category Fine-grained category Animals Mammals lion, bull, tiger, buffalo, elephant, cow, kangaroo, sheep, horse, dog, elk, boxer, antelope, cattle, lioness, orangutan, pony, lemur, boar, hippopotamus, yak, polar bear, chimpanzee, gazelle, llama, hedgehog, wildebeest, ... Birds bird, eagle, du...

work page 2026

[32] [32]

Figure 12: Hierarchical distribution visualisation

21 Published as a conference paper at ICLR 2026 Figure 11: Visualisation of category distributions on ImageNet-21K and RR-7K using t-SNE. Figure 12: Hierarchical distribution visualisation. The top panel displays the distribution of primary categories. The remaining six sub-panels each show the distribution of secondary categories. Statistical analysis in...

work page 2026