An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Bingyu Li; Da Zhang; Junyu Gao; Tao Huo; Xuelong Li; Zhiyuan Zhao

arxiv: 2606.00987 · v1 · pith:57IXEE6Onew · submitted 2026-05-31 · 💻 cs.CV · cs.AI

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Bingyu Li , Da Zhang , Tao Huo , Zhiyuan Zhao , Junyu Gao , Xuelong Li This is my paper

Pith reviewed 2026-06-28 17:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-temporal referring segmentationlarge vision-language modelschange detectionreferring segmentationbenchmark constructiontemporal reasoningLVLM fine-tuning

0 comments

The pith

MTRefSeg-R1 outperforms LVLM baselines on multi-temporal referring segmentation by pre-training on vision-only changes then fine-tuning on language-guided masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Multi-temporal Referring Segmentation as a task that combines temporal correspondence reasoning, language grounding, and pixel-level mask prediction across multiple images of the same scene. It constructs the MTRefSeg-21K benchmark of 21K triplets via the CRAFT-Agent pipeline with human auditing and proposes MTRefSeg-R1, a change-aware LVLM that first learns general temporal-change perception from 20K vision-only bi-temporal samples before fine-tuning on the benchmark. This two-stage approach addresses the failure of direct inference in existing models and produces stronger results by explicitly modeling cross-temporal visual differences and aligning instructions with temporal variations.

Core claim

MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. It achieves this after first learning general temporal-change perception from 20K vision-only bi-temporal samples and then fine-tuning on the MTRefSeg-21K benchmark, yielding strong and often superior performance compared with existing LVLM baselines on the new task.

What carries the argument

MTRefSeg-R1's two-stage training strategy that pre-trains general temporal-change perception on vision-only bi-temporal data before language-guided fine-tuning for mask prediction.

If this is right

Direct inference performs poorly on MTRS while task-specific fine-tuning alone remains limited.
Pre-training on vision-only bi-temporal samples improves subsequent language-guided temporal localization.
The benchmark exposes the joint difficulty of temporal correspondence, language grounding, and mask prediction.
Explicit cross-temporal difference modeling enables referred change mask prediction where baselines fall short.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The CRAFT-Agent construction method could generate data for longer-sequence or multi-view change tasks.
Improved temporal reasoning in LVLMs may support applications such as monitoring land-use changes from satellite pairs.
Success on this task suggests that staged training separating perception from language alignment could apply to other dynamic visual reasoning problems.

Load-bearing premise

The CRAFT-Agent pipeline with human auditing produces multi-temporal triplets whose language descriptions match genuine visual changes rather than generation artifacts.

What would settle it

Human audit of a random sample from MTRefSeg-21K reveals that more than 10 percent of language descriptions fail to correspond to actual visual differences, or ablation removing the 20K vision-only pre-training stage eliminates MTRefSeg-R1's performance advantage.

Figures

Figures reproduced from arXiv: 2606.00987 by Bingyu Li, Da Zhang, Junyu Gao, Tao Huo, Xuelong Li, Zhiyuan Zhao.

**Figure 1.** Figure 1: Task motivation of Multi-temporal Referring Segmentation (MTRS). MTRS addresses this gap by taking temporally related images and a natural-language expression as input, and segmenting the region corresponding to the described temporal change. including robotics, surveillance, autonomous driving, and remote sensing interpretation [7], [8], [9]. Recent AI Flow studies emphasize interactive and scenariodriv… view at source ↗

**Figure 2.** Figure 2: From single-time language-guided segmentation to multi-temporal referring segmentation. Complementary to the task motivation in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of CRAFT-Agent. CRAFT-Agent constructs bi-image–text–mask triplets for MTRS by integrating grid-aware change perception, expression generation, mask refinement, expression beautification, and human auditing. relying solely on explicit object attributes, the model must infer the intended target from more abstract or relational descriptions, making the task substantially more challenging. Different … view at source ↗

**Figure 4.** Figure 4: Data design for MTRS. MTRefSeg-21K provides fine-grained bi-image–text–mask triplets for language-guided multi-temporal referring fine-tuning across RS and NS domains. stage pipeline. First, a grid-aware multimodal large language model examines bi-temporal image pairs and identifies differences across spatial regions, producing referring expressions with explicit spatial cues and temporal variation catego… view at source ↗

**Figure 5.** Figure 5: Statistical analysis of the MTRefSeg-21K dataset. Left: Distribution of mask areas across the Train, Val, NS, and RS splits. Middle: Distribution of referring expression lengths for the four splits, measured by the number of words per expression. Right: Word cloud visualization of referring expressions, highlighting the most frequent words in the dataset. TABLE I: Dataset statistics and multi-domain compar… view at source ↗

**Figure 6.** Figure 6: Adapting VLM-based segmentation models to MTRS. Single-time VLM frameworks are extended with paired image inputs and temporal feature interaction to localize languagedescribed changes. SharedParam. Vision Backbone Multimodal LLM T1 embedding T2 embedding Fusion Fused embedding Sure, the segmentation result is [SEG]. [SEG] Hidden Embedding Mask Decoder “<image_t1> is the earlier image, and <image_t2> is t… view at source ↗

**Figure 7.** Figure 7: Adapting LVLM-based segmentation models to MTRS. A segmentation-oriented LVLM is modified to jointly process multi-temporal images and generate masks conditioned on temporal-change descriptions. in [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of the proposed MTRefSeg-R1 framework. MTRefSeg-R1 adopts a two-stage training strategy. Stage 1 performs multi-temporal vision pretraining on diverse view types to learn generic change-aware visual representations. Stage 2 conducts referring multi-temporal fine-tuning, where the LVLM understands temporally ordered image pairs and language instructions, and predicts the mask of the referred change… view at source ↗

**Figure 9.** Figure 9: Qualitative comparisons on the NS domain. We compare our method with representative LVLM baselines under normal-scene multi-temporal referring segmentation. The examples cover object disappearance, appearance, and state changes. Compared with existing LVLMs, our method produces more complete and spatially accurate masks that better match the language-specified temporal changes. shows that low-rank adaptati… view at source ↗

**Figure 10.** Figure 10: Qualitative comparisons on the RS domain. We compare our method with representative LVLM baselines on remote-sensing multi-temporal referring segmentation. The results show that our method can better capture building-level and region-level temporal changes under aerial-view scenes, producing masks that are more consistent with the ground truth. Expression: the white car situated along the central road tha… view at source ↗

**Figure 11.** Figure 11: Visualization of language-guided temporal attention and mask decoding. The decoder [SEG] attention maps concentrate on the regions described by the referring expressions, and the intermediate query masks progressively localize the target changed objects before producing the final prediction. We further visualize the internal decoding behavior in [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task and ships the first benchmark for multi-temporal referring segmentation, but the reported gains rest on a self-generated dataset whose construction details need checking.

read the letter

The core new thing here is the MTRS task itself—jointly doing language grounding, temporal change reasoning, and pixel masks on bi-temporal images—plus the MTRefSeg-21K benchmark of 21K triplets built with CRAFT-Agent and human audit. They also release MTRefSeg-R1, a two-stage LVLM that first trains on 20K vision-only change pairs then fine-tunes on the language task. That combination is not in the prior referring-segmentation or change-detection literature they cite, and open-sourcing the benchmark is the part that actually moves the field forward.

What works is the framing: direct LVLM inference is weak on this, and a dedicated pipeline helps. The two-stage approach is a reasonable way to bootstrap temporal perception before adding language. If the triplets really pair accurate change descriptions with real pixel differences, the benchmark will be useful for anyone testing temporal grounding.

The soft spot is exactly the one the stress-test flags. All performance numbers come from models run on MTRefSeg-21K, yet the abstract gives no quantitative metrics, no error bars, no ablation on the auditing step, and no comparison of how often the generated language matches independent human judgment of the images. If the CRAFT-Agent outputs descriptions that fit LLM priors more than the actual visual changes, then both the task difficulty and the ranking of MTRefSeg-R1 become circular. The separate 20K vision-only set is fine for pre-training, but it does not fix the evaluation set. Without those controls shown, the superiority claim is hard to trust at face value.

This is for CV groups already working on LVLMs or temporal reasoning who want a new testbed. It is not yet ready for broad citation until the data quality checks are in the paper. A serious referee should see it, mainly to verify the benchmark construction and demand the missing numbers.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Multi-temporal Referring Segmentation (MTRS) task, which requires jointly performing temporal correspondence reasoning, language grounding, and pixel-level mask prediction on multi-temporal images. It presents CRAFT-Agent, an automated data construction pipeline with human auditing, to build the MTRefSeg-21K benchmark of 21K image-text-mask triplets. It further proposes MTRefSeg-R1, a change-aware LVLM trained in two stages (first on 20K vision-only bi-temporal samples for general temporal-change perception, then fine-tuned on MTRefSeg-21K), and claims that MTRefSeg-R1 achieves strong and often superior performance relative to existing LVLM baselines.

Significance. If the benchmark triplets are shown to contain language descriptions that accurately and independently reflect genuine visual changes (rather than pipeline artifacts), the work would be significant for establishing the first dedicated benchmark and open-source baseline for MTRS. The explicit cross-temporal difference modeling and two-stage training strategy represent concrete technical contributions that could be built upon. The open-source release of the benchmark and code is a clear strength that supports reproducibility and further research in multi-temporal visual reasoning.

major comments (2)

[Benchmark construction (abstract and §3)] Benchmark construction (abstract and §3): The validity of all performance claims for MTRefSeg-R1 on MTRefSeg-21K depends on the triplets accurately reflecting genuine temporal differences. The description of human auditing provides no quantitative details such as rejection rates, inter-auditor agreement, or any post-audit verification that language descriptions match pixel-level evidence independently of CRAFT-Agent generation biases. This is load-bearing for the central claim.
[§5 (Experiments)] §5 (Experiments): The abstract asserts that direct inference performs poorly while MTRefSeg-R1 is superior, yet supplies no quantitative metrics, error bars, specific baseline implementations, or ablation results on the contribution of the two-stage training. Without these, the superiority claim cannot be evaluated for robustness.

minor comments (1)

[Abstract] Abstract: Including one or two key quantitative results (e.g., mIoU deltas versus the strongest baseline) would strengthen the summary of findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the manuscript's claims regarding benchmark validity and experimental robustness.

read point-by-point responses

Referee: [Benchmark construction (abstract and §3)] Benchmark construction (abstract and §3): The validity of all performance claims for MTRefSeg-R1 on MTRefSeg-21K depends on the triplets accurately reflecting genuine temporal differences. The description of human auditing provides no quantitative details such as rejection rates, inter-auditor agreement, or any post-audit verification that language descriptions match pixel-level evidence independently of CRAFT-Agent generation biases. This is load-bearing for the central claim.

Authors: We agree that quantitative auditing statistics are essential to substantiate the benchmark's quality and independence from pipeline artifacts. In the revised manuscript, we will add rejection rates from the human auditing stage, inter-auditor agreement metrics (e.g., Cohen's kappa), and post-audit verification procedures confirming that language descriptions align with pixel-level changes. These additions will directly address the load-bearing concern for the central claims. revision: yes
Referee: [§5 (Experiments)] §5 (Experiments): The abstract asserts that direct inference performs poorly while MTRefSeg-R1 is superior, yet supplies no quantitative metrics, error bars, specific baseline implementations, or ablation results on the contribution of the two-stage training. Without these, the superiority claim cannot be evaluated for robustness.

Authors: Section 5 of the manuscript already reports quantitative performance metrics across multiple LVLM baselines on MTRefSeg-21K, showing MTRefSeg-R1's advantages. However, we acknowledge the absence of error bars, explicit baseline implementation details, and ablations isolating the two-stage training. In the revision, we will incorporate error bars from repeated runs, clarify baseline setups, and add an ablation study on the two-stage strategy to enable rigorous evaluation of the superiority claims. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluation are independent of self-referential fits or definitions.

full rationale

The paper introduces a new task and benchmark (MTRefSeg-21K) via the CRAFT-Agent pipeline plus human auditing, then trains MTRefSeg-R1 in two stages on a separate 20K vision-only set before fine-tuning and comparing performance to external LVLM baselines. No equations, predictions, or central claims reduce by construction to author-defined inputs or self-citations; all reported results are empirical comparisons on the new data against independent models. This is a standard benchmark-plus-baseline paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard deep-learning training assumptions and the unverified quality of the automated data pipeline; no new physical entities or ad-hoc constants are introduced.

axioms (2)

standard math Standard assumptions of deep learning optimization (gradient descent reaches useful minima on non-convex loss surfaces) and i.i.d. sampling of training examples
Implicit in any claim that two-stage fine-tuning produces a generalizable model
domain assumption Human auditing of CRAFT-Agent outputs removes generation artifacts and yields faithful language-to-change correspondences
Invoked in the benchmark construction paragraph of the abstract

pith-pipeline@v0.9.1-grok · 5816 in / 1458 out tokens · 27404 ms · 2026-06-28T17:38:27.295965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 26 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

F- lmm: Grounding frozen large multimodal models,

S. Wu, S. Jin, W. Zhang, L. Xu, W. Liu, W. Li, and C. C. Loy, “F- lmm: Grounding frozen large multimodal models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 24 710–24 721

2025
[5]

Uwbench: A comprehensive vision-language benchmark for underwater understanding,

D. Zhang, C. Rong, B. Li, F. Wang, Z. Zhao, J. Gao, and X. Li, “Uwbench: A comprehensive vision-language benchmark for underwater understanding,” arXiv preprint arXiv:2510.18262, 2025

work page arXiv 2025
[6]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9579–9589

2024
[7]

Remotesam: Towards segment anything for earth observation,

L. Yao, F. Liu, D. Chen, C. Zhang, Y . Wang, Z. Chen, W. Xu, S. Di, and Y . Zheng, “Remotesam: Towards segment anything for earth observation,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 3027–3036

2025
[8]

Videoglamm: A large multimodal model for pixel-level visual grounding in videos,

S. Munasinghe, H. Gani, W. Zhu, J. Cao, E. Xing, F. S. Khan, and S. Khan, “Videoglamm: A large multimodal model for pixel-level visual grounding in videos,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 036–19 046

2025
[9]

Glamm: Pixel grounding large multimodal model,

H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel grounding large multimodal model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 009–13 018

2024
[10]

Ai flow: Perspectives, scenarios, and approaches,

H. An, W. Hu, S. Huang, S. Huang, R. Li, Y . Liang, J. Shao, Y . Song, Z. Wang, C. Yuanet al., “Ai flow: Perspectives, scenarios, and approaches,” Vicinagearth, vol. 3, no. 1, p. 1, 2026

2026
[11]

arXiv preprint arXiv:2512.05107 , year=

F. Xu, G. Zhai, X. Kong, T. Fu, D. F. Gordon, X. An, and B. Busam, “Stare-vla: Progressive stage-aware reinforcement for fine-tuning vision- language-action models,” arXiv preprint arXiv:2512.05107, 2025

work page arXiv 2025
[12]

Fine-grained preference optimization improves spatial reasoning in vlms,

Y . Shen, Y . Liu, J. Zhu, X. Cao, X. Zhang, Y . He, W. Ye, J. M. Rehg, and I. Lourentzou, “Fine-grained preference optimization improves spatial reasoning in vlms,” arXiv preprint arXiv:2506.21656, 2025

work page arXiv 2025
[13]

Toward cognitive supersensing in multimodal large language model,

B. Li, Y . Shen, Y . Liu, Y . Xu, J. Liu, X. Li, Z. Li, J. Zhu, Y . Zhong, F. Lan et al., “Toward cognitive supersensing in multimodal large language model,” arXiv preprint arXiv:2602.01541, 2026

work page arXiv 2026
[14]

Egoforge: Goal-directed egocentric world simulator,

Y . Shen, J. Liu, X. Li, Y . Liu, B. Li, H. Yang, W. Jia, Y . Li, T. Yu, J. M. Rehg et al., “Egoforge: Goal-directed egocentric world simulator,” arXiv preprint arXiv:2603.20169, 2026

work page arXiv 2026
[15]

Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025

A. Sarkar, M. Y . I. Idris, and Z. Yu, “Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,” arXiv preprint arXiv:2508.10523, 2025

work page arXiv 2025
[16]

Towards transparent ai: A survey on explainable large language models,

A. Palikhe, Z. Yu, Z. Wang, and W. Zhang, “Towards transparent ai: A survey on explainable large language models,” arXiv preprint arXiv:2506.21812, 2025

work page arXiv 2025
[17]

Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,

Z. Yu and C. S. Chan, “Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,” AAAI 2025, vol. 39, no. 9, pp. 9716–9724, 2025

2025
[18]

Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,

Z. Yu, M. Y . I. IDRIS, P. Wang, and R. Qureshi, “Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,” in NeurIPS 2025, 2025

2025
[19]

Forgetme: Benchmarking the selective forgetting capabilities of generative models,

Z. Yu, M. Y . I. Idris, P. Wang, Y . Xia, and Y . Xiang, “Forgetme: Benchmarking the selective forgetting capabilities of generative models,” EAAI, vol. 161, p. 112087, 2025

2025
[20]

Tri- subspaces disentanglement for multimodal sentiment analysis,

C. Meng, J. Luo, Z. Yan, Z. Yu, R. Fu, Z. Gan, and C. Ouyang, “Tri- subspaces disentanglement for multimodal sentiment analysis,” CVPR 2026, 2026

2026
[21]

Generative video compression: towards 0.01% compression rate for video transmission,

X. Chen, J. Luo, J. Xu, F. Yi, C. Zhang, and X. Li, “Generative video compression: towards 0.01% compression rate for video transmission,” Vicinagearth, vol. 3, no. 1, p. 7, 2026

2026
[22]

Geobench-vlm: Benchmarking vision-language models for geospatial tasks,

M. Danish, M. A. Munir, S. R. A. Shah, K. Kuckreja, F. S. Khan, P. Fraccaro, A. Lacoste, and S. Khan, “Geobench-vlm: Benchmarking vision-language models for geospatial tasks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 7132–7142

2025
[23]

Geomag: A vision-language model for pixel-level fine-grained remote sensing image parsing,

X. Ma, J. Li, C. Pei, and H. Liu, “Geomag: A vision-language model for pixel-level fine-grained remote sensing image parsing,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 5441–5450

2025
[24]

Geopixel: Pixel grounding large multimodal model in remote sensing,

A. Shabbir, M. Zumri, M. Bennamoun, F. S. Khan, and S. Khan, “Geopixel: Pixel grounding large multimodal model in remote sensing,” arXiv preprint arXiv:2501.13925, 2025

work page arXiv 2025
[25]

Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,

Z. Yu, M. Y . I. Idris, P. Wang, and R. Qureshi, “Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,” AAAI 2026, vol. 40, no. 48, pp. 41 455–41 456, 2026

2026
[26]

Visualizing our changing earth: A creative ai framework for democratizing environmental storytelling through satellite imagery,

Z. Yu, M. Y . I. Idris, and P. Wang, “Visualizing our changing earth: A creative ai framework for democratizing environmental storytelling through satellite imagery,” in NeurIPS 2025, 2025

2025
[27]

Spatiotemporal alignment for remote sensing image recovery via terrain-aware diffusion,

Z. Yu, H. Jiang, P. Wang, Z. Lin, and Y . Xiang, “Spatiotemporal alignment for remote sensing image recovery via terrain-aware diffusion,” ICASSP 2026, 2026

2026
[28]

Maris: Marine open-vocabulary instance segmentation with geometric enhancement and semantic alignment,

B. Li, F. Wang, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Maris: Marine open-vocabulary instance segmentation with geometric enhancement and semantic alignment,” arXiv preprint arXiv:2510.15398, 2025

work page arXiv 2025
[29]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” Advances in Neural Information Processing Systems, vol. 36, pp. 32 215–32 234, 2023

2023
[30]

Diffusion models for open-vocabulary segmentation,

L. Karazija, I. Laina, A. Vedaldi, and C. Rupprecht, “Diffusion models for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 299–317

2024
[31]

Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4113–4123

2024
[32]

Exploring the underwater world segmentation without extra training,

B. Li, T. Huo, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Exploring the underwater world segmentation without extra training,” arXiv preprint arXiv:2511.07923, 2025

work page arXiv 2025
[33]

Exploring efficient open-vocabulary segmentation in the remote sensing,

B. Li, H. Dong, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Exploring efficient open-vocabulary segmentation in the remote sensing,” arXiv preprint arXiv:2509.12040, 2025

work page arXiv 2025
[34]

A simple framework for open-vocabulary segmentation and detection,

H. Zhang, F. Li, X. Zou, S. Liu, C. Li, J. Yang, and L. Zhang, “A simple framework for open-vocabulary segmentation and detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1020–1031

2023
[35]

Open-vocabulary universal image segmentation with maskclip,

Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary universal image segmentation with maskclip,” arXiv preprint arXiv:2208.08984, 2022

work page arXiv 2022
[36]

Polyformer: Referring image segmentation as sequential polygon generation,

J. Liu, H. Ding, Z. Cai, Y . Zhang, R. K. Satzoda, V . Mahadevan, and R. Manmatha, “Polyformer: Referring image segmentation as sequential polygon generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 653–18 663

2023
[37]

Toward robust referring image segmentation,

J. Wu, X. Li, X. Li, H. Ding, Y . Tong, and D. Tao, “Toward robust referring image segmentation,” IEEE Transactions on Image Processing, vol. 33, pp. 1782–1794, 2024

2024
[38]

Rotated multi-scale interaction network for referring remote sensing image seg- mentation,

S. Liu, Y . Ma, X. Zhang, H. Wang, J. Ji, X. Sun, and R. Ji, “Rotated multi-scale interaction network for referring remote sensing image seg- mentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 658–26 668

2024
[39]

Lqmformer: Language-aware query mask transformer for referring image segmentation,

N. A. Shah, V . VS, and V . M. Patel, “Lqmformer: Language-aware query mask transformer for referring image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 903–12 913

2024
[40]

A survey of language-guided video object segmentation: from referring to reasoning,

Y . Shen and D. Zhang, “A survey of language-guided video object segmentation: from referring to reasoning,” Vicinagearth, vol. 2, no. 1, pp. 1–20, 2025. 16

2025
[41]

Adaptive selection based referring image segmentation,

P. Yue, J. Lin, S. Zhang, J. Hu, Y . Lu, H. Niu, H. Ding, Y . Zhang, G. Jiang, L. Cao et al., “Adaptive selection based referring image segmentation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1101–1110

2024
[42]

Rsrefseg: Refer- ring remote sensing image segmentation with foundation models,

K. Chen, J. Zhang, C. Liu, Z. Zou, and Z. Shi, “Rsrefseg: Refer- ring remote sensing image segmentation with foundation models,” in IGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2025, pp. 1070–1074

2025
[43]

Referring remote sensing image segmentation with cross-view semantics interaction network,

J. Yang, L. Zhang, and H. Lu, “Referring remote sensing image segmentation with cross-view semantics interaction network,” arXiv preprint arXiv:2508.01331, 2025

work page arXiv 2025
[44]

Deris: Decoupling perception and cognition for enhanced referring image segmentation through loopback synergy,

M. Dai, W. Cheng, J.-j. Liu, S. Yang, W. Cai, Y . Sun, and W. Yang, “Deris: Decoupling perception and cognition for enhanced referring image segmentation through loopback synergy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 936–19 946

2025
[45]

Lavt: Language-aware vision transformer for referring image segmentation,

Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 155–18 165

2022
[46]

Gres: Generalized referring expression segmentation,

C. Liu, H. Ding, and X. Jiang, “Gres: Generalized referring expression segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 592–23 601

2023
[47]

Mask grounding for referring image segmentation,

Y . X. Chng, H. Zheng, Y . Han, X. Qiu, and G. Huang, “Mask grounding for referring image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 573–26 583

2024
[48]

Segllm: Multi-round reasoning segmentation,

X. Wang, S. Zhang, S. Li, K. Kallidromitis, K. Li, Y . Kato, K. Kozuka, and T. Darrell, “Segllm: Multi-round reasoning segmentation,” arXiv preprint arXiv:2410.18923, 2024

work page arXiv 2024
[49]

Reasoning segmentation for images and videos: A survey,

Y . Shen, C. Li, F. Xiong, J.-O. Jeong, T. Wang, M. Latman, and M. Unberath, “Reasoning segmentation for images and videos: A survey,” arXiv preprint arXiv:2505.18816, 2025

work page arXiv 2025
[50]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Y . Liu, B. Peng, Z. Zhong, Z. Yue, F. Lu, B. Yu, and J. Jia, “Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement,” arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Lisa++: An improved baseline for reasoning segmentation with large language model,

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “Lisa++: An improved baseline for reasoning segmentation with large language model,” arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023
[52]

Dataset on underwater change detection,

M. Radolko, F. Farhadifard, and U. F. von Lukas, “Dataset on underwater change detection,” in OCEANS 2016 MTS/IEEE Monterey. IEEE, 2016, pp. 1–8

2016
[53]

Mds- net: An image-text enhanced multimodal dual-branch siamese network for remote sensing change detection,

T. Wang, T. Bai, C. Xu, E. Zhang, B. Liu, X. Zhao, and H. Zhang, “Mds- net: An image-text enhanced multimodal dual-branch siamese network for remote sensing change detection,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

2025
[54]

Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,

Z. Yu, J. Wang, H. Chen, and M. Y . I. Idris, “Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,” IEEE Access, 2025

2025
[55]

Dynamicearth: How far are we from open-vocabulary change detection?

K. Li, X. Cao, Y . Deng, C. Pang, Z. Xin, D. Meng, and Z. Wang, “Dynamicearth: How far are we from open-vocabulary change detection?” arXiv preprint arXiv:2501.12931, 2025

work page arXiv 2025
[56]

Semantic-cd: Remote sensing image semantic change detection towards open-vocabulary setting,

Y . Zhu, L. Li, K. Chen, C. Liu, F. Zhou, and Z. Shi, “Semantic-cd: Remote sensing image semantic change detection towards open-vocabulary setting,” in IGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2025, pp. 6388–6392

2025
[57]

Unichange: Unifying change detection with multimodal large language model,

X. Zhang, D. Li, X. Dong, T. Wu, H. Yu, J. Wang, Q. Li, and X. Li, “Unichange: Unifying change detection with multimodal large language model,” arXiv preprint arXiv:2511.02607, 2025

work page arXiv 2025
[58]

Referring change detection in remote sensing imagery,

Y . Korkmaz, J. N. Paranjape, C. M. de Melo, and V . M. Patel, “Referring change detection in remote sensing imagery,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 106–116

2026
[59]

Changechat: An interactive model for remote sensing change analysis via multimodal instruction tuning,

P. Deng, W. Zhou, and H. Wu, “Changechat: An interactive model for remote sensing change analysis via multimodal instruction tuning,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[60]

Segchange-r1: Llm-augmented remote sensing change detec- tion,

F. Zhou, “Segchange-r1: Llm-augmented remote sensing change detec- tion,” arXiv preprint arXiv:2506.17944, 2025

work page arXiv 2025
[61]

Viewpoint integration and registration with vision language foundation model for image change understanding,

X. Lu, J. Yuan, R. Niu, Y . Hu, and F. Wang, “Viewpoint integration and registration with vision language foundation model for image change understanding,” arXiv preprint arXiv:2309.08585, 2023

work page arXiv 2023
[62]

Modeling context in referring expressions,

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in European conference on computer vision. Springer, 2016, pp. 69–85

2016
[63]

Rrsis: Referring remote sensing image segmentation,

Z. Yuan, L. Mou, Y . Hua, and X. X. Zhu, “Rrsis: Referring remote sensing image segmentation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

2024
[64]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[65]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

2021
[66]

Cris: Clip- driven referring image segmentation,

Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “Cris: Clip- driven referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 686– 11 695

2022
[67]

Exploring fine-grained image-text alignment for referring remote sensing image segmentation,

S. Lei, X. Xiao, T. Zhang, H.-C. Li, Z. Shi, and Q. Zhu, “Exploring fine-grained image-text alignment for referring remote sensing image segmentation,” IEEE Transactions on Geoscience and Remote Sensing, 2024

2024
[68]

Gsva: Generalized segmentation via multimodal large language models,

Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3858–3869

2024
[69]

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

S. Ni, D. Wang, H. Chen, H. Guo, N. Zhang, and J. Zhang, “Unigeoseg: Towards unified open-world segmentation for geospatial scenes,” arXiv preprint arXiv:2511.23332, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Segearth-r1: Geospatial pixel reasoning via large language model,

K. Li, Z. Xin, L. Pang, C. Pang, Y . Deng, J. Yao, G. Xia, D. Meng, Z. Wang, and X. Cao, “Segearth-r1: Geospatial pixel reasoning via large language model,” arXiv preprint arXiv:2504.09644, 2025

work page arXiv 2025

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

F- lmm: Grounding frozen large multimodal models,

S. Wu, S. Jin, W. Zhang, L. Xu, W. Liu, W. Li, and C. C. Loy, “F- lmm: Grounding frozen large multimodal models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 24 710–24 721

2025

[5] [5]

Uwbench: A comprehensive vision-language benchmark for underwater understanding,

D. Zhang, C. Rong, B. Li, F. Wang, Z. Zhao, J. Gao, and X. Li, “Uwbench: A comprehensive vision-language benchmark for underwater understanding,” arXiv preprint arXiv:2510.18262, 2025

work page arXiv 2025

[6] [6]

Lisa: Reasoning segmentation via large language model,

X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9579–9589

2024

[7] [7]

Remotesam: Towards segment anything for earth observation,

L. Yao, F. Liu, D. Chen, C. Zhang, Y . Wang, Z. Chen, W. Xu, S. Di, and Y . Zheng, “Remotesam: Towards segment anything for earth observation,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 3027–3036

2025

[8] [8]

Videoglamm: A large multimodal model for pixel-level visual grounding in videos,

S. Munasinghe, H. Gani, W. Zhu, J. Cao, E. Xing, F. S. Khan, and S. Khan, “Videoglamm: A large multimodal model for pixel-level visual grounding in videos,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 036–19 046

2025

[9] [9]

Glamm: Pixel grounding large multimodal model,

H. Rasheed, M. Maaz, S. Shaji, A. Shaker, S. Khan, H. Cholakkal, R. M. Anwer, E. Xing, M.-H. Yang, and F. S. Khan, “Glamm: Pixel grounding large multimodal model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 009–13 018

2024

[10] [10]

Ai flow: Perspectives, scenarios, and approaches,

H. An, W. Hu, S. Huang, S. Huang, R. Li, Y . Liang, J. Shao, Y . Song, Z. Wang, C. Yuanet al., “Ai flow: Perspectives, scenarios, and approaches,” Vicinagearth, vol. 3, no. 1, p. 1, 2026

2026

[11] [11]

arXiv preprint arXiv:2512.05107 , year=

F. Xu, G. Zhai, X. Kong, T. Fu, D. F. Gordon, X. An, and B. Busam, “Stare-vla: Progressive stage-aware reinforcement for fine-tuning vision- language-action models,” arXiv preprint arXiv:2512.05107, 2025

work page arXiv 2025

[12] [12]

Fine-grained preference optimization improves spatial reasoning in vlms,

Y . Shen, Y . Liu, J. Zhu, X. Cao, X. Zhang, Y . He, W. Ye, J. M. Rehg, and I. Lourentzou, “Fine-grained preference optimization improves spatial reasoning in vlms,” arXiv preprint arXiv:2506.21656, 2025

work page arXiv 2025

[13] [13]

Toward cognitive supersensing in multimodal large language model,

B. Li, Y . Shen, Y . Liu, Y . Xu, J. Liu, X. Li, Z. Li, J. Zhu, Y . Zhong, F. Lan et al., “Toward cognitive supersensing in multimodal large language model,” arXiv preprint arXiv:2602.01541, 2026

work page arXiv 2026

[14] [14]

Egoforge: Goal-directed egocentric world simulator,

Y . Shen, J. Liu, X. Li, Y . Liu, B. Li, H. Yang, W. Jia, Y . Li, T. Yu, J. M. Rehg et al., “Egoforge: Goal-directed egocentric world simulator,” arXiv preprint arXiv:2603.20169, 2026

work page arXiv 2026

[15] [15]

Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025

A. Sarkar, M. Y . I. Idris, and Z. Yu, “Reasoning in computer vi- sion: Taxonomy, models, tasks, and methodologies,” arXiv preprint arXiv:2508.10523, 2025

work page arXiv 2025

[16] [16]

Towards transparent ai: A survey on explainable large language models,

A. Palikhe, Z. Yu, Z. Wang, and W. Zhang, “Towards transparent ai: A survey on explainable large language models,” arXiv preprint arXiv:2506.21812, 2025

work page arXiv 2025

[17] [17]

Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,

Z. Yu and C. S. Chan, “Yielding unblemished aesthetics through a unified network for visual imperfections removal in generated images,” AAAI 2025, vol. 39, no. 9, pp. 9716–9724, 2025

2025

[18] [18]

Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,

Z. Yu, M. Y . I. IDRIS, P. Wang, and R. Qureshi, “Cotextor: Training- free modular multilingual text editing via layered disentanglement and depth-aware fusion,” in NeurIPS 2025, 2025

2025

[19] [19]

Forgetme: Benchmarking the selective forgetting capabilities of generative models,

Z. Yu, M. Y . I. Idris, P. Wang, Y . Xia, and Y . Xiang, “Forgetme: Benchmarking the selective forgetting capabilities of generative models,” EAAI, vol. 161, p. 112087, 2025

2025

[20] [20]

Tri- subspaces disentanglement for multimodal sentiment analysis,

C. Meng, J. Luo, Z. Yan, Z. Yu, R. Fu, Z. Gan, and C. Ouyang, “Tri- subspaces disentanglement for multimodal sentiment analysis,” CVPR 2026, 2026

2026

[21] [21]

Generative video compression: towards 0.01% compression rate for video transmission,

X. Chen, J. Luo, J. Xu, F. Yi, C. Zhang, and X. Li, “Generative video compression: towards 0.01% compression rate for video transmission,” Vicinagearth, vol. 3, no. 1, p. 7, 2026

2026

[22] [22]

Geobench-vlm: Benchmarking vision-language models for geospatial tasks,

M. Danish, M. A. Munir, S. R. A. Shah, K. Kuckreja, F. S. Khan, P. Fraccaro, A. Lacoste, and S. Khan, “Geobench-vlm: Benchmarking vision-language models for geospatial tasks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 7132–7142

2025

[23] [23]

Geomag: A vision-language model for pixel-level fine-grained remote sensing image parsing,

X. Ma, J. Li, C. Pei, and H. Liu, “Geomag: A vision-language model for pixel-level fine-grained remote sensing image parsing,” in Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 5441–5450

2025

[24] [24]

Geopixel: Pixel grounding large multimodal model in remote sensing,

A. Shabbir, M. Zumri, M. Bennamoun, F. S. Khan, and S. Khan, “Geopixel: Pixel grounding large multimodal model in remote sensing,” arXiv preprint arXiv:2501.13925, 2025

work page arXiv 2025

[25] [25]

Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,

Z. Yu, M. Y . I. Idris, P. Wang, and R. Qureshi, “Dinov3-powered multi- task foundation model for quantitative remote sensing estimation,” AAAI 2026, vol. 40, no. 48, pp. 41 455–41 456, 2026

2026

[26] [26]

Visualizing our changing earth: A creative ai framework for democratizing environmental storytelling through satellite imagery,

Z. Yu, M. Y . I. Idris, and P. Wang, “Visualizing our changing earth: A creative ai framework for democratizing environmental storytelling through satellite imagery,” in NeurIPS 2025, 2025

2025

[27] [27]

Spatiotemporal alignment for remote sensing image recovery via terrain-aware diffusion,

Z. Yu, H. Jiang, P. Wang, Z. Lin, and Y . Xiang, “Spatiotemporal alignment for remote sensing image recovery via terrain-aware diffusion,” ICASSP 2026, 2026

2026

[28] [28]

Maris: Marine open-vocabulary instance segmentation with geometric enhancement and semantic alignment,

B. Li, F. Wang, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Maris: Marine open-vocabulary instance segmentation with geometric enhancement and semantic alignment,” arXiv preprint arXiv:2510.15398, 2025

work page arXiv 2025

[29] [29]

Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” Advances in Neural Information Processing Systems, vol. 36, pp. 32 215–32 234, 2023

2023

[30] [30]

Diffusion models for open-vocabulary segmentation,

L. Karazija, I. Laina, A. Vedaldi, and C. Rupprecht, “Diffusion models for open-vocabulary segmentation,” in European Conference on Computer Vision. Springer, 2024, pp. 299–317

2024

[31] [31]

Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,

S. Cho, H. Shin, S. Hong, A. Arnab, P. H. Seo, and S. Kim, “Cat- seg: Cost aggregation for open-vocabulary semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4113–4123

2024

[32] [32]

Exploring the underwater world segmentation without extra training,

B. Li, T. Huo, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Exploring the underwater world segmentation without extra training,” arXiv preprint arXiv:2511.07923, 2025

work page arXiv 2025

[33] [33]

Exploring efficient open-vocabulary segmentation in the remote sensing,

B. Li, H. Dong, D. Zhang, Z. Zhao, J. Gao, and X. Li, “Exploring efficient open-vocabulary segmentation in the remote sensing,” arXiv preprint arXiv:2509.12040, 2025

work page arXiv 2025

[34] [34]

A simple framework for open-vocabulary segmentation and detection,

H. Zhang, F. Li, X. Zou, S. Liu, C. Li, J. Yang, and L. Zhang, “A simple framework for open-vocabulary segmentation and detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 1020–1031

2023

[35] [35]

Open-vocabulary universal image segmentation with maskclip,

Z. Ding, J. Wang, and Z. Tu, “Open-vocabulary universal image segmentation with maskclip,” arXiv preprint arXiv:2208.08984, 2022

work page arXiv 2022

[36] [36]

Polyformer: Referring image segmentation as sequential polygon generation,

J. Liu, H. Ding, Z. Cai, Y . Zhang, R. K. Satzoda, V . Mahadevan, and R. Manmatha, “Polyformer: Referring image segmentation as sequential polygon generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 18 653–18 663

2023

[37] [37]

Toward robust referring image segmentation,

J. Wu, X. Li, X. Li, H. Ding, Y . Tong, and D. Tao, “Toward robust referring image segmentation,” IEEE Transactions on Image Processing, vol. 33, pp. 1782–1794, 2024

2024

[38] [38]

Rotated multi-scale interaction network for referring remote sensing image seg- mentation,

S. Liu, Y . Ma, X. Zhang, H. Wang, J. Ji, X. Sun, and R. Ji, “Rotated multi-scale interaction network for referring remote sensing image seg- mentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 658–26 668

2024

[39] [39]

Lqmformer: Language-aware query mask transformer for referring image segmentation,

N. A. Shah, V . VS, and V . M. Patel, “Lqmformer: Language-aware query mask transformer for referring image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 903–12 913

2024

[40] [40]

A survey of language-guided video object segmentation: from referring to reasoning,

Y . Shen and D. Zhang, “A survey of language-guided video object segmentation: from referring to reasoning,” Vicinagearth, vol. 2, no. 1, pp. 1–20, 2025. 16

2025

[41] [41]

Adaptive selection based referring image segmentation,

P. Yue, J. Lin, S. Zhang, J. Hu, Y . Lu, H. Niu, H. Ding, Y . Zhang, G. Jiang, L. Cao et al., “Adaptive selection based referring image segmentation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 1101–1110

2024

[42] [42]

Rsrefseg: Refer- ring remote sensing image segmentation with foundation models,

K. Chen, J. Zhang, C. Liu, Z. Zou, and Z. Shi, “Rsrefseg: Refer- ring remote sensing image segmentation with foundation models,” in IGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2025, pp. 1070–1074

2025

[43] [43]

Referring remote sensing image segmentation with cross-view semantics interaction network,

J. Yang, L. Zhang, and H. Lu, “Referring remote sensing image segmentation with cross-view semantics interaction network,” arXiv preprint arXiv:2508.01331, 2025

work page arXiv 2025

[44] [44]

Deris: Decoupling perception and cognition for enhanced referring image segmentation through loopback synergy,

M. Dai, W. Cheng, J.-j. Liu, S. Yang, W. Cai, Y . Sun, and W. Yang, “Deris: Decoupling perception and cognition for enhanced referring image segmentation through loopback synergy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 936–19 946

2025

[45] [45]

Lavt: Language-aware vision transformer for referring image segmentation,

Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 155–18 165

2022

[46] [46]

Gres: Generalized referring expression segmentation,

C. Liu, H. Ding, and X. Jiang, “Gres: Generalized referring expression segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 23 592–23 601

2023

[47] [47]

Mask grounding for referring image segmentation,

Y . X. Chng, H. Zheng, Y . Han, X. Qiu, and G. Huang, “Mask grounding for referring image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 573–26 583

2024

[48] [48]

Segllm: Multi-round reasoning segmentation,

X. Wang, S. Zhang, S. Li, K. Kallidromitis, K. Li, Y . Kato, K. Kozuka, and T. Darrell, “Segllm: Multi-round reasoning segmentation,” arXiv preprint arXiv:2410.18923, 2024

work page arXiv 2024

[49] [49]

Reasoning segmentation for images and videos: A survey,

Y . Shen, C. Li, F. Xiong, J.-O. Jeong, T. Wang, M. Latman, and M. Unberath, “Reasoning segmentation for images and videos: A survey,” arXiv preprint arXiv:2505.18816, 2025

work page arXiv 2025

[50] [50]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Y . Liu, B. Peng, Z. Zhong, Z. Yue, F. Lu, B. Yu, and J. Jia, “Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement,” arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Lisa++: An improved baseline for reasoning segmentation with large language model,

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia, “Lisa++: An improved baseline for reasoning segmentation with large language model,” arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023

[52] [52]

Dataset on underwater change detection,

M. Radolko, F. Farhadifard, and U. F. von Lukas, “Dataset on underwater change detection,” in OCEANS 2016 MTS/IEEE Monterey. IEEE, 2016, pp. 1–8

2016

[53] [53]

Mds- net: An image-text enhanced multimodal dual-branch siamese network for remote sensing change detection,

T. Wang, T. Bai, C. Xu, E. Zhang, B. Liu, X. Zhao, and H. Zhang, “Mds- net: An image-text enhanced multimodal dual-branch siamese network for remote sensing change detection,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

2025

[54] [54]

Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,

Z. Yu, J. Wang, H. Chen, and M. Y . I. Idris, “Qrs-trs: Style transfer-based image-to-image translation for carbon stock estimation in quantitative remote sensing,” IEEE Access, 2025

2025

[55] [55]

Dynamicearth: How far are we from open-vocabulary change detection?

K. Li, X. Cao, Y . Deng, C. Pang, Z. Xin, D. Meng, and Z. Wang, “Dynamicearth: How far are we from open-vocabulary change detection?” arXiv preprint arXiv:2501.12931, 2025

work page arXiv 2025

[56] [56]

Semantic-cd: Remote sensing image semantic change detection towards open-vocabulary setting,

Y . Zhu, L. Li, K. Chen, C. Liu, F. Zhou, and Z. Shi, “Semantic-cd: Remote sensing image semantic change detection towards open-vocabulary setting,” in IGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2025, pp. 6388–6392

2025

[57] [57]

Unichange: Unifying change detection with multimodal large language model,

X. Zhang, D. Li, X. Dong, T. Wu, H. Yu, J. Wang, Q. Li, and X. Li, “Unichange: Unifying change detection with multimodal large language model,” arXiv preprint arXiv:2511.02607, 2025

work page arXiv 2025

[58] [58]

Referring change detection in remote sensing imagery,

Y . Korkmaz, J. N. Paranjape, C. M. de Melo, and V . M. Patel, “Referring change detection in remote sensing imagery,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2026, pp. 106–116

2026

[59] [59]

Changechat: An interactive model for remote sensing change analysis via multimodal instruction tuning,

P. Deng, W. Zhou, and H. Wu, “Changechat: An interactive model for remote sensing change analysis via multimodal instruction tuning,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[60] [60]

Segchange-r1: Llm-augmented remote sensing change detec- tion,

F. Zhou, “Segchange-r1: Llm-augmented remote sensing change detec- tion,” arXiv preprint arXiv:2506.17944, 2025

work page arXiv 2025

[61] [61]

Viewpoint integration and registration with vision language foundation model for image change understanding,

X. Lu, J. Yuan, R. Niu, Y . Hu, and F. Wang, “Viewpoint integration and registration with vision language foundation model for image change understanding,” arXiv preprint arXiv:2309.08585, 2023

work page arXiv 2023

[62] [62]

Modeling context in referring expressions,

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in European conference on computer vision. Springer, 2016, pp. 69–85

2016

[63] [63]

Rrsis: Referring remote sensing image segmentation,

Z. Yuan, L. Mou, Y . Hua, and X. X. Zhu, “Rrsis: Referring remote sensing image segmentation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

2024

[64] [64]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019

[65] [65]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763

2021

[66] [66]

Cris: Clip- driven referring image segmentation,

Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu, “Cris: Clip- driven referring image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 686– 11 695

2022

[67] [67]

Exploring fine-grained image-text alignment for referring remote sensing image segmentation,

S. Lei, X. Xiao, T. Zhang, H.-C. Li, Z. Shi, and Q. Zhu, “Exploring fine-grained image-text alignment for referring remote sensing image segmentation,” IEEE Transactions on Geoscience and Remote Sensing, 2024

2024

[68] [68]

Gsva: Generalized segmentation via multimodal large language models,

Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang, “Gsva: Generalized segmentation via multimodal large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3858–3869

2024

[69] [69]

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

S. Ni, D. Wang, H. Chen, H. Guo, N. Zhang, and J. Zhang, “Unigeoseg: Towards unified open-world segmentation for geospatial scenes,” arXiv preprint arXiv:2511.23332, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Segearth-r1: Geospatial pixel reasoning via large language model,

K. Li, Z. Xin, L. Pang, C. Pang, Y . Deng, J. Yao, G. Xia, D. Meng, Z. Wang, and X. Cao, “Segearth-r1: Geospatial pixel reasoning via large language model,” arXiv preprint arXiv:2504.09644, 2025

work page arXiv 2025