SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

Bowen Chen; Chuyu Zhong; Keyan Chen; Qinzhe Yang; Zhengxia Zou; Zhenwei Shi

arxiv: 2605.25737 · v1 · pith:JXJO2XYGnew · submitted 2026-05-25 · 💻 cs.CV

SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

Chuyu Zhong , Keyan Chen , Qinzhe Yang , Bowen Chen , Zhengxia Zou , Zhenwei Shi This is my paper

Pith reviewed 2026-06-29 23:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords ultra-wide area segmentationremote sensingscale-frustum representationsmulti-scale fusionsemantic continuityGID datasetFBPS dataset

0 comments

The pith

SFR-Net builds scale-frustum representations to segment ultra-wide area remote sensing images by unifying multi-scale objects with long-range context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines ultra-wide area remote sensing images as those that combine very large pixel counts with extremely broad geographical coverage. It argues that standard segmentation networks fail on these images because ground objects appear at dramatically different scales while semantic relations must remain coherent across long distances. SFR-Net constructs scale-frustum representations modeled on the geometry of viewing frustums observed from different altitudes so that objects and surrounding context are described uniformly across scales. A cascaded cross-scale fusion step then merges the representations to keep both local detail and global continuity. The resulting network reports higher mIoU than prior methods on the GID and FBPS benchmarks and can be added to other segmentation architectures to improve accuracy and convergence speed.

Core claim

Scale-frustum representations, constructed by emulating viewing frustums from different altitudes, enable unified modeling of ground objects and contextual features at multiple scales; when combined with cascaded cross-scale fusion, they simultaneously address scale variation and long-range semantic continuity in ultra-wide area images.

What carries the argument

scale-frustum representations: feature sets that model objects and context at different scales in one structure by following the geometry of remote-sensing viewing frustums captured from varying altitudes.

If this is right

SFR-Net raises mean intersection-over-union by 1.72 percent on the GID dataset and 4.29 percent on the FBPS dataset relative to the strongest prior methods.
The same scale-frustum representations can be inserted into generic segmentation networks to increase accuracy and reduce the number of training steps needed.
Local semantic detail and long-range contextual continuity are maintained together without separate pyramid or attention modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frustum construction may transfer to other imaging domains that also combine extreme scale ranges with wide spatial extent, such as aerial video or medical whole-slide imaging.
If the geometric inspiration from the capture process proves general, it could replace hand-designed multi-scale modules in additional remote-sensing tasks.
Larger-coverage test sets would show whether the continuity benefit continues to grow with image area.

Load-bearing premise

Representations built from altitude-based viewing frustums can solve both extreme scale differences and long-range semantic continuity in one mechanism.

What would settle it

Run SFR-Net on a controlled ultra-wide dataset in which either all objects are forced to one scale or long-range context is deliberately severed while keeping everything else fixed, and check whether the reported accuracy gains disappear.

Figures

Figures reproduced from arXiv: 2605.25737 by Bowen Chen, Chuyu Zhong, Keyan Chen, Qinzhe Yang, Zhengxia Zou, Zhenwei Shi.

**Figure 2.** Figure 2: Visual examples of the two core challenges in UWA [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed SFR-Net pipeline, designed to solve the dual UWA challenges. (a) First, Scale-Frustum Representations [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of a single fusion unit in the proposed [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Although these methods can better handle semantic [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of segmentation results on the GID dataset. The first column displays the full UWA image and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of segmentation results on the FBPS dataset. The first column displays the full UWA image [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of Scale-Frustum Representations (SFR) on [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison demonstrating the e [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of failure cases. Most methods fail to perfectly distinguish semantically similar water bodies, such as “river” [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at https://github.com/ChuyuZhong/SFR-Net.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SFR-Net defines a new UWA segmentation task and reports modest SOTA gains from scale-frustum representations on two datasets.

read the letter

The main point is that this paper carves out ultra-wide area remote sensing segmentation as a distinct task and introduces scale-frustum representations to tackle both scale variation and long-range context in one go. The reported mIoU lifts are small but consistent enough to claim SOTA.

The task definition itself is the clearest new element. Prior work splits between small-pixel images and large-coverage but narrow scenes; UWA targets the intersection. The frustum construction draws from altitude-based viewing geometry to build unified multi-scale features, and the cascaded fusion step integrates them. They also show the representations can be dropped into other networks to improve accuracy and convergence speed.

Those pieces are straightforward and the motivation lines up with real remote-sensing pain points. Releasing code at the stated GitHub link will let others test the plug-in claim directly.

The evidence is thinner. The gains are 1.72% on GID and 4.29% on FBPS. In segmentation benchmarks those deltas can shift with baseline reimplementation or hyperparameter choices, so the full experiments need checking for fair capacity-matched controls and ablations. The abstract ties the frustum design to both scale and continuity, but without the architecture diagrams or loss details it is not obvious how the construction enforces long-range semantic continuity beyond standard multi-scale fusion.

This is for people already working on remote-sensing segmentation or large-area mapping. A reader focused on multi-scale feature design would get usable ideas from the representation and the new task framing. The work is coherent on its own terms and the claims are testable, so it deserves a serious referee even though the improvements stay incremental.

Referee Report

0 major / 2 minor

Summary. The paper introduces the ultra-wide area (UWA) remote sensing image segmentation task, characterized by both high pixel counts and extremely wide geographical coverage. It proposes SFR-Net, which constructs scale-frustum representations inspired by viewing frustums from different altitudes to enable unified modeling of multi-scale ground objects and contextual features. A cascaded cross-scale fusion mechanism integrates these representations to improve local semantics and long-range continuity. On the GID and FBPS datasets, SFR-Net reports state-of-the-art mIoU gains of 1.72% and 4.29% over the strongest baselines; the scale-frustum representations are also shown to boost accuracy and convergence when plugged into generic segmentation networks. Code will be released publicly.

Significance. If the reported gains prove robust, the work could advance remote sensing segmentation by jointly addressing scale variation and long-range context in large-coverage imagery. The empirical improvements and the plug-in compatibility with existing networks constitute the main potential contribution. Public code availability is a clear strength for reproducibility.

minor comments (2)

Abstract: the mIoU improvements are stated relative to 'the strongest competing methods' without naming the baselines or reporting absolute mIoU values; adding these details would make the performance claim easier to assess.
Abstract: the claim that scale-frustum representations 'can be integrated into generic segmentation networks' is presented without quantitative evidence in the abstract; a brief supporting result or reference to the relevant experiment would strengthen the statement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The positive assessment of the significance of SFR-Net, the scale-frustum representations, and the reported gains on GID and FBPS is appreciated. As the report contains no specific major comments, we have no points requiring response or revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new segmentation task for ultra-wide area images and proposes SFR-Net with scale-frustum representations and cascaded fusion as a novel architectural design motivated by viewing frustums. No equations, fitted parameters, or predictions are presented that reduce by construction to prior inputs or self-citations. The central claims rest on empirical mIoU improvements on GID and FBPS datasets, with the method described as integrable into generic networks. The derivation chain is self-contained as an empirical proposal without load-bearing reductions to definitions, fits, or author-prior uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review based solely on abstract; full architecture and training details unavailable.

axioms (1)

domain assumption Viewing frustums from different altitudes provide a useful analogy for constructing multi-scale representations of ground objects
Explicitly invoked as the inspiration for scale-frustum representations.

invented entities (2)

scale-frustum representations no independent evidence
purpose: Unified modeling of ground objects and contextual features at different scales
Newly introduced concept in the paper
cascaded cross-scale fusion mechanism no independent evidence
purpose: Integration of multi-scale representations to preserve local semantics and long-range continuity
Newly proposed network component

pith-pipeline@v0.9.1-grok · 5789 in / 1315 out tokens · 48965 ms · 2026-06-29T23:03:16.504941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Land-cover classification with high-resolution remote sensing images using transferable deep models,

X.-Y . Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang, “Land-cover classification with high-resolution remote sensing images using transferable deep models,”Remote Sensing of Environment, vol. 237, p. 111322, 2020

2020
[2]

Enabling country-scale land cover mapping with meter-resolution satellite imagery,

X.-Y . Tong, G.-S. Xia, and X. X. Zhu, “Enabling country-scale land cover mapping with meter-resolution satellite imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 178–196, 2023

2023
[3]

Fully convolutional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440

2015
[4]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

2015
[5]

Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,

V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017

2017
[6]

Per-pixel classification is not all you need for semantic segmentation,

B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,”Advances in neural information processing systems, vol. 34, pp. 17 864–17 875, 2021

2021
[7]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[8]

Rsprompter: Learning to prompt for remote sensing instance seg- mentation based on visual foundation model,

K. Chen, C. Liu, H. Chen, H. Zhang, W. Li, Z. Zou, and Z. Shi, “Rsprompter: Learning to prompt for remote sensing instance seg- mentation based on visual foundation model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024

2024
[9]

From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation,

Q. Li, W. Yang, W. Liu, Y . Yu, and S. He, “From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7252–7261

2021
[10]

Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,

W. Chen, L. Bruzzone, B. Dang, Y . Gao, Y . Deng, J.-G. Yu, L. Yuan, and Y . Li, “Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 693–710, 2026

2026
[11]

Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images,

W. Chen, Z. Jiang, Z. Wang, K. Cui, and X. Qian, “Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8924–8933

2019
[12]

Patch proposal network for fast semantic segmentation of high-resolution images,

T. Wu, Z. Lei, B. Lin, C. Li, Y . Qu, and Y . Xie, “Patch proposal network for fast semantic segmentation of high-resolution images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 402–12 409

2020
[13]

Uhrsnet: A semantic segmentation network specifically for ultra-high- resolution images,

L. Shan, M. Li, X. Li, Y . Bai, K. Lv, B. Luo, S.-B. Chen, and W. Wang, “Uhrsnet: A semantic segmentation network specifically for ultra-high- resolution images,” in2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 1460–1466. 13

2021
[14]

Isdnet: Integrating shallow and deep net- works for efficient ultra-high resolution segmentation,

S. Guo, L. Liu, Z. Gan, Y . Wang, W. Zhang, C. Wang, G. Jiang, W. Zhang, R. Yi, L. Maet al., “Isdnet: Integrating shallow and deep net- works for efficient ultra-high resolution segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4361–4370

2022
[15]

Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation,

D. Ji, F. Zhao, and H. Lu, “Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23), 2023, pp. 920–928

2023
[16]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890

2017
[17]

Icnet for real-time semantic segmentation on high-resolution images,

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 405– 420

2018
[18]

Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818

2018
[19]

D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,

L. Zhou, C. Zhang, and M. Wu, “D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 182–186

2018
[20]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[21]

Multi-scale context ag- gregation for semantic segmentation of remote sensing images,

J. Zhang, S. Lin, L. Ding, and L. Bruzzone, “Multi-scale context ag- gregation for semantic segmentation of remote sensing images,”Remote Sensing, vol. 12, no. 4, p. 701, 2020

2020
[22]

Bisenet: Bilateral segmentation network for real-time semantic segmentation,

C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 325–341

2018
[23]

Re- thinking bisenet for real-time semantic segmentation,

M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Re- thinking bisenet for real-time semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9716–9725

2021
[24]

An image is worth 16x16 words: transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: transformers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021, pp. 611–631

2021
[25]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021

2021
[26]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

2021
[27]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

2022
[28]

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,

S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torret al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890

2021
[29]

Dual attention network for scene segmentation,

J. Fu, J. Liu, H. Tian, Y . Li, Y . Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146– 3154

2019
[30]

Scattnet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images,

H. Li, K. Qiu, L. Chen, X. Mei, L. Hong, and C. Tao, “Scattnet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images,”IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 5, pp. 905–909, 2020

2020
[31]

Lanet: Local attention embedding to improve the semantic segmentation of remote sensing images,

L. Ding, H. Tang, and L. Bruzzone, “Lanet: Local attention embedding to improve the semantic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 1, pp. 426–435, 2020

2020
[32]

Rsrefseg 2: Decoupling referring remote sensing image segmentation with founda- tion models,

K. Chen, C. Liu, B. Chen, J. Zhang, Z. Zou, and Z. Shi, “Rsrefseg 2: Decoupling referring remote sensing image segmentation with founda- tion models,”IEEE Transactions on Geoscience and Remote Sensing, vol. 64, pp. 1–20, 2025

2025
[33]

Taco: Cap- turing spatio-temporal semantic consistency in remote sensing change detection,

H. Guo, C. Liu, H. Zhang, B. Chen, Z. Zou, and Z. Shi, “Taco: Cap- turing spatio-temporal semantic consistency in remote sensing change detection,”arXiv preprint arXiv:2511.20306, 2025

work page arXiv 2025
[34]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024

2024
[35]

State-space models,

J. D. Hamilton, “State-space models,”Handbook of econometrics, vol. 4, pp. 3039–3080, 1994

1994
[36]

Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,

K. Chen, C. Liu, B. Chen, W. Li, Z. Zou, and Z. Shi, “Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,”arXiv preprint arXiv:2503.16426, 2025

work page arXiv 2025
[37]

Rs-mamba for large remote sensing image dense prediction,

S. Zhao, H. Chen, X. Zhang, P. Xiao, L. Bai, and W. Ouyang, “Rs-mamba for large remote sensing image dense prediction,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

2024
[38]

Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation,

X. Ma, X. Zhang, and M.-O. Pun, “Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation,”IEEE Geo- science and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

2024
[39]

Unet- mamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images,

E. Zhu, Z. Chen, D. Wang, H. Shi, X. Liu, and L. Wang, “Unet- mamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2024

2024
[40]

Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement,

H. K. Cheng, J. Chung, Y .-W. Tai, and C.-K. Tang, “Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8890–8899

2020
[41]

Pointrend: Image seg- mentation as rendering,

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9799–9808

2020
[42]

Memory- constrained semantic segmentation for ultra-high resolution uav im- agery,

Q. Li, J. Cai, J. Luo, Y . Yu, J. Gu, J. Pan, and W. Liu, “Memory- constrained semantic segmentation for ultra-high resolution uav im- agery,”IEEE Robotics and Automation Letters, vol. 9, no. 2, pp. 1708– 1715, 2024

2024
[43]

Ultra-high resolution segmen- tation with ultra-rich context: A novel benchmark,

D. Ji, F. Zhao, H. Lu, M. Tao, and J. Ye, “Ultra-high resolution segmen- tation with ultra-rich context: A novel benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 621–23 630

2023
[44]

Wave-vit: Unifying wavelet and transformers for visual representation learning,

T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inEuropean conference on computer vision. Springer, 2022, pp. 328–345

2022
[45]

Boosting the dual-stream architecture in ultra-high resolution segmentation with resolution-biased uncertainty estimation,

R. Qin, X. Liu, J. Shi, L. Lin, and J. Yang, “Boosting the dual-stream architecture in ultra-high resolution segmentation with resolution-biased uncertainty estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 960–25 970

2025
[46]

Ultra-high resolution segmentation via boundary-enhanced patch-merging transformer,

H. Sun, Y . Zhang, L. Xu, S. Jin, and Y . Chen, “Ultra-high resolution segmentation via boundary-enhanced patch-merging transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7087–7095

2025
[47]

Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement,

W. Liu, Q. Li, X. Lin, W. Yang, S. He, and Y . Yu, “Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement,”International Journal of Computer Vision, vol. 132, no. 11, pp. 5030–5047, 2024

2024
[48]

Progressive semantic seg- mentation,

C. Huynh, A. T. Tran, K. Luu, and M. Hoai, “Progressive semantic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 755–16 764

2021
[49]

Looking outside the window: Wide-context transformer for the semantic segmentation of high-resolution remote sensing im- ages,

L. Ding, D. Lin, S. Lin, J. Zhang, X. Cui, Y . Wang, H. Tang, and L. Bruzzone, “Looking outside the window: Wide-context transformer for the semantic segmentation of high-resolution remote sensing im- ages,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022

2022
[50]

Deepglobe 2018: A challenge to parse the earth through satellite images,

I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raskar, “Deepglobe 2018: A challenge to parse the earth through satellite images,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 172–181

2018
[51]

Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,

E. Maggiori, Y . Tarabalka, G. Charpiat, and P. Alliez, “Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,” in2017 IEEE International geoscience and remote sensing symposium (IGARSS). IEEE, 2017, pp. 3226–3229

2017
[52]

Deep high-resolution repre- sentation learning for human pose estimation,

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- sentation learning for human pose estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703

2019
[53]

Object-contextual representations for semantic segmentation,

Y . Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” inComputer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer Interna- tional Publishing, 2020, pp. 173–190. 14

2020
[54]

Convnext v2: Co-designing and scaling convnets with masked autoencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoencoders,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 16 133–16 142

2023
[55]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[56]

Unified perceptual parsing for scene understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 418–434

2018
[57]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[58]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

2019
[59]

Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark,

M. Contributors, “Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark,” 2020. Chuyu Zhongreceived the B.S. degree from the School of Astronautics, Beihang University, Beijing, China, in 2025. He is currently pursuing the Ph.D. degree with the Image Processing Center, School of Astronautics, Beihang University. His research interests incl...

2020

[1] [1]

Land-cover classification with high-resolution remote sensing images using transferable deep models,

X.-Y . Tong, G.-S. Xia, Q. Lu, H. Shen, S. Li, S. You, and L. Zhang, “Land-cover classification with high-resolution remote sensing images using transferable deep models,”Remote Sensing of Environment, vol. 237, p. 111322, 2020

2020

[2] [2]

Enabling country-scale land cover mapping with meter-resolution satellite imagery,

X.-Y . Tong, G.-S. Xia, and X. X. Zhu, “Enabling country-scale land cover mapping with meter-resolution satellite imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, pp. 178–196, 2023

2023

[3] [3]

Fully convolutional networks for semantic segmentation,

J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440

2015

[4] [4]

U-net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241

2015

[5] [5]

Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,

V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017

2017

[6] [6]

Per-pixel classification is not all you need for semantic segmentation,

B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,”Advances in neural information processing systems, vol. 34, pp. 17 864–17 875, 2021

2021

[7] [7]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023

[8] [8]

Rsprompter: Learning to prompt for remote sensing instance seg- mentation based on visual foundation model,

K. Chen, C. Liu, H. Chen, H. Zhang, W. Li, Z. Zou, and Z. Shi, “Rsprompter: Learning to prompt for remote sensing instance seg- mentation based on visual foundation model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024

2024

[9] [9]

From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation,

Q. Li, W. Yang, W. Liu, Y . Yu, and S. He, “From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7252–7261

2021

[10] [10]

Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,

W. Chen, L. Bruzzone, B. Dang, Y . Gao, Y . Deng, J.-G. Yu, L. Yuan, and Y . Li, “Rest: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 1, pp. 693–710, 2026

2026

[11] [11]

Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images,

W. Chen, Z. Jiang, Z. Wang, K. Cui, and X. Qian, “Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8924–8933

2019

[12] [12]

Patch proposal network for fast semantic segmentation of high-resolution images,

T. Wu, Z. Lei, B. Lin, C. Li, Y . Qu, and Y . Xie, “Patch proposal network for fast semantic segmentation of high-resolution images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 402–12 409

2020

[13] [13]

Uhrsnet: A semantic segmentation network specifically for ultra-high- resolution images,

L. Shan, M. Li, X. Li, Y . Bai, K. Lv, B. Luo, S.-B. Chen, and W. Wang, “Uhrsnet: A semantic segmentation network specifically for ultra-high- resolution images,” in2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 1460–1466. 13

2021

[14] [14]

Isdnet: Integrating shallow and deep net- works for efficient ultra-high resolution segmentation,

S. Guo, L. Liu, Z. Gan, Y . Wang, W. Zhang, C. Wang, G. Jiang, W. Zhang, R. Yi, L. Maet al., “Isdnet: Integrating shallow and deep net- works for efficient ultra-high resolution segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4361–4370

2022

[15] [15]

Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation,

D. Ji, F. Zhao, and H. Lu, “Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23), 2023, pp. 920–928

2023

[16] [16]

Pyramid scene parsing network,

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890

2017

[17] [17]

Icnet for real-time semantic segmentation on high-resolution images,

H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time semantic segmentation on high-resolution images,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 405– 420

2018

[18] [18]

Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818

2018

[19] [19]

D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,

L. Zhou, C. Zhang, and M. Wu, “D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 182–186

2018

[20] [20]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016

[21] [21]

Multi-scale context ag- gregation for semantic segmentation of remote sensing images,

J. Zhang, S. Lin, L. Ding, and L. Bruzzone, “Multi-scale context ag- gregation for semantic segmentation of remote sensing images,”Remote Sensing, vol. 12, no. 4, p. 701, 2020

2020

[22] [22]

Bisenet: Bilateral segmentation network for real-time semantic segmentation,

C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 325–341

2018

[23] [23]

Re- thinking bisenet for real-time semantic segmentation,

M. Fan, S. Lai, J. Huang, X. Wei, Z. Chai, J. Luo, and X. Wei, “Re- thinking bisenet for real-time semantic segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9716–9725

2021

[24] [24]

An image is worth 16x16 words: transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: transformers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR), 2021, pp. 611–631

2021

[25] [25]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021

2021

[26] [26]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022

2021

[27] [27]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299

2022

[28] [28]

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,

S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torret al., “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 6881–6890

2021

[29] [29]

Dual attention network for scene segmentation,

J. Fu, J. Liu, H. Tian, Y . Li, Y . Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146– 3154

2019

[30] [30]

Scattnet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images,

H. Li, K. Qiu, L. Chen, X. Mei, L. Hong, and C. Tao, “Scattnet: Semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images,”IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 5, pp. 905–909, 2020

2020

[31] [31]

Lanet: Local attention embedding to improve the semantic segmentation of remote sensing images,

L. Ding, H. Tang, and L. Bruzzone, “Lanet: Local attention embedding to improve the semantic segmentation of remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 1, pp. 426–435, 2020

2020

[32] [32]

Rsrefseg 2: Decoupling referring remote sensing image segmentation with founda- tion models,

K. Chen, C. Liu, B. Chen, J. Zhang, Z. Zou, and Z. Shi, “Rsrefseg 2: Decoupling referring remote sensing image segmentation with founda- tion models,”IEEE Transactions on Geoscience and Remote Sensing, vol. 64, pp. 1–20, 2025

2025

[33] [33]

Taco: Cap- turing spatio-temporal semantic consistency in remote sensing change detection,

H. Guo, C. Liu, H. Zhang, B. Chen, Z. Zou, and Z. Shi, “Taco: Cap- turing spatio-temporal semantic consistency in remote sensing change detection,”arXiv preprint arXiv:2511.20306, 2025

work page arXiv 2025

[34] [34]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024

2024

[35] [35]

State-space models,

J. D. Hamilton, “State-space models,”Handbook of econometrics, vol. 4, pp. 3039–3080, 1994

1994

[36] [36]

Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,

K. Chen, C. Liu, B. Chen, W. Li, Z. Zou, and Z. Shi, “Dynamicvis: An efficient and general visual foundation model for remote sensing image understanding,”arXiv preprint arXiv:2503.16426, 2025

work page arXiv 2025

[37] [37]

Rs-mamba for large remote sensing image dense prediction,

S. Zhao, H. Chen, X. Zhang, P. Xiao, L. Bai, and W. Ouyang, “Rs-mamba for large remote sensing image dense prediction,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

2024

[38] [38]

Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation,

X. Ma, X. Zhang, and M.-O. Pun, “Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation,”IEEE Geo- science and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

2024

[39] [39]

Unet- mamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images,

E. Zhu, Z. Chen, D. Wang, H. Shi, X. Liu, and L. Wang, “Unet- mamba: An efficient unet-like mamba for semantic segmentation of high-resolution remote sensing images,”IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1–5, 2024

2024

[40] [40]

Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement,

H. K. Cheng, J. Chung, Y .-W. Tai, and C.-K. Tang, “Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8890–8899

2020

[41] [41]

Pointrend: Image seg- mentation as rendering,

A. Kirillov, Y . Wu, K. He, and R. Girshick, “Pointrend: Image seg- mentation as rendering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9799–9808

2020

[42] [42]

Memory- constrained semantic segmentation for ultra-high resolution uav im- agery,

Q. Li, J. Cai, J. Luo, Y . Yu, J. Gu, J. Pan, and W. Liu, “Memory- constrained semantic segmentation for ultra-high resolution uav im- agery,”IEEE Robotics and Automation Letters, vol. 9, no. 2, pp. 1708– 1715, 2024

2024

[43] [43]

Ultra-high resolution segmen- tation with ultra-rich context: A novel benchmark,

D. Ji, F. Zhao, H. Lu, M. Tao, and J. Ye, “Ultra-high resolution segmen- tation with ultra-rich context: A novel benchmark,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 621–23 630

2023

[44] [44]

Wave-vit: Unifying wavelet and transformers for visual representation learning,

T. Yao, Y . Pan, Y . Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” inEuropean conference on computer vision. Springer, 2022, pp. 328–345

2022

[45] [45]

Boosting the dual-stream architecture in ultra-high resolution segmentation with resolution-biased uncertainty estimation,

R. Qin, X. Liu, J. Shi, L. Lin, and J. Yang, “Boosting the dual-stream architecture in ultra-high resolution segmentation with resolution-biased uncertainty estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25 960–25 970

2025

[46] [46]

Ultra-high resolution segmentation via boundary-enhanced patch-merging transformer,

H. Sun, Y . Zhang, L. Xu, S. Jin, and Y . Chen, “Ultra-high resolution segmentation via boundary-enhanced patch-merging transformer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7087–7095

2025

[47] [47]

Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement,

W. Liu, Q. Li, X. Lin, W. Yang, S. He, and Y . Yu, “Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement,”International Journal of Computer Vision, vol. 132, no. 11, pp. 5030–5047, 2024

2024

[48] [48]

Progressive semantic seg- mentation,

C. Huynh, A. T. Tran, K. Luu, and M. Hoai, “Progressive semantic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 755–16 764

2021

[49] [49]

Looking outside the window: Wide-context transformer for the semantic segmentation of high-resolution remote sensing im- ages,

L. Ding, D. Lin, S. Lin, J. Zhang, X. Cui, Y . Wang, H. Tang, and L. Bruzzone, “Looking outside the window: Wide-context transformer for the semantic segmentation of high-resolution remote sensing im- ages,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022

2022

[50] [50]

Deepglobe 2018: A challenge to parse the earth through satellite images,

I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raskar, “Deepglobe 2018: A challenge to parse the earth through satellite images,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 172–181

2018

[51] [51]

Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,

E. Maggiori, Y . Tarabalka, G. Charpiat, and P. Alliez, “Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,” in2017 IEEE International geoscience and remote sensing symposium (IGARSS). IEEE, 2017, pp. 3226–3229

2017

[52] [52]

Deep high-resolution repre- sentation learning for human pose estimation,

K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution repre- sentation learning for human pose estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703

2019

[53] [53]

Object-contextual representations for semantic segmentation,

Y . Yuan, X. Chen, and J. Wang, “Object-contextual representations for semantic segmentation,” inComputer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer Interna- tional Publishing, 2020, pp. 173–190. 14

2020

[54] [54]

Convnext v2: Co-designing and scaling convnets with masked autoencoders,

S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “Convnext v2: Co-designing and scaling convnets with masked autoencoders,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 16 133–16 142

2023

[55] [55]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[56] [56]

Unified perceptual parsing for scene understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 418–434

2018

[57] [57]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[58] [58]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antigaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

2019

[59] [59]

Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark,

M. Contributors, “Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark,” 2020. Chuyu Zhongreceived the B.S. degree from the School of Astronautics, Beihang University, Beijing, China, in 2025. He is currently pursuing the Ph.D. degree with the Image Processing Center, School of Astronautics, Beihang University. His research interests incl...

2020