LightAVSeg: Lightweight Audio-Visual Segmentation

Angela Yao; Guodong Ding; Lingqiao Liu; Lin Yuanbo Wu; Qing Zhong; Zaiwen Feng

arxiv: 2605.08805 · v1 · submitted 2026-05-09 · 💻 cs.CV

LightAVSeg: Lightweight Audio-Visual Segmentation

Qing Zhong , Guodong Ding , Lingqiao Liu , Zaiwen Feng , Lin Yuanbo Wu , Angela Yao This is my paper

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords audio-visual segmentationlightweight modelsdecoupled attentionsemantic filteringspatial groundingcross-modal interactionmobile inferenceauxiliary alignment loss

0 comments

The pith

LightAVSeg decouples semantic filtering from spatial grounding to achieve linear-cost cross-modal interaction in audio-visual segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that audio-visual segmentation, which locates sounding objects pixel-by-pixel in video, can be made practical for resource-limited hardware by avoiding the quadratic cost of standard cross-modal attention. It introduces a decoupled design that first handles semantic alignment and then performs spatial grounding, plus a training-only auxiliary loss to keep features consistent. This yields a model with 20.5 million parameters that reaches 50.4 mIoU on the MS3 benchmark and runs efficiently on mobile processors. A sympathetic reader would care because current AVS models are too heavy for real-time use in applications like robotics or video editing. The central bet is that the split design preserves enough cross-modal information to match heavier models.

Core claim

LightAVSeg replaces dense quadratic cross-modal attention with a decoupled mechanism of semantic filtering followed by spatial grounding, reducing interaction cost to linear scaling with spatial resolution. An auxiliary alignment loss enforces semantic consistency between audio and visual streams only during training and adds no overhead at inference. On the MS3 benchmark the resulting 20.5-million-parameter network reaches 50.4 mIoU while supporting efficient mobile inference, establishing new state-of-the-art results among lightweight AVS methods.

What carries the argument

The decoupled design for semantic filtering and spatial grounding, which separates global modality alignment from localized pixel grounding to replace quadratic attention with linear-cost interaction.

If this is right

Cross-modal interaction cost scales linearly with spatial resolution instead of quadratically.
The auxiliary alignment loss improves training consistency with no added inference cost.
The model supports real-time audio-visual segmentation on mobile processors.
LightAVSeg sets a new accuracy bar among lightweight AVS methods while using roughly one-seventh the parameters of prior heavy models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could be tested on other multimodal dense-prediction tasks such as audio-visual object detection to check whether linear scaling generalizes.
If the linear-cost property holds at higher resolutions, the approach might enable on-device processing of 1080p or 4K video streams that current quadratic models cannot handle.
The training-only loss suggests similar auxiliary objectives could be explored for other efficiency-focused multimodal architectures without runtime penalty.

Load-bearing premise

Separating semantic filtering from spatial grounding still captures enough cross-modal information to localize sounding objects at pixel level as accurately as full quadratic attention.

What would settle it

Replacing the decoupled interaction module with standard cross-attention under the same parameter budget and observing whether mIoU on MS3 drops below 50.4 or mobile inference speed falls significantly.

Figures

Figures reproduced from arXiv: 2605.08805 by Angela Yao, Guodong Ding, Lingqiao Liu, Lin Yuanbo Wu, Qing Zhong, Zaiwen Feng.

**Figure 2.** Figure 2: Overview of LightAVSeg. Following the visual and audio streams, we introduce the Reciprocal Audio-Visual Encoder to iteratively refine the global audio state using visual context, the Cross-Modal Fusion Decoder to inject these auditory cues back into the visual stream for segmentation, and the Multi-Scale Audio-Visual Alignment Loss (Lmsa) to enforce progressive cross-modal consistency. for mobile video in… view at source ↗

**Figure 3.** Figure 3: Visual comparison of feature activation maps on the MS3 benchmark. For each method, we visualize the features from the last three stages S (from left to right). AVSBench is consistently distracted by background noise across stages. AVSegFormer focuses on the target but lacks boundary definition. Ours demonstrates a coarse-to-fine evolution, progressively suppressing background context to achieve precise al… view at source ↗

**Figure 4.** Figure 4: The inference latency of components. 4.6. Latency Statistics We analyze the component-wise latency on a Snapdragon 8 Elite as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Failure cases of LightAVSeg. Due to the reduced capacity of the lightweight backbone, our model may struggle in challenging scenarios: (Top) Semantic inconsistency in crowded scenes with multiple similar objects; (Middle) Incomplete segmentation of large objects with complex textures; (Bottom) Missed detection when visual cues are subtle or occluded. These cases highlight the inherent trade-off between inf… view at source ↗

**Figure 6.** Figure 6: MS3 (Zhou et al., 2022) Visualisation Results. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: S4 (Zhou et al., 2022) Visualisation Results. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: AVSS (Zhou et al., 2025) Visualisation Results. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LightAVSeg swaps quadratic attention for a decoupled semantic-spatial module to linearize costs in audio-visual segmentation, with a training-only alignment loss, but the accuracy claims rest on thin visible evidence.

read the letter

LightAVSeg replaces the quadratic cross-modal attention in audio-visual segmentation with a decoupled design that separates semantic filtering from spatial grounding. This change, along with a training-only auxiliary alignment loss, is the core of the paper. The new part is the targeted fix for the interaction bottleneck. Most prior efficiency work cuts the backbone size, but here they keep more of the model and linearize the expensive part. That makes sense for keeping performance while dropping compute. The auxiliary loss enforces consistency without adding anything at test time, which is a clean addition. The results claim 20.5M parameters and 50.4 mIoU on MS3, beating or matching other lightweight methods while running efficiently on mobile hardware. The main weakness is the lack of visible support for those claims in the provided abstract. There are no ablations showing what the decoupled module loses or gains compared to full attention, and no error analysis or detailed experimental setup. This makes it hard to assess if the design truly preserves enough cross-modal information or if the numbers come from favorable conditions. The full paper likely includes more, but the current presentation leaves the soundness of the accuracy claims open to question. This paper is for computer vision researchers focused on efficient multimodal video understanding, particularly those targeting deployment on limited hardware. A reader working on similar segmentation tasks could pick up the architectural idea and the loss trick. It deserves a serious referee because the efficiency angle is timely and the proposal is specific enough to evaluate properly. I recommend putting it through peer review, but with instructions to the reviewers to check the experimental details and ablations closely.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes LightAVSeg, a lightweight audio-visual segmentation framework that replaces dense quadratic cross-modal attention with a decoupled semantic-filtering plus spatial-grounding module to achieve linear interaction cost with spatial resolution. It further introduces a training-only auxiliary alignment loss with zero inference overhead. The central claim is that this design yields new state-of-the-art accuracy among lightweight AVS models: 20.5 M parameters (approximately 1/7 of AVSegFormer) and 50.4 mIoU on the MS3 benchmark, while supporting efficient mobile-processor inference.

Significance. If the performance and efficiency claims are substantiated with proper ablations and baselines, the work would meaningfully advance resource-efficient multimodal segmentation by directly addressing the cross-modal interaction bottleneck rather than only shrinking the backbone. This could enable practical deployment of pixel-level sounding-object localization on edge devices.

major comments (2)

Abstract: the central performance claims (new SOTA among lightweight methods, 50.4 mIoU, linear scaling, mobile inference) are asserted without any reference to experimental details, ablation studies, error analysis, or baseline comparisons, leaving the primary empirical contribution unsupported by visible evidence in the provided text.
Method (decoupled design description): the claim that semantic filtering plus spatial grounding preserves sufficient cross-modal information to match or exceed full quadratic attention is load-bearing for the accuracy claim, yet no quantitative analysis, information-flow argument, or comparison to attention baselines is supplied to substantiate that the linear-cost approximation does not discard critical audio-visual correspondences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better present our contributions. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: Abstract: the central performance claims (new SOTA among lightweight methods, 50.4 mIoU, linear scaling, mobile inference) are asserted without any reference to experimental details, ablation studies, error analysis, or baseline comparisons, leaving the primary empirical contribution unsupported by visible evidence in the provided text.

Authors: The abstract is intended as a concise summary of the key claims and results. The full manuscript supplies the requested details: Section 4 presents baseline comparisons (including AVSegFormer), ablation studies on the decoupled modules (Section 4.3), efficiency measurements on mobile processors (Section 4.4), and overall performance on MS3. We agree that explicit pointers would improve readability. We will revise the abstract to add brief references such as “as shown through extensive experiments and ablations in Section 4” while preserving its length and clarity. revision: yes
Referee: Method (decoupled design description): the claim that semantic filtering plus spatial grounding preserves sufficient cross-modal information to match or exceed full quadratic attention is load-bearing for the accuracy claim, yet no quantitative analysis, information-flow argument, or comparison to attention baselines is supplied to substantiate that the linear-cost approximation does not discard critical audio-visual correspondences.

Authors: The manuscript provides empirical support via direct comparisons: LightAVSeg exceeds the accuracy of the full quadratic-attention baseline AVSegFormer while using roughly 1/7 the parameters, and Section 4.3 ablates the individual contributions of semantic filtering and spatial grounding. These results indicate that critical correspondences are retained. We nevertheless agree that an explicit information-preservation analysis would strengthen the argument. We will add a short quantitative comparison (e.g., feature similarity metrics between the decoupled modules and full attention) together with a brief information-flow discussion, placed either in Section 3 or as an additional ablation in Section 4. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a novel decoupled architecture for semantic filtering and spatial grounding to achieve linear-cost cross-modal interaction, plus a training-only auxiliary alignment loss. These are presented as explicit design choices motivated by the quadratic attention bottleneck in prior AVS models, with performance validated through experiments on MS3 and other benchmarks. No equations or claims reduce a prediction to a fitted input by construction, no uniqueness theorems are imported from self-citations, and no ansatz or renaming of known results is used to derive the central efficiency or accuracy results. The derivation chain is self-contained and externally falsifiable via the reported mIoU and parameter counts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the design is described at a high level without mathematical details or assumptions listed.

pith-pipeline@v0.9.0 · 5455 in / 1071 out tokens · 53432 ms · 2026-05-12T01:06:29.618004+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

auxiliary Multi-Scale Audio-Visual Alignment Loss (Lmsa) to enforce semantic consistency during training with zero inference overhead

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages

[1]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2020 , publisher=

work page 2020
[2]

International journal of computer vision , volume=

The pascal visual object classes (voc) challenge , author=. International journal of computer vision , volume=. 2010 , publisher=

work page 2010
[3]

Proceedings of the 32nd ACM International Conference on Multimedia , pages=

Selm: Selective mechanism based audio-visual segmentation , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

work page
[4]

IJCAI , year=

Multi-scale Spatial Representation Learning via Recursive Polynomial Networks , author=. IJCAI , year=

work page
[5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Consistent Training for Online Video Instance Segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[6]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al

A Comedy of Estimators: On KL Regularization in RL Training of LLMs , author=. arXiv preprint arXiv:2512.21852 , year=

work page arXiv
[7]

The eleventh international conference on learning representations , year=

Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation , author=. The eleventh international conference on learning representations , year=

work page
[8]

Proceedings of the AAAI conference on artificial intelligence , volume=

Avsegformer: Audio-visual segmentation with transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[9]

European Conference on Computer Vision , pages=

Audio--visual segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022
[10]

IEEE Transactions on Multimedia , year=

Avs-mamba: Exploring temporal and multi-modal mamba for audio-visual segmentation , author=. IEEE Transactions on Multimedia , year=

work page
[11]

IEEE Transactions on Multimedia , year=

Complementary and contrastive learning for audio-visual segmentation , author=. IEEE Transactions on Multimedia , year=

work page
[12]

International Journal of Computer Vision , volume=

Audio-visual segmentation with semantics , author=. International Journal of Computer Vision , volume=. 2025 , publisher=

work page 2025
[13]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Fully convolutional networks for semantic segmentation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[14]

IEEE transactions on pattern analysis and machine intelligence , volume=

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2017 , publisher=

work page 2017
[15]

Advances in neural information processing systems , volume=

SegFormer: Simple and efficient design for semantic segmentation with transformers , author=. Advances in neural information processing systems , volume=

work page
[16]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Cross-image pixel contrasting for semantic segmentation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=

work page 2024
[17]

International Journal of Computer Vision , volume=

LLMFormer: Large language model for open-vocabulary semantic segmentation , author=. International Journal of Computer Vision , volume=. 2025 , publisher=

work page 2025
[18]

IEEE Transactions on Geoscience and Remote Sensing , volume=

Rethinking BiSeNet: A lightweight network for urban water extraction , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2023 , publisher=

work page 2023
[19]

Fast-SCNN: Fast Semantic Segmentation Network

Fast-scnn: Fast semantic segmentation network , author=. arXiv preprint arXiv:1902.04502 , year=

work page Pith review arXiv 1902
[20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Topformer: Token pyramid transformer for mobile semantic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[21]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Mobileinst: Video instance segmentation on the mobile , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[22]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Sound event detection via dilated convolutional recurrent neural networks , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[23]

Proceedings of the AAAI conference on artificial intelligence , volume=

Improving audio-visual segmentation with bidirectional generation , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[24]

arXiv preprint arXiv:2307.13236 , year=

Audio-aware query-enhanced transformer for audio-visual segmentation , author=. arXiv preprint arXiv:2307.13236 , year=

work page arXiv
[25]

Proceedings of the 31st ACM International Conference on Multimedia , pages=

Audio-visual segmentation by exploring cross-modal mutual semantics , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

work page
[26]

Proceedings of the 31st ACM international conference on multimedia , pages=

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation , author=. Proceedings of the 31st ACM international conference on multimedia , pages=

work page
[27]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Multimodal variational auto-encoder based audio-visual segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[28]

European conference on computer vision , pages=

Identity mappings in deep residual networks , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016
[29]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page
[30]

Computational visual media , volume=

Pvt v2: Improved baselines with pyramid vision transformer , author=. Computational visual media , volume=. 2022 , publisher=

work page 2022
[31]

Proceedings of the IEEE international conference on computer vision , pages=

Look, listen and learn , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[32]

Proceedings of the European conference on computer vision (ECCV) , pages=

Objects that sound , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page
[33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Localizing visual sounds the hard way , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[34]

Proceedings of the 28th ACM International Conference on Multimedia , pages=

Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning , author=. Proceedings of the 28th ACM International Conference on Multimedia , pages=

work page
[35]

Proceedings of the European conference on computer vision (ECCV) , pages=

Audio-visual scene analysis with self-supervised multisensory features , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page
[36]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Learning to localize sound source in visual scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page
[37]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[38]

2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

work page 2017
[39]

Proceedings of the IEEE international conference on computer vision , pages=

Holistically-nested edge detection , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[40]

Artificial intelligence and statistics , pages=

Deeply-supervised nets , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

work page 2015

[1] [1]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

Panns: Large-scale pretrained audio neural networks for audio pattern recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2020 , publisher=

work page 2020

[2] [2]

International journal of computer vision , volume=

The pascal visual object classes (voc) challenge , author=. International journal of computer vision , volume=. 2010 , publisher=

work page 2010

[3] [3]

Proceedings of the 32nd ACM International Conference on Multimedia , pages=

Selm: Selective mechanism based audio-visual segmentation , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=

work page

[4] [4]

IJCAI , year=

Multi-scale Spatial Representation Learning via Recursive Polynomial Networks , author=. IJCAI , year=

work page

[5] [5]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Consistent Training for Online Video Instance Segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[6] [6]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al

A Comedy of Estimators: On KL Regularization in RL Training of LLMs , author=. arXiv preprint arXiv:2512.21852 , year=

work page arXiv

[7] [7]

The eleventh international conference on learning representations , year=

Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation , author=. The eleventh international conference on learning representations , year=

work page

[8] [8]

Proceedings of the AAAI conference on artificial intelligence , volume=

Avsegformer: Audio-visual segmentation with transformer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[9] [9]

European Conference on Computer Vision , pages=

Audio--visual segmentation , author=. European Conference on Computer Vision , pages=. 2022 , organization=

work page 2022

[10] [10]

IEEE Transactions on Multimedia , year=

Avs-mamba: Exploring temporal and multi-modal mamba for audio-visual segmentation , author=. IEEE Transactions on Multimedia , year=

work page

[11] [11]

IEEE Transactions on Multimedia , year=

Complementary and contrastive learning for audio-visual segmentation , author=. IEEE Transactions on Multimedia , year=

work page

[12] [12]

International Journal of Computer Vision , volume=

Audio-visual segmentation with semantics , author=. International Journal of Computer Vision , volume=. 2025 , publisher=

work page 2025

[13] [13]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Fully convolutional networks for semantic segmentation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[14] [14]

IEEE transactions on pattern analysis and machine intelligence , volume=

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2017 , publisher=

work page 2017

[15] [15]

Advances in neural information processing systems , volume=

SegFormer: Simple and efficient design for semantic segmentation with transformers , author=. Advances in neural information processing systems , volume=

work page

[16] [16]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Cross-image pixel contrasting for semantic segmentation , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=

work page 2024

[17] [17]

International Journal of Computer Vision , volume=

LLMFormer: Large language model for open-vocabulary semantic segmentation , author=. International Journal of Computer Vision , volume=. 2025 , publisher=

work page 2025

[18] [18]

IEEE Transactions on Geoscience and Remote Sensing , volume=

Rethinking BiSeNet: A lightweight network for urban water extraction , author=. IEEE Transactions on Geoscience and Remote Sensing , volume=. 2023 , publisher=

work page 2023

[19] [19]

Fast-SCNN: Fast Semantic Segmentation Network

Fast-scnn: Fast semantic segmentation network , author=. arXiv preprint arXiv:1902.04502 , year=

work page Pith review arXiv 1902

[20] [20]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Topformer: Token pyramid transformer for mobile semantic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[21] [21]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Mobileinst: Video instance segmentation on the mobile , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[22] [22]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Sound event detection via dilated convolutional recurrent neural networks , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020

[23] [23]

Proceedings of the AAAI conference on artificial intelligence , volume=

Improving audio-visual segmentation with bidirectional generation , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[24] [24]

arXiv preprint arXiv:2307.13236 , year=

Audio-aware query-enhanced transformer for audio-visual segmentation , author=. arXiv preprint arXiv:2307.13236 , year=

work page arXiv

[25] [25]

Proceedings of the 31st ACM International Conference on Multimedia , pages=

Audio-visual segmentation by exploring cross-modal mutual semantics , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

work page

[26] [26]

Proceedings of the 31st ACM international conference on multimedia , pages=

Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation , author=. Proceedings of the 31st ACM international conference on multimedia , pages=

work page

[27] [27]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Multimodal variational auto-encoder based audio-visual segmentation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[28] [28]

European conference on computer vision , pages=

Identity mappings in deep residual networks , author=. European conference on computer vision , pages=. 2016 , organization=

work page 2016

[29] [29]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

work page

[30] [30]

Computational visual media , volume=

Pvt v2: Improved baselines with pyramid vision transformer , author=. Computational visual media , volume=. 2022 , publisher=

work page 2022

[31] [31]

Proceedings of the IEEE international conference on computer vision , pages=

Look, listen and learn , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[32] [32]

Proceedings of the European conference on computer vision (ECCV) , pages=

Objects that sound , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page

[33] [33]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Localizing visual sounds the hard way , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[34] [34]

Proceedings of the 28th ACM International Conference on Multimedia , pages=

Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning , author=. Proceedings of the 28th ACM International Conference on Multimedia , pages=

work page

[35] [35]

Proceedings of the European conference on computer vision (ECCV) , pages=

Audio-visual scene analysis with self-supervised multisensory features , author=. Proceedings of the European conference on computer vision (ECCV) , pages=

work page

[36] [36]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Learning to localize sound source in visual scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

work page

[37] [37]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009

[38] [38]

2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Audio set: An ontology and human-labeled dataset for audio events , author=. 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2017 , organization=

work page 2017

[39] [39]

Proceedings of the IEEE international conference on computer vision , pages=

Holistically-nested edge detection , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page

[40] [40]

Artificial intelligence and statistics , pages=

Deeply-supervised nets , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

work page 2015