Rethinking Atrous Convolution for Semantic Image Segmentation
Pith reviewed 2026-05-12 00:20 UTC · model grok-4.3
The pith
Atrous convolutions in cascaded or parallel modules plus global context enable accurate semantic segmentation without DenseCRF.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By employing atrous convolution in cascaded or parallel arrangements with several rates and by augmenting the Atrous Spatial Pyramid Pooling module with image-level global-context features, the DeepLabv3 architecture captures multi-scale information directly inside the network, yielding significant accuracy gains over prior DeepLab models on PASCAL VOC 2012 without requiring DenseCRF post-processing and matching the performance of contemporary state-of-the-art segmentation systems.
What carries the argument
Atrous Spatial Pyramid Pooling module augmented with image-level global features, together with cascaded or parallel atrous convolution blocks that apply multiple dilation rates.
If this is right
- Multi-scale context can be extracted inside a single forward pass by probing features at several atrous rates in parallel or cascade.
- Global image-level features can be fused with local convolutional features to improve scene layout understanding.
- DenseCRF post-processing is no longer required to reach high segmentation accuracy on PASCAL VOC 2012.
- The same modular atrous design can be inserted into other DCNN backbones to adapt their receptive fields without changing network depth.
Where Pith is reading between the lines
- The removal of the DenseCRF stage reduces both inference latency and memory use, which could support deployment on resource-limited devices.
- Because the modules operate at different scales inside the network, similar rate-scheduling ideas may transfer to other dense-prediction tasks such as depth estimation or instance segmentation.
- Sharing the exact training recipe allows direct ablation studies that isolate the effect of each atrous configuration on new datasets.
Load-bearing premise
The observed accuracy improvements are produced by the new atrous modules and global-context features rather than by any unstated changes in training schedule, data augmentation, or hyper-parameter choices.
What would settle it
Re-train the previous DeepLabv2 model with exactly the same training schedule, augmentation, and hyper-parameters used for DeepLabv3; if the mean intersection-over-union gap on the PASCAL VOC 2012 validation set closes or reverses, the contribution of the atrous modules is not established.
read the original abstract
In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper revisits atrous convolution for semantic image segmentation, proposing modules that apply atrous convolutions either in cascade or in parallel to capture multi-scale context, and augments the Atrous Spatial Pyramid Pooling (ASPP) module with image-level features to encode global context. It further details training practices and claims that the resulting DeepLabv3 system significantly improves over prior DeepLab versions (without DenseCRF) while attaining performance comparable to other state-of-the-art models on the PASCAL VOC 2012 benchmark.
Significance. If the reported gains are shown to stem from the architectural innovations rather than uncontrolled training variables, the work offers a practical and incremental advance in multi-scale context modeling for dense prediction tasks, building directly on prior ASPP designs with clear implementation guidance that could influence follow-on architectures.
major comments (2)
- [Experiments and abstract] Experiments section (and comparisons to prior DeepLab versions): the central claim that performance gains arise from the cascade/parallel atrous modules and image-level feature augmentation requires explicit confirmation that training schedules, data augmentation, and hyper-parameters are identical to those used in the authors' previous DeepLab papers; the abstract's reference to sharing 'implementation details and experience on training our system' does not substitute for matched baselines, leaving open the possibility that gains are confounded by optimization differences.
- [§3.2–3.3] §3.2–3.3 (atrous modules and ASPP augmentation): the manuscript should provide controlled ablations that isolate the incremental benefit of the new cascade/parallel atrous rates and the image-level feature branch while holding all training variables fixed, as the current presentation does not fully rule out that observed mIoU lifts are driven by the global context addition alone or by unstated hyper-parameter changes.
minor comments (1)
- [Abstract] Abstract: the claims of 'significant improvement' and 'comparable performance' would be more informative if accompanied by the specific mIoU numbers and table references that appear later in the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will update the paper to provide the requested clarifications and additional controlled experiments.
read point-by-point responses
-
Referee: [Experiments and abstract] Experiments section (and comparisons to prior DeepLab versions): the central claim that performance gains arise from the cascade/parallel atrous modules and image-level feature augmentation requires explicit confirmation that training schedules, data augmentation, and hyper-parameters are identical to those used in the authors' previous DeepLab papers; the abstract's reference to sharing 'implementation details and experience on training our system' does not substitute for matched baselines, leaving open the possibility that gains are confounded by optimization differences.
Authors: We agree that explicit confirmation of matched training variables is necessary to attribute gains to the architectural changes. In the revised manuscript we will add a dedicated paragraph and table in the Experiments section that directly compares the training schedule (poly learning-rate policy, iteration count, crop size, batch size), data augmentation (random scaling, horizontal flipping, color jitter), and hyper-parameters used in DeepLabv3 to those reported in our prior DeepLabv2 work, noting only the architecture-specific modifications. This will make clear that the core optimization settings remain identical. revision: yes
-
Referee: [§3.2–3.3] §3.2–3.3 (atrous modules and ASPP augmentation): the manuscript should provide controlled ablations that isolate the incremental benefit of the new cascade/parallel atrous rates and the image-level feature branch while holding all training variables fixed, as the current presentation does not fully rule out that observed mIoU lifts are driven by the global context addition alone or by unstated hyper-parameter changes.
Authors: We appreciate the request for stricter isolation of each component. While the current manuscript already reports ablation results for different atrous rates in cascade and parallel settings as well as the addition of the image-level feature branch, we acknowledge that these experiments could be presented more explicitly as incremental, fixed-training-protocol studies. In the revision we will insert a new table that starts from a common baseline and successively adds the cascade module, the parallel module, and the image-level features, reporting mIoU on the PASCAL VOC 2012 validation set under identical training settings for each step. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with independent results
full rationale
The paper proposes architectural changes (cascaded/parallel atrous convolutions and augmented ASPP) and reports empirical mIoU improvements on PASCAL VOC 2012. There is no derivation chain, no equations, and no 'predictions' that reduce by construction to fitted parameters or self-defined inputs. Benchmark numbers are externally falsifiable and not derived from the authors' prior fits. Self-references to earlier DeepLab versions are normal citations of prior empirical work and do not load-bear any mathematical reduction.
Axiom & Free-Parameter Ledger
free parameters (2)
- atrous rates
- training hyper-parameters
axioms (1)
- domain assumption Deep convolutional networks can be trained end-to-end on labeled segmentation data to produce per-pixel predictions.
Forward citations
Cited by 60 Pith papers
-
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
-
Interference-Aware Multi-Task Unlearning
Introduces interference-aware multi-task unlearning with task-aware gradient projection and instance-level gradient orthogonalization, reducing interference scores by 30.3% and 52.9% on vision benchmarks.
-
CineMatte: Background Matting for Virtual Production and Beyond
CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.
-
Functionalization via Structure Completion and Motion Rectification
Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
-
Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection
Noise2Map repurposes diffusion model denoising into a direct predictor for semantic segmentation and change detection tasks in remote sensing, achieving top average ranks on benchmark datasets.
-
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
OVRSISBenchV2 is a realistic benchmark expanding scene and category coverage for open-vocabulary remote sensing segmentation, with Pi-Seg baseline showing strong transfer via positive-incentive noise perturbations.
-
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
VGGT-Segmentor achieves new state-of-the-art cross-view segmentation on Ego-Exo4D with 67.7% and 68.0% average IoU using a geometry-enhanced model and correspondence-free pretraining that beats most supervised baselines.
-
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
A hierarchical prompt tree with self-reflection graph propagation enables positive forward and backward knowledge transfer in incremental surgical instrument segmentation, improving over baselines by more than 5% and ...
-
Contour Refinement using Discrete Diffusion in Low Data Regime
A CNN-based discrete diffusion method refines sparse contours from segmentation masks using simplified denoising steps and minimal post-processing, outperforming baselines on small medical and environmental datasets w...
-
CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition
CONSIGN applies conformal prediction to segmentation by incorporating spatial structure through decomposition, producing tighter and more interpretable uncertainty estimates with error guarantees.
-
EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems
OD-TTA enables resource-efficient test-time adaptation on edge devices by triggering updates only on detected domain shifts, achieving comparable accuracy with lower energy and computation costs for embodied visual systems.
-
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
-
Adaptive Camera Sensor for Vision Models
Lens adapts camera sensors in real time via the VisiT confidence-based quality indicator to improve vision model accuracy on domain-shifted images, shown on ImageNet-ES and a new diverse benchmark.
-
FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency ...
-
R$^{2}$Net: 2D Deep Residual Learning with Height Embedding for 3D Radio Map Estimation
R²Net applies 2D deep residual learning with height embedding to estimate 3D radio maps, offering separate indoor and outdoor variants plus a new 3D indoor dataset.
-
SENSE: Satellite-based ENergy Synthesis for Sustainable Environment
SENSE is a controllable diffusion model that jointly generates realistic urban satellite imagery and aligned building energy consumption and height maps from road networks and density inputs, improving downstream task...
-
SEED: Targeted Data Selection by Weighted Independent Set
SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods o...
-
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
-
Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation
A compact network called Nano-U trained with quantization-aware distillation enables accurate binary terrain segmentation and runs efficiently on ESP32-S3 microcontrollers for tiny robots.
-
UnGAP: Uncertainty-Guided Affine Prompting for Real-Time Crack Segmentation
UnGAP turns aleatoric uncertainty into an active calibration signal via affine feature modulation to fix gradient suppression in heteroscedastic crack segmentation while maintaining real-time performance.
-
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.
-
DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration
DOT-Sim uses MPM physics plus learned residual optics to simulate deformable tactile sensors, supporting zero-shot sim-to-real transfer for classification and control tasks.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment
FryNet combines RGB and thermal imaging with adversarial regularization to segment oil areas, classify usability, and predict oxidation levels like PV and Totox with high accuracy on video data.
-
Lorentz Framework for Semantic Segmentation
A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...
-
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
VGGT-Segmentor achieves new SOTA cross-view segmentation on Ego-Exo4D (67.7% Ego-to-Exo, 68.0% Exo-to-Ego IoU) via geometry-enhanced features, a three-stage segmentation head, and correspondence-free pretraining.
-
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality
GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating image, text, and DEM data with experiments showing that textual and terrain cues improve delineation accuracy over image-on...
-
Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization
A large pool of diverse artistic styles for style-transfer augmentation improves domain generalization in driving vision models more than repeated use of few styles or domain-matched styles, yielding the lightweight S...
-
CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation
CrossWeaver introduces MIB and SAF modules to enable flexible, reliability-aware cross-modal interaction and fusion, achieving SOTA multimodal semantic segmentation with minimal parameters and generalization to unseen...
-
Flux4D: Flow-based Unsupervised 4D Reconstruction
Flux4D reconstructs large-scale dynamic 4D scenes unsupervised by predicting moving 3D Gaussians from photometric losses and static regularization when trained across multiple scenes.
-
Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
Learn2Synth optimizes data synthesis parameters with hypergradients to train segmentation networks solely on synthetic brain images that generalize to real scans.
-
Causal Unsupervised Semantic Segmentation
CAUSE uses frontdoor adjustment with a discretized concept clusterbook mediator to perform unsupervised semantic segmentation and reports state-of-the-art results.
-
Language-driven Semantic Segmentation
LSeg achieves competitive zero-shot semantic segmentation by contrastively aligning dense pixel embeddings from a transformer with text embeddings of class labels.
-
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
MobileViT is a lightweight vision transformer that reports 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2% at similar size, plus gains on MS-COCO detection.
-
Segmenting Objects in Day and Night:Edge-Conditioned CNN for Thermal Image Semantic Segmentation
EC-CNN uses a gated feature-wise transform to incorporate edge priors for thermal semantic segmentation and introduces the SODA dataset of over 7,000 labeled thermal images.
-
Gated-SCNN: Gated Shape CNNs for Semantic Segmentation
Gated-SCNN adds a gated shape stream to standard CNNs for semantic segmentation, achieving improved boundary quality and SOTA results on Cityscapes.
-
Deep Saliency Models : The Quest For The Loss Function
Varying and combining loss functions in deep visual saliency prediction models produces significant performance gains on fixed architectures that hold across datasets and networks.
-
WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation
WoundFormer modifies SegFormer with a spatially-preserving multi-scale aggregation head for multi-class wound tissue segmentation, reporting 81.9% Dice on the WoundTissueSeg dataset with gains over baselines.
-
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging
A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classificati...
-
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
-
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
AutoAWG generates controllable adverse weather automotive videos via semantics-guided adaptive multi-control fusion and vanishing-point-anchored temporal synthesis from static images, reducing FID by 50% and FVD by 16...
-
DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.
-
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
-
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
-
SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation
SCASeg proposes a strip cross-attention decoder with lateral connections and a cross-layer block to efficiently capture global-local context, reporting competitive or superior results on ADE20K, Cityscapes, COCO-Stuff...
-
Cross Attention Network for Semantic Segmentation
Cross Attention Network fuses spatial and contextual features via a cross attention module to improve semantic segmentation performance and speed on Cityscapes and CamVid.
-
An Efficient 3D CNN for Action/Object Segmentation in Video
End-to-end 3D CNN with separable convolutions for efficient simultaneous spatial-temporal video object and action segmentation.
-
Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox
Stabilized SegFormer-B5 reaches 0.4572 mIoU SOTA on original Apple DMS split; 80/10/10 split reaches 0.5276 mIoU but degrades real-world OOD performance per qualitative review.
-
SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation
SpineContextResUNet achieves Dice scores of 88.17% on VerSe2020 and 88.13% on CTSpine1K while using ~1.7M parameters and running inference on commodity hardware with 8GB RAM.
-
FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation
FoR-Net improves efficiency in semantic segmentation by focusing on hard regions with a learned selector and multi-scale convolutions, achieving competitive results on Cityscapes.
-
An End-to-End Decision-Aware Multi-Scale Attention-Based Model for Explainable Autonomous Driving
A decision-aware multi-scale attention network generates tailored explanations for autonomous driving choices and outperforms prior models on F1 and a new Joint F1 metric across two datasets.
-
A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
Transformer-based models deliver strong landslide segmentation on satellite images, and parameter-efficient fine-tuning matches full fine-tuning accuracy while cutting trainable parameters by up to 95%.
-
UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation
UA-Net segments TRISO fuel micrographs into five regions with 95.5% mIoU and 97.3% mP on 102 test images, while its meta-model detects misclassifications at 91.8% specificity and 93.5% sensitivity.
-
EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation
Early RGB-Depth-Edge fusion in EDFNet provides a competitive baseline for thin-obstacle segmentation on the DDOS dataset, with the best pretrained U-Net model reaching 0.244 Thin-Structure Evaluation Score.
-
Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets
Benchmarking ten segmentation models on a nine-image histology dataset and a 153-image generalization set reveals unstable rankings, overlapping confidence intervals, and dataset-specific performance hierarchies, advo...
-
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes
GSAM applies random cropping to enable variable input sizes for efficient SAM fine-tuning, claiming lower compute with comparable or higher accuracy on varied datasets.
Reference graph
Works this paper leans on
-
[1]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
M. Abadi, A. Agarwal, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016
work page Pith review arXiv 2016
-
[2]
A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional filtering using the permutohedral lattice. In Eurographics, 2010. Method Coarse mIOU DeepLabv2-CRF [11] 70.4 Deep Layer Cascade [52] 71.1 ML-CRNN [21] 71.2 Adelaide context [55] 71.6 FRRN [70] 71.8 LRR-4x [25] ✓ 71.8 RefineNet [54] 73.6 FoveaNet [51] 74.1 Ladder DenseNet [46] 74.3 PEARL [42] 75.4 Glo...
work page 2010
-
[3]
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561, 2015
work page Pith review arXiv 2015
-
[4]
A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of computation, 31(138):333–390, 1977
work page 1977
-
[5]
W. L. Briggs, V . E. Henson, and S. F. McCormick.A multigrid tutorial. SIAM, 2000
work page 2000
- [6]
- [7]
-
[8]
S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ference for semantic image segmentation with deep Gaussian CRFs. arXiv:1603.08358, 2016
-
[9]
L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In CVPR, 2016
work page 2016
-
[10]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015
work page 2015
-
[11]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016
work page Pith review arXiv 2016
-
[12]
L.-C. Chen, Y . Yang, J. Wang, W. Xu, and A. L. Yuille. At- tention to scale: Scale-aware semantic image segmentation. In CVPR, 2016
work page 2016
-
[13]
F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357, 2016. Figure 8. Visualization results on Cityscapes val set when training with only train fine set
work page Pith review arXiv 2016
- [14]
- [15]
-
[16]
J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmenta- tion. In ICCV, 2015
work page 2015
- [17]
-
[18]
J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. arXiv:1703.06211, 2017
work page Pith review arXiv 2017
-
[19]
D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. arXiv:1411.4734, 2014
-
[20]
M. Everingham, S. M. A. Eslami, L. V . Gool, C. K. I. Williams, J. Winn, and A. Zisserma. The pascal visual object classes challenge a retrospective. IJCV, 2014
work page 2014
- [21]
-
[22]
C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features for scene labeling. PAMI, 2013
work page 2013
- [23]
- [24]
-
[25]
G. Ghiasi and C. C. Fowlkes. Laplacian reconstruction and refinement for semantic segmentation. arXiv:1605.02264, 2016
- [26]
- [27]
-
[28]
K. Grauman and T. Darrell. The pyramid match kernel: Dis- criminative classification with sets of image features. InICCV, 2005
work page 2005
-
[29]
B. Hariharan, P. Arbel´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011
work page 2011
-
[30]
B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hyper- columns for object segmentation and fine-grained localization. In CVPR, 2015
work page 2015
-
[31]
K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014
work page 2014
-
[32]
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[33]
X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random fields for image labeling. In CVPR, 2004
work page 2004
- [34]
-
[35]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[36]
M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time- Frequency Methods and Phase Space, pages 289–297. 1989
work page 1989
- [37]
-
[38]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015
work page internal anchor Pith review arXiv 2015
-
[39]
M. A. Islam, M. Rochan, N. D. Bruce, and Y . Wang. Gated feedback refinement network for dense image labeling. In CVPR, 2017
work page 2017
-
[40]
S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learn- ing to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, 2017
work page 2017
-
[41]
V . Jampani, M. Kiefel, and P. V . Gehler. Learning sparse high dimensional filters: Image filtering, dense crfs and bilateral neural networks. In CVPR, 2016
work page 2016
-
[42]
X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan. Video scene parsing with predictive feature learning. In ICCV, 2017
work page 2017
- [43]
-
[44]
S. Kong and C. Fowlkes. Recurrent scene parsing with per- spective understanding in the loop. arXiv:1705.07238, 2017
-
[45]
P. Kr¨ahenb¨uhl and V . Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011
work page 2011
- [46]
-
[47]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012
work page 2012
-
[48]
L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009
work page 2009
-
[49]
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of fea- tures: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006
work page 2006
- [50]
- [51]
- [52]
- [53]
-
[54]
G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi- path refinement networks with identity mappings for high- resolution semantic segmentation. arXiv:1611.06612, 2016
work page Pith review arXiv 2016
- [55]
-
[56]
T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv:1612.03144, 2016
work page Pith review arXiv 2016
-
[57]
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com- mon objects in context. In ECCV, 2014
work page 2014
-
[58]
W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015
work page Pith review arXiv 2015
-
[59]
Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015
work page 2015
-
[60]
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015
work page 2015
-
[61]
P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning for semantic image segmentation. In ICCV, 2017
work page 2017
-
[62]
M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed- forward semantic segmentation with zoom-out features. In CVPR, 2015
work page 2015
-
[63]
R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014
work page 2014
-
[64]
H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In ICCV, 2015
work page 2015
-
[65]
G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly- and semi-supervised learning of a dcnn for semantic image segmentation. In ICCV, 2015
work page 2015
-
[66]
G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015
work page 2015
-
[67]
G. Papandreou and P. Maragos. Multigrid geometric active contour models. TIP, 16(1):229–240, 2007
work page 2007
-
[68]
C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolu- tional network. arXiv:1703.02719, 2017
work page Pith review arXiv 2017
-
[69]
P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014
work page 2014
- [70]
-
[71]
O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015
work page 2015
-
[72]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015
work page 2015
-
[73]
A. G. Schwing and R. Urtasun. Fully connected deep struc- tured networks. arXiv:1503.02351, 2015
work page Pith review arXiv 2015
-
[74]
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013
work page Pith review arXiv 2013
-
[75]
F. Shen, R. Gan, S. Yan, and G. Zeng. Semantic segmentation via structured patch prediction, context crf and guidance crf. In CVPR, 2017
work page 2017
-
[76]
J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009
work page 2009
-
[77]
Beyond Skip Connections: Top-Down Modulation for Object Detection
A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv:1612.06851, 2016
work page Pith review arXiv 2016
-
[78]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015
work page 2015
-
[79]
C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017
work page 2017
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.