Rethinking Atrous Convolution for Semantic Image Segmentation

Florian Schroff; George Papandreou; Hartwig Adam; Liang-Chieh Chen

arxiv: 1706.05587 · v3 · submitted 2017-06-17 · 💻 cs.CV

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen , George Papandreou , Florian Schroff , Hartwig Adam This is my paper

Pith reviewed 2026-05-12 00:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic image segmentationatrous convolutionmulti-scale contextDeepLabv3ASPPglobal contextPASCAL VOC 2012convolutional networks

0 comments

The pith

Atrous convolutions in cascaded or parallel modules plus global context enable accurate semantic segmentation without DenseCRF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper revisits atrous convolution to let deep networks explicitly control their field of view and feature resolution for semantic image segmentation. It introduces modules that apply atrous convolution at multiple rates either in sequence or side-by-side to gather context across object scales. The authors further enlarge their earlier Atrous Spatial Pyramid Pooling module by adding image-level features that supply global scene context. On the PASCAL VOC 2012 benchmark the resulting DeepLabv3 system raises accuracy over earlier DeepLab versions that still needed DenseCRF post-processing and reaches performance comparable to other leading models. Readers care because pixel-accurate labeling is essential for scene understanding yet the new design removes a separate, computationally heavy post-processing stage.

Core claim

By employing atrous convolution in cascaded or parallel arrangements with several rates and by augmenting the Atrous Spatial Pyramid Pooling module with image-level global-context features, the DeepLabv3 architecture captures multi-scale information directly inside the network, yielding significant accuracy gains over prior DeepLab models on PASCAL VOC 2012 without requiring DenseCRF post-processing and matching the performance of contemporary state-of-the-art segmentation systems.

What carries the argument

Atrous Spatial Pyramid Pooling module augmented with image-level global features, together with cascaded or parallel atrous convolution blocks that apply multiple dilation rates.

If this is right

Multi-scale context can be extracted inside a single forward pass by probing features at several atrous rates in parallel or cascade.
Global image-level features can be fused with local convolutional features to improve scene layout understanding.
DenseCRF post-processing is no longer required to reach high segmentation accuracy on PASCAL VOC 2012.
The same modular atrous design can be inserted into other DCNN backbones to adapt their receptive fields without changing network depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The removal of the DenseCRF stage reduces both inference latency and memory use, which could support deployment on resource-limited devices.
Because the modules operate at different scales inside the network, similar rate-scheduling ideas may transfer to other dense-prediction tasks such as depth estimation or instance segmentation.
Sharing the exact training recipe allows direct ablation studies that isolate the effect of each atrous configuration on new datasets.

Load-bearing premise

The observed accuracy improvements are produced by the new atrous modules and global-context features rather than by any unstated changes in training schedule, data augmentation, or hyper-parameter choices.

What would settle it

Re-train the previous DeepLabv2 model with exactly the same training schedule, augmentation, and hyper-parameters used for DeepLabv3; if the mean intersection-over-union gap on the PASCAL VOC 2012 validation set closes or reverses, the contribution of the atrous modules is not established.

read the original abstract

In this work, we revisit atrous convolution, a powerful tool to explicitly adjust filter's field-of-view as well as control the resolution of feature responses computed by Deep Convolutional Neural Networks, in the application of semantic image segmentation. To handle the problem of segmenting objects at multiple scales, we design modules which employ atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple atrous rates. Furthermore, we propose to augment our previously proposed Atrous Spatial Pyramid Pooling module, which probes convolutional features at multiple scales, with image-level features encoding global context and further boost performance. We also elaborate on implementation details and share our experience on training our system. The proposed `DeepLabv3' system significantly improves over our previous DeepLab versions without DenseCRF post-processing and attains comparable performance with other state-of-art models on the PASCAL VOC 2012 semantic image segmentation benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepLabv3 refines atrous convolutions with cascade and parallel modules plus global image features to capture multi-scale context more effectively in semantic segmentation.

read the letter

DeepLabv3 is mainly about rethinking how to use atrous convolutions for handling different object scales in semantic image segmentation. The authors add cascade or parallel atrous modules and an image-level global feature to their ASPP setup, which lets the model get multi-scale context more effectively and without needing DenseCRF post-processing. What is new is the explicit cascade and parallel configurations for atrous rates, along with the global context augmentation. These are not trivial additions and represent a step forward from their prior versions. The paper does well by including details on implementation and training, which is helpful for anyone wanting to replicate or extend the work. The benchmark results on PASCAL VOC 2012 are presented as comparable to other top models, which is a practical outcome. The soft spots are not major. The stress-test concern about attributing gains to the modules versus training variables is reasonable in principle for empirical papers. However, since the authors share their training experience and the work is from the same team, the comparisons are likely fair. If the full paper has ablation tables showing the contribution of each part, that would make the central claim solid. Without those, the evidence would be weaker, but I expect they are included. This paper is for readers in computer vision who work on semantic segmentation or related dense prediction tasks. People looking for architectural ideas to improve context modeling will get value from it. It deserves a serious referee because it offers concrete, reproducible extensions to an established framework with reported results on a standard benchmark. I recommend that a serious editor send this to peer review.

Referee Report

2 major / 1 minor

Summary. The paper revisits atrous convolution for semantic image segmentation, proposing modules that apply atrous convolutions either in cascade or in parallel to capture multi-scale context, and augments the Atrous Spatial Pyramid Pooling (ASPP) module with image-level features to encode global context. It further details training practices and claims that the resulting DeepLabv3 system significantly improves over prior DeepLab versions (without DenseCRF) while attaining performance comparable to other state-of-the-art models on the PASCAL VOC 2012 benchmark.

Significance. If the reported gains are shown to stem from the architectural innovations rather than uncontrolled training variables, the work offers a practical and incremental advance in multi-scale context modeling for dense prediction tasks, building directly on prior ASPP designs with clear implementation guidance that could influence follow-on architectures.

major comments (2)

[Experiments and abstract] Experiments section (and comparisons to prior DeepLab versions): the central claim that performance gains arise from the cascade/parallel atrous modules and image-level feature augmentation requires explicit confirmation that training schedules, data augmentation, and hyper-parameters are identical to those used in the authors' previous DeepLab papers; the abstract's reference to sharing 'implementation details and experience on training our system' does not substitute for matched baselines, leaving open the possibility that gains are confounded by optimization differences.
[§3.2–3.3] §3.2–3.3 (atrous modules and ASPP augmentation): the manuscript should provide controlled ablations that isolate the incremental benefit of the new cascade/parallel atrous rates and the image-level feature branch while holding all training variables fixed, as the current presentation does not fully rule out that observed mIoU lifts are driven by the global context addition alone or by unstated hyper-parameter changes.

minor comments (1)

[Abstract] Abstract: the claims of 'significant improvement' and 'comparable performance' would be more informative if accompanied by the specific mIoU numbers and table references that appear later in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will update the paper to provide the requested clarifications and additional controlled experiments.

read point-by-point responses

Referee: [Experiments and abstract] Experiments section (and comparisons to prior DeepLab versions): the central claim that performance gains arise from the cascade/parallel atrous modules and image-level feature augmentation requires explicit confirmation that training schedules, data augmentation, and hyper-parameters are identical to those used in the authors' previous DeepLab papers; the abstract's reference to sharing 'implementation details and experience on training our system' does not substitute for matched baselines, leaving open the possibility that gains are confounded by optimization differences.

Authors: We agree that explicit confirmation of matched training variables is necessary to attribute gains to the architectural changes. In the revised manuscript we will add a dedicated paragraph and table in the Experiments section that directly compares the training schedule (poly learning-rate policy, iteration count, crop size, batch size), data augmentation (random scaling, horizontal flipping, color jitter), and hyper-parameters used in DeepLabv3 to those reported in our prior DeepLabv2 work, noting only the architecture-specific modifications. This will make clear that the core optimization settings remain identical. revision: yes
Referee: [§3.2–3.3] §3.2–3.3 (atrous modules and ASPP augmentation): the manuscript should provide controlled ablations that isolate the incremental benefit of the new cascade/parallel atrous rates and the image-level feature branch while holding all training variables fixed, as the current presentation does not fully rule out that observed mIoU lifts are driven by the global context addition alone or by unstated hyper-parameter changes.

Authors: We appreciate the request for stricter isolation of each component. While the current manuscript already reports ablation results for different atrous rates in cascade and parallel settings as well as the addition of the image-level feature branch, we acknowledge that these experiments could be presented more explicitly as incremental, fixed-training-protocol studies. In the revision we will insert a new table that starts from a common baseline and successively adds the cascade module, the parallel module, and the image-level features, reporting mIoU on the PASCAL VOC 2012 validation set under identical training settings for each step. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent results

full rationale

The paper proposes architectural changes (cascaded/parallel atrous convolutions and augmented ASPP) and reports empirical mIoU improvements on PASCAL VOC 2012. There is no derivation chain, no equations, and no 'predictions' that reduce by construction to fitted parameters or self-defined inputs. Benchmark numbers are externally falsifiable and not derived from the authors' prior fits. Self-references to earlier DeepLab versions are normal citations of prior empirical work and do not load-bear any mathematical reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard convolutional network assumptions plus the empirical hypothesis that multi-rate atrous modules plus global pooling capture scale variation better than prior designs. No new physical entities or mathematical axioms are introduced.

free parameters (2)

atrous rates
Multiple dilation rates are chosen for the cascade and parallel modules; exact values are not stated in the abstract but are free parameters tuned for the task.
training hyper-parameters
Learning rate schedule, batch size, and data augmentation choices are not detailed in the abstract yet directly affect the reported benchmark numbers.

axioms (1)

domain assumption Deep convolutional networks can be trained end-to-end on labeled segmentation data to produce per-pixel predictions.
Invoked implicitly when the authors state that the proposed modules improve segmentation performance.

pith-pipeline@v0.9.0 · 5455 in / 1355 out tokens · 49147 ms · 2026-05-12T00:20:57.968537+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
eess.IV 2025-07 unverdicted novelty 8.0

Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
Interference-Aware Multi-Task Unlearning
cs.AI 2026-05 unverdicted novelty 7.0

Introduces interference-aware multi-task unlearning with task-aware gradient projection and instance-level gradient orthogonalization, reducing interference scores by 30.3% and 52.9% on vision benchmarks.
CineMatte: Background Matting for Virtual Production and Beyond
cs.CV 2026-05 unverdicted novelty 7.0

CineMatte uses a cross-attention design on a Siamese DINOv3 ViT plus a pretrained upsampler to produce robust mattes for virtual production, backed by a new non-synthetic 4K VP dataset that supports camera motion.
Functionalization via Structure Completion and Motion Rectification
cs.CV 2026-05 unverdicted novelty 7.0

Object functionalization is cast as neural graph completion over a functional graph of parts, contacts, and motions, followed by geometry realization that also rectifies erroneous motions, demonstrated on furniture wi...
Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection
cs.CV 2026-04 unverdicted novelty 7.0

Noise2Map repurposes diffusion model denoising into a direct predictor for semantic segmentation and change detection tasks in remote sensing, achieving top average ranks on benchmark datasets.
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
cs.CR 2026-04 unverdicted novelty 7.0

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Towards Realistic Open-Vocabulary Remote Sensing Segmentation: Benchmark and Baseline
cs.CV 2026-04 unverdicted novelty 7.0

OVRSISBenchV2 is a realistic benchmark expanding scene and category coverage for open-vocabulary remote sensing segmentation, with Pi-Seg baseline showing strong transfer via positive-incentive noise perturbations.
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

VGGT-Segmentor achieves new state-of-the-art cross-view segmentation on Ego-Exo4D with 67.7% and 68.0% average IoU using a geometry-enhanced model and correspondence-free pretraining that beats most supervised baselines.
Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework
cs.CV 2026-04 conditional novelty 7.0

A hierarchical prompt tree with self-reflection graph propagation enables positive forward and backward knowledge transfer in incremental surgical instrument segmentation, improving over baselines by more than 5% and ...
Contour Refinement using Discrete Diffusion in Low Data Regime
cs.CV 2026-02 unverdicted novelty 7.0

A CNN-based discrete diffusion method refines sparse contours from segmentation masks using simplified denoising steps and minimal post-processing, outperforming baselines on small medical and environmental datasets w...
CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition
cs.CV 2025-05 unverdicted novelty 7.0

CONSIGN applies conformal prediction to segmentation by incorporating spatial structure through decomposition, producing tighter and more interpretable uncertainty estimates with error guarantees.
EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems
cs.LG 2025-05 unverdicted novelty 7.0

OD-TTA enables resource-efficient test-time adaptation on edge devices by triggering updates only on detected domain shifts, achieving comparable accuracy with lower energy and computation costs for embodied visual systems.
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
cs.CV 2025-03 unverdicted novelty 7.0

Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
Adaptive Camera Sensor for Vision Models
cs.CV 2025-03 unverdicted novelty 7.0

Lens adapts camera sensors in real time via the VisiT confidence-based quality indicator to improve vision model accuracy on domain-shifted images, shown on ImageNet-ES and a new diverse benchmark.
FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
cs.AR 2024-05 unverdicted novelty 7.0

FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency ...
R$^{2}$Net: 2D Deep Residual Learning with Height Embedding for 3D Radio Map Estimation
eess.SP 2026-05 unverdicted novelty 6.0

R²Net applies 2D deep residual learning with height embedding to estimate 3D radio maps, offering separate indoor and outdoor variants plus a new 3D indoor dataset.
SENSE: Satellite-based ENergy Synthesis for Sustainable Environment
cs.CV 2026-05 unverdicted novelty 6.0

SENSE is a controllable diffusion model that jointly generates realistic urban satellite imagery and aligned building energy consumption and height maps from road networks and density inputs, improving downstream task...
SEED: Targeted Data Selection by Weighted Independent Set
cs.LG 2026-05 unverdicted novelty 6.0

SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods o...
AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection
cs.CV 2026-05 unverdicted novelty 6.0

AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.
Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation
cs.RO 2026-05 unverdicted novelty 6.0

A compact network called Nano-U trained with quantization-aware distillation enables accurate binary terrain segmentation and runs efficiently on ESP32-S3 microcontrollers for tiny robots.
UnGAP: Uncertainty-Guided Affine Prompting for Real-Time Crack Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

UnGAP turns aleatoric uncertainty into an active calibration signal via affine feature modulation to fix gradient suppression in heteroscedastic crack segmentation while maintaining real-time performance.
Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy
cs.CV 2026-05 unverdicted novelty 6.0

RGSUD achieves SOTA unsupervised deraining by using IQA-based reward recycling and self-reinforcement to constrain optimization and improve pseudo-paired data quality.
DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration
cs.RO 2026-04 unverdicted novelty 6.0

DOT-Sim uses MPM physics plus learned residual optics to simulate deformable tactile sensors, supporting zero-shot sim-to-real transfer for classification and control tasks.
Diffusion Model as a Generalist Segmentation Learner
cs.CV 2026-04 unverdicted novelty 6.0

DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment
cs.CV 2026-04 unverdicted novelty 6.0

FryNet combines RGB and thermal imaging with adversarial regularization to segment oil areas, classify usability, and predict oxidation levels like PV and Totox with high accuracy on video data.
Lorentz Framework for Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

A Lorentz-model hyperbolic framework for semantic segmentation that integrates with Euclidean networks, provides free uncertainty maps, and is validated on ADE20K, COCO-Stuff, Pascal-VOC and Cityscapes using DeepLabV3...
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
AIBuildAI: An AI Agent for Automatically Building AI Models
cs.AI 2026-04 unverdicted novelty 6.0

AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

VGGT-Segmentor achieves new SOTA cross-view segmentation on Ego-Exo4D (67.7% Ego-to-Exo, 68.0% Exo-to-Ego IoU) via geometry-enhanced features, a three-stage segmentation head, and correspondence-free pretraining.
GTPBD-MM: A Global Terraced Parcel and Boundary Dataset with Multi-Modality
cs.CV 2026-04 unverdicted novelty 6.0

GTPBD-MM is the first multimodal benchmark for global terraced parcel extraction, integrating image, text, and DEM data with experiments showing that textual and terrain cues improve delineation accuracy over image-on...
Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization
cs.CV 2026-04 unverdicted novelty 6.0

A large pool of diverse artistic styles for style-transfer augmentation improves domain generalization in driving vision models more than repeated use of few styles or domain-matched styles, yielding the lightweight S...
CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

CrossWeaver introduces MIB and SAF modules to enable flexible, reliability-aware cross-modal interaction and fusion, achieving SOTA multimodal semantic segmentation with minimal parameters and generalization to unseen...
Flux4D: Flow-based Unsupervised 4D Reconstruction
cs.CV 2025-12 unverdicted novelty 6.0

Flux4D reconstructs large-scale dynamic 4D scenes unsupervised by predicting moving 3D Gaussians from photometric losses and static regularization when trained across multiple scenes.
Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation
cs.CV 2024-11 unverdicted novelty 6.0

Learn2Synth optimizes data synthesis parameters with hypergradients to train segmentation networks solely on synthetic brain images that generalize to real scans.
Causal Unsupervised Semantic Segmentation
cs.CV 2023-10 unverdicted novelty 6.0

CAUSE uses frontdoor adjustment with a discretized concept clusterbook mediator to perform unsupervised semantic segmentation and reports state-of-the-art results.
Language-driven Semantic Segmentation
cs.CV 2022-01 unverdicted novelty 6.0

LSeg achieves competitive zero-shot semantic segmentation by contrastively aligning dense pixel embeddings from a transformer with text embeddings of class labels.
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
cs.CV 2021-10 unverdicted novelty 6.0

MobileViT is a lightweight vision transformer that reports 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2% at similar size, plus gains on MS-COCO detection.
Segmenting Objects in Day and Night:Edge-Conditioned CNN for Thermal Image Semantic Segmentation
cs.CV 2019-07 unverdicted novelty 6.0

EC-CNN uses a gated feature-wise transform to incorporate edge priors for thermal semantic segmentation and introduces the SODA dataset of over 7,000 labeled thermal images.
Gated-SCNN: Gated Shape CNNs for Semantic Segmentation
cs.CV 2019-07 unverdicted novelty 6.0

Gated-SCNN adds a gated shape stream to standard CNNs for semantic segmentation, achieving improved boundary quality and SOTA results on Cityscapes.
Deep Saliency Models : The Quest For The Loss Function
cs.CV 2019-07 conditional novelty 6.0

Varying and combining loss functions in deep visual saliency prediction models produces significant performance gains on fixed architectures that hold across datasets and networks.
WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation
cs.CV 2026-05 unverdicted novelty 5.0

WoundFormer modifies SegFormer with a spatially-preserving multi-scale aggregation head for multi-class wound tissue segmentation, reporting 81.9% Dice on the WoundTissueSeg dataset with gains over baselines.
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging
cs.CV 2026-05 unverdicted novelty 5.0

A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classificati...
Breaking the Resource Wall: Geometry-Guided Sequence Modeling for Efficient Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DGM-Net reaches 82.3% mIoU on Cityscapes and 45.24% on ADE20K using directional geometric guidance inside a linear-complexity Mamba backbone, without heavy pretraining or large models.
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
cs.CV 2026-04 unverdicted novelty 5.0

AutoAWG generates controllable adverse weather automotive videos via semantics-guided adaptive multi-control fusion and vanishing-point-anchored temporal synthesis from static images, reducing FID by 50% and FVD by 16...
DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

DeltaSeg, a tiered-attention U-Net variant with a novel Deep Delta Attention module, outperforms 12 prior models on two multi-class structural defect segmentation benchmarks.
WILD-SAM: Phase-Aware Expert Adaptation of SAM for Landslide Detection in Wrapped InSAR Interferograms
cs.CV 2026-04 unverdicted novelty 5.0

WILD-SAM is a fine-tuned SAM variant using phase-aware MoE adapters and wavelet subband enhancement that achieves state-of-the-art landslide detection on wrapped InSAR data.
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
SCASeg: Strip Cross-Attention for Efficient Semantic Segmentation
cs.CV 2024-11 unverdicted novelty 5.0

SCASeg proposes a strip cross-attention decoder with lateral connections and a cross-layer block to efficiently capture global-local context, reporting competitive or superior results on ADE20K, Cityscapes, COCO-Stuff...
Cross Attention Network for Semantic Segmentation
cs.CV 2019-07 unverdicted novelty 5.0

Cross Attention Network fuses spatial and contextual features via a cross attention module to improve semantic segmentation performance and speed on Cityscapes and CamVid.
An Efficient 3D CNN for Action/Object Segmentation in Video
cs.CV 2019-07 unverdicted novelty 5.0

End-to-end 3D CNN with separable convolutions for efficient simultaneous spatial-temporal video object and action segmentation.
Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox
cs.CV 2026-05 unverdicted novelty 4.0

Stabilized SegFormer-B5 reaches 0.4572 mIoU SOTA on original Apple DMS split; 80/10/10 split reaches 0.5276 mIoU but degrades real-world OOD performance per qualitative review.
SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

SpineContextResUNet achieves Dice scores of 88.17% on VerSe2020 and 88.13% on CTSpine1K while using ~1.7M parameters and running inference on commodity hardware with 8GB RAM.
FoR-Net: Learning to Focus on Hard Regions for Efficient Semantic Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

FoR-Net improves efficiency in semantic segmentation by focusing on hard regions with a learned selector and multi-scale convolutions, achieving competitive results on Cityscapes.
An End-to-End Decision-Aware Multi-Scale Attention-Based Model for Explainable Autonomous Driving
cs.CV 2026-04 unverdicted novelty 4.0

A decision-aware multi-scale attention network generates tailored explanations for autonomous driving choices and outperforms prior models on F1 and a new Joint F1 metric across two datasets.
A Benchmark Study of Segmentation Models and Adaptation Strategies for Landslide Detection from Satellite Imagery
cs.CV 2026-04 unverdicted novelty 4.0

Transformer-based models deliver strong landslide segmentation on satellite images, and parameter-efficient fine-tuning matches full fine-tuning accuracy while cutting trainable parameters by up to 95%.
UA-Net: Uncertainty-Aware Network for TRISO Image Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 4.0

UA-Net segments TRISO fuel micrographs into five regions with 95.5% mIoU and 97.3% mP on 102 test images, while its meta-model detects misclassifications at 91.8% specificity and 93.5% sensitivity.
EDFNet: Early Fusion of Edge and Depth for Thin-Obstacle Segmentation in UAV Navigation
cs.CV 2026-04 unverdicted novelty 4.0

Early RGB-Depth-Edge fusion in EDFNet provides a competitive baseline for thin-obstacle segmentation on the DDOS dataset, with the best pretrained U-Net model reaching 0.244 Thin-Structure Evaluation Score.
Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets
cs.CV 2025-09 unverdicted novelty 4.0

Benchmarking ten segmentation models on a nine-image histology dataset and a 153-image generalization set reveals unstable rankings, overlapping confidence intervals, and dataset-specific performance hierarchies, advo...
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes
cs.CV 2024-08 unverdicted novelty 4.0

GSAM applies random cropping to enable variable input sizes for efficient SAM fine-tuning, claiming lower compute with comparable or higher accuracy on varied datasets.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · cited by 70 Pith papers · 3 internal anchors

[1]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016

work page Pith review arXiv 2016
[2]

Adams, J

A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional ﬁltering using the permutohedral lattice. In Eurographics, 2010. Method Coarse mIOU DeepLabv2-CRF [11] 70.4 Deep Layer Cascade [52] 71.1 ML-CRNN [21] 71.2 Adelaide context [55] 71.6 FRRN [70] 71.8 LRR-4x [25] ✓ 71.8 ReﬁneNet [54] 73.6 FoveaNet [51] 74.1 Ladder DenseNet [46] 74.3 PEARL [42] 75.4 Glo...

work page 2010
[3]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561, 2015

work page Pith review arXiv 2015
[4]

A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of computation, 31(138):333–390, 1977

work page 1977
[5]

W. L. Briggs, V . E. Henson, and S. F. McCormick.A multigrid tutorial. SIAM, 2000

work page 2000
[6]

Byeon, T

W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In CVPR, 2015

work page 2015
[7]

Caesar, J

H. Caesar, J. Uijlings, and V . Ferrari. COCO-Stuff: Thing and stuff classes in context. arXiv:1612.03716, 2016

work page arXiv 2016
[8]

Chandra and I

S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ference for semantic image segmentation with deep Gaussian CRFs. arXiv:1603.08358, 2016

work page arXiv 2016
[9]

L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-speciﬁc edge detection using cnns and a discriminatively trained domain transform. In CVPR, 2016

work page 2016
[10]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015

work page 2015
[11]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016

work page Pith review arXiv 2016
[12]

L.-C. Chen, Y . Yang, J. Wang, W. Xu, and A. L. Yuille. At- tention to scale: Scale-aware semantic image segmentation. In CVPR, 2016

work page 2016
[13]

F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357, 2016. Figure 8. Visualization results on Cityscapes val set when training with only train ﬁne set

work page Pith review arXiv 2016
[14]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

work page 2016
[15]

J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv:1412.1283, 2014

work page arXiv 2014
[16]

J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmenta- tion. In ICCV, 2015

work page 2015
[17]

J. Dai, Y . Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. arXiv:1605.06409, 2016

work page arXiv 2016
[18]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. arXiv:1703.06211, 2017

work page Pith review arXiv 2017
[19]

Eigen and R

D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. arXiv:1411.4734, 2014

work page arXiv 2014
[20]

Everingham, S

M. Everingham, S. M. A. Eslami, L. V . Gool, C. K. I. Williams, J. Winn, and A. Zisserma. The pascal visual object classes challenge a retrospective. IJCV, 2014

work page 2014
[21]

H. Fan, X. Mei, D. Prokhorov, and H. Ling. Multi-level contextual rnns with attention model for scene labeling. arXiv:1607.02537, 2016

work page arXiv 2016
[22]

Farabet, C

C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features for scene labeling. PAMI, 2013

work page 2013
[23]

J. Fu, J. Liu, Y . Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943, 2017

work page arXiv 2017
[24]

Gadde, V

R. Gadde, V . Jampani, and P. V . Gehler. Semantic video cnns through representation warping. In ICCV, 2017

work page 2017
[25]

Ghiasi and C

G. Ghiasi and C. C. Fowlkes. Laplacian reconstruction and reﬁnement for semantic segmentation. arXiv:1605.02264, 2016

work page arXiv 2016
[26]

Giusti, D

A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, 2013

work page 2013
[27]

Gould, R

S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV. IEEE, 2009

work page 2009
[28]

Grauman and T

K. Grauman and T. Darrell. The pyramid match kernel: Dis- criminative classiﬁcation with sets of image features. InICCV, 2005

work page 2005
[29]

Hariharan, P

B. Hariharan, P. Arbel´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011

work page 2011
[30]

Hariharan, P

B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hyper- columns for object segmentation and ﬁne-grained localization. In CVPR, 2015

work page 2015
[31]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014

work page 2014
[32]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random ﬁelds for image labeling. In CVPR, 2004

work page 2004
[34]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS, 2014

work page 2014
[35]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[36]

Holschneider, R

M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time- Frequency Methods and Phase Space, pages 289–297. 1989

work page 1989
[37]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017

work page 2017
[38]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015

work page internal anchor Pith review arXiv 2015
[39]

M. A. Islam, M. Rochan, N. D. Bruce, and Y . Wang. Gated feedback reﬁnement network for dense image labeling. In CVPR, 2017

work page 2017
[40]

S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learn- ing to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, 2017

work page 2017
[41]

Jampani, M

V . Jampani, M. Kiefel, and P. V . Gehler. Learning sparse high dimensional ﬁlters: Image ﬁltering, dense crfs and bilateral neural networks. In CVPR, 2016

work page 2016
[42]

X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan. Video scene parsing with predictive feature learning. In ICCV, 2017

work page 2017
[43]

Kohli, P

P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302–324, 2009

work page 2009
[44]

Kong and C

S. Kong and C. Fowlkes. Recurrent scene parsing with per- spective understanding in the loop. arXiv:1705.07238, 2017

work page arXiv 2017
[45]

Kr¨ahenb¨uhl and V

P. Kr¨ahenb¨uhl and V . Koltun. Efﬁcient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011

work page 2011
[46]

Kreˇso, S

I. Kreˇso, S. ˇSegvi´c, and J. Krapac. Ladder-style densenets for semantic segmentation of large natural images. In ICCV CVRSUAD workshop, 2017

work page 2017
[47]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012
[48]

Ladicky, C

L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009

work page 2009
[49]

Lazebnik, C

S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of fea- tures: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006

work page 2006
[50]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computa- tion, 1(4):541–551, 1989

work page 1989
[51]

X. Li, Z. Jie, W. Wang, C. Liu, J. Yang, X. Shen, Z. Lin, Q. Chen, S. Yan, and J. Feng. Foveanet: Perspective-aware urban scene parsing. arXiv:1708.02421, 2017

work page arXiv 2017
[52]

X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixels are equal: Difﬁculty-aware semantic segmentation via deep layer cascade. arXiv:1704.01344, 2017

work page arXiv 2017
[53]

Liang, X

X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with local-global long short-term memory. arXiv:1511.04510, 2015

work page arXiv 2015
[54]

G. Lin, A. Milan, C. Shen, and I. Reid. Reﬁnenet: Multi- path reﬁnement networks with identity mappings for high- resolution semantic segmentation. arXiv:1611.06612, 2016

work page Pith review arXiv 2016
[55]

G. Lin, C. Shen, I. Reid, et al. Efﬁcient piecewise train- ing of deep structured models for semantic segmentation. arXiv:1504.01013, 2015

work page arXiv 2015
[56]

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv:1612.03144, 2016

work page Pith review arXiv 2016
[57]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com- mon objects in context. In ECCV, 2014

work page 2014
[58]

W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015

work page Pith review arXiv 2015
[59]

Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015

work page 2015
[60]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015
[61]

P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning for semantic image segmentation. In ICCV, 2017

work page 2017
[62]

Mostajabi, P

M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed- forward semantic segmentation with zoom-out features. In CVPR, 2015

work page 2015
[63]

Mottaghi, X

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014

work page 2014
[64]

H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In ICCV, 2015

work page 2015
[65]

Papandreou, L.-C

G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly- and semi-supervised learning of a dcnn for semantic image segmentation. In ICCV, 2015

work page 2015
[66]

Papandreou, I

G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015

work page 2015
[67]

Papandreou and P

G. Papandreou and P. Maragos. Multigrid geometric active contour models. TIP, 16(1):229–240, 2007

work page 2007
[68]

C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolu- tional network. arXiv:1703.02719, 2017

work page Pith review arXiv 2017
[69]

Pinheiro and R

P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014

work page 2014
[70]

Pohlen, A

T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full- resolution residual networks for semantic segmentation in street scenes. arXiv:1611.08323, 2016

work page arXiv 2016
[71]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

work page 2015
[72]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015

work page 2015
[73]

A. G. Schwing and R. Urtasun. Fully connected deep struc- tured networks. arXiv:1503.02351, 2015

work page Pith review arXiv 2015
[74]

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013

work page Pith review arXiv 2013
[75]

F. Shen, R. Gan, S. Yan, and G. Zeng. Semantic segmentation via structured patch prediction, context crf and guidance crf. In CVPR, 2017

work page 2017
[76]

Shotton, J

J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009

work page 2009
[77]

Beyond Skip Connections: Top-Down Modulation for Object Detection

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv:1612.06851, 2016

work page Pith review arXiv 2016
[78]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[79]

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017

work page 2017
[80]

H. Sun, D. Xie, and S. Pu. Mixed context networks for semantic segmentation. arXiv:1610.05854, 2016

work page arXiv 2016

Showing first 80 references.

[1] [1]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

M. Abadi, A. Agarwal, et al. Tensorﬂow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016

work page Pith review arXiv 2016

[2] [2]

Adams, J

A. Adams, J. Baek, and M. A. Davis. Fast high-dimensional ﬁltering using the permutohedral lattice. In Eurographics, 2010. Method Coarse mIOU DeepLabv2-CRF [11] 70.4 Deep Layer Cascade [52] 71.1 ML-CRNN [21] 71.2 Adelaide context [55] 71.6 FRRN [70] 71.8 LRR-4x [25] ✓ 71.8 ReﬁneNet [54] 73.6 FoveaNet [51] 74.1 Ladder DenseNet [46] 74.3 PEARL [42] 75.4 Glo...

work page 2010

[3] [3]

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

V . Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561, 2015

work page Pith review arXiv 2015

[4] [4]

A. Brandt. Multi-level adaptive solutions to boundary-value problems. Mathematics of computation, 31(138):333–390, 1977

work page 1977

[5] [5]

W. L. Briggs, V . E. Henson, and S. F. McCormick.A multigrid tutorial. SIAM, 2000

work page 2000

[6] [6]

Byeon, T

W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. Scene labeling with lstm recurrent neural networks. In CVPR, 2015

work page 2015

[7] [7]

Caesar, J

H. Caesar, J. Uijlings, and V . Ferrari. COCO-Stuff: Thing and stuff classes in context. arXiv:1612.03716, 2016

work page arXiv 2016

[8] [8]

Chandra and I

S. Chandra and I. Kokkinos. Fast, exact and multi-scale in- ference for semantic image segmentation with deep Gaussian CRFs. arXiv:1603.08358, 2016

work page arXiv 2016

[9] [9]

L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-speciﬁc edge detection using cnns and a discriminatively trained domain transform. In CVPR, 2016

work page 2016

[10] [10]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015

work page 2015

[11] [11]

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915, 2016

work page Pith review arXiv 2016

[12] [12]

L.-C. Chen, Y . Yang, J. Wang, W. Xu, and A. L. Yuille. At- tention to scale: Scale-aware semantic image segmentation. In CVPR, 2016

work page 2016

[13] [13]

F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357, 2016. Figure 8. Visualization results on Cityscapes val set when training with only train ﬁne set

work page Pith review arXiv 2016

[14] [14]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

work page 2016

[15] [15]

J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv:1412.1283, 2014

work page arXiv 2014

[16] [16]

J. Dai, K. He, and J. Sun. Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmenta- tion. In ICCV, 2015

work page 2015

[17] [17]

J. Dai, Y . Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. arXiv:1605.06409, 2016

work page arXiv 2016

[18] [18]

J. Dai, H. Qi, Y . Xiong, Y . Li, G. Zhang, H. Hu, and Y . Wei. Deformable convolutional networks. arXiv:1703.06211, 2017

work page Pith review arXiv 2017

[19] [19]

Eigen and R

D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. arXiv:1411.4734, 2014

work page arXiv 2014

[20] [20]

Everingham, S

M. Everingham, S. M. A. Eslami, L. V . Gool, C. K. I. Williams, J. Winn, and A. Zisserma. The pascal visual object classes challenge a retrospective. IJCV, 2014

work page 2014

[21] [21]

H. Fan, X. Mei, D. Prokhorov, and H. Ling. Multi-level contextual rnns with attention model for scene labeling. arXiv:1607.02537, 2016

work page arXiv 2016

[22] [22]

Farabet, C

C. Farabet, C. Couprie, L. Najman, and Y . LeCun. Learning hierarchical features for scene labeling. PAMI, 2013

work page 2013

[23] [23]

J. Fu, J. Liu, Y . Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation. arXiv:1708.04943, 2017

work page arXiv 2017

[24] [24]

Gadde, V

R. Gadde, V . Jampani, and P. V . Gehler. Semantic video cnns through representation warping. In ICCV, 2017

work page 2017

[25] [25]

Ghiasi and C

G. Ghiasi and C. C. Fowlkes. Laplacian reconstruction and reﬁnement for semantic segmentation. arXiv:1605.02264, 2016

work page arXiv 2016

[26] [26]

Giusti, D

A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, 2013

work page 2013

[27] [27]

Gould, R

S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV. IEEE, 2009

work page 2009

[28] [28]

Grauman and T

K. Grauman and T. Darrell. The pyramid match kernel: Dis- criminative classiﬁcation with sets of image features. InICCV, 2005

work page 2005

[29] [29]

Hariharan, P

B. Hariharan, P. Arbel´aez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, 2011

work page 2011

[30] [30]

Hariharan, P

B. Hariharan, P. Arbel´aez, R. Girshick, and J. Malik. Hyper- columns for object segmentation and ﬁne-grained localization. In CVPR, 2015

work page 2015

[31] [31]

K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014

work page 2014

[32] [32]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv:1512.03385, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random ﬁelds for image labeling. In CVPR, 2004

work page 2004

[34] [34]

Hinton, O

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS, 2014

work page 2014

[35] [35]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[36] [36]

Holschneider, R

M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time- Frequency Methods and Phase Space, pages 289–297. 1989

work page 1989

[37] [37]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017

work page 2017

[38] [38]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167, 2015

work page internal anchor Pith review arXiv 2015

[39] [39]

M. A. Islam, M. Rochan, N. D. Bruce, and Y . Wang. Gated feedback reﬁnement network for dense image labeling. In CVPR, 2017

work page 2017

[40] [40]

S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learn- ing to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, 2017

work page 2017

[41] [41]

Jampani, M

V . Jampani, M. Kiefel, and P. V . Gehler. Learning sparse high dimensional ﬁlters: Image ﬁltering, dense crfs and bilateral neural networks. In CVPR, 2016

work page 2016

[42] [42]

X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y . Chen, J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan. Video scene parsing with predictive feature learning. In ICCV, 2017

work page 2017

[43] [43]

Kohli, P

P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302–324, 2009

work page 2009

[44] [44]

Kong and C

S. Kong and C. Fowlkes. Recurrent scene parsing with per- spective understanding in the loop. arXiv:1705.07238, 2017

work page arXiv 2017

[45] [45]

Kr¨ahenb¨uhl and V

P. Kr¨ahenb¨uhl and V . Koltun. Efﬁcient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011

work page 2011

[46] [46]

Kreˇso, S

I. Kreˇso, S. ˇSegvi´c, and J. Krapac. Ladder-style densenets for semantic segmentation of large natural images. In ICCV CVRSUAD workshop, 2017

work page 2017

[47] [47]

Krizhevsky, I

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012

work page 2012

[48] [48]

Ladicky, C

L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009

work page 2009

[49] [49]

Lazebnik, C

S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of fea- tures: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006

work page 2006

[50] [50]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computa- tion, 1(4):541–551, 1989

work page 1989

[51] [51]

X. Li, Z. Jie, W. Wang, C. Liu, J. Yang, X. Shen, Z. Lin, Q. Chen, S. Yan, and J. Feng. Foveanet: Perspective-aware urban scene parsing. arXiv:1708.02421, 2017

work page arXiv 2017

[52] [52]

X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixels are equal: Difﬁculty-aware semantic segmentation via deep layer cascade. arXiv:1704.01344, 2017

work page arXiv 2017

[53] [53]

Liang, X

X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan. Semantic object parsing with local-global long short-term memory. arXiv:1511.04510, 2015

work page arXiv 2015

[54] [54]

G. Lin, A. Milan, C. Shen, and I. Reid. Reﬁnenet: Multi- path reﬁnement networks with identity mappings for high- resolution semantic segmentation. arXiv:1611.06612, 2016

work page Pith review arXiv 2016

[55] [55]

G. Lin, C. Shen, I. Reid, et al. Efﬁcient piecewise train- ing of deep structured models for semantic segmentation. arXiv:1504.01013, 2015

work page arXiv 2015

[56] [56]

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv:1612.03144, 2016

work page Pith review arXiv 2016

[57] [57]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- manan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Com- mon objects in context. In ECCV, 2014

work page 2014

[58] [58]

W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015

work page Pith review arXiv 2015

[59] [59]

Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015

work page 2015

[60] [60]

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

work page 2015

[61] [61]

P. Luo, G. Wang, L. Lin, and X. Wang. Deep dual learning for semantic image segmentation. In ICCV, 2017

work page 2017

[62] [62]

Mostajabi, P

M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feed- forward semantic segmentation with zoom-out features. In CVPR, 2015

work page 2015

[63] [63]

Mottaghi, X

R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014

work page 2014

[64] [64]

H. Noh, S. Hong, and B. Han. Learning deconvolution net- work for semantic segmentation. In ICCV, 2015

work page 2015

[65] [65]

Papandreou, L.-C

G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly- and semi-supervised learning of a dcnn for semantic image segmentation. In ICCV, 2015

work page 2015

[66] [66]

Papandreou, I

G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015

work page 2015

[67] [67]

Papandreou and P

G. Papandreou and P. Maragos. Multigrid geometric active contour models. TIP, 16(1):229–240, 2007

work page 2007

[68] [68]

C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters–improve semantic segmentation by global convolu- tional network. arXiv:1703.02719, 2017

work page Pith review arXiv 2017

[69] [69]

Pinheiro and R

P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014

work page 2014

[70] [70]

Pohlen, A

T. Pohlen, A. Hermans, M. Mathias, and B. Leibe. Full- resolution residual networks for semantic segmentation in street scenes. arXiv:1611.08323, 2016

work page arXiv 2016

[71] [71]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

work page 2015

[72] [72]

Russakovsky, J

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015

work page 2015

[73] [73]

A. G. Schwing and R. Urtasun. Fully connected deep struc- tured networks. arXiv:1503.02351, 2015

work page Pith review arXiv 2015

[74] [74]

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y . LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229, 2013

work page Pith review arXiv 2013

[75] [75]

F. Shen, R. Gan, S. Yan, and G. Zeng. Semantic segmentation via structured patch prediction, context crf and guidance crf. In CVPR, 2017

work page 2017

[76] [76]

Shotton, J

J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009

work page 2009

[77] [77]

Beyond Skip Connections: Top-Down Modulation for Object Detection

A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be- yond skip connections: Top-down modulation for object de- tection. arXiv:1612.06851, 2016

work page Pith review arXiv 2016

[78] [78]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

work page 2015

[79] [79]

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017

work page 2017

[80] [80]

H. Sun, D. Xie, and S. Pu. Mixed context networks for semantic segmentation. arXiv:1610.05854, 2016

work page arXiv 2016