Recognition: no theorem link
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Pith reviewed 2026-05-11 09:39 UTC · model grok-4.3
The pith
Deformable DETR improves object detection over standard DETR by restricting transformer attention to a small set of learned sampling points around references, yielding better results especially on small objects while training ten times as快.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Deformable DETR replaces the full attention modules of DETR with deformable attention modules that attend only to a small set of key sampling points around each reference point. On the COCO benchmark this yields better overall performance than DETR, with the largest gains on small objects, while converging after roughly one-tenth the training epochs.
What carries the argument
The deformable attention module, which learns a small fixed number of sampling offsets around each reference location and computes attention weights and aggregated features only at those points.
Load-bearing premise
That a small learned set of sampling points around each reference location supplies enough spatial detail for accurate bounding-box regression and object classification.
What would settle it
Training both DETR and Deformable DETR for the same number of epochs on COCO and finding no faster convergence or no higher small-object average precision for the deformable version would falsify the central claim.
read the original abstract
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10 times less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https://github.com/fundamentalvision/Deformable-DETR.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Deformable DETR, which augments the DETR detector by replacing its standard multi-head self-attention with a deformable attention module. Each attention head samples a small fixed number of learned offset points (typically K=4) around reference points on multi-scale feature maps, using bilinear interpolation. This change is claimed to resolve DETR's slow convergence (reducing required epochs by a factor of 10) and limited spatial resolution, yielding higher COCO AP overall and especially on small objects (AP_s), while preserving the end-to-end set-prediction framework. Code is released.
Significance. If the performance claims hold under the reported controls, the result is significant: it renders transformer-based end-to-end detection practical by cutting training time dramatically while improving accuracy on small objects, a known weakness of DETR. The explicit code release and COCO benchmark numbers support reproducibility and allow direct comparison. The work also illustrates a clean transfer of deformable-convolution ideas into attention, which may generalize to other vision transformers.
major comments (2)
- [§3.2] §3.2, Eq. (2)–(3): the central claim that restricting attention to K learned sampling points recovers sufficient spatial detail for small-object localization rests on the offset predictor and bilinear sampling; however, when reference points fall on coarse feature-map cells (common for small objects), the limited K and lack of explicit multi-scale fusion within each head leave open whether boundary precision is preserved or whether the reported AP_s gains are driven primarily by the multi-scale backbone rather than the deformable mechanism.
- [§4.3] §4.3, Table 3 (ablation rows): the comparison isolating deformable attention from multi-scale features is incomplete; a baseline that applies standard attention to the same multi-scale pyramid is not reported, making it difficult to attribute the 10× epoch reduction and AP_s improvement specifically to the sparse sampling rather than the feature pyramid itself.
minor comments (3)
- [Abstract] Abstract: 'we proposed' should read 'we propose' for consistency with present-tense technical writing.
- [Figure 2] Figure 2: the visualization of sampling points would benefit from an overlay of ground-truth boxes on the same feature map to illustrate coverage for small objects.
- [§4.1] §4.1: the training schedule and optimizer settings for the 50-epoch Deformable DETR run should be stated explicitly in the main text (currently only in the supplement) to allow immediate replication.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation. We address the two major comments point-by-point below, offering clarifications and minor textual revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2, Eq. (2)–(3): the central claim that restricting attention to K learned sampling points recovers sufficient spatial detail for small-object localization rests on the offset predictor and bilinear sampling; however, when reference points fall on coarse feature-map cells (common for small objects), the limited K and lack of explicit multi-scale fusion within each head leave open whether boundary precision is preserved or whether the reported AP_s gains are driven primarily by the multi-scale backbone rather than the deformable mechanism.
Authors: We appreciate the referee raising this point about small-object localization. In the Deformable DETR attention module, reference points are first projected onto each level of the multi-scale feature pyramid (typically levels with strides 8, 16, 32, and 64). The offset predictor then generates sampling locations independently at each scale, allowing the same query to attend to both coarse semantic features and fine-grained details from higher-resolution maps. Bilinear interpolation is applied at the sampled points to achieve sub-pixel accuracy. This constitutes explicit cross-scale fusion within each attention head. Ablation studies (Table 3) show that performance drops when the deformable sampling is replaced by fixed-grid attention even while retaining the same multi-scale backbone, indicating that the learned sparse sampling contributes to the AP_s gains beyond the pyramid alone. We will revise the description in §3.2 to explicitly state the per-scale projection and sampling procedure. revision: yes
-
Referee: [§4.3] §4.3, Table 3 (ablation rows): the comparison isolating deformable attention from multi-scale features is incomplete; a baseline that applies standard attention to the same multi-scale pyramid is not reported, making it difficult to attribute the 10× epoch reduction and AP_s improvement specifically to the sparse sampling rather than the feature pyramid itself.
Authors: We agree that a standard (dense) attention baseline on the identical multi-scale pyramid would be informative. However, applying vanilla multi-head attention directly to the concatenated multi-scale feature maps incurs quadratic complexity in the total number of spatial locations (often >10^5 pixels), leading to prohibitive memory usage and training time—the exact limitation that motivated the deformable design. For this reason we did not include such a baseline. Table 3 instead isolates the effect of the sparse sampling mechanism (varying K, removing multi-scale, etc.) while keeping the backbone fixed, and the full model is compared against the original DETR (single-scale). The observed 10× convergence speedup and AP_s lift are therefore attributable to the combination of sparsity and multi-scale access enabled by deformable attention. We will add a short paragraph in §4.3 explaining the computational rationale for omitting the dense multi-scale baseline. revision: partial
Circularity Check
No circularity: empirical gains measured on independent COCO benchmark
full rationale
The paper defines a new deformable attention module (sampling K learned offsets around reference points) and trains the full model end-to-end on COCO. All performance numbers (AP, convergence epochs, small-object gains) are computed on the standard held-out COCO test split, which is external to the fitted weights and not derived from any internal equation or self-citation. No derivation reduces by construction to its inputs; the architecture is a direct modification of DETR with independent empirical validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of sampling points per attention head
axioms (1)
- domain assumption Deformable attention can approximate the information captured by full attention for object-detection features
Forward citations
Cited by 47 Pith papers
-
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning
UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
-
InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery
InterMesh explicitly incorporates human-object interaction semantics into multi-person mesh recovery via a detector and two lightweight modules, delivering up to 9.9% MPJPE reduction on interaction-heavy datasets.
-
Towards Open World Sound Event Detection
Introduces OW-SED paradigm and WOOT transformer framework to detect known sounds, identify unseen events, and incrementally learn in open audio environments.
-
ReLeaf: Benchmarking Leaf Segmentation across Domains and Species
A YOLO26 model trained on four leaf segmentation datasets reaches 83.9% mean mAP50-95 on their test sets but only 40.2% on a new 23-species benchmark, revealing substantial cross-domain generalization gaps.
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal uses single-image diffusion synthesis, probabilistic federated Faster R-CNN with contrastive debiasing, and inconsistent-category integration to reach 33.4% mAP in privacy-preserving multi-camera object detection.
-
Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion
ConFusion reaches 59.1 mAP and 65.6 NDS on nuScenes validation by combining heterogeneous queries with QMix cross-attention and QSwap feature exchange.
-
URoPE: Universal Relative Position Embedding across Geometric Spaces
URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...
-
Chatting about Upper-Body Expressive Human Pose and Shape Estimation
CoEvoer is a new cross-dependency transformer framework for upper-body expressive human pose and shape estimation that achieves state-of-the-art performance by enabling mutual enhancement between body parts.
-
Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.
-
SynthPID: P&ID digitization from Topology-Preserving Synthetic Data
Topology-preserving synthetic P&IDs generated by seeding from real drawings enable models trained solely on synthetics to achieve 63.8% edge mAP on real P&ID benchmarks, closing most of the gap to real-data training.
-
Online Reasoning Video Object Segmentation
The work introduces the ORVOS task, the ORVOSB benchmark with causal annotations across 210 videos, and a baseline using updated prompts plus a temporal token reservoir.
-
YUV20K: A Complexity-Driven Benchmark and Trajectory-Aware Alignment Model for Video Camouflaged Object Detection
YUV20K is a complexity-driven VCOD benchmark with 24k annotated frames, paired with a model using Motion Feature Stabilization via semantic primitives and Trajectory-Aware Alignment via deformable sampling that outper...
-
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset ...
-
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
-
WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
WUTDet is a 100K-image ship detection dataset with benchmarks indicating Transformer models outperform CNN and Mamba architectures in accuracy and small-object detection for complex maritime environments.
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
Deep Probabilistic Unfolding for Quantized Compressive Sensing
A probabilistic unfolding network with stable likelihood projection and dual-domain Mamba achieves state-of-the-art reconstruction in quantized compressive sensing.
-
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
-
A Novel Graph-Regulated Disentangling Mamba Model with Sparse Tokens for Enhanced Tree Species Classification from MODIS Time Series
A graph-regulated disentangling Mamba model with sparse tokens achieves 93.94% accuracy classifying tree species from MODIS time series in Alberta and outperforms twelve prior models.
-
Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern
Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital a...
-
InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery
InterMesh improves multi-person human mesh recovery accuracy by explicitly enriching DETR-style queries with structured interaction semantics from a human-object detector.
-
FUN: A Focal U-Net Combining Reconstruction and Object Detection for Snapshot Spectral Imaging
FUN is an end-to-end Focal U-Net that performs joint hyperspectral image reconstruction and object detection via multi-task learning with focal modulation, achieving SOTA results with 40% fewer parameters and a new 36...
-
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
-
GateMOT: Q-Gated Attention for Dense Object Tracking
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
Weakly-Supervised Referring Video Object Segmentation through Text Supervision
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
-
HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions
HiProto uses hierarchical prototypes with RPC-Loss, PR-Loss, and SPLGS to deliver competitive, interpretable object detection on low-quality datasets like ExDark and RTTS.
-
Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
VLM-based harmonization of inconsistent annotations across two document layout corpora raises detection F-score from 0.860 to 0.883 and table TEDS from 0.750 to 0.814 while tightening embedding clusters.
-
Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection
Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.
-
Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation
GCNV-Net achieves state-of-the-art accuracy on multiple 3D medical segmentation benchmarks while cutting FLOPs by 56% and inference latency by 68% through dynamic nonvoid voxelization and geometric attention.
-
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation
HeroCrystal achieves 33.4% mAP on cross-domain multi-camera object detection by combining one-shot diffusion-based synthetic data generation, probabilistic federated Faster R-CNN, and inconsistent-category distillatio...
-
Investigation of cardinality classification for bacterial colony counting using explainable artificial intelligence
XAI analysis identifies high visual similarity across colony cardinality classes as the primary limit on MicrobiaNet performance in bacterial colony counting, revising prior model assessments.
-
Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention
Dynamic Focal Attention learns class-specific difficulty via per-class biases in attention logits, improving Dice and IoU on imbalanced histopathology segmentation benchmarks.
-
Hypergraph-State Collaborative Reasoning for Multi-Object Tracking
HyperSSM integrates hypergraphs and state space models to let correlated objects mutually refine motion estimates, stabilizing trajectories under noise and occlusion for state-of-the-art multi-object tracking.
-
Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization
ST-GD adapts Grounding DINO with about 10 million trainable parameters via adapters and a temporal decoder to achieve competitive performance on limited-data spatio-temporal video grounding benchmarks.
-
MapATM: Enhancing HD Map Construction through Actor Trajectory Modeling
MapATM improves lane divider AP by 4.6 and mAP by 2.6 on NuScenes by treating actor trajectories as structural priors for road geometry.
-
EviRCOD: Evidence-Guided Probabilistic Decoding for Referring Camouflaged Object Detection
EviRCOD integrates reference-guided deformable encoding, uncertainty-aware evidential decoding, and boundary refinement to achieve state-of-the-art performance on referring camouflaged object detection benchmarks with...
-
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
-
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
GREATEN fuses surface normals with image features via gated contextual-geometric fusion and efficient sparse attentions to cut stereo matching errors by up to 30% on real datasets when trained solely on synthetic data.
-
Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer
The OG-ReG Transformer achieves state-of-the-art results on Kinetics-400, Something-Something v2, and Diving-48 by combining global glance and local gaze processing paths.
-
HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation
HQF-Net reports mIoU gains on three remote-sensing benchmarks by adding quantum circuits to skip connections and a mixture-of-experts bottleneck inside a classical U-Net fused with a DINOv3 backbone.
-
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
VoxSAMNet introduces sparsity-aware deformable attention via a dummy node and foreground modulation with dropout plus text-guided filtering to reach new state-of-the-art mIoU of 18.2% on SemanticKITTI and 20.2% on SSC...
-
Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving
MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.
-
Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
MDDCNet combines Mamba blocks with deformable dilated convolutions, enhanced feed-forward networks, and an attention-aggregating feature pyramid to achieve better multi-scale traffic object detection than prior detectors.
-
Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions
Intermediate-depth ResNet backbones in RT-DETR maintain near-perfect accuracy for round objects under lighting or background shifts, with ResNet50 best for illumination changes and ResNet34 best for background changes.
Reference graph
Works this paper leans on
-
[1]
Etc: Encoding long and structured data in transformers
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Philip Pham, Anirudh Ravula, and Sumit Sanghai. Etc: Encoding long and structured data in transformers. arXiv preprint arXiv:2004.08483,
-
[2]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[4]
Masked language modeling for pro- teins via linearly scalable long-context transformers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. Masked language modeling for pro- teins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555,
-
[5]
Deformable convolutional networks
9 Published as a conference paper at ICLR 2021 Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In ICCV,
work page 2021
-
[6]
Ax- ial attention in multidimensional transformers,
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidi- mensional transformers. arXiv preprint arXiv:1912.12180,
-
[7]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. arXiv preprint arXiv:2006.16236 ,
-
[8]
Blockwise self-attention for long document understanding
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,
-
[9]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997,
-
[10]
Revisiting the sibling head in object detector
10 Published as a conference paper at ICLR 2021 Guanglu Song, Yu Liu, and Xiaogang Wang. Revisiting the sibling head in object detector. InCVPR,
work page 2021
-
[11]
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In ICML, 2020a. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey.arXiv preprint arXiv:2009.06732, 2020b. Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In ECCV,
-
[12]
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853, 2020a. Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020b. Felix Wu...
-
[13]
A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al
Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062,
-
[14]
An empirical study of spatial attention mechanisms in deep networks
Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, and Jifeng Dai. An empirical study of spatial attention mechanisms in deep networks. In ICCV, 2019a. Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In CVPR, 2019b. 11 Published as a conference paper at ICLR 2021 A A PPENDIX A.1 C OMPLEXITY FOR DEF...
work page 2021
-
[15]
The lowest resolution feature mapxL is obtained via a 3× 3 stride 2 convolution on the finalC5 stage
(transformed by a1×1 convolution). The lowest resolution feature mapxL is obtained via a 3× 3 stride 2 convolution on the finalC5 stage. Note that FPN (Lin et al., 2017a) is not used, because our proposed multi-scale deformable attention in itself can exchange information among multi-scale feature maps. 𝐻8 ×𝑊8 × 512 𝐻16 ×𝑊16 × 1024 𝐻32 ×𝑊32 × 2048 𝐻8 ×𝑊8 ×...
work page 2048
-
[16]
According to Taylor’s theorem, the gradient norm can reflect how much the output would be changed relative to the perturbation of the pixel, thus it could show us which pixels the model mainly relys on for predicting each item. The visualization indicates that Deformable DETR looks at extreme points of the object to deter- mine its bounding box, which is s...
work page 2020
-
[17]
For readibility, we combine the sampling points and attention weights from feature maps of different resolutions into one picture. Similar to DETR (Carion et al., 2020), the instances are already separated in the encoder of De- formable DETR. While in the decoder, our model is focused on the whole foreground instance instead of only extreme points as obse...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.