pith. sign in

arxiv: 2606.09495 · v1 · pith:BAARXBIOnew · submitted 2026-06-08 · 💻 cs.CV

ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

Pith reviewed 2026-06-27 17:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords object detectioncontext dependencebenchmarkrobustnessfalse negativesCOCOprediction suppression
0
0 comments X

The pith

Object detectors miss more objects and produce fewer predictions when context is altered while objects stay identical.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ContextShift, a benchmark built on COCO that changes object backgrounds and spatial relations through controlled geometric transforms and substitutions while leaving the objects themselves unchanged. Testing across detector architectures reveals a consistent pattern where false negatives rise sharply and the total number of predictions falls, yet false positives stay flat or drop. Standard aggregate scores like average precision fail to reveal this suppression effect. The degradation appears tied to fewer valid detection candidates forming rather than simple drops in confidence, and performance along a statistical compatibility measure is non-monotonic.

Core claim

Across detectors, controlled context manipulations produce up to 227% more false negatives and up to 44% fewer predictions while false positives remain stable or decline; this suppression is driven by reduced formation of valid candidates, is masked by AP, shows non-monotonic dependence on normalized pointwise mutual information, and can be partially recovered by context-aware training augmentations.

What carries the argument

ContextShift benchmark that isolates context via geometric transformations and synthetic/natural background substitutions on COCO images, with a continuous compatibility axis based on normalized pointwise mutual information.

If this is right

  • Detectors form fewer valid candidates once object-context pairings are broken.
  • AP scores can remain stable while recall losses grow large under context change.
  • Exposing models to object-context decoupling during training raises performance on both original and altered test images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The non-monotonic pattern with statistical compatibility suggests detectors may over-rely on typical co-occurrences rather than learning robust visual features.
  • The same controlled manipulation approach could be extended to measure context sensitivity in segmentation or pose estimation.

Load-bearing premise

The manipulations keep object appearance exactly the same and change nothing else visually besides the intended context shift.

What would settle it

Run the same detectors on paired images that differ only by background replacement or object repositioning and check whether false-negative counts rise and total predictions fall without corresponding rises in false positives.

Figures

Figures reproduced from arXiv: 2606.09495 by Alex Lazarovich, Dan Zlotnikov, Ohad Ben-Shahar.

Figure 1
Figure 1. Figure 1: Representative implicit (rotation) and explicit (synthetic and natural) context manipulations. Additional full examples [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detection performance as a function of object–background NPMI compatibility ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detection performance as a function of NPMI compatibility on real images (focal object evaluation). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection improvements from context-augmentation strategies over an unaugmented baseline. All five strategies [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between COCO-O degradation and context-dependent missed detections. Each point corresponds [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full manipulation examples with all variants for representative images. Top: implicit rotation variants. Middle: [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance curves under focal-object geometric manipulations. Metrics are computed only on the manipulated [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance curves under geometric manipulations computed over the full image. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance curves under synthetic background replacement. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Detection performance as a function of object–background compatibility (NPMI) in the multi-object setting. [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prediction candidate existence rate across representative manipulations. [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Conditional score distributions given prediction candidate existence. Distributions remain largely stable across [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distribution of maximum prediction score per ground-truth object. A score of 0 indicates no prediction sufficiently [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Recoverable recall as a function of confidence threshold. Shaded regions indicate the recall gap recoverable by [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance as a function of image-frequency compatibility. Curves exhibit a saturation shape: structured [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance as a function of character n-gram surface compatibility. Curves are flat and unordered across all metrics, confirming that string-level similarity carries no visual compatibility signal. Discussion. Two conclusions follow from these experiments. First, prediction suppression under background substitution is robust to the choice of compatibility formulation: both measures—one structured, one ra… view at source ↗
Figure 17
Figure 17. Figure 17: Performance curves under compatibility-driven natural background replacement across five composition settings, [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
read the original abstract

Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ContextShift, a controlled benchmark on COCO 2017 that manipulates object-context relationships via geometric transformations and synthetic/natural background substitutions while claiming to preserve object appearance exactly. It defines a continuous compatibility axis using normalized pointwise mutual information (NPMI). Across multiple detector architectures, it reports consistent degradation: false negatives rise by up to 227%, prediction volume falls by up to 44%, false positives remain stable or decline; these effects are not captured by aggregate AP. Degradation is attributed to reduced candidate formation rather than confidence drop. NPMI performance is non-monotonic (peaks at intermediate values). Context-aware augmentation during training partially recovers performance on both original and manipulated images.

Significance. If the benchmark successfully isolates context, the work demonstrates that standard aggregate metrics like AP can mask substantial recall loss and prediction suppression under context shifts, that statistical co-occurrence (NPMI) does not linearly predict effective context, and that targeted augmentation can mitigate suppression failures. This would offer a reproducible empirical tool for studying context dependence and a concrete training intervention, strengthening robustness evaluation practices in object detection.

major comments (2)
  1. [Methods / Benchmark Construction] Benchmark construction (methods section): The central claim that detectors exhibit context-driven suppression (FN +227%, candidate volume -44%) requires that geometric transformations and background substitutions preserve object appearance exactly and introduce no confounding low-level artifacts. The manuscript provides no quantitative validation (e.g., mask-aligned pixel statistics, edge continuity metrics, or perceptual studies) confirming absence of illumination mismatches, boundary discontinuities, or resampling artifacts outside the object mask. This directly undermines causal attribution of the observed candidate-formation drop and non-monotonic NPMI curve to semantic context rather than synthesis artifacts.
  2. [Results] Results on candidate formation (results section): The claim that degradation is driven by reduced formation of valid detection candidates rather than reduced confidence is load-bearing for the interpretation that AP masks the effect. The manuscript does not report the precise operational definition of 'valid detection candidates' (e.g., IoU threshold, score threshold, or region-proposal count) nor provide per-image breakdowns or ablation on proposal generation stages, making it impossible to verify that the 44% volume reduction is not an artifact of post-processing or thresholding choices.
minor comments (2)
  1. [Abstract / Results] The abstract and results report 'up to 227%' and 'up to 44%' without specifying the exact detector, split, or manipulation condition achieving the maximum; add a table row or footnote linking each extreme value to its source experiment.
  2. [Figures] Figure captions for the NPMI plots should explicitly state the number of images per NPMI bin and whether error bars represent standard deviation across architectures or across images.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our benchmark construction and analysis. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Methods / Benchmark Construction] Benchmark construction (methods section): The central claim that detectors exhibit context-driven suppression (FN +227%, candidate volume -44%) requires that geometric transformations and background substitutions preserve object appearance exactly and introduce no confounding low-level artifacts. The manuscript provides no quantitative validation (e.g., mask-aligned pixel statistics, edge continuity metrics, or perceptual studies) confirming absence of illumination mismatches, boundary discontinuities, or resampling artifacts outside the object mask. This directly undermines causal attribution of the observed candidate-formation drop and non-monotonic NPMI curve to semantic context rather than synthesis artifacts.

    Authors: We agree that explicit quantitative validation of object preservation would strengthen causal attribution to semantic context. The transformations are constructed to leave object pixels unchanged (affine transforms applied only within instance masks, followed by seamless background compositing), but the manuscript does not report supporting statistics. In revision we will add mask-aligned L2 pixel difference statistics, gradient continuity metrics at mask boundaries, and a small perceptual study on a subset of images, all placed in an expanded Methods section and supplementary material. These additions directly address the concern while preserving the existing experimental results. revision: yes

  2. Referee: [Results] Results on candidate formation (results section): The claim that degradation is driven by reduced formation of valid detection candidates rather than reduced confidence is load-bearing for the interpretation that AP masks the effect. The manuscript does not report the precise operational definition of 'valid detection candidates' (e.g., IoU threshold, score threshold, or region-proposal count) nor provide per-image breakdowns or ablation on proposal generation stages, making it impossible to verify that the 44% volume reduction is not an artifact of post-processing or thresholding choices.

    Authors: We acknowledge the need for a precise operational definition. Valid detection candidates are defined as initial region proposals (or equivalent early-stage outputs) that achieve IoU > 0.5 with a ground-truth object before final classification and NMS thresholding. We will insert this definition, together with the exact score threshold used for counting, into the Results section. In addition, the revised manuscript will include per-image histograms of candidate counts and a brief ablation isolating the proposal-generation stage (e.g., RPN output before ROI head) in supplementary material. These clarifications will allow readers to verify that the reported 44% reduction is not an artifact of post-processing. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper introduces ContextShift as a data manipulation benchmark on COCO and reports observed detector behaviors (FN increases, prediction volume drops, non-monotonic NPMI effects) via direct comparisons. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All central claims reduce to experimental outputs on manipulated images rather than any self-referential construction. This is the expected outcome for an empirical benchmark study with no theoretical derivation component.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on standard assumptions about the COCO dataset and the validity of NPMI as a proxy for visual compatibility; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption COCO 2017 images can be manipulated to change context while exactly preserving object appearance
    Stated as the basis for isolating context as the independent variable.
  • domain assumption Normalized pointwise mutual information computed on the dataset provides a meaningful continuous axis for context compatibility
    Used to organize test cases and interpret non-monotonic performance.

pith-pipeline@v0.9.1-grok · 5784 in / 1265 out tokens · 24556 ms · 2026-06-27T17:04:10.199197+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Detecting out-of-context objects using graph contextual reasoning network

    Manoj Acharya, Anirban Roy, Kaushik Koneripalli, Susmit Jha, Christopher Kanan, and Ajay Divakaran. Detecting out-of-context objects using graph contextual reasoning network. In International Joint Conference on Artificial Intelligence, 2022

  2. [2]

    Impact of data duplication on deep neural network-based image classifiers: Robust vs

    Alireza Aghabagherloo, Aydin Abadi, Sumanta Sarkar, Vishnu Asutosh Dasu, and Bart Preneel. Impact of data duplication on deep neural network-based image classifiers: Robust vs. standard models, 2025

  3. [3]

    End to End Learning for Self-Driving Cars

    Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiawei Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016

  4. [4]

    Normalized (pointwise) mutual information in collocation extraction

    Gerlof Bouma. Normalized (pointwise) mutual information in collocation extraction. In Proceedings of the Biennial GSCL Conference, pages 31–40, Tübingen, Germany, 2009

  5. [5]

    Yolo- world: Real-time open-vocabulary object detection

    Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, and Ying Shan. Yolo- world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024

  6. [6]

    Modeling visual context is key to augmenting object detection datasets

    Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Modeling visual context is key to augmenting object detection datasets. InProceedings of the European Conference on Computer Vision (ECCV), pages 375–391, 2018

  7. [7]

    A survey on bias in visual datasets.Computer Vision and Image Understanding, 223:103552, 2022

    Simone Fabbrizzi, Symeon Papadopoulos, Eirini Ntoutsi, and Ioannis Kompatsiaris. A survey on bias in visual datasets.Computer Vision and Image Understanding, 223:103552, 2022

  8. [8]

    Cubuk, Quoc V

    Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V . Le, and Barret Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2664–2674, 2021

  9. [9]

    Deep residual learning for im- age recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  10. [10]

    Ultralytics yolo

    Glenn Jocher, Ayush Chaurasia, Jing Qiu, et al. Ultralytics yolo. https://github.com/ ultralytics/ultralytics, 2023. Accessed: 2026-04-15

  11. [11]

    WILDS: A benchmark of in-the-wild distribution shifts

    Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton Earnshaw, Imran Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A be...

  12. [12]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  13. [13]

    Spawrious: A benchmark for fine control of spurious correlation biases.arXiv preprint arXiv:2303.05470, 2023

    Aengus Lynch, Gbètondji J-S Dovonon, Jean Kaddour, and Ricardo Silva. Spawrious: A benchmark for fine control of spurious correlation biases.arXiv preprint arXiv:2303.05470, 2023

  14. [14]

    Coco- o: A benchmark for object detectors under natural distribution shifts

    Xiaofeng Mao, Yuefeng Chen, Yao Zhu, Da Chen, Hang Su, Rong Zhang, and Hui Xue. Coco- o: A benchmark for object detectors under natural distribution shifts. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6339–6350, 2023

  15. [15]

    Ecker, Matthias Bethge, and Wieland Brendel

    Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming. InNeurIPS Workshop on Machine Learning for Autonomous Driving, 2019. 10

  16. [16]

    The role of context for object detection and semantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014

  17. [17]

    Decompose-and-compose: A compositional approach to mitigating spurious correlation

    Fahimeh Hosseini Noohdani, Parsa Hosseini, Aryan Yazdan Parast, Hamidreza Yaghoubi Araghi, and Mahdieh Soleymani Baghshah. Decompose-and-compose: A compositional approach to mitigating spurious correlation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  18. [18]

    arXiv preprint arXiv:2410.13842 (2024) 4, 10

    Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, and Feng Wu. D-fine: Redefine regression task in detrs as fine-grained distribution refinement.arXiv preprint arXiv:2410.13842, 2024

  19. [19]

    Black-box explanation of object detectors via saliency maps

    Vitali Petsiuk, Rajiv Jain, Bhavan Bhotika, and Kate Saenko. Black-box explanation of object detectors via saliency maps. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11443–11452, 2021

  20. [20]

    You only look once: Unified, real-time object detection

    Joseph Redmon et al. You only look once: Unified, real-time object detection. InCVPR, 2016

  21. [21]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

  22. [22]

    Rf-detr: Real-time detection transformer

    Roboflow. Rf-detr: Real-time detection transformer. https://github.com/roboflow/ rf-detr, 2023. Accessed: 2026-04-15

  23. [23]

    Role of spatial context in adversarial robustness for object detection

    Aniruddha Saha, Akshayvarun Subramanya, Koninika Patil, and Hamed Pirsiavash. Role of spatial context in adversarial robustness for object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 784–785, 2020

  24. [24]

    Don’t judge an object by its context: Learning to overcome contextual bias

    Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, and Deepti Ghadiyaram. Don’t judge an object by its context: Learning to overcome contextual bias. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11070–11078, 2020

  25. [25]

    Benchmarking object detectors with COCO: A new path forward, 2024

    Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, and Karan Desai. Benchmarking object detectors with COCO: A new path forward, 2024

  26. [26]

    Quantifying context bias in domain adaptation for object detection.arXiv preprint arXiv:2409.14679, 2024

    Hojun Son, Asma Almutairi, and Arpan Kusari. Quantifying context bias in domain adaptation for object detection.arXiv preprint arXiv:2409.14679, 2024

  27. [27]

    Real-world anomaly detection in surveillance videos

    Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6479–6488, 2018

  28. [28]

    Resolution-robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2149–2159, January 2022

  29. [29]

    Robust real-time face detection.International Journal of Computer Vision, 57(2):137–154, 2004

    Paul Viola and Michael J Jones. Robust real-time face detection.International Journal of Computer Vision, 57(2):137–154, 2004

  30. [30]

    Clad: A contrastive learning based approach for background debiasing.arXiv preprint arXiv:2210.02748, 2022

    Ke Wang, Harshitha Machiraju, Oh-Hyeon Choung, Michael Herzog, and Pascal Frossard. Clad: A contrastive learning based approach for background debiasing.arXiv preprint arXiv:2210.02748, 2022

  31. [31]

    Noise or signal: The role of image backgrounds in object recognition

    Kai Xiao, Logan Engstrom, Andrew Ilyas, and Aleksander Madry. Noise or signal: The role of image backgrounds in object recognition. InInternational Conference on Learning Representations (ICLR), 2021. 11

  32. [32]

    Wild-time: A benchmark of in-the-wild distribution shift over time

    Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei Koh, and Chelsea Finn. Wild-time: A benchmark of in-the-wild distribution shift over time. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, volume 35, pages 10309–10324, 2022

  33. [33]

    Places: A 10 million image database for scene recognition

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. InIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017

  34. [34]

    a photo of aoin ac ′

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. InInternational Conference on Learning Representations (ICLR), 2021. 12 A Manipulation Framework A.1 Object-Background Compatibility Modeling A.1.1 Compatibility Metric We construct an |O| × |P| compatibility ma...

  35. [35]

    All data used consists of existing publicly available datasets; no new data collection involving human participants took place

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...