pith. sign in

arxiv: 1907.07061 · v1 · pith:Y2V2766Cnew · submitted 2019-07-16 · 💻 cs.CV

How much real data do we actually need: Analyzing object detection performance using synthetic and real data

Pith reviewed 2026-05-24 20:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic dataobject detectiondomain similarityreal datamixed trainingdeep learningcomputer visiondata annotation
0
0 comments X

The pith

Domain similarity between synthetic and real datasets predicts how well mixed training sets perform on real test images for object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines what happens to object detection accuracy when real training images are replaced or supplemented by synthetic ones, including scenarios with only limited real data available. It evaluates several synthetic and real datasets, generates additional synthetic examples via simulation, and computes domain similarity scores across them. The authors derive insights into a methodological procedure for deciding training mixes that optimize results on real-world test images. A sympathetic reader would care because real data annotation is expensive and time-consuming, so knowing reliable substitution rules could lower the barrier to training effective detectors. If the procedure holds, it would let practitioners use cheaper synthetic data without large drops in real-test performance.

Core claim

The authors argue that by measuring domain similarity across multiple synthetic and real datasets and testing mixed training regimes for object detection, one can design a methodological procedure that indicates how much real data is actually needed to reach target performance on real test images.

What carries the argument

Domain similarity measurement between synthetic and real datasets, used to predict and guide the composition of mixed training sets for object detectors.

If this is right

  • Synthetic data can substitute for a substantial fraction of real annotated data while preserving detection performance on real tests when domain similarity is high.
  • Limited real data combined with domain-similar synthetic data yields better real-test results than either alone.
  • A repeatable procedure emerges for selecting dataset mixes based on similarity scores rather than trial-and-error.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same similarity-guided mixing logic could be tested on other vision tasks such as segmentation or pose estimation.
  • If the procedure generalizes, it would reduce annotation budgets for new detection domains by prioritizing collection of only the most dissimilar real samples.
  • A follow-up test could apply the derived procedure to an unseen pair of synthetic and real datasets and check whether predicted performance matches observed performance.

Load-bearing premise

That measurable domain similarity between the chosen synthetic and real datasets will reliably predict how well mixed training sets will perform on real test images.

What would settle it

An experiment showing two pairs of synthetic-real datasets with nearly identical domain similarity scores but markedly different accuracy outcomes when the same proportion of real data is mixed in.

Figures

Figures reproduced from arXiv: 1907.07061 by Dhanvin Kolhatkar, Fahed Al Hassanat, Farzan Erlik Nowruzi, Julien Rebut, Prince Kapoor, Robert Laganiere.

Figure 1
Figure 1. Figure 1: Sample images from real and synthetic datasets.(a) BDD (Yu et al., 2018), (b) KC (Geiger et al., 2012)(Cordts et al., 2016), (c) NS (Caesar et al., 2019), (d) 7D (Wrenninge & Unger, 2018), (e) P4B (Richter et al., 2017), (f) CARLA (Dosovitskiy et al., 2017). a camera are used to create a real background environment dataset. This data is then augmented with synthetic 3D ob￾jects, hence reducing the burden o… view at source ↗
Figure 3
Figure 3. Figure 3: The model is trained on each of the datasets and is tested on all the other datasets. In the legend, the train-test dataset combi￾nations are shown as a tuple. one dataset is compared to the other datasets. In this sec￾tion, all the datasets are trained at full training set size to achieve their best results on their own test set. Then, their trained model is used to evaluate their performance on other dat… view at source ↗
Figure 5
Figure 5. Figure 5: Results of Fine-tuning with Real Data. Model is trained on the synthetic dataset and is then fine-tuned on a real dataset. The test results are performed on the test split of the real dataset. 4.4. Synthetic Training and Real Data Fine-tuning In mixed training, our model learns the general concepts from simulated datasets, and uses the real samples to adapt its domain. However, there is no scheduling in th… view at source ↗
Figure 6
Figure 6. Figure 6: Results of using all the synthetic datasets together for training, and 10% of real dataset size (3% of train size) for fine￾tuning. These results are shown as ASR10 in the figure [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

In recent years, deep learning models have resulted in a huge amount of progress in various areas, including computer vision. By nature, the supervised training of deep models requires a large amount of data to be available. This ideal case is usually not tractable as the data annotation is a tremendously exhausting and costly task to perform. An alternative is to use synthetic data. In this paper, we take a comprehensive look into the effects of replacing real data with synthetic data. We further analyze the effects of having a limited amount of real data. We use multiple synthetic and real datasets along with a simulation tool to create large amounts of cheaply annotated synthetic data. We analyze the domain similarity of each of these datasets. We provide insights about designing a methodological procedure for training deep networks using these datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper examines the effects of replacing portions of real training data with synthetic data for object detection, analyzes domain similarity across multiple real and synthetic datasets (including a simulation tool), and derives methodological insights for training deep networks when real annotated data is limited.

Significance. If the domain-similarity analysis reliably predicts mixed-training performance gains, the work would supply practical guidance on data mixing strategies that could reduce annotation costs in computer vision pipelines.

major comments (1)
  1. [Results] Results section (performance curves and similarity analysis): the manuscript presents domain similarity values alongside separate mAP curves for mixed training sets but does not report any correlation, regression, or ranking test demonstrating that the chosen similarity measure predicts or orders the observed performance deltas when real data is partially replaced by synthetic data. This missing link is load-bearing for the central claim that the analysis yields a generalizable methodological procedure.
minor comments (1)
  1. [Abstract] Abstract: the description of experiments across datasets mentions no quantitative results, error bars, or dataset exclusion criteria, which reduces the standalone informativeness of the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Results] Results section (performance curves and similarity analysis): the manuscript presents domain similarity values alongside separate mAP curves for mixed training sets but does not report any correlation, regression, or ranking test demonstrating that the chosen similarity measure predicts or orders the observed performance deltas when real data is partially replaced by synthetic data. This missing link is load-bearing for the central claim that the analysis yields a generalizable methodological procedure.

    Authors: We agree that the manuscript would be strengthened by an explicit statistical demonstration that the domain similarity measure predicts or ranks the observed mAP changes. The current text derives qualitative insights from the juxtaposition of similarity values and performance curves but does not include a correlation, regression, or ranking test. In the revision we will add this analysis to the Results section: we will compute and report Spearman rank correlations (and, where appropriate, linear regression R²) between the similarity scores and the mAP deltas for each real-to-synthetic replacement ratio across the evaluated datasets. These quantitative results will be presented alongside the existing figures to directly support the claim of a generalizable procedure. revision: yes

Circularity Check

0 steps flagged

Empirical dataset comparison contains no circular derivations or self-referential predictions

full rationale

The paper performs an empirical analysis by training object detectors on varying mixtures of existing synthetic and real datasets, measuring mAP, and computing domain similarity statistics between those fixed datasets. No equations, fitted parameters, or predictions are defined in terms of the target performance quantities; the reported insights follow directly from the experimental outcomes on held-out real test images. No self-citation chains or uniqueness theorems are invoked to justify the procedure, and the work does not rename known results or smuggle ansatzes. The central claim therefore remains independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced in the abstract; the study relies on pre-existing datasets and simulation tools.

pith-pipeline@v0.9.0 · 5684 in / 979 out tokens · 26025 ms · 2026-05-24T20:47:16.711625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 14 internal anchors

  1. [1]

    nuScenes: A multimodal dataset for autonomous driving

    Caesar, H., Bankiti, V ., Lang, A. H., V ora, S., Liong, V . E., Xu, Q., Krishnan, A., Pan, Y ., Baldan, G., and Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027,

  2. [2]

    Sensor Transfer: Learning Optimal Sensor Effect Image Augmentation for Sim-to-Real Domain Adaptation

    URL http://arxiv.org/abs/ 1809.06256. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834848, Apr 2018a. Chen, P.-R., Lo, S.-Y ., Hang, H.-M., Chan, ...

  3. [3]

    Simulating LiDAR point cloud for autonomous driving using real-world scenes and traffic flows,

    Fang, J., Yan, F., Zhao, T., Zhang, F., Zhou, D., Yang, R., Ma, Y ., and Wang, L. Simulating lidar point cloud for autonomous driving using real-world scenes and traffic flows. arXiv preprint arXiv:1811.07112,

  4. [4]

    FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation

    URL http://arxiv.org/abs/ 1612.02649. Hoffman, J., Tzeng, E., Park, T., Zhu, J.-Y ., Isola, P., Saenko, K., Efros, A. A., and Darrell, T. CyCADA: Cycle- Consistent Adversarial Domain Adaptation

  5. [5]

    CyCADA: Cycle-Consistent Adversarial Domain Adaptation

    URL http://arxiv.org/abs/1711.03213. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861,

  6. [6]

    Speed/accuracy trade-offs for modern con- volutional object detectors

    Huang, J., Rathod, V ., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wojna, Z., Song, Y ., Guadarrama, S., and et al. Speed/accuracy trade-offs for modern con- volutional object detectors. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul

  7. [7]

    Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks - Counting, Detection, and Tracking

    URL http: //arxiv.org/abs/1705.10118. Li, B., Zhang, T., and Xia, T. Vehicle detection from 3d lidar using fully convolutional network. Robotics: Science and Systems XII. Lin, T.-Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., and Zitnick, C. L. Microsoft coco: Common objects in context. Lecture Notes in Computer Science, pp. 740755,

  8. [8]

    Few-shot image recognition by predicting parameters from activations

    Qiao, S., Liu, C., Shen, W., and Yuille, A. Few-shot image recognition by predicting parameters from activations. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  9. [9]

    and Frtunikj, J

    Rao, Q. and Frtunikj, J. Deep learning for self-driving cars: chances and challenges. In 2018 IEEE/ACM 1st International Workshop on Software Engineering for AI in Autonomous Systems (SEFAIAS) , pp. 35–38. IEEE,

  10. [10]

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

    URL http://arxiv. org/abs/1603.05279. Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):11371149, Jun

  11. [11]

    Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation

    URL http://arxiv.org/abs/1711.06969. Shelhamer, E., Long, J., and Darrell, T. Fully convolu- tional networks for semantic segmentation. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 39 (4):640651, Apr

  12. [12]

    Complex-YOLO: Real-time 3D Object Detection on Point Clouds

    URL https://arxiv.org/pdf/1803.06199.pdf. Soviany, P. and Ionescu, R. T. Continuous trade-off op- timization between fast and accurate deep face detec- tors

  13. [13]

    Learning to Compare: Relation Network for Few-Shot Learning

    URL http: //arxiv.org/abs/1711.06025. Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., and Urtasun, R. Multinet: Real-time joint semantic reasoning for autonomous driving. In2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1013–1020. IEEE,

  14. [14]

    Dynamic Graph CNN for Learning on Point Clouds

    URL http://arxiv.org/ abs/1801.07829. Wrenninge, M. and Unger, J. Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

  15. [15]

    Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

    URL http://arxiv.org/abs/1810.08705. Wu, B., Wan, A., Yue, X., and Keutzer, K. Squeeze- Seg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud

  16. [16]

    SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud

    URL http://arxiv.org/abs/ 1710.07368. Wu, Z., Han, X., Lin, Y . L., Uzunbas, M. G., Goldstein, T., Lim, S. N., and Davis, L. S. DCAN: Dual Channel-Wise Alignment Networks for Unsupervised Scene Adaptation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 535–552,

  17. [17]

    Learning cross-modal deep representations for robust pedestrian detection

    Xu, D., Ouyang, W., Ricci, E., Wang, X., and Sebe, N. Learning cross-modal deep representations for robust pedestrian detection. 2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), Jul

  18. [18]

    Bdd100k: A diverse driving video database with scalable annotation tooling

    Yu, F., Xian, W., Chen, Y ., Liu, F., Liao, M., Madhavan, V ., and Darrell, T. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687,

  19. [19]

    Fully Convolutional Adaptation Networks for Semantic Segmentation

    URL http://arxiv.org/abs/ 1804.08286. Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene parsing network

  20. [20]

    Pyramid Scene Parsing Network

    URL http://arxiv. org/abs/1612.01105