Recognition: unknown
SyMTRS: Benchmark Multi-Task Synthetic Dataset for Depth, Domain Adaptation and Super-Resolution in Aerial Imagery
Pith reviewed 2026-05-09 22:43 UTC · model grok-4.3
The pith
SyMTRS supplies a single synthetic aerial dataset with pixel-perfect depth maps, night-time pairs, and multi-scale low-resolution images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SyMTRS is a multi-task synthetic benchmark that supplies high-resolution RGB aerial imagery, pixel-perfect depth maps, night-time domain-shift pairs, and aligned low-resolution variants at x2, x4, and x8 scales, all produced by a single high-fidelity urban simulation pipeline.
What carries the argument
The high-fidelity urban simulation pipeline that generates geometrically consistent, multi-domain aerial imagery with perfect depth and scale annotations.
Load-bearing premise
Imagery produced by the simulation pipeline has statistical properties and variations close enough to real aerial remote-sensing data that models trained on it will transfer.
What would settle it
Train a monocular depth model on SyMTRS and evaluate it on a real-world aerial depth dataset; performance substantially below that of models trained on real data would falsify the transfer assumption.
Figures
read the original abstract
Recent advances in deep learning for remote sensing rely heavily on large annotated datasets, yet acquiring high-quality ground truth for geometric, radiometric, and multi-domain tasks remains costly and often infeasible. In particular, the lack of accurate depth annotations, controlled illumination variations, and multi-scale paired imagery limits progress in monocular depth estimation, domain adaptation, and super-resolution for aerial scenes. We present SyMTRS, a large-scale synthetic dataset generated using a high-fidelity urban simulation pipeline. The dataset provides high-resolution RGB aerial imagery (2048 x 2048), pixel-perfect depth maps, night-time counterparts for domain adaptation, and aligned low-resolution variants for super-resolution at x2, x4, and x8 scales. Unlike existing remote sensing datasets that focus on a single task or modality, SyMTRS is designed as a unified multi-task benchmark enabling joint research in geometric understanding, cross-domain robustness, and resolution enhancement. We describe the dataset generation process, its statistical properties, and its positioning relative to existing benchmarks. SyMTRS aims to bridge critical gaps in remote sensing research by enabling controlled experiments with perfect geometric ground truth and consistent multi-domain supervision. The results obtained in this work can be reproduced from this Github repository: https://github.com/safouaneelg/SyMTRS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SyMTRS, a large-scale synthetic multi-task dataset for aerial imagery generated via a high-fidelity urban simulation pipeline. It supplies 2048x2048 RGB images paired with pixel-perfect depth maps, night-time domain-shift counterparts, and aligned low-resolution variants at x2/x4/x8 scales for super-resolution, positioning the resource as a unified benchmark for joint work on monocular depth estimation, cross-domain adaptation, and resolution enhancement in remote sensing.
Significance. If the simulation produces imagery whose statistical properties and domain shifts are representative of real aerial data, the dataset would enable controlled multi-task experiments with perfect geometric ground truth that real remote-sensing collections rarely provide. The accompanying GitHub repository for reproduction is a clear strength that supports benchmark adoption.
major comments (3)
- [Abstract and dataset-generation description] Abstract and dataset-generation description: the repeated claim of 'pixel-perfect depth maps' and 'high-fidelity' urban simulation is not accompanied by any quantitative validation (e.g., depth-error histograms against known simulation parameters or comparison to real LiDAR statistics), leaving the central realism assumption untested.
- [Statistical-properties and positioning section] Statistical-properties and positioning section: no tables or figures report concrete similarity metrics (FID, depth-distribution KL divergence, or day/night radiometric shift measures) between SyMTRS and existing real or synthetic aerial benchmarks, undermining the claim that the dataset bridges gaps for domain-adaptation and multi-scale research.
- [Overall contribution] Overall contribution: the manuscript contains no baseline experiments (e.g., depth-estimation or SR transfer results from SyMTRS to a real test set), so the assertion that the resource 'enables joint research' rests solely on the pipeline description rather than demonstrated utility.
minor comments (2)
- Verify that all figure captions explicitly state image dimensions, scale factors, and whether night-time pairs are aligned at the pixel level.
- Add a short table summarizing key simulation parameters (camera intrinsics, lighting model, urban asset density) to improve reproducibility beyond the GitHub link.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, acknowledging where the manuscript can be strengthened through revision while providing honest clarification on the dataset's design and contribution.
read point-by-point responses
-
Referee: [Abstract and dataset-generation description] Abstract and dataset-generation description: the repeated claim of 'pixel-perfect depth maps' and 'high-fidelity' urban simulation is not accompanied by any quantitative validation (e.g., depth-error histograms against known simulation parameters or comparison to real LiDAR statistics), leaving the central realism assumption untested.
Authors: We clarify that 'pixel-perfect' depth is obtained directly from the simulation engine's 3D geometry, yielding exact per-pixel values by construction without the reconstruction or sensor errors present in real LiDAR. We agree, however, that quantitative validation would strengthen the realism claims. In the revised manuscript we will add depth-error histograms derived from known simulation parameters together with depth-distribution comparisons against publicly available real aerial LiDAR statistics. revision: yes
-
Referee: [Statistical-properties and positioning section] Statistical-properties and positioning section: no tables or figures report concrete similarity metrics (FID, depth-distribution KL divergence, or day/night radiometric shift measures) between SyMTRS and existing real or synthetic aerial benchmarks, undermining the claim that the dataset bridges gaps for domain-adaptation and multi-scale research.
Authors: The manuscript contains a dedicated section describing statistical properties and relative positioning, yet we acknowledge the absence of the specific quantitative metrics mentioned. We will incorporate FID scores, depth-distribution KL divergences, and day/night radiometric shift measures, along with the corresponding tables and figures, in the revised version to provide stronger empirical support for the dataset's utility in domain-adaptation and multi-scale tasks. revision: yes
-
Referee: [Overall contribution] Overall contribution: the manuscript contains no baseline experiments (e.g., depth-estimation or SR transfer results from SyMTRS to a real test set), so the assertion that the resource 'enables joint research' rests solely on the pipeline description rather than demonstrated utility.
Authors: As a dataset-introduction paper, the core contribution lies in the generation pipeline, perfect ground-truth annotations, and public release that together enable controlled multi-task experiments. We nevertheless agree that preliminary baseline results would better illustrate practical utility. In the revision we will add baseline experiments for monocular depth estimation and super-resolution, including limited transfer results from SyMTRS to a real aerial test set. revision: yes
Circularity Check
No circularity: dataset artifact with no derivations or fitted predictions
full rationale
The paper introduces SyMTRS as a synthetic multi-task dataset generated from an urban simulation pipeline. No equations, parameter fitting, predictions, or derivation chains appear in the abstract or described content. The core contribution is the dataset artifact (RGB, depth, night-time, and multi-scale variants) rather than any computed result that could reduce to its own inputs by construction. Self-citations or uniqueness claims are absent from the provided text. This matches the default expectation for non-circular papers and the reader's 0.0 assessment.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Virtual kitti 2, in: CVPR
Cabon, Y., Murray, N., Humenberger, M., 2020. Virtual kitti 2, in: CVPR
2020
-
[2]
M3vir: Multi-modal multi-task multi-view immersive rendering dataset, in: CVPR
Chen, X., Zhang, Y., Xu, J., et al., 2024a. M3vir: Multi-modal multi-task multi-view immersive rendering dataset, in: CVPR
-
[3]
M3vir: A multi-modal multi-task multi-view immersive rendering dataset, in: CVPR
Chen, X., Zhang, Y., Xu, J., et al., 2024b. M3vir: A multi-modal multi-task multi-view immersive rendering dataset, in: CVPR
-
[4]
Remote sensing image scene classification: Benchmark and state of the art
Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE . 13
2017
-
[5]
Image super- resolution with deep variational autoencoders, in: Karlinsky, L., Michaeli, T., Nishino, K
Chira, D., Haralampiev, I., Winther, O., Dittadi, A., Liévin, V ., 2023. Image super- resolution with deep variational autoencoders, in: Karlinsky, L., Michaeli, T., Nishino, K. (Eds.), Computer Vision – ECCV 2022 Workshops, Springer Nature Switzerland, Cham. pp. 395–411
2023
-
[6]
Functional map of the world
Christie, G., Fendley, N., Wilson, J., Mukherjee, R., 2018. Functional map of the world. CVPR Workshops
2018
-
[7]
The cityscapes dataset for semantic urban scene understand- ing, in: CVPR
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. The cityscapes dataset for semantic urban scene understand- ing, in: CVPR
2016
-
[8]
Deepglobe 2018: A challenge to parse the earth through satellite images
Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D., Raska, R., 2018. Deepglobe 2018: A challenge to parse the earth through satellite images. CVPR Workshops
2018
-
[9]
Rareplanes: Synthetic data to improve aircraft detection in satellite imagery
DIU, Works, I.Q.T.C., 2020. Rareplanes: Synthetic data to improve aircraft detection in satellite imagery. https://www.cosmiqworks.org/rareplanes/
2020
-
[10]
Image super-resolution using deep convolu- tional networks
Dong, C., Loy, C.C., He, K., Tang, X., 2016. Image super-resolution using deep convolu- tional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 295–307. doi:10.1109/TPAMI.2015.2439281
-
[11]
Rrsgan: Reference-based super-resolution for remote sensing image
Dong, R., Lixian, Z., Fu, H., 2021. Rrsgan: Reference-based super-resolution for remote sensing image. IEEE Transactions on Geoscience and Remote Sensing PP , 1–17. doi:10.1109/TGRS.2020.3046045
-
[12]
The pascal visual object classes (voc) challenge
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2010. The pascal visual object classes (voc) challenge. International journal of computer vision
2010
-
[13]
Mid-air: A multi-modal dataset for extremely low altitude drone flights, in: IROS
Fonder, M., Courbon, J., et al., 2019. Mid-air: A multi-modal dataset for extremely low altitude drone flights, in: IROS
2019
-
[14]
Vision meets robotics: The kitti dataset, in: The International Journal of Robotics Research
Geiger, A., Lenz, P ., Urtasun, R., 2013. Vision meets robotics: The kitti dataset, in: The International Journal of Robotics Research
2013
-
[15]
Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification
Helber, P ., Bischke, B., Dengel, A., Borth, D., 2019. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
2019
-
[16]
Image-to-Image Translation with Conditional Adversarial Networks , journal =
Isola, P ., Zhu, J.Y., Zhou, T., Efros, A.A., 2017. Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR). URL: https: //openaccess.thecvf.com/content_cvpr_2017/papers/Isola_Image-To-Image_ Translation_With_CVPR_2017_paper.pdf. arXiv:1611.07004
-
[17]
Isprs potsdam dataset
ISPRS, 2018. Isprs potsdam dataset. https://www2.isprs.org/commissions/comm2/ wg4/potsdam-2d-semantic-labeling/
2018
-
[18]
Imagenet classification with deep convolutional neural networks
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Communications of the ACM
2012
-
[19]
xview: Objects in context in overhead imagery doi: 10.48550/arXiv.1802
Lam, D., Kuzma, R., McGee, K., Dooley, S., Laielli, M., Klaric, M., Bulatov, Y., McCord, B., 2018. xview: Objects in context in overhead imagery doi: 10.48550/arXiv.1802. 07856
-
[20]
Photo-realistic single image super-resolution using a generative adversarial network, pp
Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W., 2017. Photo-realistic single image super-resolution using a generative adversarial network, pp. 105–114. doi:10.1109/CVPR.2017.19
-
[21]
Li, Y., Jiang, L., Xu, L., Xiangli, Y., Wang, Z., Lin, D., Dai, B., 2023. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond, pp. 3182–3192. doi:10.1109/ICCV51070.2023.00297
-
[22]
In2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R., 2021. Swinir: Image restoration using swin transformer, in: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 1833–1844. doi:10.1109/ICCVW54120.2021. 00210. 14
-
[23]
A comparative study of deep learning models for image super-resolution, in: Jiang, X
Lim, J.Y., Chiew, Y.S., Phan, R.C.W., Wang, X., 2024. A comparative study of deep learning models for image super-resolution, in: Jiang, X. (Ed.), Asia Conference on Elec- tronic Technology (ACET 2024), International Society for Optics and Photonics. SPIE. p. 1321105. URL:https://doi.org/10.1117/12.3032724, doi:10.1117/12.3032724
-
[24]
Microsoft coco: Common objects in context
Lin, T.Y., Maire, M., Belongie, S., et al., 2014. Microsoft coco: Common objects in context. ECCV
2014
-
[25]
Usegeo - a uav-based multi-sensor dataset for geospatial research
Nex, F., Stathopoulou, E., Remondino, F., Yang, M., Madhuanand, L., Yogender, Y., Alsadik, B., Weinmann, M., Jutzi, B., Qin, R., 2024. Usegeo - a uav-based multi-sensor dataset for geospatial research. ISPRS Open Journal of Photogrammetry and Remote Sensing 13, 100070. URL: https://www.sciencedirect.com/science/article/pii/ S2667393224000140, doi:https://...
-
[26]
A comparative analysis of srgan models
Nikroo, F.R., Deshmukh, A., Sharma, A., Tam, A., Kumar, K., Norris, C., Dangi, A., 2023. A comparative analysis of srgan models. URL: https://arxiv.org/abs/2307.09456, arXiv:2307.09456
-
[27]
The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M., 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. CVPR
2016
-
[28]
Airsim: High-fidelity visual and physical simulation for autonomous vehicles
Shah, S., Dey, D., et al., 2017. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. Field and service robotics
2017
-
[29]
Indoor segmentation and support inference from rgbd images, in: ECCV
Silberman, N., Hoiem, D., Kohli, P ., Fergus, R., 2012. Indoor segmentation and support inference from rgbd images, in: ECCV
2012
-
[31]
Synrs3d: A synthetic multi-task benchmark for remote sensing 3d understanding
Song, Y., Zhang, W., Wang, T., et al., 2024b. Synrs3d: A synthetic multi-task benchmark for remote sensing 3d understanding. arXiv preprint arXiv:2409.05142
-
[32]
Soni, J., Gurappa, S., Upadhyay, H., 2024. A comparative study of deep learning models for image super-resolution across various magnification levels, in: 2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS), pp. 395–400. doi:10.1109/FMLDS63805.2024.00076
-
[33]
Bigearthnet: A large-scale benchmark archive for remote sensing image understanding
Sumbul, G., Charfuelan, M., Demir, B., Markl, V ., 2019. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. IGARSS
2019
-
[34]
Shift: A synthetic driving dataset for domain adaptation and generalization, in: CVPR
Sun, K., Liu, Z., et al., 2023. Shift: A synthetic driving dataset for domain adaptation and generalization, in: CVPR
2023
-
[35]
Deep satellite video super-resolution via global registration and local alignment
Wang, K., Wu, F., Luo, X., et al., 2022. Deep satellite video super-resolution via global registration and local alignment. CVPR
2022
-
[36]
Tartanair: A dataset to push the limits of visual slam
Wang, Y., Liu, Y., et al., 2020. Tartanair: A dataset to push the limits of visual slam. arXiv preprint arXiv:2003.14338
-
[37]
Loveda: A remote sensing land cover dataset for domain adaptive semantic segmentation, in: NeurIPS
Wang, Y., Mao, J., et al., 2021a. Loveda: A remote sensing land cover dataset for domain adaptive semantic segmentation, in: NeurIPS
-
[38]
Loveda: A remote sensing land cover dataset for domain adaptive semantic segmentation, in: NeurIPS
Wang, Y., Mao, J., Yu, X., Jin, Y., Li, X., Sun, L., 2021b. Loveda: A remote sensing land cover dataset for domain adaptive semantic segmentation, in: NeurIPS
-
[39]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 600–612. doi:10.1109/TIP.2003.819861
-
[41]
Samrs: Supervised pretraining for remote sensing foundation models
Wang, Z., Liu, Q., Yu, L., et al., 2024b. Samrs: Supervised pretraining for remote sensing foundation models. arXiv preprint arXiv:2506.23801
-
[42]
Oli2msi: A multi-sensor super-resolution dataset for remote sensing
Wei, Y., Zhang, H., Peng, X., Xu, Y., Wang, Z., Li, Y., 2021. Oli2msi: A multi-sensor super-resolution dataset for remote sensing. IGARSS . 15
2021
-
[43]
Dota: A large-scale dataset for object detection in aerial images
Xia, G.S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., 2018. Dota: A large-scale dataset for object detection in aerial images. CVPR
2018
-
[44]
Aid: A benchmark dataset for performance evaluation of aerial scene classification
Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X., 2017. Aid: A benchmark dataset for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing
2017
-
[45]
Wilduav: Real uav flight data for aerial scene understanding
Xie, J., et al., 2023. Wilduav: Real uav flight data for aerial scene understanding. Remote Sensing
2023
-
[46]
Bag-of-visual-words and spatial extensions for land-use classification
Yang, Y., Newsam, S., 2010. Bag-of-visual-words and spatial extensions for land-use classification. ACM SIGSPATIAL
2010
-
[47]
Bdd100k: A diverse driving dataset for heterogeneous multitask learning, in: CVPR
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V ., Darrell, T., 2020. Bdd100k: A diverse driving dataset for heterogeneous multitask learning, in: CVPR
2020
-
[48]
Zhang, C., Mao, Z., Nie, J., Lai, Y., Deng, L., 2025. A comparative study of deep learn- ing methods for super-resolution of npp-viirs nighttime light images. International Journal of Applied Earth Observation and Geoinformation 145, 104995. URL: https:// www.sciencedirect.com/science/article/pii/S1569843225006429, doi: https:// doi.org/10.1016/j.jag.2025.104995
-
[49]
Places: A 10 million image database for scene recognition, in: IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A., 2017. Places: A 10 million image database for scene recognition, in: IEEE Transactions on Pattern Analysis and Machine Intelligence
2017
-
[50]
Sen2naip: A real-world benchmark for cross-sensor super-resolution
Zhou, T., Wang, Y., Duan, K., Xu, Q., Tu, Z., 2023. Sen2naip: A real-world benchmark for cross-sensor super-resolution. arXiv preprint arXiv:2311.09756
-
[51]
Unpaired Image- to-Image Translation using Cycle-Consistent Adversarial Networks,
Zhu, J.Y., Park, T., Isola, P ., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV). URL: https://openaccess.thecvf.com/content_ICCV_2017/papers/Zhu_Unpaired_ Image-To-Image_Translation_ICCV_2017_paper.pdf. arXiv:1703.10593. 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.