pith. sign in

arxiv: 2606.25128 · v1 · pith:TMCV7ZPEnew · submitted 2026-06-23 · 📡 eess.IV · cs.AI· cs.CV· cs.LG

Benchmarking the Alignment of Data-Quality Metrics, Human Judgment and Land-Cover Segmentation Performance for Earth Observation

Pith reviewed 2026-06-25 21:43 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CVcs.LG
keywords synthetic datadata quality metricsearth observationsemantic segmentationhuman evaluationFIDgeospatial datamisalignment
0
0 comments X

The pith

Automatic metrics for synthetic Earth observation images misalign with human judgment and segmentation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether common automatic quality metrics like FID, KID, IS, LPIPS and SSIM align with how humans judge synthetic Earth observation images and how well those images help train land-cover segmentation models. It applies semantics-preserving changes such as rotation to real and synthetic images and finds that these changes shift the metric scores substantially even though humans still recognize the content the same way. Synthetic images that score badly on the metrics often appear as realistic or more realistic to people than high-scoring ones, and adding them to real training data can raise segmentation accuracy. The authors conclude that metrics built on ImageNet features do not reliably indicate usefulness for geospatial tasks and that evaluation should instead rely on actual task results and human ratings.

Core claim

Our results reveal a stark misalignment: semantics-preserving perturbations such as rotation drastically alter metric scores while leaving human recognition unaffected, and synthetic samples that score poorly on automatic metrics achieve comparable or higher perceived realism, and can improve downstream performance when combined with real data. By benchmarking semantic segmentation models trained on mixed real-synthetic datasets, we demonstrate that quality metrics rooted in ImageNet-pretrained feature spaces are unreliable indicators for geospatial data.

What carries the argument

Benchmarking semantic segmentation models on mixed real-synthetic Earth observation datasets while comparing automatic fidelity metrics to human perception ratings.

If this is right

  • Semantics-preserving perturbations such as rotation change automatic metric scores substantially while leaving human recognition unaffected.
  • Synthetic samples that score poorly on automatic metrics can achieve comparable or higher perceived realism to humans.
  • Synthetic data that scores poorly on metrics can still improve land-cover segmentation performance when combined with real data.
  • Quality metrics rooted in ImageNet-pretrained feature spaces are unreliable indicators for geospatial data utility.
  • Automatic quality evaluation of synthetic datasets should be grounded in downstream task performance and human evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The misalignment may extend to other remote sensing tasks such as object detection or change detection that also rely on spatial structure.
  • Task-specific quality metrics designed around geospatial semantics could replace or supplement general fidelity measures.
  • Data curation pipelines for Earth observation models may need to incorporate routine human evaluation alongside or instead of automatic scores.
  • Similar divergences could appear in other structured imaging domains that differ from the natural-image statistics underlying ImageNet features.

Load-bearing premise

The specific semantics-preserving perturbations tested and the land-cover segmentation task serve as valid proxies for general data utility in Earth observation applications.

What would settle it

If the ranking of mixed real-synthetic training sets by segmentation accuracy exactly matched their ranking by FID or KID scores across multiple perturbations, that would show the metrics are reliable and falsify the misalignment claim.

Figures

Figures reproduced from arXiv: 2606.25128 by Alptekin Temizel, \"Umit Mert \c{C}a\u{g}lar.

Figure 1
Figure 1. Figure 1: Visual comparison of image transformations across real images. As visualized in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Human perception study screens for (a) data augmentation alignment, (b) utility for downstream tasks, (c) conditional generation preference and (d) data realism scores The initial phase establishes a baseline for human perception by evaluating data augmentation alignment (Fig. 2a), asking participants to determine whether a baseline image and its altered variant depict the identical scene. The specific [P… view at source ↗
Figure 3
Figure 3. Figure 3: Image generation comparison of conditional Stable Diffusion for BELDE-trained (BELDE-CSD), ARAS-trained (ARAS-CSD) and conditional U-Net GAN (ARAS￾CUGAN) with the real and conditioning images (segmentation) The third phase investigates conditional generation preferences (Fig. 2c) using the generative samples illustrated in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Volume and quality of datasets are crucial for deep learning model training, yet they are often constrained by availability and data acquisition costs. Synthetic data augmentation can extend existing datasets with realistic images, and the quality of these images is generally assessed through fidelity metrics such as FID, KID, IS, LPIPS and SSIM that measure structural or distributional similarity. However, such metrics, including the widely used FID, focus on visual fidelity without reflecting downstream utility, and can diverge from human perception under perturbations that are imperceptible to human observers. In this work, we systematically evaluate Earth observation datasets alongside synthetic counterparts generated by deep generative models, comparing automatic metrics against human perception and downstream tasks. Our results reveal a stark misalignment: semantics-preserving perturbations such as rotation drastically alter metric scores while leaving human recognition unaffected, and synthetic samples that score poorly on automatic metrics achieve comparable or higher perceived realism, and can improve downstream performance when combined with real data. By benchmarking semantic segmentation models trained on mixed real-synthetic datasets, we demonstrate that quality metrics rooted in ImageNet-pretrained feature spaces are unreliable indicators for geospatial data. Our findings underscore that automatic quality evaluation of synthetic datasets should be grounded in downstream task performance and human evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript benchmarks automatic data-quality metrics (FID, KID, IS, LPIPS, SSIM) against human perception and land-cover semantic segmentation performance on Earth observation datasets and their synthetic counterparts generated by deep generative models. It reports misalignment under semantics-preserving perturbations such as rotation (which alter metric scores but not human recognition), shows that synthetics with poor metric scores can achieve comparable or higher perceived realism and improve downstream performance when mixed with real data, and concludes that ImageNet-pretrained feature-space metrics are unreliable indicators for geospatial data, advocating evaluation grounded in downstream tasks and human judgment.

Significance. If the empirical results hold under broader conditions, the work is significant for the EO community because it supplies concrete benchmarking evidence that standard generative-model metrics, calibrated on natural images, diverge from utility in geospatial applications. The mixed real-synthetic segmentation experiments offer a practical template for task-aware evaluation and could shift assessment practices away from purely distributional metrics.

major comments (2)
  1. [Conclusion] Conclusion and abstract: the claim that 'quality metrics rooted in ImageNet-pretrained feature spaces are unreliable indicators for geospatial data' is load-bearing on the tested semantics-preserving perturbations and the single land-cover segmentation downstream task serving as representative proxies; the manuscript provides no ablation on other EO tasks (e.g., object detection or change detection) or data regimes to establish that the observed misalignment is not task- or perturbation-specific.
  2. [Results] Results on downstream performance: the assertion that synthetics scoring poorly on automatic metrics can improve segmentation performance when combined with real data lacks reported dataset sizes, number of training runs, statistical significance tests, or error bars, making it impossible to judge whether the reported gains are robust or merely within noise.
minor comments (1)
  1. [Abstract] Abstract: quantitative details (dataset cardinalities, number of human raters, exact metric values) are omitted, which reduces the abstract's utility as a standalone summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review. We address the major comments below by qualifying our claims and improving experimental reporting. We believe these revisions will strengthen the manuscript without requiring entirely new experiments.

read point-by-point responses
  1. Referee: [Conclusion] Conclusion and abstract: the claim that 'quality metrics rooted in ImageNet-pretrained feature spaces are unreliable indicators for geospatial data' is load-bearing on the tested semantics-preserving perturbations and the single land-cover segmentation downstream task serving as representative proxies; the manuscript provides no ablation on other EO tasks (e.g., object detection or change detection) or data regimes to establish that the observed misalignment is not task- or perturbation-specific.

    Authors: We agree that the claim as stated in the abstract and conclusion is broader than the specific evidence provided. In the revision we will qualify the language to indicate that the unreliability is shown for the evaluated semantics-preserving perturbations (e.g., rotation) and the land-cover semantic segmentation task. We will also insert a limitations paragraph noting that extension to other EO tasks such as object detection or change detection remains future work. This prevents overgeneralization while preserving the core finding that ImageNet-based metrics can diverge from human judgment and segmentation utility under the tested conditions. revision: yes

  2. Referee: [Results] Results on downstream performance: the assertion that synthetics scoring poorly on automatic metrics can improve segmentation performance when combined with real data lacks reported dataset sizes, number of training runs, statistical significance tests, or error bars, making it impossible to judge whether the reported gains are robust or merely within noise.

    Authors: We accept this criticism. The revised manuscript will report the precise sizes of the real and synthetic training sets used in the mixed-data experiments, the number of independent runs (five runs with distinct random seeds), standard-deviation error bars on all mIoU and accuracy figures, and the results of statistical significance tests (paired t-test or Wilcoxon signed-rank test) against the real-only baseline. These additions will allow readers to evaluate whether the observed improvements exceed experimental variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with no derivations or self-referential claims

full rationale

The paper performs an empirical comparison of automatic fidelity metrics (FID, KID, etc.), human perception, and downstream land-cover segmentation performance on real and synthetic EO datasets. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The central claims rest on experimental outcomes rather than any reduction of results to inputs by construction. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study without mathematical models or derivations. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5766 in / 1172 out tokens · 40860 ms · 2026-06-25T21:43:07.273344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 3 linked inside Pith

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

    Adamkiewicz, K., Moser, B.B., Frolov, S., Nauen, T.C., Raue, F., Dengel, A.: When pretty isn’t useful: Investigating why modern text-to-image models fail as reliable training data generators. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2026)

  2. [2]

    Iscience28(5) (2025)

    Adams, T., Birkenbihl, C., Otte, K., Ng, H.G., Rieling, J.A., Näher, A.F., Sax, U., Prasser, F., Fröhlich, H.: On the fidelity versus privacy and utility trade-off of synthetic patient data. Iscience28(5) (2025)

  3. [3]

    Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying MMD GANs (2018)

  4. [4]

    arXiv preprint arXiv:2606.20909 (2026)

    Çağlar, Ü.M., Temizel, A.: Belde: Building a large-scale earth-observation land-cover dataset for europe. arXiv preprint arXiv:2606.20909 (2026)

  5. [5]

    arXiv preprint arXiv:2603.09625 (2026)

    Çağlar, Ü.M., Temizel, A.: Grounding synthetic data generation with vision and language models. arXiv preprint arXiv:2603.09625 (2026)

  6. [6]

    arXiv preprint arXiv:2406.18430 (2024) Benchmarking the Alignment of Data-Quality Metrics 15

    Cetin, D., Schesch, B., Stamenkovic, P., Huber, N.B., Zünd, F., Helou, M.E.: Facial image feature analysis and its specialization for Fréchet distance and neighborhoods. arXiv preprint arXiv:2406.18430 (2024) Benchmarking the Alignment of Data-Quality Metrics 15

  7. [7]

    In: ECCV (2018)

    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)

  8. [8]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Franchi, G., Belkhir, N., Trong, D.N., Xia, G., Pilzer, A.: Towards understanding and quantifying uncertainty for text-to-image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8062–8072 (2025)

  10. [10]

    In: Proceedings of the 2021 conference on empirical methods in natural language processing

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., Choi, Y.: CLIPScore: A reference- free evaluation metric for image captioning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 7514–7528 (2021)

  11. [11]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  12. [12]

    ISPRS International Journal of Geo-Information14(12), 481 (2025)

    Hisam, E., Gimeno, J., Miraut, D., Pérez-Aixendri, M., Fernández, M., Gini, R., Rodríguez, R., Meoni, G., Seker, D.Z.: Impact of synthetic data on deep learning models for earth observation: Photovoltaic panel detection case study. ISPRS International Journal of Geo-Information14(12), 481 (2025)

  13. [13]

    Journal of King Saud University Computer and Information Sciences (2026)

    Huang, Q., Hu, C.: Survey on remote sensing scene classification: from traditional methods to large generative ai models. Journal of King Saud University Computer and Information Sciences (2026)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., Kumar, S.: Rethinking FID: Towards a better evaluation metric for image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9307–9315 (2024)

  15. [15]

    arXiv preprint arXiv:2603.08064 (2026)

    Jia, Z., Luo, P., Zhong, Y., Zhang, J., Zhou, J.: Evaluating generative models via one-dimensional code distributions. arXiv preprint arXiv:2603.08064 (2026)

  16. [16]

    In: ICLR Machine Learning for Remote Sensing (ML4RS) Workshop (2024)

    Khammari, S., Fernandez-Laguilhoat, E., Sukhanov, S., Tankoyeu, I.: Synthetic data augmentation for earth observation object detection tasks. In: ICLR Machine Learning for Remote Sensing (ML4RS) Workshop (2024)

  17. [17]

    arXiv preprint arXiv:2203.06026 (2022)

    Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T., Lehtinen, J.: The role of imagenet classes in Fréchet inception distance. arXiv preprint arXiv:2203.06026 (2022)

  18. [18]

    arXiv:1805.10180 (2018)

    Li, H., Xiong, P., An, J., Wang, L.: Pyramid attention network for semantic segmentation. arXiv:1805.10180 (2018)

  19. [19]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)

  20. [20]

    JMIR AI4, e65729 (2025)

    Miletic, M., Sariyar, M.: Utility-based analysis of statistical approaches and deep learning models for synthetic data generation with focus on correlation structures: algorithm development and validation. JMIR AI4, e65729 (2025)

  21. [21]

    Remote Sensing18(3), 466 (2026)

    Mutakabbir, A., Lung, C.H., Zaman, M., Upadhyay, D., Naik, K., Millard, K., Ravichandran, T., Purcell, R.: Noah: A multi-modal and sensor fusion dataset for generative modeling in remote sensing. Remote Sensing18(3), 466 (2026)

  22. [22]

    Pan, J., Lei, S., Fu, Y., Li, J., Liu, Y., Sun, Y., He, X., Peng, L., Huang, X., Zhao, B.: Earthsynth: Generating informative earth observation with diffusion models (2025)

  23. [23]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in GAN evaluation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11410–11420 (2022) 16 Ü.M. Çağlar A. Temizel

  24. [24]

    In: International Conference on Medical image computing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

  25. [25]

    In: Advances in Neural Information Processing Systems

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Im- proved techniques for training GANs. In: Advances in Neural Information Processing Systems. vol. 29, pp. 2234–2242 (2016)

  26. [26]

    Information16(2), 81 (2025)

    Sousa, T., Ries, B., Guelfi, N.: Data augmentation in earth observation: A diffusion model approach. Information16(2), 81 (2025)

  27. [27]

    In: International Conference on Learning Representations (ICLR) (2016)

    Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: International Conference on Learning Representations (ICLR) (2016)

  28. [28]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  29. [29]

    Advances in neural information processing systems34, 12077–12090 (2021)

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems34, 12077–12090 (2021)

  30. [30]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  31. [31]

    In: International workshop on deep learning in medical image analysis (2018)

    Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: International workshop on deep learning in medical image analysis (2018)