pith. sign in

arxiv: 2605.19624 · v2 · pith:AHJZUR2Ynew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

Pith reviewed 2026-05-21 07:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords style transferSim2Realsatellite imageryimage translationcomponent-awarepose estimationdata generationvisual sensing
0
0 comments X

The pith

Component-aware style transfer produces satellite images that match real sensor appearance while retaining exact simulation geometry labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a style transfer framework to turn synthetic satellite renderings into images that look like they came from a real camera, for use in training vision systems. It creates weakly paired real and synthetic samples using camera calibration, ArUco markers for pose, CAD models, and component masks. Real-domain style is then extracted per component from unlabeled real photos and applied to matching regions in the synthetic images through mask-aligned modulation. Adversarial training plus local contrastive consistency, self-regularization, and edge-preserving terms keep the output usable for downstream tasks. Experiments on 5000 rendered images and 100 real captures show lower distribution discrepancy than baselines and higher accuracy for a pose estimator trained on the results.

Core claim

The method builds weakly paired real-synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. Adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints to keep the generated images usable for downstream supervision. On 5000 rendered images and 100 real images, it achieves FID of 54.32 and KID of 0.048, and raises GDRNet ADD pass rate to 0.260 and AUC to 0.611.

What carries the argument

mask-aligned modulation that injects part-wise real-domain style codes extracted from unlabeled real images into corresponding synthetic satellite regions

If this is right

  • The translated images achieve lower FID and KID scores than representative image-translation baselines.
  • Training GDRNet only on the translated synthetic images raises ADD pass rate to 0.260 and AUC to 0.611 in the target domain.
  • Component-level transfer preserves geometric annotations better than global image translation methods.
  • The added local contrastive consistency and edge-preserving constraints maintain structural fidelity needed for sensor-data supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-component modulation could be applied to other rigid objects with distinct surface types, such as vehicles or aircraft, if accurate masks and weak pairing are available.
  • Removing the need for ArUco markers by substituting estimated poses would test whether the method still works outside a calibrated lab setup.
  • The component masks produced as a byproduct could support joint training of detection or segmentation models alongside pose estimation.

Load-bearing premise

The calibrated real acquisition, ArUco-based pose measurement, CAD rendering, and component masks produce sufficiently accurate weakly paired samples that allow mask-aligned modulation to transfer style without distorting the geometric annotations.

What would settle it

Running the downstream GDRNet pose estimator on the translated images and finding no improvement or a drop in ADD pass rate and AUC relative to training on raw synthetic images would show the component-level transfer does not help annotation-preserving Sim2Real generation.

Figures

Figures reproduced from arXiv: 2605.19624 by Baoshi Cao, Yang Liu, Yifan Yang, Yonglong Zhang, Zongwu Xie.

Figure 1
Figure 1. Figure 1: Illustration of the proposed component-level Sim2Real style transfer framework. Weakly paired real–synthetic samples and compo￾nent masks are used to train a component-level style transfer network, which translates synthetic satellite images into realistic images while preserving structural annotations. between simulation and acquisition, so a model trained only on synthetic data may learn appearance cues … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed framework. (a) Real satellite images and component masks are acquired in a calibrated environment. (b) Weakly paired real–synthetic samples are generated using ArUco-based camera-pose measurement, CAD-based rendering, and view consistency filtering. (c) TransNet, the proposed component-level style transfer network, extracts part-wise real style codes and synthesizes structure-prese… view at source ↗
Figure 3
Figure 3. Figure 3: Detailed architecture of the proposed component-level structure-preserving style transfer network. (a) Generation pipeline, where real component styles are encoded into a part-wise style matrix and injected into the synthetic image through a SEAN-based generator. (b) Training objectives, including adversarial supervision and structure-preserving losses. target, valid samples should satisfy ni = ¯n, where n… view at source ↗
Figure 4
Figure 4. Figure 4: Component-wise style transfer visualization. Each intermediate result transfers the style of one selected satellite component while keeping the remaining components unchanged. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Foreground-only style transfer during inference. TransNet trans￾lates the satellite foreground conditioned on the synthetic image and component mask, while the original background is preserved through mask-based composition. PatchNCE, self-regularization, and edge-consistency losses during training, the translated images preserve the object silhouette, component layout, and structural boundaries. For evalu… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of different Sim2Real translation methods. CycleGAN, CUT, and DRIT perform full-image translation and may modify both the satellite and background. SEAN uses semantic masks for region-adaptive translation. Our method better preserves the original background, satellite geometry, and component layout while improving real-domain appearance. with the original synthetic background as desc… view at source ↗
Figure 8
Figure 8. Figure 8: shows the average distance threshold curves, and Tab. II reports the ADD pass rate at the 0.02 m threshold and the AUC score. Our method achieves the highest ADD pass rate of 0.260 and the highest AUC of 0.611. Compared with the strongest baseline CUT, our method improves the ADD pass rate from 0.182 to 0.260 and the AUC from 0.513 to 0.611. These results demonstrate that the proposed component￾aware and s… view at source ↗
Figure 9
Figure 9. Figure 9: Visual ablation study of the proposed framework. Each column shows the result of removing one component from the full model. The full model better preserves satellite geometry, component layout, and structural boundaries while transferring realistic appearance. and reducing Mask IoU from 0.95 to 0.82. This confirms that component-level semantic guidance is important for both real￾istic appearance transfer … view at source ↗
read the original abstract

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a component-aware structure-preserving style transfer method for satellite Sim2Real data construction. It builds weakly paired real-synthetic samples via calibrated acquisition, ArUco pose measurement, CAD rendering, and component masks; extracts part-wise real style codes; and injects them into synthetic regions using mask-aligned modulation. Adversarial training is augmented with local contrastive consistency, self-regularization, and edge-preserving constraints. On 5000 rendered images and 100 real images, the method reports the lowest FID (54.32) and KID (0.048) versus baselines, and when used to train GDRNet yields ADD pass rate 0.260 and AUC 0.611.

Significance. If the alignment and preservation claims hold, the work provides a targeted approach to annotation-preserving domain adaptation for satellite visual sensing, where real labeled data is scarce. The component-level modulation combined with multiple structure-preserving losses addresses a practical gap between synthetic geometric fidelity and real sensor appearance, with concrete downstream gains on pose estimation.

major comments (1)
  1. [Data construction pipeline and experimental setup] The central claim requires that the calibrated real acquisition, ArUco-based pose measurement, CAD rendering, and component masks produce sufficiently accurate weakly paired samples for mask-aligned modulation to transfer style without distorting geometric annotations. However, no reprojection error, mask-boundary IoU, or alignment statistics are reported for the 100 real images (see data construction pipeline and experimental setup). Because downstream GDRNet evaluation re-uses the same ArUco-derived poses for both training labels and test labels, systematic misalignment would remain invisible to the reported FID/KID and ADD/AUC metrics yet would undermine the annotation-preservation guarantee.
minor comments (2)
  1. [Experiments and results] The results section presents FID and KID values and GDRNet metrics without error bars, statistical significance tests, or details on baseline hyperparameter tuning and implementation, which would strengthen the comparative claims.
  2. [Abstract] The abstract refers to 'representative image-translation baselines' without naming them; explicit identification would improve reproducibility and context.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comment raises a valid point about the need for quantitative alignment validation in the data construction pipeline. We address it point-by-point below and commit to revisions that strengthen the manuscript without altering the core claims or results.

read point-by-point responses
  1. Referee: The central claim requires that the calibrated real acquisition, ArUco-based pose measurement, CAD rendering, and component masks produce sufficiently accurate weakly paired samples for mask-aligned modulation to transfer style without distorting geometric annotations. However, no reprojection error, mask-boundary IoU, or alignment statistics are reported for the 100 real images (see data construction pipeline and experimental setup). Because downstream GDRNet evaluation re-uses the same ArUco-derived poses for both training labels and test labels, systematic misalignment would remain invisible to the reported FID/KID and ADD/AUC metrics yet would undermine the annotation-preservation guarantee.

    Authors: We agree that explicit quantitative alignment statistics were not provided in the original submission and that this omission weakens the support for the annotation-preservation guarantee. The pipeline uses a calibrated acquisition rig with ArUco markers for 6-DoF pose recovery and projects CAD-derived component masks onto the real images; these steps are standard for controlled satellite capture but benefit from reported error metrics. In the revised manuscript we will add a dedicated paragraph and table in the experimental setup section reporting: (i) mean and std reprojection error of ArUco corner detections on the 100 real images (expected <1 px given the calibration), (ii) average boundary IoU between projected CAD component masks and manually delineated real boundaries, and (iii) qualitative examples of mask overlay. These additions directly address the referee’s request. On the potential invisibility of misalignment: while training labels and test GT both originate from the same ArUco system, the test images are real captures whose poses are measured independently of the synthetic rendering; any residual systematic bias would affect absolute pose numbers equally but would not mask the relative benefit of improved appearance matching. The observed GDRNet gains (ADD 0.260, AUC 0.611) under our translated data versus baselines therefore provide indirect evidence that mask-aligned style injection preserved usable geometry. We will also clarify in the text that the evaluation protocol uses real-image ArUco poses as GT and does not recycle synthetic poses for testing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method and results do not reduce to self-definition or fitted inputs.

full rationale

The paper describes a component-aware style transfer pipeline that builds weakly-paired samples via calibrated acquisition and ArUco poses, then applies mask-aligned modulation plus adversarial, contrastive, and edge-preserving losses. All reported outcomes (FID 54.32, KID 0.048, GDRNet ADD 0.260 / AUC 0.611) are obtained by running the trained model on held-out real images and measuring distribution distance plus downstream task metrics. No equation or claim is shown to be equivalent to its own inputs by construction, no parameter is fitted on a subset and then re-labeled as a prediction, and no load-bearing premise rests on a self-citation chain. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from style transfer and adversarial training literature; the abstract does not introduce new free parameters, axioms, or invented entities beyond typical model hyperparameters.

axioms (1)
  • domain assumption Weakly paired real-synthetic samples obtained via calibrated acquisition, ArUco markers, CAD rendering, and component masks are sufficiently aligned for part-wise style transfer.
    Stated in the method description as the basis for building samples and performing mask-aligned modulation.

pith-pipeline@v0.9.0 · 5838 in / 1306 out tokens · 52537 ms · 2026-05-21T07:41:52.031747+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    A review of cooperative and uncooperative spacecraft pose determination techniques for close-proximity operations,

    R. Opromolla, G. Fasano, G. Rufino, and M. Grassi, “A review of cooperative and uncooperative spacecraft pose determination techniques for close-proximity operations,”Progress in Aerospace Sciences, vol. 93, pp. 53–72, 2017

  2. [2]

    Deep learning-based spacecraft relative navigation methods: A survey,

    J. Song, D. Rondao, and N. Aouf, “Deep learning-based spacecraft relative navigation methods: A survey,”Acta Astronautica, vol. 191, pp. 22–40, 2022

  3. [3]

    A survey on deep learning-based monocular spacecraft pose estimation: Current state, limitations and prospects,

    L. Pauly, W. Rharbaoui, C. Shneider, A. Rathinam, V . Gaudilli `ere, and D. Aouada, “A survey on deep learning-based monocular spacecraft pose estimation: Current state, limitations and prospects,”Acta Astronautica, vol. 212, pp. 339–360, 2023

  4. [4]

    Neural network-based pose estima- tion for noncooperative spacecraft rendezvous,

    S. Sharma and S. D’Amico, “Neural network-based pose estima- tion for noncooperative spacecraft rendezvous,”IEEE Transactions on Aerospace and Electronic Systems, vol. 56, no. 6, pp. 4638–4658, 2020

  5. [5]

    Satellite pose estimation challenge: Dataset, competition design, and results,

    M. Kisantal, S. Sharma, T. H. Park, D. Izzo, M. M ¨artens, and S. D’Amico, “Satellite pose estimation challenge: Dataset, competition design, and results,”IEEE Transactions on Aerospace and Electronic Systems, vol. 56, no. 5, pp. 4083–4098, 2020

  6. [6]

    SPEED+: Next-generation dataset for spacecraft pose estimation across domain gap,

    T. H. Park, M. M ¨artens, G. L ´ecuyer, D. Izzo, and S. D’Amico, “SPEED+: Next-generation dataset for spacecraft pose estimation across domain gap,” in2022 IEEE Aerospace Conference (AERO), 2022, pp. 1–15

  7. [7]

    Robust multi-task learning and online refinement for spacecraft pose estimation across domain gap,

    T. H. Park and S. D’Amico, “Robust multi-task learning and online refinement for spacecraft pose estimation across domain gap,”Advances in Space Research, vol. 73, no. 11, pp. 5726–5740, 2024

  8. [8]

    Render for CNN: Viewpoint es- timation in images using CNNs trained with rendered 3d model views,

    H. Su, C. R. Qi, Y . Li, and L. J. Guibas, “Render for CNN: Viewpoint es- timation in images using CNNs trained with rendered 3d model views,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2686–2694

  9. [9]

    Dataset generation and vali- dation for spacecraft pose estimation via monocular images processing,

    M. Bechini, M. Lavagna, and P. Lunghi, “Dataset generation and vali- dation for spacecraft pose estimation via monocular images processing,” Acta Astronautica, vol. 204, pp. 358–369, 2023

  10. [10]

    Deep learning for spacecraft pose estimation from photorealistic rendering,

    P. F. Proenc ¸a and Y . Gao, “Deep learning for spacecraft pose estimation from photorealistic rendering,” in2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 6007–6013

  11. [11]

    Domain randomization for transferring deep neural networks from sim- ulation to the real world,

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from sim- ulation to the real world,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30

  12. [12]

    Training deep networks with synthetic data: Bridging the reality gap by domain randomization,

    J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V . Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep networks with synthetic data: Bridging the reality gap by domain randomization,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 969–977

  13. [13]

    Image-to-image translation with conditional adversarial networks,

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5967–5976

  14. [14]

    Unpaired image-to-image translation using cycle-consistent adversarial networks,

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232. XIEet al.: SATELLITE VISUAL SIM2REAL DATA CONSTRUCTION 11

  15. [15]

    Diverse image-to-image translation via disentangled representations,

    H.-Y . Lee, H.-Y . Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 35–51

  16. [16]

    Multimodal unsu- pervised image-to-image translation,

    X. Huang, M.-Y . Liu, S. Belongie, and J. Kautz, “Multimodal unsu- pervised image-to-image translation,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 172–189

  17. [17]

    Contrastive learning for unpaired image-to-image translation,

    T. Park, A. A. Efros, R. Zhang, and J.-Y . Zhu, “Contrastive learning for unpaired image-to-image translation,” inEuropean conference on computer vision. Springer, 2020, pp. 319–345

  18. [18]

    Semantic image synthesis with spatially-adaptive normalization,

    T. Park, M.-Y . Liu, T.-C. Wang, and J.-Y . Zhu, “Semantic image synthesis with spatially-adaptive normalization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2337–2346

  19. [19]

    Sean: Image synthesis with semantic region-adaptive normalization,

    P. Zhu, R. Abdal, Y . Qin, and P. Wonka, “Sean: Image synthesis with semantic region-adaptive normalization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5104–5113

  20. [20]

    Automatic generation and detection of highly reliable fiducial markers under occlusion,

    S. Garrido-Jurado, R. Mu ˜noz-Salinas, F. J. Madrid-Cuevas, and M. J. Mar´ın-Jim´enez, “Automatic generation and detection of highly reliable fiducial markers under occlusion,”Pattern Recognition, vol. 47, no. 6, pp. 2280–2292, 2014

  21. [21]

    GDR-Net: Geometry- guided direct regression network for monocular 6D object pose estima- tion,

    G. Wang, F. Manhardt, F. Tombari, and X. Ji, “GDR-Net: Geometry- guided direct regression network for monocular 6D object pose estima- tion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 606–16 616

  22. [22]

    Satellite pose estimation with deep landmark regression and nonlinear pose refine- ment,

    B. Chen, J. Cao, ´A. Parra Bustos, and T.-J. Chin, “Satellite pose estimation with deep landmark regression and nonlinear pose refine- ment,” in2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 2816–2824

  23. [23]

    Deep object pose estimation for semantic robotic grasping of household objects,

    J. Tremblay, T. To, B. Sundaralingam, Y . Xiang, D. Fox, and S. Birchfield, “Deep object pose estimation for semantic robotic grasping of household objects,” inProceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87. PMLR, 2018, pp. 306–316. [Online]. Available: https://proceedings.mlr.press/v87/tremblay18a.html

  24. [24]

    MegaPose: 6d pose estimation of novel objects via render & compare,

    Y . Labb ´e, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “MegaPose: 6d pose estimation of novel objects via render & compare,” in Proceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 205. PMLR, 2023, pp. 715–725. [Online]. Available: htt...

  25. [25]

    PSVMLP: Point and shifted voxel MLP for 3d deep learning,

    G. Xie, Y . Liu, Y . Ji, Z. Xie, and B. Cao, “PSVMLP: Point and shifted voxel MLP for 3d deep learning,”Pattern Recognition Letters, vol. 185, pp. 1–7, 2024

  26. [26]

    DexMGNet: Multi-mode dexterous grasping in cluttered scenes with generative models,

    Z. Xie, G. Xie, Y . Liu, Y . Zhang, B. Cao, Y . Ji, Z. Wang, and H. Liu, “DexMGNet: Multi-mode dexterous grasping in cluttered scenes with generative models,”IEEE Robotics and Automation Letters, vol. 10, no. 8, pp. 8483–8490, 2025

  27. [27]

    SD-Pose: Semantic decompo- sition for cross-domain 6d object pose estimation,

    Z. Li, Y . Hu, M. Salzmann, and X. Ji, “SD-Pose: Semantic decompo- sition for cross-domain 6d object pose estimation,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2020– 2028, 2021

  28. [28]

    Sim2real instance-level style transfer for 6d pose estimation,

    T. Ikeda, S. Tanishige, A. Amma, M. Sudano, H. Audren, and K. Nishi- waki, “Sim2real instance-level style transfer for 6d pose estimation,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 3225–3232

  29. [29]

    Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,

    S. Hinterstoisser, V . Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” inComputer Vision – ACCV 2012, ser. Lecture Notes in Computer Science. Springer, 2013, vol. 7724, pp. 548–562

  30. [30]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026

  31. [31]

    Analyzing and improving the image quality of StyleGAN,

    T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8110–8119

  32. [32]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  33. [33]

    De- mystifying MMD GANs,

    M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “De- mystifying MMD GANs,” inInternational Conference on Learning Representations, 2018. Zongwu XieZongwu Xie received the B.S. de- gree in electrical engineering and automation from Harbin University of Science and Technol- ogy, Harbin, China, in 1996, and the M.S. and Ph.D. degrees in mechanica...