Bridging the Smart City Cybersecurity Data Gap Through AI-Driven Synthetic Dataset Generation
Pith reviewed 2026-06-27 09:16 UTC · model grok-4.3
The pith
A generative AI framework produces synthetic cybersecurity datasets that replicate smart city device behaviors, network interactions, and attack scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an AI-based synthetic data generation framework leveraging generative artificial intelligence models can produce high-fidelity synthetic cybersecurity datasets that replicate realistic device behaviors, network interactions, and cyber-attack scenarios for smart cities, with the resulting data evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools to advance threat modeling and defense evaluation.
What carries the argument
The AI-based synthetic data generation (SDG) framework that uses generative artificial intelligence models to create datasets replicating device behaviors, network interactions, and attack scenarios
If this is right
- Researchers gain the ability to model smart city threats more effectively using accessible data.
- Defensive techniques can be evaluated more comprehensively across varied attack scenarios.
- Critical smart city infrastructures receive improved protection through better-tested cybersecurity methods.
- Synthetic datasets conform to protocol standards and maintain statistical similarity to real data.
Where Pith is reading between the lines
- If the synthetic data proves effective in tools, it could reduce dependence on privacy-restricted real collections for training detection systems.
- The same generation approach might apply to other data-scarce cyber-physical domains such as industrial control systems.
- A practical next step would be to measure how well models trained solely on the synthetic sets detect novel attack variants in live deployments.
Load-bearing premise
Generative AI models can be trained or prompted to output data that passes statistical similarity checks and proves useful in common security tools.
What would settle it
A direct comparison in which a security tool trained on the synthetic data shows markedly lower performance on actual smart city network traces than on the synthetic traces themselves.
Figures
read the original abstract
Smart cities rely on interconnected cyber-physical systems that integrate sensors, IoT devices, cloud platforms, and AI-driven services and decision-making. While these systems enhance city services, they also introduce complex cybersecurity challenges due to their large attack surfaces, heterogeneous data flows, and evolving threat vectors. Developing and validating cybersecurity tools for smart cities requires high-quality datasets that accurately represent real operational conditions. However, real-world datasets are often incomplete, contain privacy-sensitive data, are difficult to access, or lack sufficient malicious activity to support tool development. This research addresses this critical gap by proposing an AI-based synthetic data generation (SDG) framework designed specifically for smart city cybersecurity research. The proposed framework leverages generative artificial intelligence models to produce high-fidelity synthetic cybersecurity datasets that replicate realistic device behaviors, network interactions, and cyber-attack scenarios. The synthetic datasets are evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools. The resulting synthetic data generation framework and evaluation metrics are expected to advance smart city cybersecurity by enabling researchers to model threats more effectively and evaluate defensive techniques more comprehensively to better protect critical smart city infrastructures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an AI-based synthetic data generation (SDG) framework for smart city cybersecurity research. It claims that generative AI models can be used to produce high-fidelity synthetic datasets replicating realistic device behaviors, network interactions, and cyber-attack scenarios. These datasets are to be evaluated for conformity to protocol standards, statistical similarity to original datasets, and utility in common security tools, with the expectation that the framework will advance threat modeling and defensive technique evaluation in smart city infrastructures.
Significance. If a concrete implementation of the proposed framework were developed and shown to meet the stated evaluation criteria, it could meaningfully address data scarcity and privacy issues in smart city cybersecurity research, enabling broader experimentation with defensive tools. The core idea of using generative models for this purpose aligns with existing needs in the field, though the manuscript provides no evidence that the approach is feasible or novel relative to prior synthetic data work in cybersecurity.
major comments (3)
- [Abstract] Abstract: The framework is presented only as a high-level proposal with no specification of the generative models (e.g., GAN, VAE, transformer, or diffusion variants), conditioning inputs, architecture, or training procedure. This is load-bearing because the central claim that the SDG pipeline produces high-fidelity data replicating device behaviors and attacks rests entirely on the unstated assumption that such models can be configured to succeed.
- [Abstract] Abstract: No training corpus, loss functions, or quantitative results are supplied for any of the three evaluation axes (protocol conformity, statistical similarity, downstream utility in security tools). The text uses only forward-looking language ('are expected to', 'will advance') rather than demonstrated outcomes, leaving the load-bearing assumption that the pipeline meets these criteria untested.
- [Abstract] Abstract: The manuscript contains no derivation, pseudocode, or preliminary validation showing that synthetic data can pass the required checks for smart-city traffic and attacks; without these elements the proposal cannot be assessed for internal consistency or feasibility.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback on our manuscript. We agree that the work is presented as a high-level conceptual proposal and will use the comments to strengthen the description of the framework.
read point-by-point responses
-
Referee: [Abstract] Abstract: The framework is presented only as a high-level proposal with no specification of the generative models (e.g., GAN, VAE, transformer, or diffusion variants), conditioning inputs, architecture, or training procedure. This is load-bearing because the central claim that the SDG pipeline produces high-fidelity data replicating device behaviors and attacks rests entirely on the unstated assumption that such models can be configured to succeed.
Authors: We agree that the manuscript describes the framework at a conceptual level without specifying particular generative models, conditioning inputs, or training procedures. The current version focuses on the overall pipeline and evaluation strategy rather than implementation details. We will revise to include example model selections (e.g., conditional GANs or diffusion models for traffic and attack generation), conditioning on device types and protocols, and high-level architecture and training considerations. revision: yes
-
Referee: [Abstract] Abstract: No training corpus, loss functions, or quantitative results are supplied for any of the three evaluation axes (protocol conformity, statistical similarity, downstream utility in security tools). The text uses only forward-looking language ('are expected to', 'will advance') rather than demonstrated outcomes, leaving the load-bearing assumption that the pipeline meets these criteria untested.
Authors: The referee is correct that no specific training corpora, loss functions, or quantitative results are provided. As this is a framework proposal rather than an empirical study, the manuscript does not include implemented results. In a revision we will specify example training sources (public IoT and smart-city datasets), suggest loss functions aligned with the three evaluation axes, and clarify that empirical outcomes are intended as future work. revision: yes
-
Referee: [Abstract] Abstract: The manuscript contains no derivation, pseudocode, or preliminary validation showing that synthetic data can pass the required checks for smart-city traffic and attacks; without these elements the proposal cannot be assessed for internal consistency or feasibility.
Authors: We acknowledge the absence of pseudocode, derivations, or preliminary validation. The manuscript is a high-level proposal, so these elements were not included. We will revise to add high-level pseudocode for the SDG pipeline and discuss feasibility based on related synthetic data literature for cybersecurity, while noting that concrete validation requires implementation. revision: yes
- The manuscript contains no implementation or empirical results, so actual quantitative outcomes, trained models, or full validation data cannot be supplied in response to requests for demonstrated performance.
Circularity Check
No circularity: high-level proposal with no derivations or self-referential claims
full rationale
The manuscript is a framework proposal that states an intent to use generative AI for synthetic datasets and lists evaluation criteria (protocol conformity, statistical similarity, utility) but supplies no equations, model architectures, training procedures, fitted parameters, or derivations. No self-citations appear in the provided text, and no step reduces a claimed result to its own inputs by construction. The central claim remains an unformalized expectation rather than a derived prediction, so no circularity patterns apply.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
[Online]
Grand View Research,Smart cities market size, share — global industry report, 2019-2025, 2025. [Online]. Available: https://www. grandviewresearch.com/industry-analysis/smart-cities-market
2019
-
[2]
International Telecommunication Union,Digital transformation for people-centered cities, Sep. 2022. Accessed: May 31, 2026. [Online]. Available: https://www.itu.int/cities/about/
2022
-
[3]
Woetzel,Smart city technology for a more liveable future, Jun
J. Woetzel,Smart city technology for a more liveable future, Jun. 2018. [Online]. Available: https://www.mckinsey.com/capabilities/operations/ our-insights/smart-cities-digital-solutions-for-a-more-livable-future
2018
-
[4]
Lea,Smart Cities: An Overview of the Technology Trends Driving Smart Cities
R. Lea,Smart Cities: An Overview of the Technology Trends Driving Smart Cities. 2017. [Online]. Available: https://web.archive.org/web/ 20251207164621/https://www.ieee.org/content/dam/ieee-org/ieee/web/ org/about/corporate/ieee- industry- advisory- board/ieee- smart- cities- trend-paper-2017.pdf
2017
-
[5]
Analysis of smart cities security: Challenges and advancements,
M. Houichi, F. Jaidi, and A. Bouhoula, “Analysis of smart cities security: Challenges and advancements,” in2022 15th International Conference on Security of Information and Networks (SIN), 2022, pp. 01–05.DOI: 10.1109/SIN56466.2022.9970494
-
[6]
Smart city: The state of the art, datasets, and evaluation platforms,
S. Mallapuram, N. Ngwum, F. Yuan, C. Lu, and W. Yu, “Smart city: The state of the art, datasets, and evaluation platforms,” inIEEE/ACIS 16th International Conference on Computer & Information Science (ICIS), IEEE, 2017, 447–452.DOI: 10.1109/ICIS.2017.7960034
-
[7]
Data sets, modeling, and decision making in smart cities: A survey,
M. Ma, S. M. Preum, M. Y . Ahmed, W. T ¨arneberg, A. Hendawi, and J. A. Stankovic, “Data sets, modeling, and decision making in smart cities: A survey,” en,ACM Transactions on Cyber-Physical Systems, vol. 4, no. 2, 1–28, Apr. 2020.DOI: 10.1145/3355283
-
[8]
M. A. Ferrag, O. Friha, D. Hamouda, L. Maglaras, and H. Janicke, “Edge-iiotset: A new comprehensive realistic cyber security dataset of iot and iiot applications for centralized and federated learning,”IEEE Access, vol. 10, 40281–40306, 2022.DOI: 10.1109/ACCESS.2022. 3165809
-
[9]
WUSTL-IIOT-2021 dataset for IIoT cybersecurity research,
M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain, “WUSTL-IIOT-2021 dataset for IIoT cybersecurity research,” 2021. [Online]. Available: https://www.cse.wustl.edu/∼jain/iiot2/
2021
-
[10]
M. Al-Hawawreh, E. Sitnikova, and N. Aboutorab, “X-iiotid: A connectivity-agnostic and device-agnostic intrusion data set for indus- trial internet of things,”IEEE Internet of Things Journal, vol. 9, no. 5, 3962–3977, Mar. 2022.DOI: 10.1109/JIOT.2021.3102056
-
[11]
N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for network intrusion detection systems (unsw-nb15 network data set),” in 2015 Military Communications and Information Systems Conference (MilCIS), Nov. 2015, 1–6.DOI: 10.1109/MilCIS.2015.7348942
-
[12]
In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2025
N. Koroniotis, N. Moustafa, E. Sitnikova, and J. Slay, “Towards developing network forensic mechanism for botnet activities in the iot based on machine learning techniques,” inMobile Networks and Management, vol. 235, Springer, 2018, 30–44.DOI: 10.1007/978-3- 319-90775-8 3
-
[13]
Federated TON IoT Windows Datasets for Evaluating AI-Based Security Ap- plications,
N. Moustafa, M. Keshky, E. Debiez, and H. Janicke, “Federated TON IoT Windows Datasets for Evaluating AI-Based Security Ap- plications,” in2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China: IEEE, Dec. 2020, 848–855.DOI: 10 . 1109 / TrustCom50675.2020.00114
arXiv 2020
-
[14]
Liu et al.,Best practices and lessons learned on synthetic data,
R. Liu et al.,Best practices and lessons learned on synthetic data,
-
[15]
Privacy mechanisms and evaluation metrics for synthetic data generation: A systematic review,
P. A. Osorio-Marulanda, G. Epelde, M. Hernandez, I. Isasa, N. M. Reyes, and A. B. Iraola, “Privacy mechanisms and evaluation metrics for synthetic data generation: A systematic review,”IEEE Access, vol. 12, 88048–88074, 2024.DOI: 10.1109/ACCESS.2024.3417608
-
[16]
Synthetic data generation models for time series: A literature review,
D. Viana, R. Teixeira, J. Baptista, and T. Pinto, “Synthetic data generation models for time series: A literature review,” in2024 Inter- national Conference on Electrical, Computer and Energy Technologies (ICECET, Jul. 2024, 1–6.DOI: 10.1109/ICECET61485.2024.10698494
-
[17]
K. Wang and M. Govindarasu, “Fgsm-based synthetic data generation technique and application to anomaly detection in smart grid,” in2024 IEEE Power & Energy Society General Meeting (PESGM), Jul. 2024, 1–5.DOI: 10.1109/PESGM51994.2024.10688539
-
[18]
Synthetic training-data gen- eration for ml-based process mining tools,
A. Singh, Z. Bettouche, and A. Fischer, “Synthetic training-data gen- eration for ml-based process mining tools,” in2024 14th International Conference on Advanced Computer Information Technologies (ACIT), Sep. 2024, 705–709.DOI: 10.1109/ACIT62333.2024.10712516
-
[19]
2024, 10.1109/BigData62323.2024.10825388
I. Tenison, A. Chen, N. Singh, O. Dahleh, E. Zemour, and L. Kagal, “Private synthetic data generation for mixed type datasets,” in2024 IEEE International Conference on Big Data (BigData), Dec. 2024, 6379–6386.DOI: 10.1109/BigData62323.2024.10825249
-
[20]
Elaborate synthetic data generation for internet of things services at smart home environment,
R. Myung, S. Choi, W. Choi, H. Yu, D. Lee, and E. Lee, “Elaborate synthetic data generation for internet of things services at smart home environment,” in2016 International Conference on Computational Science and Computational Intelligence (CSCI), Dec. 2016, 226–229. DOI: 10.1109/CSCI.2016.0050
-
[21]
C.-L. Wu, Y .-Y . Chen, P.-Y . Chou, and C.-Y . Wang, “Synthetic packet traffic generative adversarial networks in multi agents with peer-to-peer and global priority queue generation,”IEEE Transactions on Network Science and Engineering, vol. 13, 5851–5869, 2026,ISSN: 2327-4697. DOI: 10.1109/TNSE.2026.3653576
-
[22]
Iotgemini: Modeling iot network behaviors for synthetic traffic generation,
R. Li et al., “Iotgemini: Modeling iot network behaviors for synthetic traffic generation,”IEEE Transactions on Mobile Computing, vol. 23, no. 12, 13240–13257, Dec. 2024,ISSN: 1558-0660.DOI: 10 . 1109 / TMC.2024.3426600
arXiv 2024
-
[23]
Toward synthetic network traffic generating in ntn- enabled iot: A generative ai approach,
D. Jiang et al., “Toward synthetic network traffic generating in ntn- enabled iot: A generative ai approach,”IEEE Internet of Things Journal, vol. 12, no. 2, 2174–2187, Jan. 2025,ISSN: 2327-4662.DOI: 10.1109/JIOT.2024.3468209
-
[24]
A. Schoen, G. Blanc, P.-F. Gimenez, Y . Han, F. Majorczyk, and L. Me, “A tale of two methods: Unveiling the limitations of gan and the rise of bayesian networks for synthetic network traffic generation,” in 2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), ISSN: 2768-0657, Jul. 2024, 273–286.DOI: 10.1109/ EuroSPW61312.2024.00036
Pith/arXiv arXiv 2024
-
[25]
Explainable ai for network threat detection: Isolation forests and synthetic wifi traffic,
S. Fioretto, E. Masciari, and E. V . Napolitano, “Explainable ai for network threat detection: Isolation forests and synthetic wifi traffic,” in 2025 IEEE/ACS 22nd International Conference on Computer Systems and Applications (AICCSA), ISSN: 2161-5330, Oct. 2025, 1–5.DOI: 10.1109/AICCSA66935.2025.11315249
-
[26]
N.-T. Nguyen, T.-N. Le, K.-H. Le-Minh, and K.-H. Le, “Towards generating semi-synthetic datasets for network intrusion detection system,” in2023 International Conference on Information Networking (ICOIN), Jan. 2023, 62–66.DOI: 10.1109/ICOIN56518.2023.10048962
-
[27]
C. Task, K. Bhagat, and G. Howarth,SDNist: Deidentified Data Report Tool. Apr. 2023.DOI: 10.18434/mds2-2943 [Online]. Available: https: //github.com/usnistgov/SDNist [28]SDMetrics, en, Sep. 2025. [Online]. Available: https://docs.sdv.dev/ sdmetrics [29]SynthEval. schneiderkamplab, Oct. 2025. [Online]. Available: https : //github.com/schneiderkamplab/syntheval
-
[28]
G. Santangelo, G. Nicora, R. Bellazzi, and A. Dagliati, “How good is your synthetic data? synthro, a dashboard to evaluate and bench- mark synthetic tabular data,”BMC Medical Informatics and Decision Making, vol. 25, no. 1, p. 89, 2025.DOI: 10.1186/s12911-024-02731-9
-
[29]
Mario Stefanelli
BMI “Mario Stefanelli” Lab - UNIPV,SynthRO, Jul. 2025. [Online]. Available: https://github.com/bmi-labmedinfo/SynthRO
2025
-
[30]
V . U. Krishnan, R. Dhumpati, V . E. Salis, M. B. K, K. Sutaria, and G. Abhyankar, “Employing generative adversarial networks for secure and reliable synthetic data generation in cyber security applications,” in2025 4th International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE), Apr. 2025, 1–5.DOI: 10.1109/ICDCECE6...
-
[31]
D. A. Ammara, J. Ding, and K. Tutschku, “Architectural selection framework for synthetic network traffic: Quantifying the fidelity–utility trade-off,”IEEE Access, vol. 14, 468–484, 2026,ISSN: 2169-3536. DOI: 10.1109/ACCESS.2025.3646769
-
[32]
Idaho National Labs,Malcolm, Feb. 2026. [Online]. Available: https: //github.com/idaholab/Malcolm
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.