arxiv: 2602.24047 · v1 · submitted 2026-02-27 · 💻 cs.NI · cs.CR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Unsupervised Baseline Clustering and Incremental Adaptation for IoT Device Traffic Profiling

Sean M. Alderman , John D. Hastings

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:53 UTC · model grok-4.3

classification 💻 cs.NI cs.CRcs.LG

keywords IoT device profilingunsupervised clusteringDBSCANBIRCHtraffic analysisincremental learningnetwork security

0 comments

The pith

Density-based clustering best matches ground-truth IoT device labels in unsupervised traffic profiling while incremental methods trade purity for adaptability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-stage approach to handle the challenge of identifying and tracking diverse IoT devices whose traffic patterns change over time. It first applies classical unsupervised clustering methods to long-duration network flow data from the Deakin IoT dataset to establish baseline device profiles without using labels. Among the methods tested, DBSCAN stands out by separating a large number of outliers and achieving the highest normalized mutual information score of 0.78 with actual device identities. The second stage then evaluates stream clustering techniques for updating these profiles as new traffic arrives, showing that BIRCH offers fast updates and good separation for a new device but at the cost of some accuracy on previously known devices.

Core claim

Density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71).

What carries the argument

Two-stage pipeline: DBSCAN for baseline density-based clustering on flow features to profile devices, followed by BIRCH for incremental stream-oriented clustering to adapt to evolving traffic.

If this is right

Static profiling using DBSCAN can achieve high alignment with device identities in fixed datasets.
Incremental updates with BIRCH enable handling of new devices in under a second per update.
Adaptation to novel traffic comes with reduced accuracy on previously profiled devices.
Flow features alone can distinguish many devices but leave some traffic as outliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such methods could reduce reliance on labeled data for IoT security monitoring in dynamic environments.
Testing on more varied datasets would reveal if the observed trade-offs generalize beyond the selected captures.
Combining density-based and stream clustering might balance purity and adaptability better than either alone.

Load-bearing premise

The selected long-duration captures from the Deakin IoT dataset are representative of real-world evolving IoT traffic and that flow features alone suffice to distinguish device identities across time.

What would settle it

Running the same pipeline on a different IoT dataset with ground-truth labels and measuring if DBSCAN still achieves NMI above 0.7 and BIRCH maintains similar purity and update times.

Figures

Figures reproduced from arXiv: 2602.24047 by John D. Hastings, Sean M. Alderman.

**Figure 1.** Figure 1: t-SNE 2D Cluster Visualizations NMI and Silhouette Coefficient scores are measured, resulting in inconclusive results to determine the best model by the score alone. Further analysis indicates that NMI carries greater importance for our goals in RQ2 and should be the principal metric while also trying to maximize the Silhouette score. NMI represents an external metric measured with the [PITH_FULL_IMAGE:f… view at source ↗

**Figure 2.** Figure 2: t-SNE 2D Cluster Visualizations A comparison of cluster visualizations for RQ1 and RQ2 are provided in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

The growth and heterogeneity of IoT devices create security challenges where static identification models can degrade as traffic evolves. This paper presents a two-stage, flow-feature-based pipeline for unsupervised IoT device traffic profiling and incremental model updating, evaluated on selected long-duration captures from the Deakin IoT dataset. For baseline profiling, density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, we evaluate stream-oriented clustering approaches and find that BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71). Overall, the results highlight a practical trade-off between high-purity static profiling and the flexibility of incremental clustering for evolving IoT environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DBSCAN hits NMI 0.78 on Deakin IoT flows with BIRCH handling fast updates, but the paper applies standard clustering without new methods or full feature details.

read the letter

The main thing here is that DBSCAN on flow features from the selected Deakin captures aligns with ground-truth device labels at NMI 0.78 and outperforms centroid-based options on purity, while BIRCH supports quick incremental updates at 0.13 seconds with 0.87 purity on a novel device. The work shows a clear accuracy trade-off after adaptation, dropping known-device performance to 0.71. This gives a practical look at handling traffic evolution in heterogeneous IoT settings using public data and held-out evaluation.

Referee Report

3 major / 2 minor

Summary. The paper presents a two-stage unsupervised pipeline for IoT device traffic profiling using flow features from selected long-duration captures in the Deakin IoT dataset. Baseline profiling applies DBSCAN to isolate outliers and achieve the highest alignment with ground-truth labels (NMI 0.78, outperforming centroid-based methods on purity). Incremental adaptation evaluates stream clustering methods, with BIRCH providing efficient updates (0.13 s per update) and coherent clusters for a held-out novel device (purity 0.87), though at the cost of limited novel traffic capture (share 0.72) and reduced known-device accuracy (0.71) post-adaptation. The work emphasizes practical trade-offs between static high-purity profiling and flexible incremental updates for evolving IoT environments.

Significance. If the empirical results hold under full specification, the paper offers a concrete, reproducible demonstration of density-based clustering for static IoT profiling and BIRCH for low-latency adaptation, with explicit metrics (NMI, purity, update latency) that quantify the accuracy-flexibility trade-off. This could directly inform the design of label-free security systems for heterogeneous, time-varying IoT deployments where static models degrade.

major comments (3)

[Methodology] Methodology section: the exact composition of the flow feature vector (e.g., which packet-size, timing, protocol, or statistical aggregates are extracted) is not defined. Without this, the central claim that DBSCAN yields NMI 0.78 and isolates a substantial outlier portion cannot be verified or reproduced, as the result may depend on particular feature choices.
[Experimental Setup] Experimental Setup and Results sections: the values chosen for DBSCAN's eps and min_samples, and for BIRCH's threshold and branching factor, are not reported, nor is the procedure used to select them. These free parameters directly determine the reported NMI 0.78, purity 0.87, and 0.13 s update time; their omission makes the performance comparison to other classical methods non-reproducible.
[Evaluation] Evaluation section: the selection criteria and representativeness of the 'long-duration captures' from the Deakin dataset are not justified, nor is any statistical significance test provided for the NMI/purity differences. This weakens the claim that the observed trade-offs generalize to real-world evolving IoT traffic.

minor comments (2)

[Abstract] Abstract: the phrase 'selected long-duration captures' should be accompanied by the total number of flows or devices involved to give readers immediate scale.
[Results] Notation: the definitions of 'outlier portion', 'share', and 'known-device accuracy' are used without explicit formulas or pseudocode, reducing clarity when comparing the two stages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects for improving reproducibility and the strength of our claims. We address each major comment below and will revise the manuscript to incorporate the necessary clarifications and additions.

read point-by-point responses

Referee: [Methodology] Methodology section: the exact composition of the flow feature vector (e.g., which packet-size, timing, protocol, or statistical aggregates are extracted) is not defined. Without this, the central claim that DBSCAN yields NMI 0.78 and isolates a substantial outlier portion cannot be verified or reproduced, as the result may depend on particular feature choices.

Authors: We agree that a precise definition of the flow feature vector is required for reproducibility. In the revised manuscript, we will expand the Methodology section with a complete enumeration of all extracted features, including packet-size statistics (mean, variance, min/max), timing attributes (inter-arrival times, durations), protocol indicators, and statistical aggregates. This addition will directly support verification of the DBSCAN results, including the NMI of 0.78 and outlier isolation. revision: yes
Referee: [Experimental Setup] Experimental Setup and Results sections: the values chosen for DBSCAN's eps and min_samples, and for BIRCH's threshold and branching factor, are not reported, nor is the procedure used to select them. These free parameters directly determine the reported NMI 0.78, purity 0.87, and 0.13 s update time; their omission makes the performance comparison to other classical methods non-reproducible.

Authors: We acknowledge the need for full parameter transparency. The revised Experimental Setup section will report the exact values used for DBSCAN (eps and min_samples) and BIRCH (threshold and branching factor), along with the selection procedure (e.g., evaluation via internal validation metrics such as silhouette score on a validation subset). This will enable reproduction of the reported metrics and fair comparison to other methods. revision: yes
Referee: [Evaluation] Evaluation section: the selection criteria and representativeness of the 'long-duration captures' from the Deakin dataset are not justified, nor is any statistical significance test provided for the NMI/purity differences. This weakens the claim that the observed trade-offs generalize to real-world evolving IoT traffic.

Authors: We appreciate the point on generalizability. We will revise the Evaluation section to explicitly justify the selection of long-duration captures based on their extended temporal span and device diversity, which enable analysis of both static and incremental clustering. We will also add statistical significance tests (e.g., paired t-tests across multiple runs or bootstrap methods) for the NMI and purity differences to better support the observed trade-offs. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical clustering results on external dataset

full rationale

The paper applies standard DBSCAN and other clustering algorithms to flow features from the public Deakin IoT dataset, then reports direct empirical metrics such as NMI 0.78 against ground-truth device labels. No equations, derivations, or fitted parameters are defined inside the paper that later appear as 'predictions' or results by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The evaluation remains a straightforward comparison of classical methods on held-out data, fully independent of any internal reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The pipeline rests on the domain assumption that flow-level statistics are stable enough to separate device identities and on standard clustering hyperparameters that are tuned to the data.

free parameters (2)

DBSCAN eps and min_samples
Core density parameters chosen to isolate outliers and produce the reported NMI; values not stated in abstract.
BIRCH threshold and branching factor
Control incremental cluster formation and update speed; values not stated in abstract.

axioms (2)

domain assumption Flow features (packet sizes, timings, protocols) are sufficient to distinguish device identities
Invoked by the entire flow-feature-based pipeline.
domain assumption Ground-truth device labels in the Deakin dataset are accurate and stable across the capture duration
Required for computing NMI and purity scores.

pith-pipeline@v0.9.0 · 5486 in / 1338 out tokens · 28144 ms · 2026-05-15T18:53:22.521466+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

density-based clustering (DBSCAN) isolates a substantial outlier portion... NMI 0.78... BIRCH supports efficient updates
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

25 numerical features... iat mean, pkt size bin, top dst port

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

A machine learning based framework for IoT device identification and abnormal traffic detection,

O. Salman, I. H. Elhajj, A. Chehab, and A. Kayssi, “A machine learning based framework for IoT device identification and abnormal traffic detection,” en,Transactions on Emerging Telecommunications Technologies, vol. 33, no. 3, 2022.DOI: 10.1002/ett.3743

work page doi:10.1002/ett.3743 2022
[2]

A Generic Machine Learning Approach for IoT Device Identifica- tion,

Z. Ali, F. Hussain, S. Ghazanfar, M. Husnain, S. Zahid, and G. A. Shah, “A Generic Machine Learning Approach for IoT Device Identifica- tion,” in2021 International Conference on Cyber Warfare and Security (ICCWS), Nov. 2021.DOI: 10.1109/ICCWS53234.2021.9702983

work page doi:10.1109/iccws53234.2021.9702983 2021
[3]

Machine Learning With Computer Networks: Tech- niques, Datasets, and Models,

H. Afifi et al., “Machine Learning With Computer Networks: Tech- niques, Datasets, and Models,”IEEE Access, vol. 12, pp. 54 673– 54 720, 2024.DOI: 10.1109/ACCESS.2024.3384460

work page doi:10.1109/access.2024.3384460 2024
[4]

A comprehensive study of supervised machine learning assisted approaches for IoT device identification,

Y . Wang et al., “A comprehensive study of supervised machine learning assisted approaches for IoT device identification,” in2024 Interna- tional Conference on Computing, Networking and Communications (ICNC), Feb. 2024.DOI: 10.1109/ICNC59896.2024.10556143

work page doi:10.1109/icnc59896.2024.10556143 2024
[5]

IoTTFID: An incremental IoT device iden- tification model based on traffic fingerprint,

Q. Hao and Z. Rong, “IoTTFID: An incremental IoT device iden- tification model based on traffic fingerprint,”IEEE Access, vol. 11, pp. 58 679–58 691, 2023.DOI: 10.1109/ACCESS.2023.3284542

work page doi:10.1109/access.2023.3284542 2023
[6]

A lightweight IoT device identification using enhanced behavioral-based features,

M. Rabbani et al., “A lightweight IoT device identification using enhanced behavioral-based features,”Peer-to-Peer Networking & Ap- plications, vol. 18, no. 2, 2024.DOI: 10.1007/s12083-024-01891-9

work page doi:10.1007/s12083-024-01891-9 2024
[7]

Kolcun et al.,Revisiting IoT Device Identification, arXiv:2107.07818 [cs], Jul

R. Kolcun et al.,Revisiting IoT Device Identification, arXiv:2107.07818 [cs], Jul. 2021.DOI: 10.48550/arXiv.2107.07818

work page doi:10.48550/arxiv.2107.07818 2021
[8]

Smart Recon: Network Traffic Fingerprinting for IoT Device Identification,

J. Thom, N. Thom, S. Sengupta, and E. Hand, “Smart Recon: Network Traffic Fingerprinting for IoT Device Identification,” in2022 IEEE 12th Annual Computing and Communication Workshop and Confer- ence (CCWC), Jan. 2022.DOI: 10.1109/CCWC54503.2022.9720739

work page doi:10.1109/ccwc54503.2022.9720739 2022
[9]

Application of Machine Learning Models for De- vice Identification in Wireless Network Traffic,

R. N. Anaedevha, “Application of Machine Learning Models for De- vice Identification in Wireless Network Traffic,” in2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), Jan. 2024, pp. 104–110.DOI: 10.1109/ElCon61730.2024.10468413

work page doi:10.1109/elcon61730.2024.10468413 2024
[10]

A network device identification method based on packet temporal features and machine learning,

L. Hu, B. Zhao, and G. Wang, “A network device identification method based on packet temporal features and machine learning,”Applied Sciences, vol. 14, no. 17, p. 7954, 2024.DOI: 10.3390/app14177954

work page doi:10.3390/app14177954 2024
[11]

Enhancing IoT security via automatic network traffic analysis: The transition from machine learning to deep learning,

M. Hamidouche, E. Popko, and B. Ouni, “Enhancing IoT security via automatic network traffic analysis: The transition from machine learning to deep learning,” in13th Int Conf on the Internet of Things (IoT ’23), ACM, Mar. 2024.DOI: 10.1145/3627050.3627053

work page doi:10.1145/3627050.3627053 2024
[12]

Classifying IoT devices in smart environments using network traffic characteristics,

A. Sivanathan et al., “Classifying IoT devices in smart environments using network traffic characteristics,”IEEE Transactions on Mobile Computing, vol. 18, no. 8, 2019.DOI: 10.1109/TMC.2018.2866249

work page doi:10.1109/tmc.2018.2866249 2019
[13]

Descriptor: Deakin IoT Traffic (D-IoT),

A. Pasquini, R. Vasa, I. Logothetis, H. H. Gharakheili, A. Chambers, and M. Tran, “Descriptor: Deakin IoT Traffic (D-IoT),”IEEE Data Descriptions, vol. 2, 2025.DOI: 10.1109/IEEEDATA.2025.3549716

work page doi:10.1109/ieeedata.2025.3549716 2025
[14]

An unsu- pervised machine learning approach for IoT device categorization,

F. Sawadogo, J. Violos, A. Hameed, and A. Leivadeas, “An unsu- pervised machine learning approach for IoT device categorization,” in 2022 IEEE Int Mediterranean Conf on Communications & Networking (MeditCom), 2022.DOI: 10.1109/MeditCom55741.2022.9928766

work page doi:10.1109/meditcom55741.2022.9928766 2022
[15]

Intrusion detection using network traffic profiling and machine learn- ing for IoT,

J. R. Rose, M. Swann, G. Bendiab, S. Shiaeles, and N. Kolokotronis, “Intrusion detection using network traffic profiling and machine learn- ing for IoT,” in2021 IEEE 7th Int Conf on Network Softwarization (NetSoft), 2021.DOI: 10.1109/NetSoft51509.2021.9492685

work page doi:10.1109/netsoft51509.2021.9492685 2021
[16]

Intelligent anomaly detection for large network traffic with optimized deep clustering (ODC) algorithm,

A. G. Roselin, P. Nanda, S. Nepal, and X. He, “Intelligent anomaly detection for large network traffic with optimized deep clustering (ODC) algorithm,”IEEE Access, vol. 9, pp. 47 243–47 251, 2021.DOI: 10.1109/ACCESS.2021.3068172

work page doi:10.1109/access.2021.3068172 2021
[17]

UNSW HomeNet: A network traffic flow dataset for AI-based smart home device classification,

M. M. Rahman, F. Bouhafs, S. A. Hoseini, and F. d. Hartog, “UNSW HomeNet: A network traffic flow dataset for AI-based smart home device classification,”Computers & Industrial Engineering, vol. 204, p. 111 041, Jun. 2025.DOI: 10.1016/j.cie.2025.111041

work page doi:10.1016/j.cie.2025.111041 2025
[18]

A compara- tive study of unsupervised learning techniques and natural language processing in network traffic classification,

Y . P. Kumar S, S. Mishra, and V . K. Chaithanya Manam, “A compara- tive study of unsupervised learning techniques and natural language processing in network traffic classification,” inIEEE Int. Conf. on Advanced Networks & Telecommunications Systems (ANTS), 2023. DOI: 10.1109/ANTS59832.2023.10469018

work page doi:10.1109/ants59832.2023.10469018 2023
[19]

IoT device identification method based on transformer and clustering,

L. Deng, D. Gu, and Z. Lin, “IoT device identification method based on transformer and clustering,”Computer Networks, vol. 273, p. 111 791, Dec. 2025.DOI: 10.1016/j.comnet.2025.111791

work page doi:10.1016/j.comnet.2025.111791 2025