pith. machine review for the scientific record. sign in

arxiv: 2602.24047 · v1 · submitted 2026-02-27 · 💻 cs.NI · cs.CR· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Unsupervised Baseline Clustering and Incremental Adaptation for IoT Device Traffic Profiling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:53 UTC · model grok-4.3

classification 💻 cs.NI cs.CRcs.LG
keywords IoT device profilingunsupervised clusteringDBSCANBIRCHtraffic analysisincremental learningnetwork security
0
0 comments X

The pith

Density-based clustering best matches ground-truth IoT device labels in unsupervised traffic profiling while incremental methods trade purity for adaptability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-stage approach to handle the challenge of identifying and tracking diverse IoT devices whose traffic patterns change over time. It first applies classical unsupervised clustering methods to long-duration network flow data from the Deakin IoT dataset to establish baseline device profiles without using labels. Among the methods tested, DBSCAN stands out by separating a large number of outliers and achieving the highest normalized mutual information score of 0.78 with actual device identities. The second stage then evaluates stream clustering techniques for updating these profiles as new traffic arrives, showing that BIRCH offers fast updates and good separation for a new device but at the cost of some accuracy on previously known devices.

Core claim

Density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71).

What carries the argument

Two-stage pipeline: DBSCAN for baseline density-based clustering on flow features to profile devices, followed by BIRCH for incremental stream-oriented clustering to adapt to evolving traffic.

If this is right

  • Static profiling using DBSCAN can achieve high alignment with device identities in fixed datasets.
  • Incremental updates with BIRCH enable handling of new devices in under a second per update.
  • Adaptation to novel traffic comes with reduced accuracy on previously profiled devices.
  • Flow features alone can distinguish many devices but leave some traffic as outliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such methods could reduce reliance on labeled data for IoT security monitoring in dynamic environments.
  • Testing on more varied datasets would reveal if the observed trade-offs generalize beyond the selected captures.
  • Combining density-based and stream clustering might balance purity and adaptability better than either alone.

Load-bearing premise

The selected long-duration captures from the Deakin IoT dataset are representative of real-world evolving IoT traffic and that flow features alone suffice to distinguish device identities across time.

What would settle it

Running the same pipeline on a different IoT dataset with ground-truth labels and measuring if DBSCAN still achieves NMI above 0.7 and BIRCH maintains similar purity and update times.

Figures

Figures reproduced from arXiv: 2602.24047 by John D. Hastings, Sean M. Alderman.

Figure 1
Figure 1. Figure 1: t-SNE 2D Cluster Visualizations NMI and Silhouette Coefficient scores are measured, re￾sulting in inconclusive results to determine the best model by the score alone. Further analysis indicates that NMI carries greater importance for our goals in RQ2 and should be the principal metric while also trying to maximize the Silhouette score. NMI represents an external metric measured with the [PITH_FULL_IMAGE:f… view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE 2D Cluster Visualizations A comparison of cluster visualizations for RQ1 and RQ2 are provided in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

The growth and heterogeneity of IoT devices create security challenges where static identification models can degrade as traffic evolves. This paper presents a two-stage, flow-feature-based pipeline for unsupervised IoT device traffic profiling and incremental model updating, evaluated on selected long-duration captures from the Deakin IoT dataset. For baseline profiling, density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, we evaluate stream-oriented clustering approaches and find that BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71). Overall, the results highlight a practical trade-off between high-purity static profiling and the flexibility of incremental clustering for evolving IoT environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents a two-stage unsupervised pipeline for IoT device traffic profiling using flow features from selected long-duration captures in the Deakin IoT dataset. Baseline profiling applies DBSCAN to isolate outliers and achieve the highest alignment with ground-truth labels (NMI 0.78, outperforming centroid-based methods on purity). Incremental adaptation evaluates stream clustering methods, with BIRCH providing efficient updates (0.13 s per update) and coherent clusters for a held-out novel device (purity 0.87), though at the cost of limited novel traffic capture (share 0.72) and reduced known-device accuracy (0.71) post-adaptation. The work emphasizes practical trade-offs between static high-purity profiling and flexible incremental updates for evolving IoT environments.

Significance. If the empirical results hold under full specification, the paper offers a concrete, reproducible demonstration of density-based clustering for static IoT profiling and BIRCH for low-latency adaptation, with explicit metrics (NMI, purity, update latency) that quantify the accuracy-flexibility trade-off. This could directly inform the design of label-free security systems for heterogeneous, time-varying IoT deployments where static models degrade.

major comments (3)
  1. [Methodology] Methodology section: the exact composition of the flow feature vector (e.g., which packet-size, timing, protocol, or statistical aggregates are extracted) is not defined. Without this, the central claim that DBSCAN yields NMI 0.78 and isolates a substantial outlier portion cannot be verified or reproduced, as the result may depend on particular feature choices.
  2. [Experimental Setup] Experimental Setup and Results sections: the values chosen for DBSCAN's eps and min_samples, and for BIRCH's threshold and branching factor, are not reported, nor is the procedure used to select them. These free parameters directly determine the reported NMI 0.78, purity 0.87, and 0.13 s update time; their omission makes the performance comparison to other classical methods non-reproducible.
  3. [Evaluation] Evaluation section: the selection criteria and representativeness of the 'long-duration captures' from the Deakin dataset are not justified, nor is any statistical significance test provided for the NMI/purity differences. This weakens the claim that the observed trade-offs generalize to real-world evolving IoT traffic.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'selected long-duration captures' should be accompanied by the total number of flows or devices involved to give readers immediate scale.
  2. [Results] Notation: the definitions of 'outlier portion', 'share', and 'known-device accuracy' are used without explicit formulas or pseudocode, reducing clarity when comparing the two stages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects for improving reproducibility and the strength of our claims. We address each major comment below and will revise the manuscript to incorporate the necessary clarifications and additions.

read point-by-point responses
  1. Referee: [Methodology] Methodology section: the exact composition of the flow feature vector (e.g., which packet-size, timing, protocol, or statistical aggregates are extracted) is not defined. Without this, the central claim that DBSCAN yields NMI 0.78 and isolates a substantial outlier portion cannot be verified or reproduced, as the result may depend on particular feature choices.

    Authors: We agree that a precise definition of the flow feature vector is required for reproducibility. In the revised manuscript, we will expand the Methodology section with a complete enumeration of all extracted features, including packet-size statistics (mean, variance, min/max), timing attributes (inter-arrival times, durations), protocol indicators, and statistical aggregates. This addition will directly support verification of the DBSCAN results, including the NMI of 0.78 and outlier isolation. revision: yes

  2. Referee: [Experimental Setup] Experimental Setup and Results sections: the values chosen for DBSCAN's eps and min_samples, and for BIRCH's threshold and branching factor, are not reported, nor is the procedure used to select them. These free parameters directly determine the reported NMI 0.78, purity 0.87, and 0.13 s update time; their omission makes the performance comparison to other classical methods non-reproducible.

    Authors: We acknowledge the need for full parameter transparency. The revised Experimental Setup section will report the exact values used for DBSCAN (eps and min_samples) and BIRCH (threshold and branching factor), along with the selection procedure (e.g., evaluation via internal validation metrics such as silhouette score on a validation subset). This will enable reproduction of the reported metrics and fair comparison to other methods. revision: yes

  3. Referee: [Evaluation] Evaluation section: the selection criteria and representativeness of the 'long-duration captures' from the Deakin dataset are not justified, nor is any statistical significance test provided for the NMI/purity differences. This weakens the claim that the observed trade-offs generalize to real-world evolving IoT traffic.

    Authors: We appreciate the point on generalizability. We will revise the Evaluation section to explicitly justify the selection of long-duration captures based on their extended temporal span and device diversity, which enable analysis of both static and incremental clustering. We will also add statistical significance tests (e.g., paired t-tests across multiple runs or bootstrap methods) for the NMI and purity differences to better support the observed trade-offs. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical clustering results on external dataset

full rationale

The paper applies standard DBSCAN and other clustering algorithms to flow features from the public Deakin IoT dataset, then reports direct empirical metrics such as NMI 0.78 against ground-truth device labels. No equations, derivations, or fitted parameters are defined inside the paper that later appear as 'predictions' or results by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The evaluation remains a straightforward comparison of classical methods on held-out data, fully independent of any internal reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The pipeline rests on the domain assumption that flow-level statistics are stable enough to separate device identities and on standard clustering hyperparameters that are tuned to the data.

free parameters (2)
  • DBSCAN eps and min_samples
    Core density parameters chosen to isolate outliers and produce the reported NMI; values not stated in abstract.
  • BIRCH threshold and branching factor
    Control incremental cluster formation and update speed; values not stated in abstract.
axioms (2)
  • domain assumption Flow features (packet sizes, timings, protocols) are sufficient to distinguish device identities
    Invoked by the entire flow-feature-based pipeline.
  • domain assumption Ground-truth device labels in the Deakin dataset are accurate and stable across the capture duration
    Required for computing NMI and purity scores.

pith-pipeline@v0.9.0 · 5486 in / 1338 out tokens · 28144 ms · 2026-05-15T18:53:22.521466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    A machine learning based framework for IoT device identification and abnormal traffic detection,

    O. Salman, I. H. Elhajj, A. Chehab, and A. Kayssi, “A machine learning based framework for IoT device identification and abnormal traffic detection,” en,Transactions on Emerging Telecommunications Technologies, vol. 33, no. 3, 2022.DOI: 10.1002/ett.3743

  2. [2]

    A Generic Machine Learning Approach for IoT Device Identifica- tion,

    Z. Ali, F. Hussain, S. Ghazanfar, M. Husnain, S. Zahid, and G. A. Shah, “A Generic Machine Learning Approach for IoT Device Identifica- tion,” in2021 International Conference on Cyber Warfare and Security (ICCWS), Nov. 2021.DOI: 10.1109/ICCWS53234.2021.9702983

  3. [3]

    Machine Learning With Computer Networks: Tech- niques, Datasets, and Models,

    H. Afifi et al., “Machine Learning With Computer Networks: Tech- niques, Datasets, and Models,”IEEE Access, vol. 12, pp. 54 673– 54 720, 2024.DOI: 10.1109/ACCESS.2024.3384460

  4. [4]

    A comprehensive study of supervised machine learning assisted approaches for IoT device identification,

    Y . Wang et al., “A comprehensive study of supervised machine learning assisted approaches for IoT device identification,” in2024 Interna- tional Conference on Computing, Networking and Communications (ICNC), Feb. 2024.DOI: 10.1109/ICNC59896.2024.10556143

  5. [5]

    IoTTFID: An incremental IoT device iden- tification model based on traffic fingerprint,

    Q. Hao and Z. Rong, “IoTTFID: An incremental IoT device iden- tification model based on traffic fingerprint,”IEEE Access, vol. 11, pp. 58 679–58 691, 2023.DOI: 10.1109/ACCESS.2023.3284542

  6. [6]

    A lightweight IoT device identification using enhanced behavioral-based features,

    M. Rabbani et al., “A lightweight IoT device identification using enhanced behavioral-based features,”Peer-to-Peer Networking & Ap- plications, vol. 18, no. 2, 2024.DOI: 10.1007/s12083-024-01891-9

  7. [7]

    Kolcun et al.,Revisiting IoT Device Identification, arXiv:2107.07818 [cs], Jul

    R. Kolcun et al.,Revisiting IoT Device Identification, arXiv:2107.07818 [cs], Jul. 2021.DOI: 10.48550/arXiv.2107.07818

  8. [8]

    Smart Recon: Network Traffic Fingerprinting for IoT Device Identification,

    J. Thom, N. Thom, S. Sengupta, and E. Hand, “Smart Recon: Network Traffic Fingerprinting for IoT Device Identification,” in2022 IEEE 12th Annual Computing and Communication Workshop and Confer- ence (CCWC), Jan. 2022.DOI: 10.1109/CCWC54503.2022.9720739

  9. [9]

    Application of Machine Learning Models for De- vice Identification in Wireless Network Traffic,

    R. N. Anaedevha, “Application of Machine Learning Models for De- vice Identification in Wireless Network Traffic,” in2024 Conference of Young Researchers in Electrical and Electronic Engineering (ElCon), Jan. 2024, pp. 104–110.DOI: 10.1109/ElCon61730.2024.10468413

  10. [10]

    A network device identification method based on packet temporal features and machine learning,

    L. Hu, B. Zhao, and G. Wang, “A network device identification method based on packet temporal features and machine learning,”Applied Sciences, vol. 14, no. 17, p. 7954, 2024.DOI: 10.3390/app14177954

  11. [11]

    Enhancing IoT security via automatic network traffic analysis: The transition from machine learning to deep learning,

    M. Hamidouche, E. Popko, and B. Ouni, “Enhancing IoT security via automatic network traffic analysis: The transition from machine learning to deep learning,” in13th Int Conf on the Internet of Things (IoT ’23), ACM, Mar. 2024.DOI: 10.1145/3627050.3627053

  12. [12]

    Classifying IoT devices in smart environments using network traffic characteristics,

    A. Sivanathan et al., “Classifying IoT devices in smart environments using network traffic characteristics,”IEEE Transactions on Mobile Computing, vol. 18, no. 8, 2019.DOI: 10.1109/TMC.2018.2866249

  13. [13]

    Descriptor: Deakin IoT Traffic (D-IoT),

    A. Pasquini, R. Vasa, I. Logothetis, H. H. Gharakheili, A. Chambers, and M. Tran, “Descriptor: Deakin IoT Traffic (D-IoT),”IEEE Data Descriptions, vol. 2, 2025.DOI: 10.1109/IEEEDATA.2025.3549716

  14. [14]

    An unsu- pervised machine learning approach for IoT device categorization,

    F. Sawadogo, J. Violos, A. Hameed, and A. Leivadeas, “An unsu- pervised machine learning approach for IoT device categorization,” in 2022 IEEE Int Mediterranean Conf on Communications & Networking (MeditCom), 2022.DOI: 10.1109/MeditCom55741.2022.9928766

  15. [15]

    Intrusion detection using network traffic profiling and machine learn- ing for IoT,

    J. R. Rose, M. Swann, G. Bendiab, S. Shiaeles, and N. Kolokotronis, “Intrusion detection using network traffic profiling and machine learn- ing for IoT,” in2021 IEEE 7th Int Conf on Network Softwarization (NetSoft), 2021.DOI: 10.1109/NetSoft51509.2021.9492685

  16. [16]

    Intelligent anomaly detection for large network traffic with optimized deep clustering (ODC) algorithm,

    A. G. Roselin, P. Nanda, S. Nepal, and X. He, “Intelligent anomaly detection for large network traffic with optimized deep clustering (ODC) algorithm,”IEEE Access, vol. 9, pp. 47 243–47 251, 2021.DOI: 10.1109/ACCESS.2021.3068172

  17. [17]

    UNSW HomeNet: A network traffic flow dataset for AI-based smart home device classification,

    M. M. Rahman, F. Bouhafs, S. A. Hoseini, and F. d. Hartog, “UNSW HomeNet: A network traffic flow dataset for AI-based smart home device classification,”Computers & Industrial Engineering, vol. 204, p. 111 041, Jun. 2025.DOI: 10.1016/j.cie.2025.111041

  18. [18]

    A compara- tive study of unsupervised learning techniques and natural language processing in network traffic classification,

    Y . P. Kumar S, S. Mishra, and V . K. Chaithanya Manam, “A compara- tive study of unsupervised learning techniques and natural language processing in network traffic classification,” inIEEE Int. Conf. on Advanced Networks & Telecommunications Systems (ANTS), 2023. DOI: 10.1109/ANTS59832.2023.10469018

  19. [19]

    IoT device identification method based on transformer and clustering,

    L. Deng, D. Gu, and Z. Lin, “IoT device identification method based on transformer and clustering,”Computer Networks, vol. 273, p. 111 791, Dec. 2025.DOI: 10.1016/j.comnet.2025.111791