pith. machine review for the scientific record. sign in

arxiv: 2604.08800 · v1 · submitted 2026-04-09 · 💻 cs.CR · cs.LG

Recognition: unknown

Tracing the Chain: Deep Learning for Stepping-Stone Intrusion Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:47 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords stepping-stone intrusion detectiondeep learningflow correlationnetwork securitytransformer modelssynthetic dataintrusion detectiontunneling protocols
0
0 comments X

The pith

A deep learning model called ESPRESSO detects stepping-stone intrusions by correlating network flows with over 99 percent true positive rate at very low false positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that deep learning can reliably detect stepping-stone intrusions, in which attackers route sessions through chains of compromised hosts to hide their origin. This matters because classical statistical methods cannot achieve the extremely low false positive rates needed for operational network defense. ESPRESSO combines a transformer-based feature extractor, time-aligned multi-channel interval features, and online triplet metric learning to correlate incoming and outgoing flows at relay hosts. The authors built a synthetic data generator covering five tunneling protocols and show that the model outperforms the prior DeepCoFFEA baseline in both host-mode and network-mode settings. They also demonstrate chain-length prediction for spotting malicious pivoting and identify timing perturbations as the main vulnerability.

Core claim

ESPRESSO substantially outperforms the state-of-the-art DeepCoFFEA baseline across all five protocols and both host-mode and network-mode detection scenarios, achieving a true positive rate exceeding 0.99 at a false positive rate of 10^{-3} for standard bursty protocols in network-mode.

What carries the argument

ESPRESSO, a flow correlation model that combines transformer-based feature extraction, time-aligned multi-channel interval features, and online triplet metric learning to match incoming and outgoing network flows.

If this is right

  • Reliable detection at the low false positive rates required for operational use becomes feasible.
  • Chain length prediction offers a way to distinguish malicious pivoting from benign activity.
  • Timing-based perturbations are revealed as the primary vulnerability, pointing to where robustness improvements are needed.
  • The method works across SSH, SOCAT, ICMP, DNS, and mixed multi-protocol chains in both host and network modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The synthetic data generator could serve as a public benchmark for testing other flow correlation techniques.
  • Incorporating adversarial training against timing changes might strengthen similar deep learning detectors in the future.
  • This line of work could extend to analyzing traffic in anonymity systems where flow correlation is also a core challenge.
  • Real-time deployment of such models would require addressing computational costs of transformer inference on high-speed links.

Load-bearing premise

The synthetic data collection tool generates traffic whose statistical and timing properties sufficiently match those of real-world stepping-stone intrusions.

What would settle it

Evaluating ESPRESSO on a dataset of real captured stepping-stone flows from actual compromised hosts would determine whether the reported detection rates hold beyond the synthetic examples.

Figures

Figures reproduced from arXiv: 2604.08800 by Matthew Wright, Nate Mathews, Nicholas Hopper.

Figure 1
Figure 1. Figure 1: The standard stepping-stone intrusion detection model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Convolutional projection during transformer self [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: N P A N P A margin Learning margin Easy Negatives Hard Negatives Semi-hard Negatives [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Architecture comparison between DCF and ESPRESSO. ESPRESSO generates a global feature sequence first and then applies windowing, whereas DCF performs windowing before feature extraction. computation. The self-attention layer is followed by a position￾wise feed-forward (MLP) sublayer. By processing the full traffic sequence before any windowing operation, the FEN can capture global temporal context that wou… view at source ↗
Figure 5
Figure 5. Figure 5: Diagram of synthetic stepping stone data collection [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the network traffic correlation scenarios [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ROC curves presenting correlation efficacy of (a) [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ROC curves presenting correlation efficacy of (a) [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: ROC curves demonstrating correlation performance on [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of window feature similarity of ESPRESSO trained with different loss strategies on one randomly chosen [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: ROC comparison of ESPRESSO trained using single [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: ROC results for additional benchmarks on the Multi [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: ROC curves for (a) ESPRESSO and (b) DCF on [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 14
Figure 14. Figure 14: Full chain correlation accuracy as a function of FPR [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: ROC curves for (a) ESPRESSO and (b) DCF on the [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 20
Figure 20. Figure 20: ROC curves comparing ESPRESSO, GreenTea, [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 18
Figure 18. Figure 18: ROC curves for (a) ESPRESSO and (b) DCF on [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: ROC curves comparing ESPRESSO (online mining), [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 23
Figure 23. Figure 23: ROC curves comparing ESPRESSO and DeepCoF [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: ROC curves presenting correlation efficacy across five datasets in the [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: ROC curves demonstrating correlation performance of [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗
read the original abstract

Stepping-stone intrusions (SSIs) are a prevalent network evasion technique in which attackers route sessions through chains of compromised intermediate hosts to obscure their origin. Effective SSI detection requires correlating the incoming and outgoing flows at each relay host at extremely low false positive rates -- a stringent requirement that renders classical statistical methods inadequate in operational settings. We apply ESPRESSO, a deep learning flow correlation model combining a transformer-based feature extraction network, time-aligned multi-channel interval features, and online triplet metric learning, to the problem of stepping-stone intrusion detection. To support training and evaluation, we develop a synthetic data collection tool that generates realistic stepping-stone traffic across five tunneling protocols: SSH, SOCAT, ICMP, DNS, and mixed multi-protocol chains. Across all five protocols and in both host-mode and network-mode detection scenarios, ESPRESSO substantially outperforms the state-of-the-art DeepCoFFEA baseline, achieving a true positive rate exceeding 0.99 at a false positive rate of $10^{-3}$ for standard bursty protocols in network-mode. We further demonstrate chain length prediction as a tool for distinguishing malicious from benign pivoting, and conduct a systematic robustness analysis revealing that timing-based perturbations are the primary vulnerability of correlation-based stepping-stone detectors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ESPRESSO, a deep learning model for stepping-stone intrusion detection that combines a transformer-based feature extractor, time-aligned multi-channel interval features, and online triplet metric learning. The authors introduce a synthetic data collection tool to generate stepping-stone traffic across five protocols (SSH, SOCAT, ICMP, DNS, and mixed chains). They report that ESPRESSO substantially outperforms the DeepCoFFEA baseline in both host-mode and network-mode scenarios, achieving TPR > 0.99 at FPR = 10^{-3} for standard bursty protocols in network mode. The work also includes chain-length prediction for distinguishing malicious pivoting and a robustness analysis highlighting timing perturbations as the primary vulnerability.

Significance. If the synthetic traces prove representative of real-world stepping-stone flows, the results would advance network security by demonstrating that modern deep learning techniques can achieve the stringent low false-positive rates required for operational SSI detection, where classical statistical methods are inadequate. The explicit comparison to an external baseline, the systematic robustness study, and the use of triplet learning constitute concrete strengths. The synthetic generator enables reproducible controlled experiments, but its unvalidated fidelity to live traffic is the key limiting factor for translating these findings into practice.

major comments (2)
  1. [Abstract] Abstract: All headline performance claims (TPR exceeding 0.99 at FPR of 10^{-3} across protocols in network mode) rest exclusively on flows produced by the authors' synthetic collection tool. No external validation—such as statistical distance metrics to real pivoting traces, comparison against public datasets, or sensitivity analysis to generator hyperparameters—is reported. Because the central claim is that the model is suitable for operational SSI detection, the absence of evidence that the synthetic distribution reproduces real timing jitter, burst structure, and protocol artifacts is load-bearing and must be addressed.
  2. [Evaluation] Evaluation and data-generation sections: The manuscript provides insufficient detail on data-generation parameters, training procedures (including triplet-loss margin and transformer hyperparameters), statistical tests for performance differences, and measures against overfitting beyond standard train-test splits. These omissions prevent independent assessment of whether the reported gains over DeepCoFFEA are robust or merely in-distribution artifacts of the synthetic generator.
minor comments (1)
  1. [Abstract] The abstract refers to 'time-aligned multi-channel interval features' without a concise definition or diagram; a short methods subsection or figure would improve clarity for readers unfamiliar with the feature construction.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for highlighting both the strengths of our work and areas for improvement. We appreciate the acknowledgment of our baseline comparison, robustness analysis, and the value of the synthetic generator for controlled experiments. We address the major comments below with commitments to revisions that improve reproducibility and transparency. We note that while we can substantially expand methodological details, direct validation against real-world stepping-stone traces remains constrained by data availability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: All headline performance claims (TPR exceeding 0.99 at FPR of 10^{-3} across protocols in network mode) rest exclusively on flows produced by the authors' synthetic collection tool. No external validation—such as statistical distance metrics to real pivoting traces, comparison against public datasets, or sensitivity analysis to generator hyperparameters—is reported. Because the central claim is that the model is suitable for operational SSI detection, the absence of evidence that the synthetic distribution reproduces real timing jitter, burst structure, and protocol artifacts is load-bearing and must be addressed.

    Authors: We acknowledge that the performance claims are derived from synthetic traces and that external validation would further support operational relevance. Real pivoting datasets are not publicly available due to privacy and ethical constraints, which is why we developed a controllable synthetic generator. In the revision we will add a new subsection on data fidelity that includes: (1) sensitivity analysis sweeping generator hyperparameters (jitter variance, burst length, inter-packet timing distributions); (2) statistical comparisons (e.g., Kolmogorov-Smirnov tests on inter-arrival times and packet-size distributions) against publicly available non-pivoting traces from CAIDA and MAWI; and (3) an explicit limitations paragraph discussing the gap between synthetic and live traffic. These additions will clarify the scope of our claims without overstating generalizability to arbitrary real-world conditions. revision: partial

  2. Referee: [Evaluation] Evaluation and data-generation sections: The manuscript provides insufficient detail on data-generation parameters, training procedures (including triplet-loss margin and transformer hyperparameters), statistical tests for performance differences, and measures against overfitting beyond standard train-test splits. These omissions prevent independent assessment of whether the reported gains over DeepCoFFEA are robust or merely in-distribution artifacts of the synthetic generator.

    Authors: We agree that additional implementation details are required for reproducibility. The revised manuscript will expand both sections as follows: data-generation parameters will list all protocol-specific settings, chain-length distributions, delay ranges, and packet-size models; training details will specify the online triplet loss margin (1.0), transformer configuration (6 layers, 8 heads, 512-dimensional embeddings, 0.1 dropout), optimizer (Adam, lr=1e-4), batch size, and epoch count; statistical significance will be reported via paired Wilcoxon signed-rank tests on TPR/FPR across five independent runs with different seeds; and overfitting controls will include early stopping on a held-out validation set, L2 regularization, and explicit train/validation/test splits with no overlap in generated chains. We will also release the full data-generation tool and training code upon acceptance. revision: yes

standing simulated objections not resolved
  • Direct empirical validation of synthetic traces against real-world stepping-stone pivoting flows, as no suitable public datasets exist and ethical considerations preclude new collection of live attack traffic.

Circularity Check

0 steps flagged

No significant circularity detected in derivation or evaluation chain

full rationale

The paper trains and evaluates ESPRESSO on held-out synthetic flows generated by its own collection tool, then compares against the external DeepCoFFEA baseline using standard metrics (TPR at low FPR). No equations, parameters, or predictions reduce to the inputs by construction; the reported performance is an empirical result on a fixed dataset split rather than a tautology. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are present in the provided text. The central claim remains an independent empirical comparison even if the synthetic data's fidelity to real traffic is separately questioned.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest primarily on the domain assumption that synthetic traffic captures essential real-world characteristics of stepping-stone chains; model hyperparameters are fitted during training but do not define the reported metrics by construction.

free parameters (1)
  • triplet loss margin and transformer hyperparameters
    These are tuned during model training on the synthetic dataset to achieve the reported correlation performance.
axioms (1)
  • domain assumption Synthetic data generated by the custom tool accurately reproduces timing, burstiness, and protocol behaviors of real stepping-stone intrusions.
    All training, evaluation, and robustness results depend on this generated dataset.

pith-pipeline@v0.9.0 · 5515 in / 1451 out tokens · 50795 ms · 2026-05-10T16:47:43.803142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    The panama papers: Exposing the rogue offshore finance industry,

    I. C. of Investigative Journalists, “The panama papers: Exposing the rogue offshore finance industry,” 2016

  2. [2]

    Tlp: White-analysis of the cyber attack on the ukrainian power grid-defense use case,

    R. Lee, M. J. Assante, and T. Conway, “Tlp: White-analysis of the cyber attack on the ukrainian power grid-defense use case,” inProc. Electr. Inf. Sharing Anal. Center (E-ISAC), 2016, pp. 1–29

  3. [3]

    Active medical device cyber-attacks,

    L. Ayala, “Active medical device cyber-attacks,” inCybersecurity for hospitals and healthcare facilities. Springer, 2016, pp. 19–37

  4. [4]

    Global energy cyberattacks: Night dragon,

    Mcafee, “Global energy cyberattacks: Night dragon,” 2011

  5. [5]

    Advanced persistent threats and how to monitor and deter them,

    C. Tankard, “Advanced persistent threats and how to monitor and deter them,”Network security, vol. 2011, no. 8, pp. 16–19, 2011

  6. [6]

    E. U. A. for Cybersecurity,Baseline security recommendations for IoT in the context of critical information infrastructures. European Network and Information Security Agency, 2017

  7. [7]

    Holding intruders accountable on the internet,

    S. Staniford-Chen and L. T. Heberlein, “Holding intruders accountable on the internet,” inProceedings 1995 IEEE Symposium on Security and Privacy. IEEE, 1995, pp. 39–49

  8. [8]

    Detecting stepping stones

    Y . Zhang and V . Paxson, “Detecting stepping stones.” inUSENIX Security Symposium, vol. 171, 2000, p. 184

  9. [9]

    Finding a connection chain for tracing intruders,

    K. Yoda and H. Etoh, “Finding a connection chain for tracing intruders,” inEuropean Symposium on Research in Computer Security. Springer, 2000, pp. 191–205

  10. [10]

    Inter-packet delay based correlation for tracing encrypted connections through stepping stones,

    X. Wang, D. S. Reeves, and S. F. Wu, “Inter-packet delay based correlation for tracing encrypted connections through stepping stones,” inProceedings of ESORICS 2002, October 2002, pp. 244–263

  11. [11]

    Multiscale stepping-stone detection: Detecting pairs of jittered interactive streams by exploiting maximum tolerable delay,

    D. L. Donoho, A. G. Flesia, U. Shankar, V . Paxson, J. Coit, and S. Staniford, “Multiscale stepping-stone detection: Detecting pairs of jittered interactive streams by exploiting maximum tolerable delay,” inInternational Workshop on Recent Advances in Intrusion Detection. Springer, 2002, pp. 17–35

  12. [12]

    Detection of interactive stepping stones: Algorithms and confidence bounds,

    A. Blum, D. Song, and S. Venkataraman, “Detection of interactive stepping stones: Algorithms and confidence bounds,” inInternational Workshop on Recent Advances in Intrusion Detection. Springer, 2004, pp. 258–277

  13. [13]

    Correlating tcp/ip packet contexts to detect stepping-stone intrusion,

    J. Yang and D. Woolbright, “Correlating tcp/ip packet contexts to detect stepping-stone intrusion,”Computers & Security, vol. 30, no. 6, pp. 538–546, 2011. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167404811000721

  14. [14]

    Neural network based approach for stepping stone detection under delay and chaff perturbations,

    R. Kumar and B. Gupta, “Neural network based approach for stepping stone detection under delay and chaff perturbations,” Procedia Computer Science, vol. 85, pp. 155–165, 2016, international Conference on Computational Modelling and Security (CMS 2016). [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S187705091630552X

  15. [15]

    Sleepy watermark tracing: An active network-based intrusion response framework,

    X. Wang, D. S. Reeves, S. F. Wu, and J. Yuill, “Sleepy watermark tracing: An active network-based intrusion response framework,” inIFIP International Information Security Conference. Springer, 2001, pp. 369–384

  16. [16]

    Robust correlation of encrypted attack traffic through stepping stones by manipulation of interpacket delays,

    X. Wang and D. S. Reeves, “Robust correlation of encrypted attack traffic through stepping stones by manipulation of interpacket delays,” inProceedings of the 10th ACM Conference on Computer and Communications Security, ser. CCS ’03. New York, NY , USA: Association for Computing Machinery, 2003, p. 20–29. [Online]. Available: https://doi.org/10.1145/948109.948115

  17. [17]

    Rainbow: A robust and invisible non-blind watermark for network flows

    A. Houmansadr, N. Kiyavash, and N. Borisov, “Rainbow: A robust and invisible non-blind watermark for network flows.” inNetwork & Distributed System Security Symposium (NDSS), 2009

  18. [18]

    The design and implementation of an efficient quaternary network flow watermark technology,

    L. Mo, G. Lv, B. Wang, G. Qiao, and J. Tan, “The design and implementation of an efficient quaternary network flow watermark technology,” in2021 17th International Conference on Mobility, Sensing and Networking (MSN), 2021, pp. 746–751

  19. [19]

    Deepcoffea: Improved flow correlation attacks on tor via metric learning and amplification,

    S. E. Oh, T. Yang, N. Mathews, J. K. Holland, M. S. Rahman, N. Hopper, and M. Wright, “Deepcoffea: Improved flow correlation attacks on tor via metric learning and amplification,” in2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 1915–1932

  20. [20]

    Espresso: Advanced end-to-end flow correlation attacks on tor,

    T. Chawla, S. Mittal, N. Mathews, and M. Wright, “Espresso: Advanced end-to-end flow correlation attacks on tor,” inProceedings of the 8th Asia-Pacific Workshop on Networking, ser. APNet ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 219–220. [Online]. Available: https://doi.org/10.1145/3663408.3665824

  21. [21]

    A research survey in stepping-stone intrusion detection,

    L. Wang and J. Yang, “A research survey in stepping-stone intrusion detection,”EURASIP Journal on Wireless Communications and Networking, vol. 2018, no. 1, p. 276, Dec 2018. [Online]. Available: https://doi.org/10.1186/s13638-018-1303-2

  22. [22]

    Detecting encrypted stepping-stone connections,

    T. He and L. Tong, “Detecting encrypted stepping-stone connections,” IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 1612–1623, 2007

  23. [23]

    Monitoring network traffic to detect stepping-stone intrusion,

    J. Yang, B. Lee, and S. S. Huang, “Monitoring network traffic to detect stepping-stone intrusion,” in22nd International Conference on Advanced Information Networking and Applications - Workshops (aina workshops 2008), 2008, pp. 56–61

  24. [24]

    Identify encrypted packets to detect stepping-stone intrusion,

    J. Yang, L. Wang, S. Shakya, and M. Workman, “Identify encrypted packets to detect stepping-stone intrusion,” inAdvanced Information Networking and Applications, L. Barolli, I. Woungang, and T. Enokido, Eds. Cham: Springer International Publishing, 2021, pp. 536–547

  25. [25]

    Manipulating network traffic to evade stepping-stone intrusion detection,

    J. Yang, L. Wang, A. Lesh, and B. Lockerbie, “Manipulating network traffic to evade stepping-stone intrusion detection,”Internet of Things, vol. 3-4, pp. 34–45, 2018. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S254266051830057X

  26. [26]

    Evading stepping-stone detection with enough chaff,

    H. Clausen, M. S. Gibson, and D. Aspinall, “Evading stepping-stone detection with enough chaff,” inNetwork and System Security, M. Ku- tyłowski, J. Zhang, and C. Chen, Eds. Cham: Springer International Publishing, 2020, pp. 431–446

  27. [27]

    Detecting long connection chains of interactive terminal sessions,

    K. H. Yung, “Detecting long connection chains of interactive terminal sessions,” inInternational Workshop on Recent Advances in Intrusion Detection. Springer, 2002, pp. 1–16

  28. [28]

    A real-time algorithm to detect long connection chains of interactive terminal sessions,

    J. Yang and S.-H. S. Huang, “A real-time algorithm to detect long connection chains of interactive terminal sessions,” inProceedings of the 3rd international conference on Information security, 2004, pp. 198– 203

  29. [29]

    Detect stepping-stone intrusion by mining network traffic using k-means clus- tering,

    L. Wang, J. Yang, M. Mccormick, P.-J. Wan, and X. Xu, “Detect stepping-stone intrusion by mining network traffic using k-means clus- tering,” in2020 IEEE 39th International Performance Computing and Communications Conference (IPCCC). IEEE, 2020, pp. 1–8

  30. [30]

    Mining network traffic with the-means clustering algorithm for stepping-stone intrusion detection,

    L. Wang, J. Yang, X. Xu, and P.-J. Wan, “Mining network traffic with the-means clustering algorithm for stepping-stone intrusion detection,” Wireless Communications and Mobile Computing, vol. 2021, 2021

  31. [31]

    Mining tcp/ip packets to detect stepping-stone intrusion,

    J. Yang and S.-H. S. Huang, “Mining tcp/ip packets to detect stepping-stone intrusion,”Computers & Security, vol. 26, no. 7, pp. 479–484, 2007. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0167404807000934

  32. [32]

    Finn: Fingerprinting network flows using neural networks,

    F. Rezaei and A. Houmansadr, “Finn: Fingerprinting network flows using neural networks,” inAnnual Computer Security Applications Conference, ser. ACSAC. New York, NY , USA: Association for Computing Machinery, 2021, p. 1011–1024. [Online]. Available: https://doi.org/10.1145/3485832.3488010

  33. [33]

    Dynamic interval-based watermarking for tracking down network attacks,

    L. Yu, L. Zhang, Y . Zhang, W. Wen, X. Du, and F. Cao, “Dynamic interval-based watermarking for tracking down network attacks,” in2021 IEEE 21st International Conference on Software Quality, Reliability and Security (QRS), 2021, pp. 52–61

  34. [34]

    Invisible flow watermarks for channels with dependent substitution, deletion, and bursty insertion errors,

    X. Gong, M. Rodrigues, and N. Kiyavash, “Invisible flow watermarks for channels with dependent substitution, deletion, and bursty insertion errors,”IEEE Transactions on Information Forensics and Security, vol. 8, no. 11, pp. 1850–1859, 2013

  35. [35]

    Tagit: Tagging network flows using blind fingerprints,

    F. Rezaei and A. Houmansadr, “Tagit: Tagging network flows using blind fingerprints,”Proceedings on Privacy Enhancing Technologies, vol. 2017, no. 4, pp. 290–307, 2017. [Online]. Available: https: //doi.org/10.1515/popets-2017-0050

  36. [36]

    Botmosaic: Collaborative network watermark for the detection of irc-based botnets,

    A. Houmansadr and N. Borisov, “Botmosaic: Collaborative network watermark for the detection of irc-based botnets,”Journal of Systems and Software, vol. 86, no. 3, pp. 707–715, 2013. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121212003068

  37. [37]

    Dropwat: An invisible network flow watermark for data exfiltration traceback,

    A. Iacovazzi, S. Sarda, D. Frassinelli, and Y . Elovici, “Dropwat: An invisible network flow watermark for data exfiltration traceback,”IEEE Transactions on Information Forensics and Security, vol. 13, no. 5, pp. 1139–1154, 2018

  38. [38]

    Non-blind watermarking of network flows,

    A. Houmansadr, N. Kiyavash, and N. Borisov, “Non-blind watermarking of network flows,”IEEE/ACM Transactions on Networking, vol. 22, no. 4, pp. 1232–1244, 2013

  39. [39]

    Network flow watermarking attack on low-latency anonymous communication systems,

    X. Wang, S. Chen, and S. Jajodia, “Network flow watermarking attack on low-latency anonymous communication systems,” inIEEE Symposium on Security and Privacy (S&P), 2007, pp. 116–130

  40. [40]

    Blind detection of spread spectrum flow watermarks,

    W. Jia, F. P. Tso, Z. Ling, X. Fu, D. Xuan, and W. Yu, “Blind detection of spread spectrum flow watermarks,”Security and Communication Networks, vol. 6, no. 3, pp. 257–274, 2013. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/sec.540

  41. [41]

    New attacks on timing-based network flow watermarks,

    Z. Lin and N. Hopper, “New attacks on timing-based network flow watermarks,” in21st USENIX Security Symposium (USENIX Security 12). Bellevue, W A: USENIX Association, Aug. 2012, pp. 381–396. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity12/technical-sessions/presentation/zin

  42. [42]

    On the secrecy of timing-based active watermarking trace-back techniques,

    P. Peng, P. Ning, and D. S. Reeves, “On the secrecy of timing-based active watermarking trace-back techniques,” inProceedings of the 2006 IEEE Symposium on Security and Privacy, ser. SP ’06. USA: IEEE Computer Society, 2006, p. 334–349. [Online]. Available: https://doi.org/10.1109/SP.2006.28

  43. [43]

    Exposing invisible timing-based traffic watermarks with backlit,

    X. Luo, P. Zhou, J. Zhang, R. Perdisci, W. Lee, and R. K. C. Chang, “Exposing invisible timing-based traffic watermarks with backlit,” inProceedings of the 27th Annual Computer Security Applications Conference, ser. ACSAC ’11. New York, NY , USA: Association for Computing Machinery, 2011, p. 197–206. [Online]. Available: https://doi.org/10.1145/2076732.2076760

  44. [44]

    Multi-flow attacks against network flow watermarking schemes

    N. Kiyavash, A. Houmansadr, and N. Borisov, “Multi-flow attacks against network flow watermarking schemes.” inUSENIX security symposium. Berkeley, CA, 2008, pp. 307–320

  45. [45]

    DeepCorr: Strong Flow Correlation Attacks on Tor Using Deep Learning,

    M. Nasr, A. Bahramali, and A. Houmansadr, “DeepCorr: Strong Flow Correlation Attacks on Tor Using Deep Learning,” inACM Conference on Computer and Communications Security (CCS), 2018, p. 1962–1976

  46. [46]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inProc. Advances in Neural Information Processing Systems, 2017

  47. [47]

    Cvt: Introducing convolutions to vision transformers,

    H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, “Cvt: Introducing convolutions to vision transformers,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 22–31

  48. [48]

    Triplet Fingerprinting: More practical and portable website fingerprinting with n-shot learning,

    P. Sirinam, N. Mathews, M. S. Rahman, and M. Wright, “Triplet Fingerprinting: More practical and portable website fingerprinting with n-shot learning,” inACM Conference on Computer and Communications Security (CCS), 2019, p. 1131–1148

  49. [49]

    Deep Fingerprinting: Undermining website fingerprinting defenses with deep learning,

    P. Sirinam, M. Imani, M. Juarez, and M. Wright, “Deep Fingerprinting: Undermining website fingerprinting defenses with deep learning,”ACM Conference on Computer and Communications Security (CCS), 2018

  50. [50]

    Automated website fingerprinting through deep learning,

    V . Rimmer, D. Preuveneers, M. Juarez, T. Van Goethem, and W. Joosen, “Automated website fingerprinting through deep learning,” inNetwork and Distributed System Security Symposium (NDSS), 2018

  51. [51]

    Tik-Tok: The utility of packet timing in website fingerprint- ing attacks,

    M. S. Rahman, P. Sirinam, N. Mathews, K. G. Gangadhara, and M. Wright, “Tik-Tok: The utility of packet timing in website fingerprint- ing attacks,”Proceedings on Privacy Enhancing Technologies (PETS), vol. 2020, no. 3, pp. 5–24, 2020

  52. [52]

    Var-CNN: A data-efficient website fingerprinting attack based on deep learning,

    S. Bhat, D. Lu, A. Kwon, and S. Devadas, “Var-CNN: A data-efficient website fingerprinting attack based on deep learning,”Proceedings on Privacy Enhancing Technologies (PETS), vol. 2019, no. 4, pp. 292–310, 2019

  53. [53]

    Subverting website fingerprinting defenses with robust traffic representation,

    M. Shen, K. Ji, Z. Gao, Q. Li, L. Zhu, and K. Xu, “Subverting website fingerprinting defenses with robust traffic representation,” in32nd USENIX Security Symposium (USENIX Security 23). Anaheim, CA: USENIX Association, Aug. 2023, pp. 607–624. [Online]. Available: https://www.usenix.org/conference/ usenixsecurity23/presentation/shen-meng

  54. [54]

    In: 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21-25, 2023

    N. Mathews, J. K. Holland, S. Oh, M. Rahman, N. Hopper, and M. Wright, “Sok: A critical evaluation of efficient website fingerprinting defenses,” in2023 IEEE Symposium on Security and Privacy (SP). Los Alamitos, CA, USA: IEEE Computer Society, may 2023, pp. 969–986. [Online]. Available: https://doi.ieeecomputersociety.org/10. 1109/SP46215.2023.10179289

  55. [55]

    Measuring information leakage in website fingerprinting attacks and defenses,

    S. Li, H. Guo, and N. Hopper, “Measuring information leakage in website fingerprinting attacks and defenses,” inACM Conference on Computer and Communications Security (CCS), 2018

  56. [56]

    Available: https://www.linux.org/ docs/man8/tc-netem.html

    Linux Foundation, “Netem.” [Online]. Available: https://www.linux.org/ docs/man8/tc-netem.html

  57. [57]

    Def con 25 ctf

    D. C. 25, “Def con 25 ctf.” [Online]. Available: https://media.defcon. org/DEF%20CON%2025/DEF%20CON%2025%20ctf/

  58. [58]

    socat - multipurpose relay

    dest-unreach.org, “socat - multipurpose relay.” [Online]. Available: http://www.dest-unreach.org/socat/

  59. [59]

    ptunnel-ng

    Toni Uhlig, “ptunnel-ng.” [Online]. Available: https://github.com/utoni/ ptunnel-ng

  60. [60]

    dnscat2

    Ron Bowes, “dnscat2.” [Online]. Available: https://github.com/iagox86/ dnscat2

  61. [61]

    Or- thogonal projection loss,

    K. Ranasinghe, M. Naseer, M. Hayat, S. Khan, and F. S. Khan, “Or- thogonal projection loss,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 333–12 343

  62. [62]

    Laserbeak: Evolving website fingerprinting attacks with attention and multi-channel feature representation,

    N. Mathews, J. K. Holland, N. Hopper, and M. Wright, “Laserbeak: Evolving website fingerprinting attacks with attention and multi-channel feature representation,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 9285–9300, 2024

  63. [63]

    Multitask learning,

    R. Caruana, “Multitask learning,”Machine learning, vol. 28, no. 1, pp. 41–75, 1997

  64. [64]

    Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory,

    I. Kokkinos, “Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory,” in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5454–5463

  65. [65]

    Position detec- tion and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network,

    X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, and K. Fu, “Position detec- tion and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network,”IEEE Access, vol. 6, pp. 50 839–50 849, 2018

  66. [66]

    Collaborative joint training with multitask recurrent model for speech and speaker recognition,

    Z. Tang, L. Li, D. Wang, and R. Vipperla, “Collaborative joint training with multitask recurrent model for speech and speaker recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 3, pp. 493–504, 2017

  67. [67]

    URL https: //doi.org/10.1145/1390156.1390177

    R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” inProceedings of the 25th International Conference on Machine Learning, ser. ICML ’08. New York, NY , USA: Association for Computing Machinery, 2008, p. 160–167. [Online]. Available: https://doi.org/10.1145/1390156.1390177

  68. [68]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy

  69. [69]

    Toward an efficient website fingerprinting defense,

    M. Juarez, M. Imani, M. Perry, C. Diaz, and M. Wright, “Toward an efficient website fingerprinting defense,” inComputer Security– ESORICS 2016: 21st European Symposium on Research in Computer Security, Heraklion, Greece, September 26-30, 2016, Proceedings, Part I 21. Springer, 2016, pp. 27–46

  70. [70]

    Peek-a-boo, i still see you: Why efficient traffic analysis countermeasures fail,

    K. P. Dyer, S. E. Coull, T. Ristenpart, and T. Shrimpton, “Peek-a-boo, i still see you: Why efficient traffic analysis countermeasures fail,” in2012 IEEE symposium on security and privacy. IEEE, 2012, pp. 332–346

  71. [71]

    Efficient attention: Attention with linear complexities,

    Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3531– 3539

  72. [72]

    Linformer: Self-Attention with Linear Complexity

    S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020. APPENDIXA ESPRESSOONTORTRAFFICCORRELATION ESPRESSO was originally developed and evaluated as a flow correlation attack against the Tor anonymity network, where the goal is to link a user’s ingress connection (entering...

  73. [73]

    False Positive Rate (FPR) curve, also known as the Receiver Operating Characteristic (ROC) curve

    Metrics:Our primary metric for evaluating correlation performance is the True Positive Rate (TPR) vs. False Positive Rate (FPR) curve, also known as the Receiver Operating Characteristic (ROC) curve. To emphasize performance at extremely low FPR values, which are critical for practical flow correlation applications, we plot ROC curves using a logarithmic ...

  74. [74]

    This dataset consists of network traffic flows collected using a Tor client and an SSH proxy to capture both ingress and egress flows

    Dataset:We evaluate on the same dataset collected and used in the DCF paper [19]. This dataset consists of network traffic flows collected using a Tor client and an SSH proxy to capture both ingress and egress flows. From this dataset, we use the data collected in June 2022, using 8,662 flow pairs for training, 764 pairs for validation, and 811 pairs enti...

  75. [75]

    The results demonstrate that ESPRESSO sig- nificantly outperforms DCF, particularly at very low FPRs (below10 −6)

    ROC Curve Analysis:Comparison of ESPRESSO and DeepCoFFEA.Figure 19 shows the ROC curves comparing ESPRESSO (trained with online and offline mining) against DeepCoFFEA. The results demonstrate that ESPRESSO sig- nificantly outperforms DCF, particularly at very low FPRs (below10 −6). The online training strategy combined with hard triplet mining gives ESPRE...

  76. [76]

    whole-sequence

    Summary Table:Analysis of Max TPR and pAUC. From Table IX, ESPRESSO consistently outperforms DCF and other variants, particularly at very low FPR thresholds. ESPRESSO trained with a margin of 0.5 and online batch- hard mining achieves the highest Max TPR values at all FPR levels. At an FPR threshold of10 −7, ESPRESSO achieves a Max TPR of 0.811 and a pAUC...

  77. [77]

    Modified DCF

    Experimental Setup:To rigorously evaluate the impact of windowing and backbone architecture, we devised two distinct experimental configurations. a) Input Window Scaling:In our baseline experiments, the DCF model operates on a fixed window sizeW, which we set to 8s by default. To test sensitivity to this hyperparameter, we retrained and evaluated the mode...

  78. [78]

    The first observation is that the impact of window size is highly dataset-dependent

    Experimental Results:Figure 24 presents the ROC curves for each dataset under each experimental setting, with the standard ESPRESSO configuration included as a refer- ence. The first observation is that the impact of window size is highly dataset-dependent. Increasing the window size of DCF yields significant performance improvements against DNS and ICMP ...

  79. [79]

    Similar to the results for ESPRESSO, the impact of the loss on correlation is highly variable

    Results:Figure 25 shows the ROC curves for Modified DCF applied with various loss configurations on the SSH-only, DNS-only, and Mixed-protocol datasets. Similar to the results for ESPRESSO, the impact of the loss on correlation is highly variable. On SSH-only data, the combined loss augmentation outperforms other models at10 −5 FPR, achieving a TPR of 0.4...