pith. sign in

arxiv: 2604.23812 · v1 · submitted 2026-04-26 · 💻 cs.CR · cs.LG

SeqShield: A Behavioral Analysis Approach to Uncover Rootkits

Pith reviewed 2026-05-08 05:51 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords rootkit detectionAPI call sequencesn-gram analysisbehavioral malware detectionWindows securityfeature selectionmachine learning classificationdynamic analysis
0
0 comments X

The pith

SeqShield detects rootkits by examining the order of API calls instead of static code signatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rootkits evade traditional scanners by rewriting their own code. The paper demonstrates that the sequence in which a program invokes Windows system functions still carries recognizable patterns of malicious activity. By converting those sequences into bigram and trigram features, ranking them by importance, and feeding the reduced set to a classifier, the method identifies rootkit behavior at high accuracy even when the original samples have been heavily mutated.

Core claim

The central claim is that API call sequences, when represented as bigrams and trigrams and then reduced via Gini-importance ranking, supply enough information for a machine-learning model to separate rootkit execution traces from benign ones, reaching 96.72 percent accuracy on optimized bigrams and 97.81 percent on optimized trigrams when tested against tenfold metamorphic variants generated by the authors.

What carries the argument

n-gram extraction from API call traces followed by iterative Gini-based feature selection

If this is right

  • Detection can shift from matching fixed byte patterns to monitoring live sequences of system calls.
  • Reducing the number of n-gram features lowers memory and computation while preserving reported accuracy.
  • The same pipeline can be re-trained when new Windows APIs are introduced.
  • Generating large numbers of metamorphic variants provides a practical way to test robustness before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sequence-based approach might extend to other Windows malware families that rely on API calls rather than kernel hooking.
  • Continuous monitoring of API traces on a running system could allow earlier intervention than post-execution analysis.
  • If future rootkits adopt direct system-call invocation that bypasses the standard API layer, the current feature set would lose coverage.

Load-bearing premise

The particular API call sequences produced by rootkits always contain stable patterns that survive the mutations the authors applied.

What would settle it

A rootkit whose malicious actions are performed through a completely different collection of API calls, or through direct kernel manipulation that never appears in the recorded user-mode sequence, would produce traces that the trained model labels as benign.

Figures

Figures reproduced from arXiv: 2604.23812 by Anand Handa, Nitesh Kumar, Paras Ghodeshwar, Sandeep K Shukla.

Figure 1
Figure 1. Figure 1: Architecture of SeqShield 4.1 Experimental Setup Our experiments utilize multiple tools to collect and process data in a secure environment. The experiments were conducted on an Intel® CoreTM i7-4700 CPU @ 3.40 GHz × 8, running a 64-bit Ubuntu 18.04.6 LTS operating system with 16 GB of RAM and 1 TB of disk space. Ubuntu was used as the host OS, with VirtualBox installed for virtualization. A Windows 7 Pro … view at source ↗
Figure 2
Figure 2. Figure 2: MetaMe Mutation Architecture Each malware sample was passed through MetaMe in a recursive process: 1. If A is the original executable, MetaMe generates A1. 2. A1 is then passed through MetaMe to generate A2. 3. This process continues iteratively up to A10, resulting in ten variations of the original sample (A1, A2, ..., A10). These transformed samples were then analyzed to assess their detectability. When … view at source ↗
Figure 3
Figure 3. Figure 3: Model performance comparison for Top n-feature where F1-score is maximum 5.2 Identifying the Most Contributing Features To pinpoint the most relevant features, we identified the intersection of the top n×100 features from both Decision Tree and Random Forest classifiers where the F1-score is max. The overlapping features were considered the most informative for rootkit detection. Let us consider the set A,… view at source ↗
Figure 4
Figure 4. Figure 4: Model Prediction for Unseen Samples them using our trained machine learning classifier. We conducted experiments on both high-dimensional feature matrices (containing both relevant and irrel￾evant features) and optimized lower-dimensional feature sets (comprising only the most significant features). Using a dataset of 90 previously unseen rootkit samples. For the higher-dimensional feature matrix, where bo… view at source ↗
read the original abstract

Rootkits are among the most elusive types of malware, capable of bypassing traditional static analysis methods due to their metamorphic behavior. Signature-based detection techniques struggle against these threats, necessitating a shift toward dynamic analysis approaches. We propose SeqShield, a behavior-based rootkit detection approach designed specifically for the Windows OS, leveraging API call sequences for dynamic behavior analysis. Instead of relying on static signatures, SeqShield examines the execution patterns of API calls, which inherently reflect malicious intent. Analyzing API sequences, we can effectively identify rootkit-like behavior. We also employed a metamorphic code engine to generate 10X mutated variants of rootkits, demonstrating their obfuscation strategies. SeqShield applies n-gram analysis to extract bigram and trigram features from these API call sequences, enabling effective detection of rootkit-like activity. Among the models tested, Random Forest achieves the highest accuracy of 97.27% (bigram) and 96.17% (trigram). To optimize performance and decrease the dimension, we apply feature importance ranking using the Gini Impurity Index, iteratively selecting the most significant features. The optimized lower-dimensional feature matrix significantly enhances detection efficiency without sacrificing accuracy. Using the optimized feature set, our approach achieves 96.72% accuracy for bigrams and 97.81% accuracy for trigrams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SeqShield, a dynamic behavioral detection method for Windows rootkits that extracts bigram and trigram n-gram features from API call sequences. It generates 10X metamorphic variants of rootkits via a custom engine, trains several ML models (Random Forest reported at 97.27% bigram / 96.17% trigram accuracy), and applies post-hoc Gini impurity feature ranking to produce a reduced feature set that yields 96.72% bigram and 97.81% trigram accuracy.

Significance. If the evaluation methodology were strengthened with independent hold-out data and proper cross-validation, the work would offer a useful demonstration that n-gram features on API sequences can distinguish rootkit-like behavior even after limited synthetic obfuscation. The explicit use of a metamorphic generator to test robustness is a positive step toward falsifiable behavioral claims.

major comments (3)
  1. [Abstract] Abstract: the headline accuracies (96.72% bigram, 97.81% trigram) are obtained after Gini-based feature selection performed on the same data used for both training and final reporting, with no mention of dataset cardinality, train-test split ratio, cross-validation procedure, or false-positive rates. This renders the central performance claim difficult to interpret and risks optimistic bias.
  2. [Abstract] Abstract and §3 (method description): the metamorphic engine is described only as producing “10X mutated variants” without enumerating the transformation primitives (register renaming, junk insertion, control-flow flattening, etc.) or the provenance and size of the original rootkit corpus. Without these details it is impossible to assess whether the selected n-gram features generalize beyond the synthetic distribution.
  3. [Abstract] Abstract: no baseline comparisons (e.g., simple frequency thresholds, other n-gram orders, or static signature detectors) or ablation on the contribution of the Gini ranking step are supplied, making it unclear whether the reported lift is attributable to the behavioral n-gram approach or to post-hoc optimization on the evaluation set.
minor comments (2)
  1. [Abstract] The abstract states Random Forest achieves 97.27% (bigram) and 96.17% (trigram) before optimization, yet the optimized figures are 96.72% bigram and 97.81% trigram; the reversal should be explained or corrected.
  2. [Abstract] Notation for n-gram order and feature counts after ranking is used without an explicit table or equation defining the final dimensionality.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight important aspects of our evaluation methodology and presentation. We address each major comment below and will make the indicated revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline accuracies (96.72% bigram, 97.81% trigram) are obtained after Gini-based feature selection performed on the same data used for both training and final reporting, with no mention of dataset cardinality, train-test split ratio, cross-validation procedure, or false-positive rates. This renders the central performance claim difficult to interpret and risks optimistic bias.

    Authors: We agree that the current manuscript does not specify dataset cardinality, train-test split, cross-validation procedure, or false-positive rates in the abstract, and that Gini-based feature selection was performed on the full dataset used for reporting. This creates a legitimate risk of optimistic bias. We will revise the abstract and methods section to include these details and adopt a nested cross-validation approach in which feature selection occurs only within training folds. We will also report false-positive rates and update the performance figures if the re-evaluation changes them. revision: yes

  2. Referee: [Abstract] Abstract and §3 (method description): the metamorphic engine is described only as producing “10X mutated variants” without enumerating the transformation primitives (register renaming, junk insertion, control-flow flattening, etc.) or the provenance and size of the original rootkit corpus. Without these details it is impossible to assess whether the selected n-gram features generalize beyond the synthetic distribution.

    Authors: We concur that the description of the metamorphic engine is insufficient for assessing generalization. We will expand §3 to enumerate the transformation primitives applied (including register renaming, junk insertion, and control-flow flattening) and to state the provenance and size of the original rootkit corpus. These additions will enable readers to evaluate the scope of the synthetic variants. revision: yes

  3. Referee: [Abstract] Abstract: no baseline comparisons (e.g., simple frequency thresholds, other n-gram orders, or static signature detectors) or ablation on the contribution of the Gini ranking step are supplied, making it unclear whether the reported lift is attributable to the behavioral n-gram approach or to post-hoc optimization on the evaluation set.

    Authors: We accept that the absence of baselines and ablations makes it difficult to attribute performance gains specifically to the n-gram approach versus the feature-ranking step. We will add baseline comparisons (simple frequency thresholds, alternative n-gram orders, and static signature detectors where feasible) and an ablation study isolating the Gini ranking contribution. These will be included in the revised results section. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML pipeline on explicitly constructed synthetic dataset

full rationale

The paper describes an empirical workflow: collection of API call sequences from rootkits, generation of 10X metamorphic variants via an engine, extraction of bigram/trigram n-gram features, training of classifiers (e.g., Random Forest), and post-hoc Gini-based feature selection to reduce dimensionality while reporting resulting accuracies. No equations, first-principles derivations, or claimed predictions exist that reduce to the inputs by construction. Feature selection is performed on the same dataset used for evaluation, which is a conventional (if potentially optimistic) ML practice, not a hidden self-definition or fitted input renamed as independent prediction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. Results are presented as measured performance on the authors' generated data rather than as externally validated generalizations derived from the method itself. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The claim rests entirely on empirical ML performance; multiple modeling choices are fitted to the generated dataset rather than derived from first principles.

free parameters (3)
  • n-gram order (bigram vs trigram)
    Selected to extract sequence features from API calls
  • Feature count after Gini ranking
    Iteratively chosen to reduce dimension while preserving accuracy
  • Random Forest hyperparameters
    Tuned to achieve highest accuracy among tested models
axioms (2)
  • domain assumption API call sequences reflect malicious intent of rootkits
    Stated as the basis for shifting from static signatures to dynamic sequence analysis
  • domain assumption Metamorphic engine produces representative rootkit variants
    Used to demonstrate detection under obfuscation

pith-pipeline@v0.9.0 · 5540 in / 1375 out tokens · 65838 ms · 2026-05-08T05:51:29.138784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 11 canonical work pages

  1. [1]

    Abuse.ch: Malwarebazaar,https://bazaar.abuse.ch/, accessed: March 1, 2025

  2. [2]

    ATT&CK, M.: T1014: Rootkit (2024),https://attack.mitre.org/techniques/ T1014/, accessed: Feb 13, 2025

  3. [3]

    com/2013/03/the-pyramid-of-pain.html, accessed: March 1, 2025

    Bianco, D.J.: The pyramid of pain (2013),https://detect-respond.blogspot. com/2013/03/the-pyramid-of-pain.html, accessed: March 1, 2025

  4. [4]

    Cuckoo Sandbox Team: Cuckoo sandbox,https://cuckoosandbox.org/, accessed: Feb 13, 2025

  5. [5]

    In: 2017 12th International Conference on Malicious and Unwanted Software (MAL- WARE)

    Dawson, J.A., McDonald, J.T., Shropshire, J., Andel, T.R., Luckett, P., Hively, L.: Rootkit detection through phase-space analysis of power voltage measurements. In: 2017 12th International Conference on Malicious and Unwanted Software (MAL- WARE). pp. 19–27 (2017).https://doi.org/10.1109/MALWARE.2017.8323953

  6. [6]

    In: 20th Annual Computer Security Applications Conference

    Kruegel, C., Robertson, W., Vigna, G.: Detecting kernel-level rootkits through binary analysis. In: 20th Annual Computer Security Applications Conference. pp. 91–100 (2004).https://doi.org/10.1109/CSAC.2004.19

  7. [7]

    In: 2016 Cybersecurity Symposium (CYBERSEC)

    Luckett, P., McDonald, J.T., Dawson, J.: Neural network analysis of system call timing for rootkit detection. In: 2016 Cybersecurity Symposium (CYBERSEC). pp. 1–6 (2016).https://doi.org/10.1109/CYBERSEC.2016.008

  8. [8]

    Mulligan, D., Perzanowski, A.: The magnificence of the disaster: Reconstructing the sony bmg rootkit incident. Aaron K. Perzanowski (10 2010)

  9. [9]

    In: 2021 International Conference on Engi- neering and Emerging Technologies (ICEET)

    Nadim, M., Akopian, D., Lee, W.: A review on learning-based detection ap- proaches of the kernel-level rootkit. In: 2021 International Conference on Engi- neering and Emerging Technologies (ICEET). pp. 1–6 (2021).https://doi.org/ 10.1109/ICEET53442.2021.9659710

  10. [10]

    Nadim, M., Lee, W., Akopian, D.: Characteristic features of the kernel-level rootkit for learning-based detection model training. Electronic Imaging33(3), 34– 1–34–1 (2021).https://doi.org/10.2352/ISSN.2470-1173.2021.3.MOBMU-034, https://library.imaging.org/ei/articles/33/3/art00003 SeqShield: A Behavioral Analysis Approach to Uncover Rootkits 17

  11. [11]

    Nadim, M., Lee, W., Akopian, D.: Kernel-level rootkit detection, prevention and behavior profiling: A taxonomy and survey (2023),https://arxiv.org/abs/2304. 00473

  12. [12]

    com/a0rtega/metame, accessed: Feb 13, 2025

    Ortega, A.: Metame: A metamorphic engine for evasion (2024),https://github. com/a0rtega/metame, accessed: Feb 13, 2025

  13. [13]

    In: Meng, W., Yan, Z., Piuri, V

    Saha, B., Rani, N., Shukla, S.K.: Malxcap: A method for malware capability ex- traction. In: Meng, W., Yan, Z., Piuri, V. (eds.) Information Security Practice and Experience. pp. 230–249. Springer Nature Singapore, Singapore (2023)

  14. [14]

    In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security

    Singh, B., Evtyushkin, D., Elwell, J., Riley, R., Cervesato, I.: On the detection of kernel-level rootkits using hardware performance counters. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. p. 483–493. ASIA CCS ’17, Association for Computing Machinery, New York, NY, USA(2017).https://doi.org/10.1145/3052973.30...

  15. [15]

    In: 2024 10th International Conference on Communication and Signal Processing (ICCSP)

    Suresh Kumar, S., Stephen, S., Suhainul Rumysia, M.: Rootkit detection using deep learning: A comprehensive survey. In: 2024 10th International Conference on Communication and Signal Processing (ICCSP). pp. 365–370 (2024).https: //doi.org/10.1109/ICCSP60870.2024.10543963

  16. [16]

    The Volatility Foundation: The volatility framework,https:// volatilityfoundation.org, accessed: Feb 13, 2025

  17. [17]

    VirusTotal: Virustotal,https://www.virustotal.com/, accessed: Feb 13, 2025

  18. [18]

    MathematicalBiosciencesandEngineering16,2650–2667(032019).https://doi

    Wang, X., Zhang, J., Zhang, A., Ren, J.: Tkrd: Trusted kernel rootkit detection for cybersecurity of vms based on machine learning and memory forensic analysis. MathematicalBiosciencesandEngineering16,2650–2667(032019).https://doi. org/10.3934/mbe.2019132

  19. [19]

    Wikipedia contributors: Rootkit (2025),https://en.wikipedia.org/wiki/ Rootkit, accessed: Feb 13, 2025

  20. [20]

    Zhou, B., Gupta, A., Jahanshahi, R., Egele, M., Joshi, A.: Hardware performance counters can detect malware: Myth or fact? In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security. p. 457–468. ASIACCS ’18, Association for Computing Machinery, New York, NY, USA (2018).https://doi. org/10.1145/3196494.3196515,https://doi.org/10...

  21. [21]

    In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)

    Zhou, L., Makris, Y.: Hardware-assisted rootkit detection via on-line statistical fingerprinting of process execution. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). pp. 1580–1585 (2018).https://doi.org/10. 23919/DATE.2018.8342267

  22. [22]

    ACM Trans

    Zhou, Z., Hooker, G.: Unbiased measurement of feature importance in tree-based methods. ACM Trans. Knowl. Discov. Data15(2) (Jan 2021).https://doi.org/ 10.1145/3429445,https://doi.org/10.1145/3429445