SeqShield: A Behavioral Analysis Approach to Uncover Rootkits
Pith reviewed 2026-05-08 05:51 UTC · model grok-4.3
The pith
SeqShield detects rootkits by examining the order of API calls instead of static code signatures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that API call sequences, when represented as bigrams and trigrams and then reduced via Gini-importance ranking, supply enough information for a machine-learning model to separate rootkit execution traces from benign ones, reaching 96.72 percent accuracy on optimized bigrams and 97.81 percent on optimized trigrams when tested against tenfold metamorphic variants generated by the authors.
What carries the argument
n-gram extraction from API call traces followed by iterative Gini-based feature selection
If this is right
- Detection can shift from matching fixed byte patterns to monitoring live sequences of system calls.
- Reducing the number of n-gram features lowers memory and computation while preserving reported accuracy.
- The same pipeline can be re-trained when new Windows APIs are introduced.
- Generating large numbers of metamorphic variants provides a practical way to test robustness before deployment.
Where Pith is reading between the lines
- The same sequence-based approach might extend to other Windows malware families that rely on API calls rather than kernel hooking.
- Continuous monitoring of API traces on a running system could allow earlier intervention than post-execution analysis.
- If future rootkits adopt direct system-call invocation that bypasses the standard API layer, the current feature set would lose coverage.
Load-bearing premise
The particular API call sequences produced by rootkits always contain stable patterns that survive the mutations the authors applied.
What would settle it
A rootkit whose malicious actions are performed through a completely different collection of API calls, or through direct kernel manipulation that never appears in the recorded user-mode sequence, would produce traces that the trained model labels as benign.
Figures
read the original abstract
Rootkits are among the most elusive types of malware, capable of bypassing traditional static analysis methods due to their metamorphic behavior. Signature-based detection techniques struggle against these threats, necessitating a shift toward dynamic analysis approaches. We propose SeqShield, a behavior-based rootkit detection approach designed specifically for the Windows OS, leveraging API call sequences for dynamic behavior analysis. Instead of relying on static signatures, SeqShield examines the execution patterns of API calls, which inherently reflect malicious intent. Analyzing API sequences, we can effectively identify rootkit-like behavior. We also employed a metamorphic code engine to generate 10X mutated variants of rootkits, demonstrating their obfuscation strategies. SeqShield applies n-gram analysis to extract bigram and trigram features from these API call sequences, enabling effective detection of rootkit-like activity. Among the models tested, Random Forest achieves the highest accuracy of 97.27% (bigram) and 96.17% (trigram). To optimize performance and decrease the dimension, we apply feature importance ranking using the Gini Impurity Index, iteratively selecting the most significant features. The optimized lower-dimensional feature matrix significantly enhances detection efficiency without sacrificing accuracy. Using the optimized feature set, our approach achieves 96.72% accuracy for bigrams and 97.81% accuracy for trigrams.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SeqShield, a dynamic behavioral detection method for Windows rootkits that extracts bigram and trigram n-gram features from API call sequences. It generates 10X metamorphic variants of rootkits via a custom engine, trains several ML models (Random Forest reported at 97.27% bigram / 96.17% trigram accuracy), and applies post-hoc Gini impurity feature ranking to produce a reduced feature set that yields 96.72% bigram and 97.81% trigram accuracy.
Significance. If the evaluation methodology were strengthened with independent hold-out data and proper cross-validation, the work would offer a useful demonstration that n-gram features on API sequences can distinguish rootkit-like behavior even after limited synthetic obfuscation. The explicit use of a metamorphic generator to test robustness is a positive step toward falsifiable behavioral claims.
major comments (3)
- [Abstract] Abstract: the headline accuracies (96.72% bigram, 97.81% trigram) are obtained after Gini-based feature selection performed on the same data used for both training and final reporting, with no mention of dataset cardinality, train-test split ratio, cross-validation procedure, or false-positive rates. This renders the central performance claim difficult to interpret and risks optimistic bias.
- [Abstract] Abstract and §3 (method description): the metamorphic engine is described only as producing “10X mutated variants” without enumerating the transformation primitives (register renaming, junk insertion, control-flow flattening, etc.) or the provenance and size of the original rootkit corpus. Without these details it is impossible to assess whether the selected n-gram features generalize beyond the synthetic distribution.
- [Abstract] Abstract: no baseline comparisons (e.g., simple frequency thresholds, other n-gram orders, or static signature detectors) or ablation on the contribution of the Gini ranking step are supplied, making it unclear whether the reported lift is attributable to the behavioral n-gram approach or to post-hoc optimization on the evaluation set.
minor comments (2)
- [Abstract] The abstract states Random Forest achieves 97.27% (bigram) and 96.17% (trigram) before optimization, yet the optimized figures are 96.72% bigram and 97.81% trigram; the reversal should be explained or corrected.
- [Abstract] Notation for n-gram order and feature counts after ranking is used without an explicit table or equation defining the final dimensionality.
Simulated Author's Rebuttal
We thank the referee for the constructive comments that highlight important aspects of our evaluation methodology and presentation. We address each major comment below and will make the indicated revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline accuracies (96.72% bigram, 97.81% trigram) are obtained after Gini-based feature selection performed on the same data used for both training and final reporting, with no mention of dataset cardinality, train-test split ratio, cross-validation procedure, or false-positive rates. This renders the central performance claim difficult to interpret and risks optimistic bias.
Authors: We agree that the current manuscript does not specify dataset cardinality, train-test split, cross-validation procedure, or false-positive rates in the abstract, and that Gini-based feature selection was performed on the full dataset used for reporting. This creates a legitimate risk of optimistic bias. We will revise the abstract and methods section to include these details and adopt a nested cross-validation approach in which feature selection occurs only within training folds. We will also report false-positive rates and update the performance figures if the re-evaluation changes them. revision: yes
-
Referee: [Abstract] Abstract and §3 (method description): the metamorphic engine is described only as producing “10X mutated variants” without enumerating the transformation primitives (register renaming, junk insertion, control-flow flattening, etc.) or the provenance and size of the original rootkit corpus. Without these details it is impossible to assess whether the selected n-gram features generalize beyond the synthetic distribution.
Authors: We concur that the description of the metamorphic engine is insufficient for assessing generalization. We will expand §3 to enumerate the transformation primitives applied (including register renaming, junk insertion, and control-flow flattening) and to state the provenance and size of the original rootkit corpus. These additions will enable readers to evaluate the scope of the synthetic variants. revision: yes
-
Referee: [Abstract] Abstract: no baseline comparisons (e.g., simple frequency thresholds, other n-gram orders, or static signature detectors) or ablation on the contribution of the Gini ranking step are supplied, making it unclear whether the reported lift is attributable to the behavioral n-gram approach or to post-hoc optimization on the evaluation set.
Authors: We accept that the absence of baselines and ablations makes it difficult to attribute performance gains specifically to the n-gram approach versus the feature-ranking step. We will add baseline comparisons (simple frequency thresholds, alternative n-gram orders, and static signature detectors where feasible) and an ablation study isolating the Gini ranking contribution. These will be included in the revised results section. revision: yes
Circularity Check
No circularity: standard empirical ML pipeline on explicitly constructed synthetic dataset
full rationale
The paper describes an empirical workflow: collection of API call sequences from rootkits, generation of 10X metamorphic variants via an engine, extraction of bigram/trigram n-gram features, training of classifiers (e.g., Random Forest), and post-hoc Gini-based feature selection to reduce dimensionality while reporting resulting accuracies. No equations, first-principles derivations, or claimed predictions exist that reduce to the inputs by construction. Feature selection is performed on the same dataset used for evaluation, which is a conventional (if potentially optimistic) ML practice, not a hidden self-definition or fitted input renamed as independent prediction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. Results are presented as measured performance on the authors' generated data rather than as externally validated generalizations derived from the method itself. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (3)
- n-gram order (bigram vs trigram)
- Feature count after Gini ranking
- Random Forest hyperparameters
axioms (2)
- domain assumption API call sequences reflect malicious intent of rootkits
- domain assumption Metamorphic engine produces representative rootkit variants
Reference graph
Works this paper leans on
-
[1]
Abuse.ch: Malwarebazaar,https://bazaar.abuse.ch/, accessed: March 1, 2025
2025
-
[2]
ATT&CK, M.: T1014: Rootkit (2024),https://attack.mitre.org/techniques/ T1014/, accessed: Feb 13, 2025
2024
-
[3]
com/2013/03/the-pyramid-of-pain.html, accessed: March 1, 2025
Bianco, D.J.: The pyramid of pain (2013),https://detect-respond.blogspot. com/2013/03/the-pyramid-of-pain.html, accessed: March 1, 2025
2013
-
[4]
Cuckoo Sandbox Team: Cuckoo sandbox,https://cuckoosandbox.org/, accessed: Feb 13, 2025
2025
-
[5]
In: 2017 12th International Conference on Malicious and Unwanted Software (MAL- WARE)
Dawson, J.A., McDonald, J.T., Shropshire, J., Andel, T.R., Luckett, P., Hively, L.: Rootkit detection through phase-space analysis of power voltage measurements. In: 2017 12th International Conference on Malicious and Unwanted Software (MAL- WARE). pp. 19–27 (2017).https://doi.org/10.1109/MALWARE.2017.8323953
-
[6]
In: 20th Annual Computer Security Applications Conference
Kruegel, C., Robertson, W., Vigna, G.: Detecting kernel-level rootkits through binary analysis. In: 20th Annual Computer Security Applications Conference. pp. 91–100 (2004).https://doi.org/10.1109/CSAC.2004.19
-
[7]
In: 2016 Cybersecurity Symposium (CYBERSEC)
Luckett, P., McDonald, J.T., Dawson, J.: Neural network analysis of system call timing for rootkit detection. In: 2016 Cybersecurity Symposium (CYBERSEC). pp. 1–6 (2016).https://doi.org/10.1109/CYBERSEC.2016.008
-
[8]
Mulligan, D., Perzanowski, A.: The magnificence of the disaster: Reconstructing the sony bmg rootkit incident. Aaron K. Perzanowski (10 2010)
2010
-
[9]
In: 2021 International Conference on Engi- neering and Emerging Technologies (ICEET)
Nadim, M., Akopian, D., Lee, W.: A review on learning-based detection ap- proaches of the kernel-level rootkit. In: 2021 International Conference on Engi- neering and Emerging Technologies (ICEET). pp. 1–6 (2021).https://doi.org/ 10.1109/ICEET53442.2021.9659710
-
[10]
Nadim, M., Lee, W., Akopian, D.: Characteristic features of the kernel-level rootkit for learning-based detection model training. Electronic Imaging33(3), 34– 1–34–1 (2021).https://doi.org/10.2352/ISSN.2470-1173.2021.3.MOBMU-034, https://library.imaging.org/ei/articles/33/3/art00003 SeqShield: A Behavioral Analysis Approach to Uncover Rootkits 17
-
[11]
Nadim, M., Lee, W., Akopian, D.: Kernel-level rootkit detection, prevention and behavior profiling: A taxonomy and survey (2023),https://arxiv.org/abs/2304. 00473
2023
-
[12]
com/a0rtega/metame, accessed: Feb 13, 2025
Ortega, A.: Metame: A metamorphic engine for evasion (2024),https://github. com/a0rtega/metame, accessed: Feb 13, 2025
2024
-
[13]
In: Meng, W., Yan, Z., Piuri, V
Saha, B., Rani, N., Shukla, S.K.: Malxcap: A method for malware capability ex- traction. In: Meng, W., Yan, Z., Piuri, V. (eds.) Information Security Practice and Experience. pp. 230–249. Springer Nature Singapore, Singapore (2023)
2023
-
[14]
In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security
Singh, B., Evtyushkin, D., Elwell, J., Riley, R., Cervesato, I.: On the detection of kernel-level rootkits using hardware performance counters. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. p. 483–493. ASIA CCS ’17, Association for Computing Machinery, New York, NY, USA(2017).https://doi.org/10.1145/3052973.30...
-
[15]
In: 2024 10th International Conference on Communication and Signal Processing (ICCSP)
Suresh Kumar, S., Stephen, S., Suhainul Rumysia, M.: Rootkit detection using deep learning: A comprehensive survey. In: 2024 10th International Conference on Communication and Signal Processing (ICCSP). pp. 365–370 (2024).https: //doi.org/10.1109/ICCSP60870.2024.10543963
-
[16]
The Volatility Foundation: The volatility framework,https:// volatilityfoundation.org, accessed: Feb 13, 2025
2025
-
[17]
VirusTotal: Virustotal,https://www.virustotal.com/, accessed: Feb 13, 2025
2025
-
[18]
MathematicalBiosciencesandEngineering16,2650–2667(032019).https://doi
Wang, X., Zhang, J., Zhang, A., Ren, J.: Tkrd: Trusted kernel rootkit detection for cybersecurity of vms based on machine learning and memory forensic analysis. MathematicalBiosciencesandEngineering16,2650–2667(032019).https://doi. org/10.3934/mbe.2019132
-
[19]
Wikipedia contributors: Rootkit (2025),https://en.wikipedia.org/wiki/ Rootkit, accessed: Feb 13, 2025
2025
-
[20]
Zhou, B., Gupta, A., Jahanshahi, R., Egele, M., Joshi, A.: Hardware performance counters can detect malware: Myth or fact? In: Proceedings of the 2018 on Asia Conference on Computer and Communications Security. p. 457–468. ASIACCS ’18, Association for Computing Machinery, New York, NY, USA (2018).https://doi. org/10.1145/3196494.3196515,https://doi.org/10...
-
[21]
In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Zhou, L., Makris, Y.: Hardware-assisted rootkit detection via on-line statistical fingerprinting of process execution. In: 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). pp. 1580–1585 (2018).https://doi.org/10. 23919/DATE.2018.8342267
-
[22]
Zhou, Z., Hooker, G.: Unbiased measurement of feature importance in tree-based methods. ACM Trans. Knowl. Discov. Data15(2) (Jan 2021).https://doi.org/ 10.1145/3429445,https://doi.org/10.1145/3429445
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.