pith. sign in

arxiv: 2605.22058 · v1 · pith:IYI6EM6Snew · submitted 2026-05-21 · 💻 cs.SE · cs.CR

Finding Missing Input Validation in TEEs via LLM-Assisted Symbolic Execution

Pith reviewed 2026-05-22 04:53 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords Trusted Execution EnvironmentsSymbolic ExecutionInput ValidationLLM-assisted AnalysisVulnerability DetectionKLEESecurity Analysis
0
0 comments X

The pith

SymTEE uses LLMs to generate mock environments that let symbolic execution find missing input validations in TEE code without any real hardware setup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SymTEE as a way to detect missing input validation in Trusted Execution Environment applications by first using AST analysis to pull relevant code slices and then having an LLM turn those slices into KLEE harness programs with simple mock environments. This sidesteps the usual high cost and complexity of building complete TEE build and runtime setups plus the limited visibility caused by hardware isolation. A sympathetic reader would care because unvalidated inputs can create exploitable security holes inside otherwise protected code, yet current analysis methods are too cumbersome for routine use. SymTEE reports perfect precision and 92.3 percent recall across 26 tested vulnerabilities at an average cost of five cents per scan. The work therefore positions LLM-assisted symbolic execution as a practical route to scalable security checks for trusted computing systems.

Core claim

SymTEE extracts TEE code slices that may lack sufficient input validation via AST analysis, then employs an LLM to automatically convert those slices into KLEE-compatible harness programs that include lightweight mock execution environments, allowing symbolic execution to expose the vulnerabilities. On a set of 26 vulnerabilities consisting of 11 real-world and 15 synthetic cases, the framework reaches 100 percent precision and 92.3 percent recall while averaging only $0.05 in analysis cost per case.

What carries the argument

LLM-generated KLEE harness programs that embed lightweight mock execution environments, which enable symbolic analysis of extracted TEE code slices without requiring actual TEE hardware or full runtime configurations.

If this is right

  • 100 percent precision implies reported issues require no manual filtering for false alarms.
  • 92.3 percent recall means the large majority of existing missing-validation problems are caught.
  • An average cost of $0.05 makes repeated or large-scale scans economically practical.
  • The same workflow applies equally to both real-world and synthetic vulnerability examples.
  • The method supplies a scalable alternative for security analysis of trusted computing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar LLM-driven mock generation could be tried for other classes of bugs inside hardware-isolated environments.
  • Continued LLM capability gains might push recall to 100 percent on the same task.
  • Embedding the pipeline inside continuous-integration tools could let TEE developers run checks automatically.
  • The lowered setup barrier may let smaller teams perform security reviews that previously required specialized hardware access.

Load-bearing premise

The large language model can reliably translate extracted TEE code slices into correct KLEE-compatible harness programs whose mock environments are faithful enough for symbolic analysis to reveal missing input validations.

What would settle it

Apply SymTEE to a new collection of TEE applications containing documented missing-input-validation bugs and check whether the generated harnesses consistently expose those bugs through symbolic execution or instead miss them or produce false paths.

Figures

Figures reproduced from arXiv: 2605.22058 by Chengyan Ma, David Lo, Jieke Shi, Ruidong Han, Ye Liu, Yuqing Niu.

Figure 1
Figure 1. Figure 1: Architecture of a Trusted Execution Environment [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example of missing input length validation in a [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SymTEE workflow consisting of three stages: [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of LLM-generated KLEE harness. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Trusted Execution Environments (TEEs) provide hardware-enforced isolation that protects sensitive code and data from untrusted software. Despite their strong security guarantees, analyzing TEE applications remains challenging due to the high cost and complexity of configuring complete TEE build and runtime environments, as well as the limited observability imposed by hardware isolation. This paper presents SymTEE, a novel large language model (LLM)-assisted symbolic execution framework for detecting missing input validation issues in TEE applications without requiring real TEE setups. SymTEE begins by leveraging Abstract Syntax Tree (AST) analysis to extract TEE code slices that may lack sufficient input validation, and then employs an LLM (GPT-5 in our case) to automatically convert the extracted slices into KLEE-compatible harness programs containing lightweight mock execution environments for symbolic analysis. Evaluations on 26 vulnerabilities (11 real-world and 15 synthetic) show that SymTEE achieves 100% precision and 92.3% recall in detecting missing input validation vulnerabilities while incurring an average analysis cost of only $0.05. These results demonstrate the effectiveness and practicality of SymTEE's pioneering paradigm of LLM-assisted symbolic execution, where LLMs autonomously generate mock environments to enable automated security analysis without complex setup, providing a more accessible and scalable framework for trusted computing systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents SymTEE, a framework that uses AST analysis to extract code slices from TEE applications potentially lacking input validation, then employs an LLM (GPT-5) to automatically generate KLEE-compatible harness programs with lightweight mock execution environments. Symbolic execution via KLEE is then applied to detect missing input validations without requiring actual TEE hardware or full runtime setups. The evaluation on 26 cases (11 real-world vulnerabilities and 15 synthetic) reports 100% precision and 92.3% recall at an average cost of $0.05 per analysis, positioning the work as a scalable, low-setup paradigm for TEE security analysis.

Significance. If the LLM-generated mocks preserve sufficient TEE semantics for KLEE to reach and expose validation bugs, the approach could meaningfully reduce the cost and complexity of analyzing isolated TEE code. The reported aggregate metrics and low per-analysis cost indicate practical potential for automated security tooling in trusted computing, particularly if future work strengthens the validation of the mock-generation step.

major comments (3)
  1. [Evaluation] Evaluation section: The central claim of 92.3% recall (and thus the overall effectiveness) depends on the LLM-generated lightweight mock environments faithfully reproducing the control- and data-flow conditions needed to expose missing input validations. The manuscript provides no per-case audit of the generated harnesses, no comparison against hand-written reference harnesses for the 11 real-world cases, and no explicit validation that TEE-specific elements (entry points, attestation flows, or isolation boundaries) are preserved rather than hallucinated or simplified away.
  2. [Evaluation] Evaluation section: Aggregate metrics are reported for the full set of 26 cases, but no breakdown is given separating performance on the 11 real-world vulnerabilities from the 15 synthetic ones. Without this disaggregation or a description of how the synthetic cases were constructed, it is difficult to determine whether the 100% precision / 92.3% recall results generalize beyond the evaluated set.
  3. [Method] Method section: The pipeline relies on the LLM to translate extracted slices into correct KLEE harnesses whose mock environments are adequate for symbolic detection. The manuscript does not report how often LLM outputs required manual repair, nor does it quantify the frequency or nature of semantic omissions that could suppress or fabricate the very conditions the analysis targets.
minor comments (2)
  1. [Abstract] Abstract and introduction: The phrase 'GPT-5' should be clarified (model version, access date, or whether it denotes a specific configuration) to allow reproducibility.
  2. [Evaluation] The paper would benefit from a short table or appendix listing the 11 real-world cases with their sources and the key TEE features each exercises.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our evaluation and methodology. We address each major comment in detail below and indicate the changes we will make to the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] The central claim of 92.3% recall depends on the LLM-generated lightweight mock environments faithfully reproducing the control- and data-flow conditions needed to expose missing input validations. The manuscript provides no per-case audit of the generated harnesses, no comparison against hand-written reference harnesses for the 11 real-world cases, and no explicit validation that TEE-specific elements (entry points, attestation flows, or isolation boundaries) are preserved rather than hallucinated or simplified away.

    Authors: We recognize the value of providing more transparency on the mock environment generation. Although the original manuscript focused on aggregate results, we did conduct per-case reviews during development to ensure that the mocks captured necessary TEE semantics such as entry points and basic isolation. To address this concern, we will include in the revised Evaluation section a summary of the validation process for the mocks, including examples of preserved elements for both real-world and synthetic cases. A full side-by-side comparison with hand-written harnesses is not feasible within the scope of this work as it would require duplicating the effort for each case manually; however, the absence of false positives (100% precision) provides indirect evidence that the mocks did not introduce spurious behaviors. We will also add explicit discussion on how attestation flows are mocked using symbolic return values. revision: partial

  2. Referee: [Evaluation] Aggregate metrics are reported for the full set of 26 cases, but no breakdown is given separating performance on the 11 real-world vulnerabilities from the 15 synthetic ones. Without this disaggregation or a description of how the synthetic cases were constructed, it is difficult to determine whether the 100% precision / 92.3% recall results generalize beyond the evaluated set.

    Authors: We agree that disaggregating the results would improve the reader's ability to assess generalizability. In the revised manuscript, we will present separate metrics for the 11 real-world vulnerabilities and the 15 synthetic cases. We will also add a description of how the synthetic cases were constructed. revision: yes

  3. Referee: [Method] The pipeline relies on the LLM to translate extracted slices into correct KLEE harnesses whose mock environments are adequate for symbolic detection. The manuscript does not report how often LLM outputs required manual repair, nor does it quantify the frequency or nature of semantic omissions that could suppress or fabricate the very conditions the analysis targets.

    Authors: This is a valid observation. We will update the Method section to report the rate at which LLM-generated harnesses required manual intervention. We will also discuss the steps taken to mitigate semantic omissions, such as post-generation checks against the original slices, and quantify any instances where omissions were detected and corrected. revision: yes

Circularity Check

0 steps flagged

No circularity: results derived from independent labeled test cases

full rationale

The paper describes an empirical tool SymTEE that extracts code slices via AST analysis and uses an LLM to synthesize KLEE harnesses with mock environments for symbolic execution. Precision and recall are measured directly against a fixed collection of 26 externally labeled vulnerabilities (11 real-world plus 15 synthetic), which function as independent ground truth rather than quantities constructed from the method's own definitions or fitted parameters. No equations, self-referential definitions, or load-bearing self-citations appear in the provided derivation; the central performance claims rest on observable outcomes that can be reproduced by re-running the tool on the same test suite. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the LLM producing harnesses whose mock environments preserve enough TEE semantics for KLEE to detect missing validation; this is an unproven domain assumption rather than a derived result.

axioms (1)
  • domain assumption LLM-generated mock execution environments are sufficiently faithful to real TEE behavior for symbolic execution to expose missing input validations.
    Invoked when the paper states that the generated harnesses enable automated security analysis without complex setup.

pith-pipeline@v0.9.0 · 5771 in / 1363 out tokens · 76900 ms · 2026-05-22T04:53:55.977379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Waqas Ahmed, Aamir Rasool, Abdul Rehman Javed, Neeraj Kumar, Thippa Reddy Gadekallu, Zunera Jalil, and Natalia Kryvinska. 2021. Security in Next Generation Mobile Payment Systems: A Comprehensive Survey.IEEE Access9 (2021), 115932– 115950

  2. [2]

    Sunil Anasuri. 2023. Confidential Computing Using Trusted Execution Environ- ments.International Journal of AI, BigData, Computational and Management Stud- ies4, 2 (Jun. 2023), 97–110

  3. [3]

    Matthew Areno and Jim Plusquellic. 2012. Securing Trusted Execution Environ- ments with PUF Generated Secret Keys. In2012 IEEE 11th International Conference on Trust, Security and Privacy in Computing and Communications. 1188–1193

  4. [4]

    Roberto Baldoni, Emilio Coppa, Daniele Cono D’elia, Camil Demetrescu, and Irene Finocchi. 2018. A Survey of Symbolic Execution Techniques.ACM Comput. Surv.51, 3, Article 50 (May 2018), 39 pages

  5. [5]

    Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. InProceed- ings of the Annual Conference on USENIX Annual Technical Conference(Anaheim, CA)(ATEC ’05). USENIX Association, USA, 41

  6. [6]

    Marcel Busch, Aravind Machiry, Chad Spensky, Giovanni Vigna, Christopher Kruegel, and Mathias Payer. 2023. TEEzz: Fuzzing Trusted Applications on COTS Android Devices. In2023 IEEE Symposium on Security and Privacy (SP). 1204–1219

  7. [7]

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. InProceedings of the 8th USENIX Conference on Operating Systems Design and Implementation(San Diego, California)(OSDI’08). USENIX Association, USA, 209–224

  8. [8]

    David Cerdeira, Nuno Santos, Pedro Fonseca, and Sandro Pinto. 2020. SoK: Understanding the Prevailing Security Vulnerabilities in TrustZone-assisted TEE Systems. In2020 IEEE Symposium on Security and Privacy (SP). 1416–1432

  9. [9]

    Liheng Chen, Zheming Li, Zheyu Ma, Yuan Li, Baojian Chen, and Chao Zhang

  10. [10]

    In31st An- nual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024

    EnclaveFuzz: Finding Vulnerabilities in SGX Applications. In31st An- nual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024. The Internet Society

  11. [11]

    Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2012. The S2E Platform: Design, Implementation, and Applications.ACM Trans. Comput. Syst. 30, 1, Article 2 (Feb. 2012), 49 pages

  12. [12]

    Intel Corporation. 2025. Intel®Software Guard Extensions (Intel® SGX). https://www.intel.com/content/www/us/en/products/docs/accelerator- engines/software-guard-extensions.html. [Accessed 05-11-2025]

  13. [13]

    Leonardo De Moura and Nikolaj Bjørner. 2011. Satisfiability modulo theories: introduction and applications.Commun. ACM54, 9 (Sept. 2011), 69–77

  14. [14]

    Amit Dua, Siddharth Sekhar Barpanda, Neeraj Kumar, and Sudeep Tanwar. 2020. Trustful: A Decentralized Public Key Infrastructure and Identity Management System. In2020 IEEE Globecom Workshops (GC Wkshps. 1–6

  15. [15]

    Guoyun Duan, Yuanzhi Fu, Boyang Zhang, Peiyao Deng, Jianhua Sun, Hao Chen, and Zhiwen Chen. 2023. TEEFuzzer: A fuzzing framework for trusted execution environments with heuristic seed mutation.Future Gener. Comput. Syst.144 (2023), 192–204

  16. [16]

    Zachary Englhardt, Richard Li, Dilini Nissanka, Zhihan Zhang, Girish Narayan- swamy, Joseph Breda, Xin Liu, Shwetak Patel, and Vikram Iyer. 2024. Exploring and Characterizing Large Language Models for Embedded System Development and Debugging. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI EA ’24). ...

  17. [17]

    Shufan Fei, Zheng Yan, Wenxiu Ding, and Haomeng Xie. 2021. Security Vulnera- bilities of SGX and Countermeasures: A Survey.ACM Comput. Surv.54, 6, Article 126 (July 2021), 36 pages

  18. [18]

    Fabian Fleischer, Marcel Busch, and Phillip Kuhrt. 2020. Memory corruption attacks within Android TEEs: a case study based on OP-TEE. InProceedings of the 15th International Conference on A vailability, Reliability and Security(Virtual Event, Ireland)(ARES ’20). Association for Computing Machinery, New York, NY, USA, Article 53, 9 pages

  19. [19]

    Akalanka Galappaththi, Sarah Nadi, and Christoph Treude. 2024. An Empiri- cal Study of API Misuses of Data-Centric Libraries. InProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement(Barcelona, Spain)(ESEM ’24). Association for Computing Machinery, New York, NY, USA, 245–256

  20. [20]

    Lee Harrison, Hayawardh Vijayakumar, Rohan Padhye, Koushik Sen, and Michael Grace. 2020. PARTEMU: Enabling Dynamic Analysis of Real-World TrustZone Software Using Emulation. In29th USENIX Security Symposium (USENIX Security 20). USENIX Association, 789–806

  21. [21]

    Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead.ACM Trans. Softw. Eng. Methodol.34, 5, Article 124 (May 2025), 30 pages

  22. [22]

    Patrick Jauernig, Ahmad-Reza Sadeghi, and Emmanuel Stapf. 2020. Trusted Execution Environments: Properties, Applications, and Challenges.IEEE Security & Privacy18, 2 (2020), 56–60

  23. [23]

    Mustakimur Rahman Khandaker, Yueqiang Cheng, Zhi Wang, and Tao Wei. 2020. COIN Attacks: On Insecurity of Enclave Untrusted Interfaces in SGX. InPro- ceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems(Lausanne, Switzerland)(ASPLOS ’20). Association for Computing Machinery, New York...

  24. [24]

    Wenhao Li, Yubin Xia, Long Lu, Haibo Chen, and Binyu Zang. 2019. TEEv: virtualizing trusted execution environments on mobile platforms. InProceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execu- tion Environments(Providence, RI, USA)(VEE 2019). Association for Computing Machinery, New York, NY, USA, 2–16

  25. [25]

    Xiaoguo Li, Bowen Zhao, Guomin Yang, Tao Xiang, Jian Weng, and Robert H. Deng. 2023. A Survey of Secure Computation Using Trusted Execution Environ- ments. arXiv:2302.12150 [cs.CR] https://arxiv.org/abs/2302.12150

  26. [26]

    David Lo. 2023. Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). 69–85

  27. [27]

    Chengyan Ma, Ruidong Han, Jieke Shi, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, and David Lo. 2025. DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications. arXiv:2502.15281 [cs.CR] https://arxiv.org/abs/2502.15281

  28. [28]

    Sanoop Mallissery and Yu-Sung Wu. 2023. Demystify the Fuzzing Methods: A Comprehensive Survey.ACM Comput. Surv.56, 3, Article 71 (Oct. 2023), 38 pages

  29. [29]

    Ruijie Meng, Van-Thuan Pham, Marcel Böhme, and Abhik Roychoudhury. 2025. AFLNet Five Years Later: On Coverage-Guided Protocol Fuzzing.IEEE Transac- tions on Software Engineering51, 4 (2025), 960–974

  30. [30]

    Zahra Mousavi, Chadni Islam, Muhammad Ali Babar, Alsharif Abuadbba, and Kristen Moore. 2025. Detecting Misuse of Security APIs: A Systematic Review. ACM Comput. Surv.57, 12, Article 303 (July 2025), 39 pages

  31. [31]

    Sarah Nadi, Stefan Krüger, Mira Mezini, and Eric Bodden. 2016. Jumping through hoops: why do Java developers struggle with cryptography APIs?. InProceedings of the 38th International Conference on Software Engineering(Austin, Texas)(ICSE ’16). Association for Computing Machinery, New York, NY, USA, 935–946

  32. [32]

    Olivier Nourry, Yutaro Kashiwa, Bin Lin, Gabriele Bavota, Michele Lanza, and Yasutaka Kamei. 2023. The Human Side of Fuzzing: Challenges Faced by Devel- opers during Fuzzing Activities.ACM Trans. Softw. Eng. Methodol.33, 1, Article 14 (Nov. 2023), 26 pages

  33. [33]

    Arm Limited (or its affiliates). 2025. ARM TrustZone Technology. https://developer.arm.com/documentation/100690/0200/ARM-TrustZone- technology. [Accessed 05-11-2025]

  34. [34]

    Mohamed Sabt, Mohammed Achemlal, and Abdelmadjid Bouabdallah. 2015. Trusted Execution Environment: What It is, and What It is Not. In2015 IEEE Trustcom/BigDataSE/ISPA, Vol. 1. 57–64. FORGE ’26, April 12–13, 2026, Rio de Janeiro, Brazil Ma et al

  35. [35]

    Jieke Shi, Zhou Yang, and David Lo. 2025. Efficient and Green Large Language Models for Software Engineering: Literature Review, Vision, and the Road Ahead. ACM Trans. Softw. Eng. Methodol.34, 5, Article 137 (May 2025), 22 pages

  36. [36]

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In2016 IEEE Symposium on Security and Privacy (SP). 138–157

  37. [37]

    Bo Yang, Kang Yang, Zhenfeng Zhang, Yu Qin, and Dengguo Feng. 2016. AEP-M: Practical Anonymous E-Payment for Mobile Devices Using ARM TrustZone and Divisible E-Cash. InInformation Security, Matt Bishop and Anderson C A Nascimento (Eds.). Springer International Publishing, Cham, 130–146

  38. [38]

    Joobeom Yun, Fayozbek Rustamov, Juhwan Kim, and Youngjoo Shin. 2022. Fuzzing of Embedded Systems: A Survey.ACM Comput. Surv.55, 7, Article 137 (Dec. 2022), 33 pages