pith. machine review for the scientific record. sign in

arxiv: 2604.24203 · v1 · submitted 2026-04-27 · 💻 cs.CR · cs.AI· cs.ET· cs.MA

Recognition: unknown

Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:53 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.ETcs.MA
keywords agentic witnessingtrusted execution environmentprivacy-preserving auditingLLM auditormodel context protocolboolean queriescodebase verificationsemantic properties
0
0 comments X

The pith

An LLM auditor inside a trusted execution environment lets outsiders verify high-level properties of private data using only yes-or-no questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that isolating an LLM auditor in a hardware-protected enclave allows a verifier to confirm semantic features of a prover's secret dataset without ever seeing the raw information. The verifier poses a small number of simple binary questions while the auditor reads the data locally through a context protocol and returns a yes-or-no answer plus a signed reasoning trace. This trace is cryptographically bound to both the original data and the enclave's hardware root of trust. The approach targets the space between zero-knowledge proofs, which excel at precise math but struggle with meaning, and full disclosure, which leaks proprietary material. If the method holds, it opens a route to scalable checks on complex, unstructured properties such as whether released code actually matches its published description.

Core claim

Agentic Witnessing shifts verification from attested execution to attested reasoning. An LLM-based Auditor runs inside a TEE, uses the Model Context Protocol to inspect the Prover's private dataset on demand, answers a limited set of binary true/false questions from the Verifier, and emits a yes/no verdict together with a cryptographic transcript. The transcript is a signed hash chain that ties the reasoning steps to the original dataset and the TEE's hardware root of trust. The framework was demonstrated by auditing five high-level properties across the codebases of 21 peer-reviewed computer science papers, treating the source code as confidential.

What carries the argument

The TEE-isolated LLM Auditor that dynamically inspects datasets via the Model Context Protocol, answers boolean queries, and produces a signed hash-chain transcript bound to both the data and the hardware root of trust.

Load-bearing premise

The LLM auditor can reliably inspect the dataset through the context protocol and deliver accurate yes/no verdicts on high-level qualitative properties.

What would settle it

Run the auditor on a set of codebases whose true answers to the target properties are already known to an external party, then check whether the auditor's yes/no outputs match those known answers while the raw code remains hidden from the verifier.

read the original abstract

Auditing the semantic properties of proprietary data creates a fundamental tension: verification requires transparent access, while proprietary rights demand confidentiality. While Zero-Knowledge Proofs (ZKPs) ensure privacy, they are typically limited to precise algebraic constraints and are ill-suited for verifying qualitative, unstructured properties, such as the logic within a codebase. We propose {\em Agentic Witnessing}, a framework that moves verification from attested execution to {\em attested reasoning}. The system is composed of three agents: a Verifier (who wants to check properties of a dataset), a Prover (who owns the dataset) and an Auditor (that inspects the dataset). The Verifier is allowed to ask a limited number of simple binary true/false questions to the auditor. By isolating an LLM-based Auditor within a Trusted Execution Environment (TEE), the system enables the Verifier to query a Prover's private data via simple Boolean queries, without exposing the raw dataset. The Auditor uses the Model Context Protocol (MCP) to dynamically inspect the target dataset, producing a yes/no verdict accompanied by a cryptographic transcript: a signed hash chain binding the reasoning trace to both the original dataset and the TEE's hardware root of trust. We demonstrate this architecture by automating the artifact evaluation process for 21 peer-reviewed computer science papers with released codebases on GitHub (e.g. Does the codebase implement the system described in the paper?). We verified five high-level properties of these codebases described in the corresponding publications, treating the source code as private. Our results show that TEE-enabled agentic auditing provides a mechanism for privacy-preserving oversight, effectively decoupling qualitative verification from the need for data disclosure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Agentic Witnessing, a framework for privacy-preserving verification of high-level semantic properties on proprietary datasets. A Verifier poses simple Boolean queries to an LLM-based Auditor isolated inside a TEE; the Auditor uses the Model Context Protocol (MCP) to inspect the Prover's private data and returns yes/no verdicts accompanied by a signed hash chain that binds the reasoning trace to the dataset and the TEE hardware root of trust. The architecture is demonstrated by automating artifact evaluation on 21 peer-reviewed CS papers, verifying five qualitative properties (e.g., whether released code implements the system described in the paper) while treating the codebases as private.

Significance. If the LLM Auditor can be shown to produce reliable verdicts, the approach would offer a pragmatic complement to ZKPs for qualitative auditing tasks that resist algebraic formalization. The TEE-attested reasoning model and the concrete demonstration on real GitHub repositories are strengths that could influence practical privacy-preserving oversight in software and data auditing.

major comments (3)
  1. [§5 (Demonstration)] The 21-paper demonstration (abstract and §5) reports no quantitative metrics—accuracy, precision, recall, false-positive rate, or inter-rater agreement with human experts—for the LLM Auditor’s yes/no verdicts on the five high-level properties. Without these data the central claim that the system “provides a mechanism for privacy-preserving oversight” rests on an untested assumption.
  2. [§4 (Architecture)] The signed hash chain (abstract and §4) attests execution and data binding but cannot attest semantic correctness of the LLM trace. No analysis of hallucination, misinterpretation of code semantics, or failure modes on qualitative properties (e.g., “does the codebase implement the described system”) is provided, leaving the reliability of the attested reasoning unaddressed.
  3. [§3 (System Design)] The manuscript contains no security analysis or threat model for the TEE + MCP integration. It is unclear how the Auditor is prevented from leaking raw data through side channels or how the limited Boolean-query interface is enforced inside the enclave.
minor comments (2)
  1. [Abstract] The abstract states “our results show” yet supplies no numerical outcomes or tables summarizing the 21-paper evaluation; a results table or summary statistics should be added.
  2. [§4] Notation for the five verified properties and the exact MCP interface calls is introduced only informally; a concise table or pseudocode listing would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of attested reasoning in privacy-preserving auditing. We agree that the manuscript would benefit from stronger empirical support, explicit discussion of LLM limitations, and a security analysis. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [§5 (Demonstration)] The 21-paper demonstration (abstract and §5) reports no quantitative metrics—accuracy, precision, recall, false-positive rate, or inter-rater agreement with human experts—for the LLM Auditor’s yes/no verdicts on the five high-level properties. Without these data the central claim that the system “provides a mechanism for privacy-preserving oversight” rests on an untested assumption.

    Authors: We agree that the absence of quantitative metrics leaves the reliability of the LLM Auditor’s verdicts unquantified. The demonstration in §5 was designed as an end-to-end feasibility study on real GitHub repositories rather than a controlled benchmark of LLM accuracy. In the revised manuscript we will add a new evaluation subsection that reports inter-rater agreement with human experts on a representative subset of the 21 codebases, together with precision, recall, and false-positive rates for the five properties. We will also explicitly qualify the central claim to reflect that the system provides attested execution of reasoning rather than verified correctness. revision: yes

  2. Referee: [§4 (Architecture)] The signed hash chain (abstract and §4) attests execution and data binding but cannot attest semantic correctness of the LLM trace. No analysis of hallucination, misinterpretation of code semantics, or failure modes on qualitative properties (e.g., “does the codebase implement the described system”) is provided, leaving the reliability of the attested reasoning unaddressed.

    Authors: The referee is correct that the hash chain provides cryptographic attestation of execution integrity and data binding inside the TEE but does not guarantee semantic correctness of the LLM trace. We will expand §4 with a dedicated paragraph on failure modes, including hallucinations, misinterpretation of code semantics, and the inherent limits of qualitative property verification. The revision will clarify that the contribution lies in attested reasoning under a restricted Boolean-query interface, note that the Verifier can issue follow-up queries to probe suspicious verdicts, and reference existing literature on LLM reliability in code analysis. revision: yes

  3. Referee: [§3 (System Design)] The manuscript contains no security analysis or threat model for the TEE + MCP integration. It is unclear how the Auditor is prevented from leaking raw data through side channels or how the limited Boolean-query interface is enforced inside the enclave.

    Authors: We acknowledge the lack of an explicit threat model. The original text relies on the standard isolation guarantees of TEEs and the design of the Model Context Protocol. In the revision we will insert a concise threat-model subsection in §3 that (i) states the assumed TEE security properties, (ii) describes how the Boolean-query interface is enforced at the protocol level to prevent arbitrary output, and (iii) discusses side-channel leakage risks together with standard mitigations. We will also note that a full formal analysis of the TEE+MCP composition is left for future work. revision: yes

Circularity Check

0 steps flagged

Architectural proposal with case study exhibits no circularity

full rationale

The paper advances an architectural framework (Agentic Witnessing) that isolates an LLM Auditor in a TEE to answer Verifier Boolean queries on Prover data without disclosure, using MCP for inspection and a signed hash chain for attestation. It supports the proposal with a case-study demonstration on 21 GitHub codebases for five high-level properties. No mathematical derivations, equations, fitted parameters, predictions of related quantities, or self-citation chains appear in the provided text or abstract. The central claim is a pragmatic system design whose validity rests on the external TEE guarantees and the (unvalidated) LLM reasoning reliability, not on any internal reduction to its own inputs by construction. This is a standard non-circular engineering proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about TEE security and LLM reasoning reliability applied to a new auditing workflow; no free parameters or new invented entities with independent evidence are introduced.

axioms (2)
  • domain assumption Trusted Execution Environments provide hardware-rooted isolation, attestation, and tamper resistance for code execution.
    Invoked to guarantee the Auditor's integrity and the validity of the cryptographic transcript.
  • ad hoc to paper An LLM auditor can accurately determine binary truth values for high-level semantic properties of code or data when given inspection access via MCP.
    This is the core functional assumption enabling the yes/no verdicts without which the system cannot operate.
invented entities (2)
  • Agentic Witnessing framework no independent evidence
    purpose: To decouple qualitative verification from data disclosure using attested reasoning.
    Newly proposed system architecture; no independent evidence provided beyond the described demonstration.
  • Model Context Protocol (MCP) no independent evidence
    purpose: To enable dynamic inspection of the private dataset by the TEE-isolated auditor.
    Referenced as the interface for data access; details and prior existence not specified in abstract.

pith-pipeline@v0.9.0 · 5609 in / 1662 out tokens · 43368 ms · 2026-05-08T02:53:55.643988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    Taming throughput- latency tradeoff in llm inference with sarathi-serve,

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, and R. Ramjee. 2024. Taming Throughput- Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Sym- posium on Operating Systems Design and Implementation (OSDI 24). 117–134. doi:10.48550/arXiv.2403.02310

  2. [2]

    Eli Ben-Sasson, Alessandro Chiesa, Daniel Genkin, Eran Tromer, and Madars Virza. 2013. SNARKs for C: Verifying Program Executions Succinctly and in Zero Knowledge.CRYPTO(2013), 90–108. doi:10.1007/978-3-642-40084-1_6

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    A. D. Camuto and J. Morton. 2023. EZKL. https://github.com/zkonduit/ezkl Accessed: January 12, 2026

  5. [5]

    David Cerdeira, Nuno Santos, Pedro Fonseca, and Sandro Pinto. 2020. SoK: Understanding the Prevailing Security Vulnerabilities in TrustZone-assisted TEE Systems. InIEEE Symposium on Security and Privacy (SP). IEEE, San Francisco, CA, 1416–1432. doi:10.1109/sp40000.2020.00061

  6. [6]

    Yinwei Dai, Rui Pan, Anand Iyer, Kai Li, and Ravi Netravali. 2024. Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving. In Proceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP ’24). 607–623. doi:10.1145/3694715.3695963

  7. [7]

    Dolev and A

    D. Dolev and A. C. Yao. 1981. On the security of public key protocols.IEEE Transactions on Information Theory29, 2 (1981), 198–208. doi:10.1109/sfcs.1981.32

  8. [8]

    European Union. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonised rules on artificial intelligence.Official Journal of the European UnionL (2024), 1–144. https: //tinyurl.com/4atj6det 2024/1689

  9. [9]

    Flashbots. 2022. The Future of MEV is SUAVE. https://writings.flashbots.net/the- future-of-mev-is-suave/

  10. [10]

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yu- vraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless Inference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135–153. http://arxiv.org/abs/2401.14351v2

  11. [11]

    S Goldwasser, S Micali, and C Rackoff. 1985. The knowledge complexity of inter- active proof-systems. InProceedings of the seventeenth annual ACM symposium on Theory of Computing, Vol. 18. 291–304. doi:10.1145/22145.22178

  12. [12]

    2026.Gemini CLI

    Google. 2026.Gemini CLI. https://google-gemini.github.io/gemini-cli/ Command Line Interface for Gemini models

  13. [13]

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection.Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security(2023), 79–90. doi:10.1145/3605764.3623985

  14. [14]

    Antonio Gullí. 2025. Model Context Protocol. 147-162 pages. doi:10.1007/978-3- 032-01402-3_10 Technical Specification

  15. [15]

    Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, and Luo Mai. 2025. WaferLLM: Large Language Model Inference at Wafer Scale. InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 257–273. http://arxiv.org/abs/2502.04563v3

  16. [16]

    Zhisheng Hu, Pengfei Zuo, Yizou Chen, Chao Wang, Junliang Hu, and Ming- Chang Yang. 2024. Aceso: Achieving Efficient Fault Tolerance in Memory- Disaggregated Key-Value Stores. InProceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP ’24). 127–143. doi:10.1145/3694715.3695951

  17. [17]

    Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, and Jeff Braga. 2022. CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learn- ing Algorithms.Journal of Machine Learning Research23, 274 (2022), 1–18. https://jmlr.org/papers/volume23/21-1342/21-1342.pdf

  18. [18]

    Scaling Laws for Neural Language Models

    J. Kaplan, Sam McCandlish, T. Henighan, Tom B. Brown, Benjamin Chess, R. Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361(2020), 4257–4273. doi:10.48550/arXiv.2001.08361

  19. [19]

    Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. 2019. Spectre Attacks: Exploiting Speculative Execution. In 2019 IEEE Symposium on Security and Privacy (SP), Vol. 63. 1–19. doi:10.1109/sp. 2019.00002

  20. [20]

    Modulus Labs. 2023. The Cost of Intelligence: Proving Machine Learning Infer- ence with Zero-Knowledge. https://github.com/Modulus-Labs/Papers Accessed: January 12, 2026

  21. [21]

    2002.Specifying Systems: The TLA+ Language and Tools for Hard- ware and Software Engineers [Book Review]

    Leslie Lamport. 2002.Specifying Systems: The TLA+ Language and Tools for Hard- ware and Software Engineers [Book Review]. Vol. 35. Addison-Wesley Professional. 81–81 pages. doi:10.1109/mc.2002.1033032

  22. [22]

    Lorch, Oded Padon, and Bryan Parno

    Andrea Lattuada, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, Reto Achermann, Tej Chajed, Chris Haw- blitzel, Jon Howell, Jacob R. Lorch, Oded Padon, and Bryan Parno. 2024. Verus: A Practical Foundation for Systems Verification. InProceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP ’24)...

  23. [23]

    Chenghao Liu, Wenjing Yang, Himanshu Mittal, Manpreet Singh, Doyen Sahoo, and S. Hoi. 2023. PyRCA: A Library for Metric-based Root Cause Analysis. arXiv:2306.11417 [cs.AI] doi:10.48550/arXiv.2306.11417

  24. [24]

    Haoran Ma, Yifan Qiao, Shiafun Liu, Shan Yu, Yuanjiang Ni, Qingda Lu, Jiesheng Wu, Yiying Zhang, Miryung Kim, and Harry Xu. 2024. DRust: Language-Guided Distributed Shared Memory with Fine Granularity, Full Transparency, and Ul- tra Efficiency. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 97–115. doi:10.48550/arXiv.2406.02803

  25. [25]

    Simon Meier, Benedikt Schmidt, Cas Cremers, and David Basin. 2013. The TAMARIN Prover for the Symbolic Analysis of Security Protocols. InComputer Aided Verification. Springer, 696–701. doi:10.1007/978-3-642-39799-8_48

  26. [26]

    Kun Ren, Dennis Li, and Daniel J. Abadi. 2019. SLOG: serializable, low-latency, geo-replicated transactions.Proceedings of the VLDB Endowment12, 11 (2019), 1747–1761. doi:10.14778/3342263.3342647

  27. [27]

    Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. 2024. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. InProceedings of the 30th ACM Symposium on Operating Systems Principles (SOSP ’24). 590–606. doi:10.1145/3694715.3695964

  28. [28]

    Hubert Chan, Christopher W

    Emil Stefanov, Marten Van Dijk, Elaine Shi, T.-H. Hubert Chan, Christopher Fletcher, Ling Ren, Xiangyao Yu, and Srinivas Devadas. 2018. Path ORAM: An Extremely Simple Oblivious RAM Protocol. InProceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security (CCS), Vol. 65. 299–310. doi:10.1145/3177872

  29. [29]

    Florian Tramer and Dan Boneh. 2018. Slalom: Fast, Verifiable and Private Execu- tion of Neural Networks in Trusted Hardware.arXiv preprint arXiv:1806.03287 (2018). http://arxiv.org/abs/1806.03287v2

  30. [30]

    Gomez, Lukasz Kaiser, and I

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and I. Polosukhin. 2017. Attention is All you Need. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 30. doi:10.65215/nxvz2v36

  31. [31]

    Ke Wang, Felix Qu, Libin Xia, Zishuo Zhao, Chris Tong, Lynn Ai, and Eric Yang

  32. [32]

    doi:10.48550/ arXiv.2509.24257

    VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference.arXiv preprint arXiv:2509.24257abs/2509.24257 (2025). doi:10.48550/ arXiv.2509.24257

  33. [33]

    Liang, Feng Wu, and Francis Y

    Zibo Wang, Pinghe Li, C. Liang, Feng Wu, and Francis Y. Yan. 2022. Autothrot- tle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 149–165. http://arxiv.org/abs/2212.12180v5

  34. [34]

    Hao Wu, Yue Yu, Jun Deng, Shadi Ibrahim, Song Wu, Haoqiang Fan, Ziyue Cheng, and Hai Jin. 2024. StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 59–73. https://www.usenix.org/conference/atc24/presentation/wu-hao

  35. [35]

    Mengwei Xu, Dongqi Cai, Yaozong Wu, Xiang Li, and Shangguang Wang. 2024. FwdLLM: Efficient Federated Finetuning of Large Language Models with Per- turbed Inferences. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 579–596. https://www.usenix.org/conference/atc24/presentation/xu-mengwei

  36. [36]

    Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, P

    Francis Y. Yan, Hudson Ayers, Chenzhi Zhu, Sadjad Fouladi, James Hong, Keyi Zhang, P. Levis, and Keith Winstein. 2019. Learning in situ: a randomized experiment in video streaming. InProceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2020). 495–511. http: //arxiv.org/abs/1906.01113v2

  37. [37]

    Yifan Yang, Lin He, Jiasheng Zhou, Xiaoyi Shi, Jiamin Cao, and Ying Liu. 2024. P4runpro: Enabling Runtime Programmability for RMT Programmable Switches. InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 393–409. doi:10.1145/3651890.3672230

  38. [38]

    Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, and Haibo Chen. 2024. BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 275–293. http://arxiv.org/abs/2412.17246v2

  39. [39]

    Shiwei Zhang, Lansong Diao, Chuan Wu, Zongyan Cao, Siyu Wang, and Wei Lin. 2024. HAP: SPMD DNN Training on Heterogeneous GPU Clusters with 10 Agentic Witnessing: TEE-Enabled Auditing Automated Program Synthesis. In2024 USENIX Annual Technical Conference (USENIX ATC 24). 19–36. doi:10.1145/3627703.3629580

  40. [40]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, E. Xing, Haotong Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-judge with MT- Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems, Vol. 36. http://arxiv.org/abs/2306.05685v4

  41. [41]

    Beekman, Raluca A

    Wenting Zheng, Ankur Dave, J. Beekman, Raluca A. Popa, Joseph E. Gonzalez, and Ion Stoica. 2017. Opaque: An Oblivious and Encrypted Distributed Ana- lytics Platform. In14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 283–298

  42. [42]

    Yusheng Zheng, Tong Yu, Yiwei Yang, Yanpeng Hu, Xiaozheng Lai, and Andrew Quinn. 2023. bpftime: userspace eBPF Runtime for Uprobe, Syscall and Kernel- User Interactions. arXiv:2311.07923 doi:10.48550/arXiv.2311.07923

  43. [43]

    Yusheng Zheng, Tong Yu, Yiwei Yang, Yanpeng Hu, Xiaozheng Lai, Dan Williams, and Andi Quinn. 2025. Extending Applications Safely and Efficiently. InPro- ceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). USENIX Association, Boston, MA, USA, 557–574. https://www.usenix.org/system/files/osdi25-zheng-yusheng.pdf

  44. [44]

    Ziqiao Zhou, Anjali, Weiteng Chen, Sishuai Gong, Chris Hawblitzel, and Weidong Cui. 2024. VeriSMo: A Verified Security Module for Confidential VMs. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 599–614. https://www.usenix.org/conference/osdi24/presentation/zhou

  45. [45]

    Forked History

    Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tianyi Liu, Zihao Ye, Keisuke Kamahori, Chien-Nan Chen, Stephanie Wang, and Baris Kasikci. 2024. NanoFlow: Towards Optimal Large Language Model Serving Throughput. In19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25). 749–765. doi:10.48550/...