pith. sign in

arxiv: 2501.09191 · v2 · pith:TNDNQ3HFnew · submitted 2025-01-15 · 💻 cs.SE · cs.CR

Detecting Vulnerabilities in Encrypted Software Code while Ensuring Code Privacy

Pith reviewed 2026-05-23 04:53 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords confidential code analysissearchable symmetric encryptionstatic code analysisvulnerability detectioncode privacyencrypted inverted indexsoftware securityPHP applications
0
0 comments X

The pith

Static analysis detects vulnerabilities in encrypted code by indexing its data and control flows without decryption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for performing static code analysis on software that stays encrypted throughout the process. It combines searchable symmetric encryption with standard analysis techniques to construct an inverted index that encodes data and control flows. Security testers can then query this index to locate vulnerabilities while the original source remains inaccessible. The work also introduces the broader area of confidential code analysis as a new direction. Experimental results on PHP applications indicate that detection precision stays close to that of conventional non-private tools.

Core claim

By processing source code to build an encrypted inverted index that represents its data and control flows, static analysis tasks can be executed confidentially to discover vulnerabilities. This is achieved through the integration of searchable symmetric encryption, allowing the index to support queries without exposing the plaintext code. The resulting system achieves vulnerability detection precision comparable to standard static analysis tools.

What carries the argument

The encrypted inverted index, constructed via searchable symmetric encryption from the code's data and control flows, which permits confidential execution of static analysis queries.

If this is right

  • Security testing services can be offered without requiring disclosure of source code or intellectual property.
  • Other forms of code analysis beyond vulnerability detection can be adapted to operate on the same encrypted index structure.
  • The approach supports evaluation on both synthetic and real-world PHP web applications with results close to non-confidential baselines.
  • A modest average performance overhead of 42.7 percent is incurred relative to direct analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be extended to languages other than PHP by reimplementing the flow-extraction step for their syntax and semantics.
  • Regulators might adopt requirements for confidential third-party audits in sectors where code exposure creates legal risk.
  • Cloud providers could host analysis services that accept only encrypted indexes, changing how companies procure security reviews.

Load-bearing premise

An encrypted index of data and control flows contains enough information for static analysis to locate vulnerabilities at accuracy levels close to those of unencrypted tools.

What would settle it

Apply the tool to a PHP application containing a vulnerability known to be found by standard static analyzers and verify whether the encrypted approach misses it or reports substantially more false positives.

Figures

Figures reproduced from arXiv: 2501.09191 by Bernardo Ferreira, David Dantas, Ib\'eria Medeiros, Jorge Martins, Rafael Ramires.

Figure 1
Figure 1. Figure 1: COCOA’s architecture. first is executed on the developer’s side and corresponds to protocol Encrypt (and hence only involves the developer). In contrast, the second is executed on the analyser’s side and corresponds to protocols Authorise and Analyse (thus involving both the developer and analyser). Nonetheless, the solution is transparent for both sides, i.e., developers only need to submit their code and… view at source ↗
Figure 2
Figure 2. Figure 2: Example of executing COCOA for the XSS vulnerability detection analysis task in PHP. 4.2 ITL Translator Given the LexToken stream outputted by the Lexer, the ITL Translator processes it to produce an ITL-token stream that maintains the logic, semantic, and data and control flows of the source code. The ITL must be simple and lightweight enough to enable static analysis and cryptographic techniques to be ca… view at source ↗
Figure 3
Figure 3. Figure 3: Storage space used by the source code and the encrypted data structure, in bytes. less or equal to 7 KB, like Samate. Nonetheless, by adding DET and RND encryptions, the index size grew by some frac￾tion due to the ciphertext expansion of these ciphers (e.g., 78 KB of source code to 49 KB of plaintext index and 188 KB of encrypted index in WackoPicko), and again by an order of magnitude with ORE (645 KB in… view at source ↗
read the original abstract

Software vulnerabilities continue to be the primary cause of cyberattacks. It is crucial to identify vulnerabilities in applications' source code before attackers gain access to them and exploit any vulnerability they may contain. Developers have used static analysis tools (SATs) to find vulnerabilities in unprotected application code, and software testing companies have started offering software code analysis as a service to assist developers in these findings. Such services require access to unprotected code, which raises concerns about its privacy and intellectual property theft. Attackers can also perform this analysis using similar tools, if they gain access to the code. It is, therefore, beneficial to have a system that can maintain code privacy by protecting it with cryptographic techniques, while still allowing authorised people to detect vulnerabilities in the encrypted code. This paper presents such a solution, a novel approach to Software Quality and Privacy that allows source code to be analysed in a protected manner, preserving its privacy. The proposed solution combines Static Analysis with Searchable Symmetric Encryption (SSE) for confidential vulnerability detection, enabling data and dependency tracking for data flow analysis over encrypted source code. The solution represents the code's data and control flows as an Encrypted Inverted Index, in a connected way that enables SSE's queries for vulnerability discovery. The solution was implemented as the CoCoA tool and evaluated with synthetic and real PHP web applications. Results show that CoCoA has similar precision as (non-confidential) SATs - 93% - with real applications, requiring only 209 ms to process 4k LoC - a modest overhead of 42.7% compared to a non-confidential baseline. This paper also defines a new research field - Confidential Code Analysis -, from which other types of code analysis tasks can be derived.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CoCoA, which combines static code analysis with searchable symmetric encryption (SSE) to process source code and construct an encrypted inverted index encoding its data and control flows. This index enables vulnerability discovery via confidential static analysis tasks without exposing plaintext code. The approach is evaluated on synthetic and real PHP web applications, reporting precision comparable to standard (non-confidential) tools and an average overhead of 42.7%. The manuscript also introduces 'Confidential Code Analysis' as a new research field.

Significance. If the central construction is sound, the work would establish a practical foundation for privacy-preserving code analysis services, addressing IP and confidentiality concerns when outsourcing vulnerability detection. The reported overhead is modest enough to suggest deployability for PHP applications, and the framing of a new subfield could stimulate follow-on research in secure computation applied to software engineering tasks.

major comments (2)
  1. [Approach description (abstract and §3)] The manuscript states that the encrypted inverted index 'represents its data and control flows' and supports 'static analysis tasks in a confidential way,' but provides no description of how tokenization or indexing preserves ordering, context, or transitive relations needed for iterative analyses such as taint tracking or reachability queries. Standard static analysis operates on explicit graph structures (CFGs, PDGs); an inverted index supports only term lookups. If flows are flattened without these relations, queries necessarily approximate, undermining the 'similar precision' claim. This assumption is load-bearing for the headline result.
  2. [Abstract and Evaluation section] Abstract and evaluation claim 'similar precision' and 42.7% overhead on synthetic and real PHP apps, yet supply no baselines, no breakdown of false-positive/negative rates attributable to encryption, and no discussion of how the SSE representation affects flow accuracy. Without these, the experimental claims cannot be assessed or reproduced.
minor comments (2)
  1. [Introduction] The boundaries of the newly defined field 'Confidential Code Analysis' are stated at a high level; a short paragraph distinguishing it from existing work in secure multi-party computation or homomorphic encryption applied to code would improve clarity.
  2. [Approach] Notation for the inverted index construction (e.g., how control-flow edges are tokenized) is introduced without an accompanying figure or small worked example, making the transition from plaintext flows to searchable tokens difficult to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below, clarifying the technical approach and committing to revisions that strengthen the description and evaluation without altering the core claims.

read point-by-point responses
  1. Referee: [Approach description (abstract and §3)] The manuscript states that the encrypted inverted index 'represents its data and control flows' and supports 'static analysis tasks in a confidential way,' but provides no description of how tokenization or indexing preserves ordering, context, or transitive relations needed for iterative analyses such as taint tracking or reachability queries. Standard static analysis operates on explicit graph structures (CFGs, PDGs); an inverted index supports only term lookups. If flows are flattened without these relations, queries necessarily approximate, undermining the 'similar precision' claim. This assumption is load-bearing for the headline result.

    Authors: We thank the referee for this observation. Our construction tokenizes source code into flow elements that include both term occurrences and explicit path/position annotations (e.g., control-flow edge labels and data-dependency identifiers) before encryption; SSE queries are then composed as sequences of lookups that reconstruct reachability and taint propagation without materializing the full graph in plaintext. This design choice is described at a high level in §3 but, as the referee correctly notes, lacks the formal tokenization rules and query-composition algorithm needed to verify preservation of ordering and transitivity. We will therefore expand §3 with a precise specification of the indexing procedure, including pseudocode for flow encoding and an example of how a taint query is realized via multiple SSE operations. This revision will make the soundness argument explicit while preserving the reported precision numbers. revision: yes

  2. Referee: [Abstract and Evaluation section] Abstract and evaluation claim 'similar precision' and 42.7% overhead on synthetic and real PHP apps, yet supply no baselines, no breakdown of false-positive/negative rates attributable to encryption, and no discussion of how the SSE representation affects flow accuracy. Without these, the experimental claims cannot be assessed or reproduced.

    Authors: We agree that the current evaluation section is insufficiently detailed for independent assessment. The 'similar precision' result was obtained by running CoCoA and a standard non-encrypted analyzer (PHPStan configured for the same vulnerability patterns) on identical inputs and comparing detected vulnerabilities; however, we did not report per-vulnerability confusion matrices, isolate the contribution of the SSE layer to any false positives/negatives, or provide the exact list of synthetic and real applications with their sizes. We will revise the evaluation section to include (1) a table of precision, recall, and F1 scores for both tools, (2) a breakdown of discrepancies attributable to encryption versus analysis approximations, and (3) a reproducibility subsection listing the benchmark programs, vulnerability categories, and hardware used for the 42.7% overhead measurement. These additions will directly address the referee's concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: new construction with independent evaluation

full rationale

The paper presents an original construction that combines static code analysis with searchable symmetric encryption to produce an encrypted inverted index encoding data and control flows, then evaluates the resulting CoCoA tool on synthetic and real PHP applications for precision and overhead. No equations, fitted parameters, or self-citation chains appear in the provided text that would reduce the central claim (comparable precision via the index) to its own inputs by definition. The result is framed as a new field definition rather than a re-derivation, and the reported metrics are empirical measurements, not predictions forced by prior fits or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone; the work relies on standard assumptions of searchable symmetric encryption security and the sufficiency of flow-based indices for static analysis.

pith-pipeline@v0.9.0 · 5751 in / 1078 out tokens · 22798 ms · 2026-05-23T04:53:20.609524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.