Are Sparse Autoencoders Useful for Java Function Bug Detection?

Andre Catarino; Claudia Mamede; Henrique Lopes Cardoso; Rui Abreu; Rui Melo

arxiv: 2505.10375 · v4 · submitted 2025-05-15 · 💻 cs.SE · cs.AI· cs.LG

Are Sparse Autoencoders Useful for Java Function Bug Detection?

Rui Melo , Claudia Mamede , Andre Catarino , Rui Abreu , Henrique Lopes Cardoso This is my paper

Pith reviewed 2026-05-22 15:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords sparse autoencodersbug detectionJava codeLLM representationssoftware vulnerabilitiesunsupervised feature extractioncode security

0 comments

The pith

Sparse autoencoders extract features from pretrained LLM internals that detect bugs in Java functions at up to 89% F1 without fine-tuning or supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether sparse autoencoders can turn the hidden activations of models like GPT-2 and Gemma into usable signals for spotting buggy Java code. It shows these features achieve high detection performance while leaving the original LLM untouched. A sympathetic reader would care because this points to a lighter, more inspectable way to use existing language models for security tasks instead of retraining them end to end. The work is framed as the first direct evidence that unsupervised feature extraction from LLM representations can outperform standard fine-tuned baselines on this problem.

Core claim

Sparse autoencoders trained on the internal representations of pretrained LLMs isolate features that highlight buggy behavior in Java functions, enabling bug detection with an F1 score of up to 89 percent while consistently beating fine-tuned transformer encoder baselines and requiring no task-specific supervision or fine-tuning of the underlying models.

What carries the argument

Sparse autoencoders applied to the activations of pretrained LLMs, which isolate a small set of interpretable features that correlate with the presence of bugs in code.

If this is right

Bug detection becomes possible directly from frozen LLM representations rather than requiring task-specific fine-tuning.
The extracted SAE features offer a more interpretable route to understanding what the model sees as buggy behavior.
The same approach could reduce computational cost and data needs for other code-analysis tasks that currently rely on supervised fine-tuning.
Security tools could incorporate these features as a lightweight check alongside traditional static analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the signals are robust, similar SAE-based probes could be applied to other programming languages or to different defect types such as security vulnerabilities.
The method might eventually support debugging by pointing to which specific features in the code trigger the bug flag.
Adoption would shift emphasis from retraining large models toward post-hoc analysis of their existing internal states.

Load-bearing premise

The hidden states inside pretrained language models already carry detectable information about whether a Java function contains a bug, and sparse autoencoders can pull that information out without any labeled examples or model updates.

What would settle it

An experiment that trains sparse autoencoders on the same LLM activations but finds no better-than-chance correlation between the resulting features and independently verified bug labels on a fresh Java dataset would falsify the central claim.

read the original abstract

Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision. Code available at https://github.com/rufimelo99/SAE-Java-Bug-Detection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims SAEs on GPT-2 and Gemma activations can detect Java bugs unsupervised at up to 89% F1 and beat fine-tuned baselines, but missing methods leave the result uncheckable.

read the letter

This paper's core idea is using sparse autoencoders to pull out bug-related signals from the internal activations of models like GPT-2 Small and Gemma 2B for detecting issues in Java functions. They report F1 scores reaching 89% in an unsupervised setup that doesn't require fine-tuning the base LLMs, and they say it beats some fine-tuned transformer encoders. On the positive side, the approach tries to leverage existing interpretability techniques for a real-world problem in code security. It avoids the usual costs of retraining large models and aims for something more transparent. Making the code available on GitHub is a good step that could let others test the claims. The weaknesses stand out clearly because we only have the abstract. There's no information about the size or source of the Java function dataset, how buggy samples were identified or labeled, which specific layers or positions in the models they extracted activations from, the exact SAE architecture and training parameters like sparsity level, or how they selected and used the features for the final bug detection without supervision. These gaps make it hard to trust the numbers or understand why it supposedly works better than the baselines. People focused on applying mechanistic interpretability to software engineering tasks might get some value from seeing this direction explored. It could spark ideas for similar uses in other languages or bug types. Overall, the work is too preliminary for peer review right now. The authors should expand the methods and results sections with full details before it gets sent to referees.

Referee Report

1 major / 1 minor

Summary. The manuscript explores whether Sparse Autoencoders (SAEs) can be applied to internal representations of pretrained LLMs (GPT-2 Small and Gemma 2B) to detect bugs in Java functions. It reports that SAE-derived features achieve an F1 score of up to 89% for this task, consistently outperforming fine-tuned transformer encoder baselines, and presents this as the first empirical evidence that SAEs can detect software bugs directly from LLM activations without fine-tuning or task-specific supervision. Public code is referenced via a GitHub link.

Significance. If the results hold upon full verification, the work would indicate that SAEs can isolate unsupervised signals of buggy behavior within LLM representations, offering a potentially lightweight and interpretable alternative to fine-tuning for vulnerability detection in code. This could contribute to mechanistic interpretability applications in software engineering. The provision of public code is a positive element supporting reproducibility.

major comments (1)

[Abstract] Abstract: The headline result of an F1 score up to 89% from unsupervised SAE features, along with outperformance over fine-tuned baselines, cannot be assessed because the abstract supplies no details on the Java function dataset (including size and how buggy vs. clean labels were assigned), the specific layers or tokens used for activation extraction, SAE training hyperparameters such as sparsity and width, or the post-hoc unsupervised procedure that maps selected features to bug predictions. These omissions are load-bearing for the central claim that SAEs isolate buggy-behavior signals in a fully unsupervised manner.

minor comments (1)

[Abstract] Abstract: The phrasing 'Sparse Autoencoder offer' contains a subject-verb agreement error; 'Autoencoders offer' would be consistent with the title and standard usage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We agree that the abstract should be more self-contained to allow readers to evaluate the central claims without immediately consulting the full text. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result of an F1 score up to 89% from unsupervised SAE features, along with outperformance over fine-tuned baselines, cannot be assessed because the abstract supplies no details on the Java function dataset (including size and how buggy vs. clean labels were assigned), the specific layers or tokens used for activation extraction, SAE training hyperparameters such as sparsity and width, or the post-hoc unsupervised procedure that maps selected features to bug predictions. These omissions are load-bearing for the central claim that SAEs isolate buggy-behavior signals in a fully unsupervised manner.

Authors: We agree that these details are important for assessing the unsupervised nature of the approach. In the revised manuscript we will expand the abstract to include: (1) a brief description of the Java function dataset, its size, and the labeling procedure for buggy versus clean functions; (2) the specific layers and token positions used for activation extraction from GPT-2 Small and Gemma 2B; (3) the main SAE training hyperparameters, including sparsity level and dictionary width; and (4) a concise outline of the post-hoc unsupervised feature selection and mapping procedure that produces the bug predictions. These elements are already described in Sections 3 and 4 of the full manuscript; summarizing them in the abstract will make the central claim more transparent while preserving the abstract's length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation with no derivations or self-referential reductions

full rationale

The paper is an empirical study reporting experimental results on applying SAEs to LLM activations for Java bug detection, with a claimed F1 of up to 89% outperforming baselines. The abstract contains no equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. The central claim rests on observable performance metrics from a public codebase rather than any self-definitional structure, ansatz smuggling, or uniqueness theorem imported from prior author work. This is a standard empirical evaluation self-contained against external benchmarks and code reproduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of concrete free parameters, axioms, or invented entities. The approach implicitly relies on standard assumptions for training sparse autoencoders on neural activations (such as choice of sparsity penalty and dictionary size), but none are specified or justified in the provided text.

pith-pipeline@v0.9.0 · 5727 in / 1194 out tokens · 44486 ms · 2026-05-22T15:02:25.514823+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions... SAE-derived features enable bug detection with an F1 score of up to 89%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
cs.CR 2026-04 unverdicted novelty 6.0

SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.