arxiv: 2605.01555 · v1 · submitted 2026-05-02 · 💻 cs.CL · cs.AI· cs.HC

Recognition: unknown

Automated Interpretability and Feature Discovery in Language Models with Agents

Arnau Marin-Llobet , Javier Ferrando

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords mechanistic interpretabilitymulti-agent systemsfeature discoverylanguage modelsactivation spaceautomated explanationsGemma-2

0 comments

The pith

An autonomous multiagent system automates the discovery and explanation of internal features in language models through iterative empirical testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework where agents work in coupled loops to both refine explanations of known features and discover new ones in large language models. In the explanation loop, agents generate competing hypotheses about what a feature does and test them using carefully chosen prompts and multiple evaluation metrics. The feature discovery loop has agents create prompt sets, build a graph of similar activations, and pick out candidates that stand out statistically and make semantic sense. Tested on Gemma-2 models, this yields better results than simple one-shot labeling, uncovers features tied to specific languages or safety concerns, and leaves clear records of how conclusions were reached. Readers might care because it points to a more systematic way to understand and verify what goes on inside these complex models.

Core claim

The central claim is that running agent-driven empirical loops—specifically an explanation refinement process with hypothesis proposal and targeted testing plus a feature discovery process using k-nearest-neighbor graphs in activation space and selection by statistical separability and semantic coherence—produces sharper and more falsifiable explanations of internal features in language models than traditional one-shot auto-interpretations, as shown through improvements on Gemma-2 family models and weight-sparse transformers.

What carries the argument

The autonomous multiagent framework consisting of two coupled loops: one for iterative explanation refinement via competing hypotheses and prompt-based testing with multi-metric evaluation, and one for feature discovery via prompt set generation, k-nearest-neighbor graph construction in activation space, and candidate retrieval based on statistical separability and semantic coherence criteria.

If this is right

The agent framework improves explanation quality over one-shot auto-interpretations.
It successfully discovers language-specific features in the tested models.
It identifies safety-relevant internal features.
It generates auditable explanation traces for the discovered features.
Agent-driven empirical loops lead to sharper and more falsifiable explanations compared to one-shot labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach scales, it could enable comprehensive automated audits of model internals for safety and alignment purposes.
The method might be extended to other types of models or components beyond MLP neurons.
Combining this with human review of the traces could accelerate collaborative interpretability research.
Successful feature discovery here suggests similar graph-based methods could apply to other high-dimensional data in AI.

Load-bearing premise

That the multi-metric evaluation together with statistical separability and semantic coherence criteria reliably pick out genuine internal features instead of artifacts created by the choice of prompts or how the graph is built.

What would settle it

Running controlled interventions such as ablating or editing the weights of the discovered features and observing whether the model's behavior on relevant prompts changes in the specific manner predicted by the agent's explanation would falsify the claim if no such change occurs.

Figures

Figures reproduced from arXiv: 2605.01555 by Arnau Marin-Llobet, Javier Ferrando.

**Figure 1.** Figure 1: Overview of the InterpAgent architecture. A supervisor coordinates two subagents, FeatureFinder and FeatureExplainer, which operate within a shared execution environment and persistent memory. These observations suggest that interpretability should be treated not as a static labeling task, but as an experimental process in which hypotheses are proposed, tested, and revised (Chan et al., 2022). Agentic sys… view at source ↗

**Figure 2.** Figure 2: Example of the Agent interactions. The user queries for features related to “Spanish”. The FeatureFinder executes our statistical discovery algorithm to identify candidate features that distinguish Spanish prompts from others. Upon selecting the top-ranked feature, the FeatureExplainer performs a hypothesis stress test. It refines the baseline Auto-Interp explanation (which missed the linguistic constrain… view at source ↗

**Figure 3.** Figure 3: FeatureExplainer pipeline. A feature’s top activations seed hypothesis candidates, which undergo multi-metric evaluation. An iterative refinement loop sharpens explanations until stopping criteria are met. Top hypotheses are checked for semantic similarity; if distinct hypotheses achieve comparable support, the feature is classified as polysemantic. if multiple semantically distinct hypotheses achieve comp… view at source ↗

**Figure 4.** Figure 4: Activation-based validation of discovered marker features. For each category, we compare the mean normalized activation of discovered markers on category-specific prompts versus random controls (t-test). Left: Language-family markers (500 prompts per language, layer 0). Right: Non-linguistic markers (coding, math, harmful content) using the same pipeline. Error bars: SEM; * p < 0.05, *** p < 0.001. To asse… view at source ↗

**Figure 5.** Figure 5: Representative activation heatmap of discovered marker neurons in a weightsparse transformer. Top-2 neurons per Python syntax category (rows) vs. prompt categories (columns). The block-diagonal structure confirms selective activation without any SAE. Details in Appendix A.9. sensitive prompt ( view at source ↗

**Figure 6.** Figure 6: Case Study 3: Safety-behavior auditing via causal probing. A direction discovered by FeatureFinder is used to steer Gemma-2-2B-it (−50×). Left: specificity controls show the intervention preserves benign responses and unrelated refusals. Right: the intervention suppresses refusal on the target safety-sensitive prompt. Feature identifier and intervention recipe withheld. cost: the iterative loop requires m… view at source ↗

read the original abstract

We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent proposes competing hypotheses and iteratively tests them with targeted prompt controls and a multi-metric evaluation; and (2) feature discovery, where an agent generates prompt sets, constructs a k-nearest-neighbor graph in activation space, and retrieves candidate features using statistical separability and semantic coherence criteria. On Gemma-2 family models and MLP neurons in weight-sparse transformers, our agent improves over one-shot auto-interpretations, discovers language-specific and safety-relevant features, and produces auditable explanation traces, showing that agent-driven empirical loops yield sharper and more falsifiable explanations than one-shot labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The coupled agent loops for explanation refinement are a practical step forward, but the feature discovery via k-NN activation graphs on generated prompts risks recovering prompt artifacts instead of model-intrinsic features.

read the letter

The main takeaway is that this multi-agent setup automates two parts of interpretability on Gemma-2 models: an explanation loop where agents propose and iteratively test competing hypotheses with targeted prompts and multi-metric scoring, plus a discovery loop that generates prompt sets, builds k-NN graphs over activations, and filters candidates by statistical separability and semantic coherence. The explanation refinement part is the stronger piece. Running empirical tests against one-shot auto-interp baselines and keeping auditable traces gives a clearer path to falsifiable claims than static labeling, and the paper shows concrete gains there on MLP neurons in sparse transformers. That framing is new enough to be worth noting. The discovery loop is where things get soft. Because the prompts come from an agent, any semantic structure in them will tend to produce separable activations regardless of whether the neurons actually implement those concepts. The paper does not report the obvious negative controls—shuffled prompt sets, unrelated corpora, or activation permutations—so it is difficult to tell whether the language-specific and safety-relevant features are real internal representations or just echoes of the generation process. No quantitative metrics, error bars, or detailed baseline tables appear in the abstract, and the full text does not seem to add the missing falsification steps either. This work is aimed at people already doing automated or agent-assisted interpretability who want ideas for scaling empirical loops. A reader looking for ready-to-use tools or strong evidence of new discoveries will come away disappointed, but someone thinking about how to make interp more iterative could pick up useful design patterns. The thinking is coherent and engages the literature on auto-interp without obvious internal contradictions. I would bring it to a reading group to discuss the control problem, would not cite it in its current form, and would send it to peer review with a request for the missing baselines and quantitative comparisons.

Referee Report

2 major / 2 minor

Summary. The paper introduces an autonomous multi-agent framework for mechanistic interpretability in LLMs, with two coupled loops: (1) explanation refinement, where agents propose competing hypotheses tested via targeted prompt controls and multi-metric evaluation, and (2) feature discovery, where agents generate prompt sets, construct k-NN graphs over activations, and select candidates via statistical separability plus semantic coherence. Applied to Gemma-2 family models and MLP neurons, it claims improvements over one-shot auto-interpretations, discovery of language-specific and safety-relevant features, and more auditable/falsifiable explanations than single-pass labeling.

Significance. If the empirical loops and discovery criteria are shown to surface genuine internal representations rather than prompt artifacts, the work could meaningfully scale mechanistic interpretability by replacing brittle one-shot labeling with iterative, auditable agent-driven testing. The emphasis on multi-metric evaluation and falsifiability is a conceptual strength, though its practical impact hinges on rigorous validation of the feature-discovery pipeline.

major comments (2)

[Feature discovery loop (Section 3.2 / 4)] Feature discovery loop (Section 3.2 / 4): the claim that statistical separability + semantic coherence on k-NN activation graphs recovers model-intrinsic features is load-bearing for the 'discovers language-specific and safety-relevant features' result, yet the manuscript reports no negative controls (shuffled prompts, unrelated corpora, or activation-shuffled baselines). Without these, separability may simply reflect semantic clusters already present in the LLM-generated prompt sets rather than computations implemented by the neurons.
[Evaluation and results sections] Evaluation and results sections: the abstract asserts quantitative improvements and new discoveries, but the manuscript supplies no baseline comparisons, error bars, separability threshold values, or details on how prompt controls were chosen and validated. This prevents assessment of whether the multi-metric evaluation actually distinguishes genuine features from artifacts.

minor comments (2)

[Methods] Notation for the k-NN graph construction and separability metric is introduced without a clear equation or pseudocode; adding a compact formal definition would improve reproducibility.
[Results] The paper would benefit from an explicit statement of the number of independent runs and the precise criteria used to declare a feature 'language-specific' or 'safety-relevant'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting the potential of the multi-agent framework. We address the two major comments point by point below. Both concerns can be resolved through clarifications and additional experiments that we will include in the revised manuscript.

read point-by-point responses

Referee: [Feature discovery loop (Section 3.2 / 4)] Feature discovery loop (Section 3.2 / 4): the claim that statistical separability + semantic coherence on k-NN activation graphs recovers model-intrinsic features is load-bearing for the 'discovers language-specific and safety-relevant features' result, yet the manuscript reports no negative controls (shuffled prompts, unrelated corpora, or activation-shuffled baselines). Without these, separability may simply reflect semantic clusters already present in the LLM-generated prompt sets rather than computations implemented by the neurons.

Authors: We agree that negative controls are necessary to confirm that separability on the k-NN activation graphs reflects computations performed by the neurons rather than semantic structure already present in the LLM-generated prompt sets. The current manuscript selects candidates via statistical separability plus semantic coherence and then subjects them to the coupled explanation-refinement loop, yielding several language-specific and safety-relevant features not recovered by one-shot baselines. To close this gap, the revision will add explicit negative-control results using shuffled prompt sets, unrelated corpora, and activation-shuffled baselines, allowing direct comparison of separability statistics under each condition. revision: yes
Referee: [Evaluation and results sections] Evaluation and results sections: the abstract asserts quantitative improvements and new discoveries, but the manuscript supplies no baseline comparisons, error bars, separability threshold values, or details on how prompt controls were chosen and validated. This prevents assessment of whether the multi-metric evaluation actually distinguishes genuine features from artifacts.

Authors: The manuscript does present comparisons against one-shot auto-interpretation baselines and reports multi-metric scores for the discovered features. We nevertheless accept that the presentation lacks error bars, explicit separability-threshold values, and a full account of how prompt controls were selected and validated. The revised results section will supply these missing elements: error bars computed across repeated runs, the precise threshold values applied for candidate selection, and a description of the validation procedure used for the prompt controls, thereby enabling readers to assess the robustness of the multi-metric evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical loops are self-contained against external benchmarks

full rationale

The paper presents an autonomous multiagent framework consisting of explanation refinement and feature discovery loops. These are described as iterative empirical processes: an agent proposes hypotheses tested via targeted prompts and multi-metric evaluation, and generates prompt sets to build k-NN activation graphs filtered by statistical separability and semantic coherence. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The claimed improvements over one-shot auto-interpretations and discovery of language-specific features rest on experimental outcomes rather than reducing by construction to the input prompt sets or graph criteria. The derivation chain is therefore independent and falsifiable against external baselines.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects high-level elements implied by the description rather than explicit statements from the full text.

free parameters (2)

k for k-nearest-neighbor graph
Used to construct activation-space graph in feature discovery; exact value and sensitivity not stated in abstract.
thresholds for statistical separability and semantic coherence
Criteria for retrieving candidate features; values chosen but not reported in abstract.

axioms (2)

domain assumption Targeted prompt controls can isolate and test hypotheses about specific internal features
Central to the explanation refinement loop described in the abstract.
domain assumption Statistical separability combined with semantic coherence identifies meaningful features
Used to filter candidate features in the discovery loop.

pith-pipeline@v0.9.0 · 5430 in / 1475 out tokens · 74144 ms · 2026-05-09T14:00:57.670856+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Steering Language Models With Activation Engineering

URLhttps://arxiv.org/abs/2308.10248. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022. Frank Wilcoxon. Individual comparisons by ranking methods. InBreakthroughs in statistics: Methodology and...

work page internal anchor Pith review arXiv 2022
[2]

For each example xi, the LLM predictsYES(matches hypothesis) orNO

Detection F1.An LLM classifier receives the hypothesis text and a shuffled mixture of n activating and n non-activating examples (with labels hidden). For each example xi, the LLM predictsYES(matches hypothesis) orNO. Let ˆyi ∈ {0, 1} be the prediction and yi the ground-truth label (1 ifx i ∈ P +). We compute: Precision= TP TP+FP , Recall= TP TP+FN , DetF...
[3]

Fuzzing F1.For each activating example, we randomly highlight a subset of tokens and ask the LLM whether the highlighted tokens are the ones responsible for activation according to the hypothesis. Lets i ∈ {0, 1}be the LLM’s judgment for examplei: FuzzF1= 1 |P + sample| ∑ i si This measures robustness: a hypothesis that correctly identifies the activating...
[4]

Surprisal AUROC.The LLM rates how coherent each example is with respect to the hypothesis on a 0–10 scale. Let c+ i and c− j be the normalized coherence ratings (∈[ 0, 1]) for activating and non-activating examples respectively: SurpAUROC= ¯c+ − ¯c− +1 2 where ¯c+ = 1 |P + sample| ∑i c+ i and ¯c− = 1 |P − sample| ∑j c− j . This discrimination score is bou...
[5]

Let sim(h, x) = eh·ex ∥eh∥∥ex∥ where eh, ex are the respective embeddings

Embedding Similarity.Using a sentence embedding model ( all-MiniLM-L6-v2), we compute the cosine similarity between the hypothesis text h and each example. Let sim(h, x) = eh·ex ∥eh∥∥ex∥ where eh, ex are the respective embeddings. The discrimination score is: Embed= sim+ − sim− +1 2 where sim+ = 1 n ∑x∈P + sample sim(h, x) and analogously for sim−. Values...
[6]

Under review

Statistical Separability ( p-value).We test whether activating examples have signifi- cantly higher feature activations than non-activating controls using Welch’st-test (unequal variance): t= ¯a+ − ¯a− q s2 +/n+ +s 2 −/n− ,p=P(T>t|H 0) 14 Preprint. Under review. where ¯a+, ¯a− are the mean activations, s+, s− the sample standard deviations, and n+, n− the...
[7]

Effect Size (Cohen’s d).The standardized mean difference between activating and non-activating activation distributions: d= ¯a+ − ¯a− spooled ,s pooled = s (n+ −1)s 2 + + (n− −1)s 2 − n+ +n − −2 Cohen’s d captures the practical magnitude of separation: d> 0.2 is small, d> 0.5 is medium, andd>0.8 is large
[8]

The prompt includes the hypothesis text and up to 10 representative activating examples (truncated to 200 characters each)

LLM-as-Judge Coherence.An LLM rates how well the hypothesis captures the pattern observed in the top activating examples, on a 0–10 integer scale. The prompt includes the hypothesis text and up to 10 representative activating examples (truncated to 200 characters each). The score is normalized to[0, 1]: Judge= LLM rating 10 This provides a holistic assess...
[9]

Per-prompt SAE activation vectors are log-transformed and per-prompt normalized
[10]

PCA is applied with up to min(200, N−1) components, where N is the number of prompts
[11]

A k-nearest-neighbor graph is constructed using min(npcs−5, 50) principal compo- nents, withk=15
[12]

Leiden community detection is applied at resolution 0.5
[13]

Found” counts all candidate markers returned by the statistical discovery pipeline. “Validated

Wilcoxon rank-sum tests with Benjamini–Hochberg correction identify marker features per cluster. Candidates are retained if adjusted p< 0.001, score > 2.0, log-fold change>0.5, and Cohen’sd>0.3. Ablation setup.We sweep over PCA dimensions ∈ {50, 200}, k∈ { 10, 15, 25}, and Leiden resolution ∈ {0.3, 0.5, 1.0}, yielding 18 configurations. Each configuration...

work page arXiv 2025
[14]

Call FeatureFinder with prompts_dir and save_path
[15]

Get results_dir from FeatureFinder output
[16]

Call FeatureExplainer with results_dir to explain top features ``` **MODE B: Skip to Explanation (Data Already Exists)** User provides: existing results_dir (from previous FeatureFinder run) ``` You:
[17]

Skip FeatureFinder (data already exists)
[18]

Call FeatureExplainer directly with results_dir
[19]

Just extract features, don't explain

Explain requested features ``` 29 Preprint. Under review. **MODE C: Only Feature Extraction** User provides: prompts_dir, save_path User says: "Just extract features, don't explain" ``` You:
[20]

Stop after feature extraction ``` **MODE D: Explain Specific Feature from Existing Data** User provides: results_dir and specific idx ``` You:
[21]

Call FeatureExplainer with results_dir and idx
[22]

**Step 3**: FeatureFinder will:

Generate validated explanation for that specific feature ``` **KEY DECISION RULE**: - If user has results_dir→Can use FeatureExplainer directly - If user has prompts_dir→Need FeatureFinder first - If user has both→Ask which they want --- ## DETAILED AGENT WORKFLOWS ### WORKFLOW 1: Using FeatureFinder **Step 1**: User provides prompts_dir and save_path (or...
[23]

Setup environment and validate prompt files
[24]

Run pipeline to extract SAE features
[25]

Compute marker statistics for each category
[26]

Generate visualizations
[27]

Results saved to /path/to/save_path/{day}_{time}/

Report back the timestamped results directory path **Step 4**: Remember the results_dir path for FeatureExplainer **Example FeatureFinder Output**: "Results saved to /path/to/save_path/{day}_{time}/" --- ### WORKFLOW 2: Using FeatureExplainer **Step 1**: Ensure you have results_dir (from FeatureFinder or user) **Step 2**: Call FeatureExplainer to load fea...
[28]

Generate initial hypothesis from marker statistics
[29]

Initialize SAE (loads Gemma-2-2B + SAE weights)
[30]

FOR each iteration: - Generate test cases (positive/negative/edge/adversarial) - Test with real SAE activations - Get LLM criticism - Refine hypothesis - Check if thresholds met (accuracy >= 80%, confidence >= 85%)
[31]

Extract and explain features from /path/to/prompts/, save to /path/to/results/

Save comprehensive report with test results **Step 6**: Report results to user with confidence and test accuracy --- ## ORCHESTRATION RULES - You are the supervisor. You never perform any computation yourself. - Only issue commands and respond to subagents via tool calls. - Wait for each subagent to complete before proceeding. - If you know the answer to ...
[32]

/path/to/prompts/

Call FeatureFinder(prompts_dir="/path/to/prompts/", save_path="/path/to/results/")
[33]

Results saved to /path/to/results/{day}_{time}/

FeatureFinder responds: "Results saved to /path/to/results/{day}_{time}/"
[34]

/path/to/results/{day}_{time}/

Call FeatureExplainer to load data from results_dir="/path/to/results/{day}_{time}/"
[35]

Found 45 features, top is 4351 with effect_size 0.123

FeatureExplainer shows: "Found 45 features, top is 4351 with effect_size 0.123"
[36]

Call FeatureExplainer.explain_feature(idx=4351)
[37]

Feature 4351 detects German modal verbs with 87% confidence, 83% test accuracy

Report: "Feature 4351 detects German modal verbs with 87% confidence, 83% test accuracy" ``` **Example 2: Skip to Explanation** ``` User: "Explain features from /path/to/results/{day}_{time}/" You:
[38]

Recognize results_dir already exists
[39]

/path/to/results/{day}_{time}/

Call FeatureExplainer directly with results_dir="/path/to/results/{day}_{time}/"
[40]

FeatureExplainer shows available features
[41]

Pick top feature or ask user which to explain
[42]

Call FeatureExplainer.explain_feature(idx=...)
[43]

Explain feature 1234 from /path/to/results/{day}_{time}/

Report results ``` **Example 3: Specific Feature** 31 Preprint. Under review. ``` User: "Explain feature 1234 from /path/to/results/{day}_{time}/" You:
[44]

Call FeatureExplainer with results_dir and idx=1234
[45]

Run full hypothesis/test/refine loop
[46]

System prompt for FeatureExplainer.Verbatim prompt used to run the refinement and hypothesis-testing loop

Report validated explanation ``` --- ## TOOLS - **Subagent tools**: Always use tool calls when communicating with subagents - **FeatureFinder**: Tool to extract SAE features - **FeatureExplainer**: Tool to explain features with testing --- ## AUTOMATION FIRST - Each subagent has an automated pipeline - let them do their job - Your job is to coordinate and...
[47]

Early topic alignment (LLM-as-judge): - Inputs: Neuronpedia activations and logit tokens - Decision: Related to ANY studied topics? If NO→abort (optional CSV deletion)
[48]

Initial hypothesis: - Prefer Neuronpedia (activations/logits); fallback to marker statistics - Immediately run category alignment; abort/maybe-delete on mismatch
[49]

Initialize SAE (Gemma-2-2B + SAE weights)
[50]

Generate test cases (positive / negative / edge_case / adversarial)
[51]

Test with SAE and compute metrics (overall and per-category accuracy)
[52]

LLM critic assessment (confidence, strengths, weaknesses, failure patterns, refinements)
[53]

Refine hypothesis based on criticism and failures
[54]

Repeat steps 4{7 until thresholds met or max iterations **Step 3: Results** - Default output: Clear, concise natural-language explanation shown to the user. - By default (`save_results=True`): - Save detailed JSON to `{{results_dir}}/explanations/L{{layer}}_F{{idx}}_{{timestamp}}.json` - Append one row to`{{results_dir}}/explanations/explanations_summary....
[55]

load_data_from_results(results_dir=...)
[56]

Explain features from results_dir/results_gemma2/{day}_{time}/

explain_feature(idx=...)→full loop + final ranking in JSON OR for multi-hypothesis exploration: 2a. generate_hypothesis(idx=..., use_previous_results=False) →initial hypothesis saved 2b. generate_hypothesis(idx=..., use_previous_results=True) →second hypothesis informed by first (run rank_hypotheses in between if you have rankings) 2c. rank_hypotheses(idx...
[57]

results_dir/{day}_{time}/

Call load_data_from_results(results_dir="results_dir/{day}_{time}/")
[58]

Execute with python_repl_tool
[59]

Found 45 features across 4 categories (french, italian, german, spanish), Layer 0

Review: "Found 45 features across 4 categories (french, italian, german, spanish), Layer 0"
[60]

Pick top feature by effect size (e.g., feature 1234)
[61]

Call explain_feature(idx=1234, max_iterations=3)
[62]

Execute with python_repl_tool (takes 10-15 minutes)
[63]

Feature XXXX detects French temporal expressions with 87% confidence 35 Preprint. Under review. and 83% test accuracy

Report: "Feature XXXX detects French temporal expressions with 87% confidence 35 Preprint. Under review. and 83% test accuracy" Repeat step 5-7 for additional features as needed. """ 36 A.12 System prompt used for FeatureFinder For reproducibility, we include the (verbatim) system prompt used to run the FeatureFinder agent in our experiments. System promp...
[64]

", prompts_dir=

Call environment_setup_tool(workspace_root="...", prompts_dir="...", concepts="french,spanish") then execute with python_repl_tool
[65]

gemma_2b

Call run_pipeline(model_key="gemma_2b", prompts_dir="...", results_dir="", sae_layer_idx=0, max_prompts_per_category=500, concepts="french, spanish") then execute with python_repl_tool
[66]

Pipeline finished. Results in <path>. Marker features: french N, spanish M. Use this results_dir with FeatureExplainer to explain features

Report: "Pipeline finished. Results in <path>. Marker features: french N, spanish M. Use this results_dir with FeatureExplainer to explain features." Alternatively, call run_pipeline with all parameters explicitly (prompts_dir, concepts, results_dir) without calling environment_setup_tool first; environment_setup_tool is optional when you provide everythi...