Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Jiarui Li; Pengtao Xie; Qi Cao; Ruiyi Zhang; Shuhao Zhang

arxiv: 2605.30837 · v1 · pith:VUHD7PHXnew · submitted 2026-05-29 · 💻 cs.CR · cs.LG

Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense

Shuhao Zhang , Jiarui Li , Qi Cao , Ruiyi Zhang , Pengtao Xie This is my paper

Pith reviewed 2026-06-28 22:15 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords prompt injection defenseadaptive detector allocationpre-hoc reasoningoutcome predictionsafety-utility trade-offSCOUT-450 benchmarkuncertainty-aware triagedetector heterogeneity

0 comments

The pith

SCOUT allocates prompt-injection detectors per request by predicting each one's reliability and latency from past similar inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that fixed single-detector pipelines waste resources and leave blind spots because detectors vary in strength across attack types. SCOUT instead predicts per-sample reliability and latency from historical behavior on similar inputs, then chooses which detectors to run and whether to escalate to an LLM judge. A single exposed threshold lets the operator balance safety against benign-pass rate and wall-clock time. On a new benchmark of complex agent-facing attacks, this yields measurable gains that also transfer to three prior benchmarks.

Core claim

SCOUT forecasts each detector's reliability and latency on the current input from its performance on similar past inputs, then uses those forecasts to decide detector allocation and escalation under a single safety-utility threshold. On SCOUT-450 this produces a 46 percent drop in attack-success rate and 40 percent drop in total wall-clock time versus an always-on GPT-4o judge, with a 5.1-point benign-utility cost. The same allocation policy improves the safety-utility frontier on BIPIA, IPI, and IHEval.

What carries the argument

The SCOUT outcome-prediction model that forecasts per-detector reliability and latency to drive per-request allocation and escalation decisions.

If this is right

A safety-oriented operating point on SCOUT-450 reduces attack-success rate by 46 percent and wall-clock time by 40 percent relative to always-on GPT-4o.
The same point incurs only a 5.1-point drop in benign utility.
SCOUT improves the safety-utility frontier when applied to BIPIA, IPI, and IHEval.
Operators control the trade-off through one adjustable safety-utility threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-driven allocation logic could be tested on heterogeneous detectors in other security settings such as malware or phishing classification.
Collecting additional historical traces over time would likely tighten the reliability forecasts for attacks that evolve after initial training.
Pairing SCOUT with periodic retraining on recent attacks could maintain accuracy as the distribution of injections shifts.

Load-bearing premise

Predictions drawn from how detectors behaved on similar past inputs will accurately forecast their performance on new structurally complex attacks.

What would settle it

A held-out collection of prompt-injection attacks structurally unlike the historical data where the predicted reliabilities produce allocations whose attack-success rate and latency are no better than those of a fixed single-detector pipeline.

Figures

Figures reproduced from arXiv: 2605.30837 by Jiarui Li, Pengtao Xie, Qi Cao, Ruiyi Zhang, Shuhao Zhang.

**Figure 1.** Figure 1: SCOUT allocates detectors per input. (a) Four detectors on the shared SCOUT-450 sample space (UMAP of input embeddings): green dots are correct decisions, blue crosses are false positives, and orange crosses are false negatives. The detectors make different errors and their accuracy varies widely. (b) A singledetector defense commits each request to one fixed detector’s verdict (top). SCOUT (bottom) asks … view at source ↗

**Figure 2.** Figure 2: SCOUT framework overview. (a) Fingerprint construction: for each request x, kNN retrieval over the anchor prompt bank surfaces the top-10 neighbours and returns the matching detector fingerprints. (b) Outcome prediction: a small predictor (Qwen3-4B, SFT+GRPO post-training on SCOUT-30K with chain-of-thought rationales) maps the fingerprint slice to per-detector (pred_corr, pred_lat) estimates. (c) Uncertain… view at source ↗

**Figure 3.** Figure 3: Quality–latency frontier on SCOUT-450. Wall-clock vs. quality across the τ ∈ [0.50, 1.00] sweep of the predictor-filtered cascade; gray triangles are all individual standalone detectors, and X marks the always-on DLLM. Attack block rate ≡ 1 − ASR (higher is better). Sweeping τ trades latency for quality, with Acc, 1 − ASR, and BU improving together. 2025) and related public sets) trains our D2, D3, D5, D7,… view at source ↗

**Figure 4.** Figure 4: Ablation curves on SCOUT-450 (polynomial fits over the threshold sweep τ ∈ [0.50, 1.00]; x is total wall-clock, log scale). Top row: 1 − ASR (attack block rate). Bottom row: benign utility (1 − FPR). Columns vary one link of SCOUT: predictor recipe, routing rule, and trust mixing ω. Always-DLLM (X) is the high-latency reference. Companion accuracy panels are in Appendix G, [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 5.** Figure 5: Detector behavioral diversity. Left: median per-request latency (log scale). Right: accuracy. Both are shown for SCOUT-30K and SCOUT-450. Detectors span four orders of magnitude in latency and a wide accuracy band, and each detector’s profile is consistent across the two splits. per-detector error geometries are what the predictor exploits when it filters the light pool per input. Detector error atlas [PI… view at source ↗

**Figure 6.** Figure 6: Per-detector correctness on SCOUT-450. Each panel shows one detector’s predictions overlaid on the shared UMAP of eval_content embeddings. Light gray dots are the full SCOUT-450 distribution; green dots are correct decisions, dark-blue crosses are false positives, and orange crosses are false negatives [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Detector correctness atlas on SCOUT-450. Columns are the base-pool and extension detectors; rows are samples grouped by category (harder samples first within each group). Gray cells are correct decisions, dark-blue cells are false positives, and orange cells are false negatives. Top bars give each detector’s correct/false-positive/falsenegative rates (accuracy annotated above); the right margin counts how… view at source ↗

**Figure 8.** Figure 8: Composition of the three released splits. (a) SCOUT-30K source carriers: 2,700 unique input samples paired with detector profiles to produce the 29,551 (sample, detector) examples used for SFT and GRPO. (b) Anchor-400, the held-out fingerprint and kNN retrieval set. (c) SCOUT-450, the held-out routing benchmark. Each donut shows the inner attack-vs-benign split (red = attack, blue = benign) and the outer p… view at source ↗

**Figure 9.** Figure 9: SCOUT-450 sample space (UMAP). The shared UMAP of eval_content embeddings, colored by attack category (left), carrier type (middle), and difficulty (right). The per-detector correctness overlays in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Sample difficulty on SCOUT-450. Left: each sample on the shared UMAP, colored by the number of the eight base-pool detectors that misclassify it. Right: histogram of that count [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Anchor-400 coverage of the SCOUT-30K training distribution, by category. One panel per attack/benign category, on a shared UMAP of eval_content embeddings; light gray = all training samples, colored points = training samples in the focal category, squares = Anchor-400 anchors in the focal category. Panel titles report per-category query and anchor counts. 1. detector_profile: one sentence describing what … view at source ↗

**Figure 12.** Figure 12: The hidden_tricky category by concealment strategy. Anchor-400 coverage of the ten concealment strategies that make up hidden_tricky (SCOUT-30K queries vs Anchor-400 anchors), on the shared eval_content UMAP; light gray = all training samples, colored points = training samples using the focal strategy, squares = Anchor-400 anchors. Panel titles report per-strategy query and anchor counts. rank by cosine s… view at source ↗

**Figure 13.** Figure 13: Fingerprint descriptor space (UMAP). UMAP of the anchor sample_characteristics descriptors used on the document side of retrieval, colored by attack category (left), carrier type (middle), and difficulty (right). 1 5 10 top-K 0.0 0.2 0.4 0.6 0.8 1.0 atta c k-ty p e C o v era g e @ K (a) Same-attack-type retrieval overall attack benign Benign Cred. exfil Instr. override Tool misuse Param. manip. Task hijac… view at source ↗

**Figure 14.** Figure 14: Attack-type retrieval and fingerprint compaction (SCOUT-450). (a) Attack-type Coverage@K: fraction of queries whose top-K anchors include a same-attack_type anchor, overall and split into attack and benign queries. (b) Coverage@10 per attack type; bar annotations give the query count n. (c) Mean tokens per anchor record: the LLM-serialized sample_characteristics (∼69) versus the raw eval_content (∼535), t… view at source ↗

**Figure 15.** Figure 15: Latency reward rlat. Predictions within δ/2 of the ground-truth latency receive full reward, with linear decay to zero at δ, where δ = max(2 ms, 0.5 ℓgt). Left: a light detector (ℓgt = 20 ms). Right: the LLM judge (ℓgt = 1500 ms). The full reward is gfmt · rcorr · (1 + rlat), so a correct prediction earns between 1 and 2 depending on latency accuracy, and an incorrect or malformed one earns 0. What each p… view at source ↗

**Figure 16.** Figure 16: Uncertainty-aware triage in SCOUT. The predictor emits a per-detector reliability estimate (pred_corr) and latency estimate (pred_lat). The reliability estimate filters the light pool before execution, and the selected light detectors vote with trust weights that mix local fingerprint evidence and global detector reliability. If the weighted vote is confident under threshold τ , SCOUT skips the LLM judge;… view at source ↗

**Figure 17.** Figure 17: plots both maps on SCOUT-450; the headline τ = 0.875 lies on both curves. Detector-invocation portfolio across the τ -sweep [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Detector-invocation portfolio across the τ -sweep on SCOUT-450 (Base pool). Share of detector invocations under filter+skip routing. Rows: LLM judge (GPT-4o, GPT-5.1). Columns: τ ∈ {0.55, 0.75, 0.875, 1.00}; the first three match the SCOUT rows in [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Ablation: accuracy panels on SCOUT-450, companion to [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Per-predictor quality and verbosity. Left: pred_corr accuracy broken out by detector group (light pool / DLLM judge / overall). The light-pool bar governs the routing trade-off position in the predictor-recipe block of [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Pool extension on SCOUT-450 (RQ3), latency view. Six panels (3 rows × 2 judges) show the SCOUT threshold sweep (base, + D7, + D7 + D8) on Acc, 1 − ASR, and BU plotted against total wall-clock. Columns: GPT-4o (left), GPT-5.1 (right). The always-LLM-judge marker (X) is the right-end reference. 10 1 10 2 10 3 Total time on N = 1000 (s, log) 0.88 0.90 0.92 0.94 0.96 A c c (a) BIPIA Acc 10 1 10 2 10 3 Total t… view at source ↗

**Figure 22.** Figure 22: Cross-benchmark latency-quality view. SCOUT τ -sweeps on BIPIA, IPI, and IHEval (N=1000 each), plotted against Acc, 1−ASR, and BU with GPT-5.1 as DLLM. Always-LLM-judge (X) is the right-end reference; SCOUT cuts wall-clock on all three benchmarks [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗

**Figure 23.** Figure 23: Predicted wall-clock follows the realized trend. Predicted total wall-clock Tˆ(τ ) (from pred_lat, dashed) and realized (solid) across the τ -sweep (ω = 0.6) for all four benchmarks under both judges (rows: SCOUT-450, BIPIA, IPI, IHEval; columns: GPT-4o, GPT-5.1). On SCOUT-450 the curves match in magnitude (gap ≤ 2.5%); on the external benchmarks the predictor underestimates absolute wall-clock (gap up to… view at source ↗

read the original abstract

Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCOUT reframes detector choice as per-request allocation via similarity-based predictions and adds a targeted benchmark, but the reported gains rest on an unexamined assumption that historical matches will predict outcomes on novel complex attacks.

read the letter

The main contribution here is treating defense as dynamic allocation across a pool of detectors instead of committing every input to one fixed model, with decisions driven by how similar past cases performed and a single tunable safety-utility threshold. They also release SCOUT-450 to cover agent-facing injections that prior sets under-sample, and they show the method transfers to BIPIA, IPI, and IHEval while beating an always-on GPT-4o baseline on their own data.

The setup is sensible for anyone running multiple detectors in production: it directly attacks the blind-spot problem and gives operators one knob rather than a pile of hyperparameters. The transfer results and the concrete deltas (46% ASR drop, 40% wall-clock cut at modest utility cost) are the parts that feel worth testing further.

The soft spot is exactly the generalization question the stress-test note raises. The allocation depends on a similarity lookup surfacing past cases whose reliability and latency will carry over to structurally new attacks; the abstract gives no ablation on the metric itself, no error bars, and no construction details for the benchmark. If the similarity is mostly lexical or embedding-surface rather than attack-semantic, the pre-hoc predictions will be least reliable on the inputs SCOUT-450 was built to stress. That makes the headline numbers hard to interpret without the methods section.

This is aimed at applied AI-security groups that already maintain detector ensembles and want to cut average cost without losing coverage. It shows clear engagement with the practical constraint rather than just another detector paper.

I would send it to peer review; the framing is useful and the claims are falsifiable once the similarity mechanism and variance are shown.

Referee Report

4 major / 2 minor

Summary. The paper introduces the SCOUT framework for prompt-injection defense, which reframes detection as per-request detector allocation. SCOUT predicts each detector's reliability and latency on a new input by referencing its behavior on similar past inputs, then uses a single tunable safety-utility threshold to decide which detectors to invoke and whether to escalate to an LLM judge. The authors construct the SCOUT-450 benchmark to better capture structurally complex, agent-facing injections and report that a safety-oriented operating point yields a 46% reduction in attack-success rate and 40% reduction in wall-clock time versus an always-on GPT-4o judge (with a 5.1-point benign-utility penalty), while also improving the safety-utility frontier on BIPIA, IPI, and IHEval.

Significance. If the similarity-based prediction mechanism generalizes, the work offers a practical way to exploit detector heterogeneity without committing every query to a single detector's blind spots, potentially lowering both false-negative risk and inference cost in production LLM systems. The release of SCOUT-450 as a benchmark focused on agent-facing attacks is a concrete contribution that could stimulate further research on adaptive defenses.

major comments (4)

[Abstract, §4] Abstract and §4 (evaluation results): The headline claims of 46% ASR reduction and 40% wall-clock reduction are stated without error bars, statistical significance tests, or an explicit description of how the safety-utility threshold was selected on SCOUT-450; this leaves open whether the reported operating point was chosen post-hoc or pre-specified, directly affecting the reproducibility of the safety-utility frontier improvement.
[§3] §3 (SCOUT prediction mechanism): The method for computing per-sample reliability and latency predictions from similar past inputs is described at a high level but lacks an ablation on the similarity metric (embedding model, feature set, or distance function) and does not report how many historical examples are retrieved or how their outcomes are aggregated; because the central claim rests on these predictions transferring to novel attacks, the absence of such controls is load-bearing.
[§4.1] §4.1 (SCOUT-450 construction): The paper states that SCOUT-450 captures "structurally complex, agent-facing injections" under-represented in prior sets, yet provides no quantitative comparison (e.g., attack taxonomy coverage, structural complexity metrics) against BIPIA/IPI/IHEval or details on how the 450 samples were sampled or generated; without this, it is impossible to verify that the benchmark actually stresses the generalization assumption highlighted in the skeptic note.
[§4.3] Transfer experiments (§4.3): The claim that SCOUT improves the safety-utility frontier on the three external benchmarks does not specify whether the safety-utility threshold was held fixed from SCOUT-450 or re-tuned per benchmark, nor does it report per-detector allocation statistics; this information is required to assess whether the gains arise from the pre-hoc reasoning or from benchmark-specific tuning.

minor comments (2)

[§3] Notation for the safety-utility threshold and the similarity lookup should be introduced with an explicit equation or pseudocode early in §3 rather than only in prose.
[Figures in §4] Figure captions for the safety-utility frontier plots should include the exact threshold values used for each curve and the number of runs or seeds underlying any shaded regions.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen reproducibility and methodological transparency. We address each major point below and will incorporate revisions where they improve the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (evaluation results): The headline claims of 46% ASR reduction and 40% wall-clock reduction are stated without error bars, statistical significance tests, or an explicit description of how the safety-utility threshold was selected on SCOUT-450; this leaves open whether the reported operating point was chosen post-hoc or pre-specified, directly affecting the reproducibility of the safety-utility frontier improvement.

Authors: We agree that error bars, significance testing, and explicit threshold-selection details are needed for reproducibility. The operating point was chosen via grid search on a held-out validation split of SCOUT-450 targeting a fixed safety level; we will add this procedure, report standard errors from 5 random seeds, and include paired t-tests against the GPT-4o baseline in the revised §4. revision: yes
Referee: [§3] §3 (SCOUT prediction mechanism): The method for computing per-sample reliability and latency predictions from similar past inputs is described at a high level but lacks an ablation on the similarity metric (embedding model, feature set, or distance function) and does not report how many historical examples are retrieved or how their outcomes are aggregated; because the central claim rests on these predictions transferring to novel attacks, the absence of such controls is load-bearing.

Authors: We will expand §3 with the missing controls: cosine similarity on sentence-transformer embeddings, k=5 nearest neighbors, and mean aggregation of historical outcomes. An ablation table comparing alternative embeddings, distance functions, and k values will be added to demonstrate that the reported gains are robust to these choices. revision: yes
Referee: [§4.1] §4.1 (SCOUT-450 construction): The paper states that SCOUT-450 captures "structurally complex, agent-facing injections" under-represented in prior sets, yet provides no quantitative comparison (e.g., attack taxonomy coverage, structural complexity metrics) against BIPIA/IPI/IHEval or details on how the 450 samples were sampled or generated; without this, it is impossible to verify that the benchmark actually stresses the generalization assumption highlighted in the skeptic note.

Authors: We will augment §4.1 with quantitative comparisons (average prompt length, tool-call count, and a 6-category taxonomy coverage) against the three external sets, plus explicit sampling details (stratified draw from production agent traces plus controlled synthetic generation). These additions directly address the request for evidence that SCOUT-450 stresses the generalization claim. revision: yes
Referee: [§4.3] Transfer experiments (§4.3): The claim that SCOUT improves the safety-utility frontier on the three external benchmarks does not specify whether the safety-utility threshold was held fixed from SCOUT-450 or re-tuned per benchmark, nor does it report per-detector allocation statistics; this information is required to assess whether the gains arise from the pre-hoc reasoning or from benchmark-specific tuning.

Authors: The threshold was held fixed at the SCOUT-450 safety-oriented value for all transfer runs; we will state this explicitly and add per-benchmark allocation statistics (fraction of queries routed to each detector and to the LLM judge) in the revised §4.3 to clarify that gains stem from the pre-hoc mechanism rather than per-benchmark retuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames SCOUT as using similarity-based predictions of per-detector reliability and latency drawn from historical behavior, then evaluates the resulting allocation policy on the new SCOUT-450 benchmark plus three external sets. No equations, self-citations, or definitional steps are visible that would make the reported safety-utility gains (ASR reduction, wall-clock savings) equivalent to the input data or fitted parameters by construction. The similarity lookup is presented as an empirical mechanism rather than a tautology, and the central claims rest on out-of-sample transfer rather than internal re-labeling of the same quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into parameters; the core premise that detectors are heterogeneous and that past behavior predicts future reliability is taken as given without further justification.

free parameters (1)

safety-utility threshold
Single tunable parameter exposed to the operator that bundles benign-pass rate and wall-clock time; its specific value determines the reported 46%/40% operating point.

axioms (1)

domain assumption Each detector is strong on a different slice of attacks and none is always reliable.
Opening premise of the abstract that motivates the allocation problem.

pith-pipeline@v0.9.1-grok · 5766 in / 1378 out tokens · 22709 ms · 2026-06-28T22:15:17.035631+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Rothblum

On the impossibility of separating intel- ligence from judgment: The computational in- tractability of filtering for AI alignment.Preprint, arXiv:2507.07341. Qi Cao, Shuhao Zhang, Ruizhe Zhou, Ruiyi Zhang, Peijia Qin, and Pengtao Xie. 2026. Models under SCOPE: Scalable and controllable routing via pre- hoc reasoning.Preprint, arXiv:2601.22323. Lingjiao Ch...

work page arXiv 2026
[2]

InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90

Not what you’ve signed up for: Compromis- ing real-world LLM-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90. ACM. Pengcheng He, Jianfeng Gao, and Weizhu Chen
[3]

A multi-agent llm defense pipeline against prompt injection attacks,

DeBERTaV3: Improving DeBERTa us- ing ELECTRA-style pre-training with gradient- disentangled embedding sharing. InInternational Conference on Learning Representations (ICLR). S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, and Jung- pil Shin. 2025. A multi-agent LLM defense pipeline against prompt injection attacks.Prepri...

work page arXiv 2025
[4]

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, and Neil Zhenqiang Gong

Evaluating the instruction-following robust- ness of large language models to prompt injection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 557– 568, Miami, Florida, USA. Association for Compu- tational Linguistics. Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2024a. Chain of hindsight aligns language ...

work page arXiv 2024
[5]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

UMAP: Uniform manifold approximation and projection for dimension reduction.Preprint, arXiv:1802.03426. Nay Myat Min, Long H. Pham, and Jun Sun. 2026. Lay- erwise convergence fingerprints for runtime misbe- havior detection in large language models.Preprint, arXiv:2604.24542. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chi- ang, Tianhao Wu, Joseph E. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

ignore the above and instead

LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Appendix contents.The appendix is organized in three parts. Method details (A–E). • A Detector pool details. • B Data construction and composition. • C Fingerprint...

2019
[7]

for the SFT stage and veRL (Sheng et al.,
[8]

for the GRPO stage, both built on Hug- gingFace Transformers (Wolf et al., 2020); infer- ence is served with vLLM (Kwon et al., 2023) batched decoding. Sentence embeddings for the D2 family and for retrieval use Qwen3-Embedding- 0.6B (Zhang et al., 2025b); the D2 classifiers use scikit-learn (Pedregosa et al., 2011) defaults ex- cept for the k-nearest-nei...

work page arXiv 2020
[9]

as the LLM judge DLLM. To choose it, we first ran five candidate judges (GPT-4o, GPT- 5.1 (OpenAI, 2026), DeepSeek-V3.2 (DeepSeek- AI, 2025), DeepSeek-V4 (DeepSeek-AI, 2026), and Gemini-3.1 (Google DeepMind, 2026)) on Anchor-400 (Section 6) and selected by accuracy (Table 10). We then evaluated the candidates on the held-out SCOUT-450 benchmark to check w...

2026
[10]

Detector profile: one sentence describing what this detector does and how it works
[11]

Keep to 3 sentences or fewer

Sample characteristics: describe in detail what the sample contains -- whether it is an attack or benign, its category, difficulty, carrier type, attack mechanism (if applicable), and the full content and goal. Keep to 3 sentences or fewer
[12]

id": "<sample id>

Prediction result: state what the detector predicted (attack /benign), whether it was correct or incorrect, the confidence score, and the latency in milliseconds. Return ONLY a JSON array, one object per sample, preserving input order: [ { "id": "<sample id>", "detector_profile": "<1 sentence>", "sample_characteristics": "<detailed description>", "predict...

2025

[1] [1]

Rothblum

On the impossibility of separating intel- ligence from judgment: The computational in- tractability of filtering for AI alignment.Preprint, arXiv:2507.07341. Qi Cao, Shuhao Zhang, Ruizhe Zhou, Ruiyi Zhang, Peijia Qin, and Pengtao Xie. 2026. Models under SCOPE: Scalable and controllable routing via pre- hoc reasoning.Preprint, arXiv:2601.22323. Lingjiao Ch...

work page arXiv 2026

[2] [2]

InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90

Not what you’ve signed up for: Compromis- ing real-world LLM-integrated applications with in- direct prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Secu- rity, AISec ’23, pages 79–90. ACM. Pengcheng He, Jianfeng Gao, and Weizhu Chen

[3] [3]

A multi-agent llm defense pipeline against prompt injection attacks,

DeBERTaV3: Improving DeBERTa us- ing ELECTRA-style pre-training with gradient- disentangled embedding sharing. InInternational Conference on Learning Representations (ICLR). S M Asif Hossain, Ruksat Khan Shayoni, Mohd Ruhul Ameen, Akif Islam, M. F. Mridha, and Jung- pil Shin. 2025. A multi-agent LLM defense pipeline against prompt injection attacks.Prepri...

work page arXiv 2025

[4] [4]

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, and Neil Zhenqiang Gong

Evaluating the instruction-following robust- ness of large language models to prompt injection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 557– 568, Miami, Florida, USA. Association for Compu- tational Linguistics. Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. 2024a. Chain of hindsight aligns language ...

work page arXiv 2024

[5] [5]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

UMAP: Uniform manifold approximation and projection for dimension reduction.Preprint, arXiv:1802.03426. Nay Myat Min, Long H. Pham, and Jun Sun. 2026. Lay- erwise convergence fingerprints for runtime misbe- havior detection in large language models.Preprint, arXiv:2604.24542. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chi- ang, Tianhao Wu, Joseph E. ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

ignore the above and instead

LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Appendix contents.The appendix is organized in three parts. Method details (A–E). • A Detector pool details. • B Data construction and composition. • C Fingerprint...

2019

[7] [7]

for the SFT stage and veRL (Sheng et al.,

[8] [8]

for the GRPO stage, both built on Hug- gingFace Transformers (Wolf et al., 2020); infer- ence is served with vLLM (Kwon et al., 2023) batched decoding. Sentence embeddings for the D2 family and for retrieval use Qwen3-Embedding- 0.6B (Zhang et al., 2025b); the D2 classifiers use scikit-learn (Pedregosa et al., 2011) defaults ex- cept for the k-nearest-nei...

work page arXiv 2020

[9] [9]

as the LLM judge DLLM. To choose it, we first ran five candidate judges (GPT-4o, GPT- 5.1 (OpenAI, 2026), DeepSeek-V3.2 (DeepSeek- AI, 2025), DeepSeek-V4 (DeepSeek-AI, 2026), and Gemini-3.1 (Google DeepMind, 2026)) on Anchor-400 (Section 6) and selected by accuracy (Table 10). We then evaluated the candidates on the held-out SCOUT-450 benchmark to check w...

2026

[10] [10]

Detector profile: one sentence describing what this detector does and how it works

[11] [11]

Keep to 3 sentences or fewer

Sample characteristics: describe in detail what the sample contains -- whether it is an attack or benign, its category, difficulty, carrier type, attack mechanism (if applicable), and the full content and goal. Keep to 3 sentences or fewer

[12] [12]

id": "<sample id>

Prediction result: state what the detector predicted (attack /benign), whether it was correct or incorrect, the confidence score, and the latency in milliseconds. Return ONLY a JSON array, one object per sample, preserving input order: [ { "id": "<sample id>", "detector_profile": "<1 sentence>", "sample_characteristics": "<detailed description>", "predict...

2025