Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks

Doohee You

arxiv: 2605.18988 · v1 · pith:KMLQ56I3new · submitted 2026-05-18 · 💻 cs.CR · cs.AI

Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks

Doohee You This is my paper

Pith reviewed 2026-05-20 09:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords multimodal attacksmulti-turn defenseTRIAD frameworksurvival analysisadversarial robustnesstrajectory dynamicsanomaly detection

0 comments

The pith

The TRIAD framework models multi-turn multimodal conversations as trajectories to bound expected time until an attack succeeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current static defenses miss attacks that spread malicious intent across many conversation turns and modalities because they treat each input in isolation. It introduces the TRIAD framework to represent the full conversational flow as a continuous trajectory and track covariance shifts plus topological acceleration to spot building malicious drift. These signals feed into a time-varying Cox survival model through a Bayesian HMM loop to deliver a mathematically bounded prediction of time-to-failure. A sympathetic reader would care because this could let agentic AI systems anticipate progressive poisoning in real time rather than reacting only after damage occurs.

Core claim

The TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively, by mapping multimodal multi-turn flow to a continuous trajectory monitored with structural anomaly detection, Ledoit-Wolf regularized Mahalanobis distance, and topological trajectory acceleration, all integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model feedback loop.

What carries the argument

The Triple-tier Anomaly Defense (TRIAD) framework, which maps conversational flow to a continuous trajectory and integrates covariance-shift monitoring, regularized Mahalanobis distance, and topological acceleration into a Cox proportional-hazards model through a Bayesian HMM feedback loop.

If this is right

Detects cumulative structural poisoning across longitudinal trajectories that turn-by-turn Markov guards miss.
Supplies a predictive, real-time safeguard for autonomous agentic workflows.
Differentiates benign creative exploration from continuous malicious drift using kinematic and geometric features.
Supports continuous safety alignment without requiring periodic empirical retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-monitoring idea could extend to tracking intent drift in multi-agent systems where goals evolve over successive exchanges.
If the positive divergence property holds, hazard thresholds could trigger automated pauses or clarifications before failure occurs.
A natural next test would measure how well the bounds survive when conversations mix more than two modalities or run for dozens of turns.

Load-bearing premise

The assumption that multimodal multi-turn conversational flow can be usefully represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift.

What would settle it

Run controlled multi-turn attacks on an MLLM where malicious content is injected gradually across turns and check whether the predicted time-to-failure bound is violated or whether the model misclassifies clearly benign trajectories as high-hazard.

Figures

Figures reproduced from arXiv: 2605.18988 by Doohee You.

read the original abstract

The expansion of Multimodal Large Language Models (MLLMs) and their integration into autonomous agentic workflows has introduced a non-stationary attack surface. Empirical observations indicate that adversaries employ progressive, cross-modal perturbations that evade turn-specific guardrails by distributing malicious intent across longitudinal conversational trajectories. Static defense mechanisms, constrained by the Markov property, evaluate inputs in isolation and fail to detect cumulative structural poisoning. To handle this limitation, this paper formulates safety verification as a dynamic survival prediction and trajectory dynamics problem. The Triple-tier Anomaly Defense (TRIAD) framework is proposed as a predictive model that maps multimodal and multi-turn conversational flow as a continuous trajectory. The framework integrates structural anomaly detection to monitor covariance shifts, a Ledoit-Wolf regularized Mahalanobis distance to monitor covariance shifts in high-dimensional spaces, and topological trajectory acceleration to differentiate benign creative exploration from continuous malicious drift. These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively. This framework provides a computationally efficient, interpretable, and predictive safeguard for real-time agentic AI systems, establishing a rigorous foundation for continuous safety alignment without relying on empirical retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The TRIAD framework mixes existing methods into a new defense for multi-turn multimodal attacks, but the claimed theoretical bound lacks a derivation.

read the letter

The paper's main contribution is a framework that models multi-turn multimodal conversations as trajectories and uses survival analysis to predict attack success times. It claims to deliver a bounded expected time-to-failure for adversarial cases. The assembly of techniques is what stands out. It uses Ledoit-Wolf regularized Mahalanobis distance to track covariance shifts in the high-dimensional space of model states. Topological features capture trajectory acceleration to tell creative exploration apart from steady malicious drift. These feed into a time-varying Cox model through a Bayesian Hidden Markov Model feedback loop. This specific combination for handling progressive, cross-modal attacks on agentic MLLMs looks new. The paper does a good job identifying the problem with static defenses. They evaluate inputs in isolation and miss how intent builds over a conversation. Framing safety as a dynamic prediction task with kinematic and geometric features is a reasonable step. The soft spots center on the theoretical claim. The abstract asserts that the TRIAD framework provides a mathematically bounded expected time-to-failure under perturbations, with malicious acceleration diverging positively. But it supplies no derivation, no explicit equations for the model, and no validation. The bound is said to emerge from integrating the features into the Cox model, yet without the steps it's not clear how the HMM enforces the positive divergence or why the bound follows. The continuous trajectory assumption could also be a stretch. Real conversations have discrete turns and modality changes that might introduce non-stationarities the covariance and topology measures do not handle. If those breaks weaken the separation between benign and malicious paths, the predictive power drops. This work is for people building safety mechanisms for autonomous multimodal agents. Readers familiar with anomaly detection or survival analysis in other domains might see ways to adapt it. It deserves a serious referee to examine whether the math holds and to suggest experiments that could test the claims. I would send it to peer review. The idea addresses a genuine gap, and feedback on the unshown analysis would help strengthen it.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the TRIAD framework to defend against progressive, cross-modal multi-turn attacks on MLLMs in agentic workflows. It models conversational flows as continuous trajectories, using structural anomaly detection via Ledoit-Wolf regularized Mahalanobis distance to track covariance shifts and topological trajectory acceleration to distinguish benign exploration from malicious drift. These kinematic and geometric features are fed into a time-varying Cox proportional hazards model through a Bayesian HMM feedback loop. The central claim is that this integration yields a mathematically bounded expected time-to-failure under adversarial perturbations, with malicious acceleration diverging positively, providing a predictive, interpretable safeguard without empirical retraining.

Significance. If the claimed bound on expected time-to-failure can be rigorously derived from the feature integration and the continuous-trajectory assumption holds without unmodeled discontinuities, the work would advance dynamic safety verification by adapting survival analysis to non-stationary multimodal attack surfaces, offering an interpretable alternative to static guardrails.

major comments (2)

[Abstract] Abstract: The assertion that 'Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively' is presented without any derivation, explicit model equations, proof sketch, or integration details showing how the Ledoit-Wolf Mahalanobis and topological acceleration features, when mapped via the Bayesian HMM, enforce the bound or positive divergence in the Cox proportional-hazards form. This is load-bearing for the central claim.
[Framework description (Bayesian HMM feedback loop)] Framework description (Bayesian HMM feedback loop): The assumption that multimodal multi-turn flow can be represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift is invoked when mapping inputs to the time-varying Cox model, yet no analysis addresses potential non-stationarities from discrete turn boundaries or cross-modal switches that could invalidate the kinematic features and prevent the claimed bound from following.

minor comments (2)

[Methods] The notation for 'topological trajectory acceleration' and its computation from the continuous trajectory is introduced without a formal definition or pseudocode, hindering reproducibility of the geometric features.
[Evaluation] No empirical validation or simulation results are referenced to support the separation of benign vs. malicious trajectories under the proposed features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We address the major comments point by point below, indicating the revisions we plan to make to enhance the rigor and clarity of the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively' is presented without any derivation, explicit model equations, proof sketch, or integration details showing how the Ledoit-Wolf Mahalanobis and topological acceleration features, when mapped via the Bayesian HMM, enforce the bound or positive divergence in the Cox proportional-hazards form. This is load-bearing for the central claim.

Authors: We agree with the referee that the abstract presents the theoretical claim without sufficient supporting details. The manuscript includes a high-level description of the theoretical analysis, but to fully substantiate the central claim, we will revise the abstract to reference the key theoretical components and add a dedicated subsection or appendix providing the derivation, explicit equations, and proof sketch for how the Ledoit-Wolf regularized Mahalanobis distance and topological acceleration features, integrated through the Bayesian HMM, lead to the bounded expected time-to-failure and positive divergence in the Cox proportional hazards model. revision: yes
Referee: [Framework description (Bayesian HMM feedback loop)] Framework description (Bayesian HMM feedback loop): The assumption that multimodal multi-turn flow can be represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift is invoked when mapping inputs to the time-varying Cox model, yet no analysis addresses potential non-stationarities from discrete turn boundaries or cross-modal switches that could invalidate the kinematic features and prevent the claimed bound from following.

Authors: The continuous trajectory representation is a foundational assumption of the TRIAD framework, and the Bayesian HMM feedback loop is designed to capture and adapt to non-stationarities, including those from discrete turn boundaries and cross-modal switches, by dynamically updating the state and feature mappings. However, we acknowledge that a more explicit analysis of these potential invalidations is warranted. In the revision, we will include additional discussion and analysis in the framework description section to demonstrate that the kinematic features remain reliable and the bound holds under such conditions. revision: yes

Circularity Check

1 steps flagged

Bounded E[time-to-failure] presented as theoretical result but constructed directly from TRIAD's own trajectory features and Cox-HMM integration

specific steps

fitted input called prediction [Abstract]
"These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively."

The 'theoretical analysis' is invoked immediately after describing the feature extraction and Cox-HMM integration. The bounded E[time-to-failure] and positive divergence are therefore outputs of the same continuous-trajectory representation and covariance/topological features that the framework introduces; the survival bound is statistically forced by the model definition rather than derived from independent premises.

full rationale

The paper's central theoretical claim reduces to a restatement of its modeling assumptions. The abstract defines the TRIAD framework by mapping conversational flow to a continuous trajectory, extracting covariance-shift and topological-acceleration features, and feeding them into a time-varying Cox model via Bayesian HMM. It then asserts that this same construction 'provides a mathematically bounded expected time-to-failure' with positive divergence for malicious cases. No independent derivation, external benchmark, or parameter-free proof is supplied; the bound is therefore equivalent to the input modeling choices by construction. This matches the fitted-input-called-prediction pattern at the level of the survival outcome itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Only the abstract is available, so the ledger records the modeling choices and new constructs explicitly named or implied in the abstract; no empirical or formal support is provided for any of them.

axioms (2)

domain assumption Conversational flow can be represented as a continuous trajectory in a high-dimensional multimodal space
Invoked when the paper states that inputs are mapped as a continuous trajectory for the Cox model.
domain assumption Covariance shifts and topological acceleration distinguish malicious drift from benign exploration
Central to the structural anomaly detection and kinematic features described.

invented entities (2)

Triple-tier Anomaly Defense (TRIAD) framework no independent evidence
purpose: Predictive safety verification for multi-turn multimodal attacks
Newly proposed integrated system combining anomaly detection, distance metrics, and survival modeling.
Topological trajectory acceleration no independent evidence
purpose: Differentiate benign creative exploration from continuous malicious drift
New kinematic feature introduced for the trajectory model.

pith-pipeline@v0.9.0 · 5766 in / 1543 out tokens · 33090 ms · 2026-05-20T09:03:39.587609+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

maps multimodal and multi-turn conversational flow as a continuous trajectory... Ledoit-Wolf regularized Mahalanobis distance... topological trajectory acceleration... time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop
IndisputableMonolith/Foundation/ArrowOfTime.lean forward_accumulates unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2: Positive Divergence of Adversarial Acceleration... a_t = d²/dt² D_M(t) remains strictly positive

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

[1]

arXiv preprint arXiv:2602.16935

URLhttps: //arxiv.org/abs/2602.16935. arXiv preprint arXiv:2602.16935. Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can 9 control generative models at runtime,

work page arXiv
[2]

Image hijacks: Adversarial images can control generative models at runtime

URLhttps://arxiv.org/abs/2309.00236. arXiv preprint arXiv:2309.00236. Anshuman Chhabra, Shrestha Datta, Shahriar Kabir Nahin, and Prasant Mohapatra. Agentic AI security: Threats, defenses, evaluation, and open challenges,

work page arXiv
[3]

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

URLhttps://arxiv.org/ abs/2510.23883. arXiv preprint arXiv:2510.23883. David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2602.01025

URLhttps://arxiv.org/abs/2602.01025. arXiv preprint arXiv:2602.01025. Badhan Chandra Das, Md Tasnim Jawad, Joaquin Molto, M. Hadi Amini, and Yanzhao Wu. Multi- turn jailbreaking attack in multi-modal large language models,

work page arXiv
[5]

arXiv preprint arXiv:2601.05339

URLhttps://arxiv.or g/abs/2601.05339. arXiv preprint arXiv:2601.05339. Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yu- val Kluger. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.BMC Medical Research Methodology, 18(1):24, feb

work page arXiv
[6]

Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger

doi: 10.1186/s12874-018-0482-1. URLhttps://doi.org/10.1186/s12874-018-0482-1. Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411,

work page doi:10.1186/s12874-018-0482-1
[7]

A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

URLhttps://arxiv.org/abs/1807 .03888. arXiv preprint arXiv:1807.03888. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation Forest. In2008 Eighth IEEE Interna- tional Conference on Data Mining, pages 413–422,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Isolation forest

doi: 10.1109/ICDM.2008.17. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,

work page doi:10.1109/icdm.2008.17 2008
[9]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

URLhttps://arxiv.org/abs/2310.04451. arXiv preprint arXiv:2310.04451. Atharva Mehta, Rajesh Kumar, Aman Singla, Kartik Bisht, Yaman Kumar Singla, and Rajiv Ratn Shah. Detecting LLM-assisted academic dishonesty using keystroke dynamics,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics

URLhttps: //arxiv.org/abs/2511.12468. arXiv preprint arXiv:2511.12468. Maximilian Mueller and Matthias Hein. Mahalanobis++: Improving OOD detection via fea- ture normalization,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2505.18032

URLhttps://arxiv.org/abs/2505.18032. arXiv preprint arXiv:2505.18032. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. GPT-4 technical report,

work page arXiv
[12]

GPT-4 Technical Report

URL https://arxiv.org/abs/2303.08774. arXiv preprint arXiv:2303.08774. Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review.ACM Comput. Surv., 54(2), mar

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Deep learning for anomaly detection

doi: 10.1145/3439950. URLhttps://doi.org/10.1145/3439950. 10 J. Ramprasath, S. Ramakrishnan, V. Tharani, R. Sushmitha, and D. Arunima. Cloud service anomaly traffic detection using Random Forest. In Shailesh Tiwari, Munesh C. Trivedi, Mohan L. Kolhe, and Brajesh Kumar Singh, editors,Advances in Data and Information Sciences, pages 269–279, Singapore,

work page doi:10.1145/3439950
[14]

arXiv preprint arXiv:1906.02845

URLhttps://arxiv.org/abs/1906.02845. arXiv preprint arXiv:1906.02845. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,

work page arXiv 1906
[15]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

URLhttps://arxiv.org/abs/2404.01833. arXiv preprint arXiv:2404.01833. Abhishek Singhania, Christophe Dupuy, Shivam Mangale, and Amani Namboori. Multi-lingual multi-turn automated red teaming for LLMs,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

arXiv preprint arXiv:2504.03174

URLhttps://arxiv.org/abs/2504.03174. arXiv preprint arXiv:2504.03174. Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, and Zhihui Fu. Knowledge-driven multi-turn jail- breaking on large language models,

work page arXiv
[17]

arXiv preprint arXiv:2601.05445

URLhttps://arxiv.org/abs/2601.05445. arXiv preprint arXiv:2601.05445. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models,

work page arXiv
[18]

Gemini: A Family of Highly Capable Multimodal Models

URLhttps://arxiv.org/abs/2312.11805. arXiv preprint arXiv:2312.11805. Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Guangquan Xu, and Shouling Ji. PolyJailbreak: Cross-modal jailbreaking attacks on black-box multimodal LLMs,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

org/abs/2510.17277

URLhttps://arxiv. org/abs/2510.17277. arXiv preprint arXiv:2510.17277. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail?,

work page arXiv
[20]

Jailbroken: How Does LLM Safety Training Fail?

URLhttps://arxiv.org/abs/2307.02483. arXiv preprint arXiv:2307.02483. Zixuan Weng, Xiaolong Jin, Jinyuan Jia Regel, and Xiangyu Zhang. Foot-in-the-door: A multi- turn jailbreak for LLMs,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2502.19820

URLhttps://arxiv.org/abs/2502.19820. arXiv preprint arXiv:2502.19820. Yubo Li, Ramayya Krishnan, and Rema Padman. Time-to-inconsistency: A survival analysis of large language model robustness to adversarial attacks,

work page arXiv
[22]

arXiv preprint arXiv:2510.02712

URLhttps://arxiv.org/abs/ 2510.02712. arXiv preprint arXiv:2510.02712. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,

work page arXiv
[23]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv.or g/abs/2307.15043. arXiv preprint arXiv:2307.15043. 11 Multimodal Input & Telemetric CovariatesV (t) Pillar 1: Structural Scout (Isolation Forest) CalculateS (t) iso S(t) iso > α Pillar 2: Distributional Anchoring & Kinematics CalculateD (t) M anda t CCM: Bayesian Belief Update HMM State Tracking Pillar 3: Survival Forecast Cox Hazardh(...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

arXiv preprint arXiv:2602.16935

URLhttps: //arxiv.org/abs/2602.16935. arXiv preprint arXiv:2602.16935. Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can 9 control generative models at runtime,

work page arXiv

[2] [2]

Image hijacks: Adversarial images can control generative models at runtime

URLhttps://arxiv.org/abs/2309.00236. arXiv preprint arXiv:2309.00236. Anshuman Chhabra, Shrestha Datta, Shahriar Kabir Nahin, and Prasant Mohapatra. Agentic AI security: Threats, defenses, evaluation, and open challenges,

work page arXiv

[3] [3]

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

URLhttps://arxiv.org/ abs/2510.23883. arXiv preprint arXiv:2510.23883. David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2602.01025

URLhttps://arxiv.org/abs/2602.01025. arXiv preprint arXiv:2602.01025. Badhan Chandra Das, Md Tasnim Jawad, Joaquin Molto, M. Hadi Amini, and Yanzhao Wu. Multi- turn jailbreaking attack in multi-modal large language models,

work page arXiv

[5] [5]

arXiv preprint arXiv:2601.05339

URLhttps://arxiv.or g/abs/2601.05339. arXiv preprint arXiv:2601.05339. Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yu- val Kluger. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.BMC Medical Research Methodology, 18(1):24, feb

work page arXiv

[6] [6]

Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger

doi: 10.1186/s12874-018-0482-1. URLhttps://doi.org/10.1186/s12874-018-0482-1. Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411,

work page doi:10.1186/s12874-018-0482-1

[7] [7]

A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks

URLhttps://arxiv.org/abs/1807 .03888. arXiv preprint arXiv:1807.03888. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation Forest. In2008 Eighth IEEE Interna- tional Conference on Data Mining, pages 413–422,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Isolation forest

doi: 10.1109/ICDM.2008.17. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,

work page doi:10.1109/icdm.2008.17 2008

[9] [9]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

URLhttps://arxiv.org/abs/2310.04451. arXiv preprint arXiv:2310.04451. Atharva Mehta, Rajesh Kumar, Aman Singla, Kartik Bisht, Yaman Kumar Singla, and Rajiv Ratn Shah. Detecting LLM-assisted academic dishonesty using keystroke dynamics,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics

URLhttps: //arxiv.org/abs/2511.12468. arXiv preprint arXiv:2511.12468. Maximilian Mueller and Matthias Hein. Mahalanobis++: Improving OOD detection via fea- ture normalization,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2505.18032

URLhttps://arxiv.org/abs/2505.18032. arXiv preprint arXiv:2505.18032. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. GPT-4 technical report,

work page arXiv

[12] [12]

GPT-4 Technical Report

URL https://arxiv.org/abs/2303.08774. arXiv preprint arXiv:2303.08774. Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review.ACM Comput. Surv., 54(2), mar

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Deep learning for anomaly detection

doi: 10.1145/3439950. URLhttps://doi.org/10.1145/3439950. 10 J. Ramprasath, S. Ramakrishnan, V. Tharani, R. Sushmitha, and D. Arunima. Cloud service anomaly traffic detection using Random Forest. In Shailesh Tiwari, Munesh C. Trivedi, Mohan L. Kolhe, and Brajesh Kumar Singh, editors,Advances in Data and Information Sciences, pages 269–279, Singapore,

work page doi:10.1145/3439950

[14] [14]

arXiv preprint arXiv:1906.02845

URLhttps://arxiv.org/abs/1906.02845. arXiv preprint arXiv:1906.02845. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,

work page arXiv 1906

[15] [15]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

URLhttps://arxiv.org/abs/2404.01833. arXiv preprint arXiv:2404.01833. Abhishek Singhania, Christophe Dupuy, Shivam Mangale, and Amani Namboori. Multi-lingual multi-turn automated red teaming for LLMs,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

arXiv preprint arXiv:2504.03174

URLhttps://arxiv.org/abs/2504.03174. arXiv preprint arXiv:2504.03174. Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, and Zhihui Fu. Knowledge-driven multi-turn jail- breaking on large language models,

work page arXiv

[17] [17]

arXiv preprint arXiv:2601.05445

URLhttps://arxiv.org/abs/2601.05445. arXiv preprint arXiv:2601.05445. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models,

work page arXiv

[18] [18]

Gemini: A Family of Highly Capable Multimodal Models

URLhttps://arxiv.org/abs/2312.11805. arXiv preprint arXiv:2312.11805. Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Guangquan Xu, and Shouling Ji. PolyJailbreak: Cross-modal jailbreaking attacks on black-box multimodal LLMs,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

org/abs/2510.17277

URLhttps://arxiv. org/abs/2510.17277. arXiv preprint arXiv:2510.17277. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail?,

work page arXiv

[20] [20]

Jailbroken: How Does LLM Safety Training Fail?

URLhttps://arxiv.org/abs/2307.02483. arXiv preprint arXiv:2307.02483. Zixuan Weng, Xiaolong Jin, Jinyuan Jia Regel, and Xiangyu Zhang. Foot-in-the-door: A multi- turn jailbreak for LLMs,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2502.19820

URLhttps://arxiv.org/abs/2502.19820. arXiv preprint arXiv:2502.19820. Yubo Li, Ramayya Krishnan, and Rema Padman. Time-to-inconsistency: A survival analysis of large language model robustness to adversarial attacks,

work page arXiv

[22] [22]

arXiv preprint arXiv:2510.02712

URLhttps://arxiv.org/abs/ 2510.02712. arXiv preprint arXiv:2510.02712. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,

work page arXiv

[23] [23]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv.or g/abs/2307.15043. arXiv preprint arXiv:2307.15043. 11 Multimodal Input & Telemetric CovariatesV (t) Pillar 1: Structural Scout (Isolation Forest) CalculateS (t) iso S(t) iso > α Pillar 2: Distributional Anchoring & Kinematics CalculateD (t) M anda t CCM: Bayesian Belief Update HMM State Tracking Pillar 3: Survival Forecast Cox Hazardh(...

work page internal anchor Pith review Pith/arXiv arXiv