Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks
Pith reviewed 2026-05-20 09:03 UTC · model grok-4.3
The pith
The TRIAD framework models multi-turn multimodal conversations as trajectories to bound expected time until an attack succeeds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively, by mapping multimodal multi-turn flow to a continuous trajectory monitored with structural anomaly detection, Ledoit-Wolf regularized Mahalanobis distance, and topological trajectory acceleration, all integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model feedback loop.
What carries the argument
The Triple-tier Anomaly Defense (TRIAD) framework, which maps conversational flow to a continuous trajectory and integrates covariance-shift monitoring, regularized Mahalanobis distance, and topological acceleration into a Cox proportional-hazards model through a Bayesian HMM feedback loop.
If this is right
- Detects cumulative structural poisoning across longitudinal trajectories that turn-by-turn Markov guards miss.
- Supplies a predictive, real-time safeguard for autonomous agentic workflows.
- Differentiates benign creative exploration from continuous malicious drift using kinematic and geometric features.
- Supports continuous safety alignment without requiring periodic empirical retraining.
Where Pith is reading between the lines
- The same trajectory-monitoring idea could extend to tracking intent drift in multi-agent systems where goals evolve over successive exchanges.
- If the positive divergence property holds, hazard thresholds could trigger automated pauses or clarifications before failure occurs.
- A natural next test would measure how well the bounds survive when conversations mix more than two modalities or run for dozens of turns.
Load-bearing premise
The assumption that multimodal multi-turn conversational flow can be usefully represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift.
What would settle it
Run controlled multi-turn attacks on an MLLM where malicious content is injected gradually across turns and check whether the predicted time-to-failure bound is violated or whether the model misclassifies clearly benign trajectories as high-hazard.
Figures
read the original abstract
The expansion of Multimodal Large Language Models (MLLMs) and their integration into autonomous agentic workflows has introduced a non-stationary attack surface. Empirical observations indicate that adversaries employ progressive, cross-modal perturbations that evade turn-specific guardrails by distributing malicious intent across longitudinal conversational trajectories. Static defense mechanisms, constrained by the Markov property, evaluate inputs in isolation and fail to detect cumulative structural poisoning. To handle this limitation, this paper formulates safety verification as a dynamic survival prediction and trajectory dynamics problem. The Triple-tier Anomaly Defense (TRIAD) framework is proposed as a predictive model that maps multimodal and multi-turn conversational flow as a continuous trajectory. The framework integrates structural anomaly detection to monitor covariance shifts, a Ledoit-Wolf regularized Mahalanobis distance to monitor covariance shifts in high-dimensional spaces, and topological trajectory acceleration to differentiate benign creative exploration from continuous malicious drift. These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively. This framework provides a computationally efficient, interpretable, and predictive safeguard for real-time agentic AI systems, establishing a rigorous foundation for continuous safety alignment without relying on empirical retraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the TRIAD framework to defend against progressive, cross-modal multi-turn attacks on MLLMs in agentic workflows. It models conversational flows as continuous trajectories, using structural anomaly detection via Ledoit-Wolf regularized Mahalanobis distance to track covariance shifts and topological trajectory acceleration to distinguish benign exploration from malicious drift. These kinematic and geometric features are fed into a time-varying Cox proportional hazards model through a Bayesian HMM feedback loop. The central claim is that this integration yields a mathematically bounded expected time-to-failure under adversarial perturbations, with malicious acceleration diverging positively, providing a predictive, interpretable safeguard without empirical retraining.
Significance. If the claimed bound on expected time-to-failure can be rigorously derived from the feature integration and the continuous-trajectory assumption holds without unmodeled discontinuities, the work would advance dynamic safety verification by adapting survival analysis to non-stationary multimodal attack surfaces, offering an interpretable alternative to static guardrails.
major comments (2)
- [Abstract] Abstract: The assertion that 'Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively' is presented without any derivation, explicit model equations, proof sketch, or integration details showing how the Ledoit-Wolf Mahalanobis and topological acceleration features, when mapped via the Bayesian HMM, enforce the bound or positive divergence in the Cox proportional-hazards form. This is load-bearing for the central claim.
- [Framework description (Bayesian HMM feedback loop)] Framework description (Bayesian HMM feedback loop): The assumption that multimodal multi-turn flow can be represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift is invoked when mapping inputs to the time-varying Cox model, yet no analysis addresses potential non-stationarities from discrete turn boundaries or cross-modal switches that could invalidate the kinematic features and prevent the claimed bound from following.
minor comments (2)
- [Methods] The notation for 'topological trajectory acceleration' and its computation from the continuous trajectory is introduced without a formal definition or pseudocode, hindering reproducibility of the geometric features.
- [Evaluation] No empirical validation or simulation results are referenced to support the separation of benign vs. malicious trajectories under the proposed features.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our manuscript. We address the major comments point by point below, indicating the revisions we plan to make to enhance the rigor and clarity of the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that 'Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively' is presented without any derivation, explicit model equations, proof sketch, or integration details showing how the Ledoit-Wolf Mahalanobis and topological acceleration features, when mapped via the Bayesian HMM, enforce the bound or positive divergence in the Cox proportional-hazards form. This is load-bearing for the central claim.
Authors: We agree with the referee that the abstract presents the theoretical claim without sufficient supporting details. The manuscript includes a high-level description of the theoretical analysis, but to fully substantiate the central claim, we will revise the abstract to reference the key theoretical components and add a dedicated subsection or appendix providing the derivation, explicit equations, and proof sketch for how the Ledoit-Wolf regularized Mahalanobis distance and topological acceleration features, integrated through the Bayesian HMM, lead to the bounded expected time-to-failure and positive divergence in the Cox proportional hazards model. revision: yes
-
Referee: [Framework description (Bayesian HMM feedback loop)] Framework description (Bayesian HMM feedback loop): The assumption that multimodal multi-turn flow can be represented as a continuous trajectory whose covariance shifts and topological acceleration reliably separate benign exploration from malicious drift is invoked when mapping inputs to the time-varying Cox model, yet no analysis addresses potential non-stationarities from discrete turn boundaries or cross-modal switches that could invalidate the kinematic features and prevent the claimed bound from following.
Authors: The continuous trajectory representation is a foundational assumption of the TRIAD framework, and the Bayesian HMM feedback loop is designed to capture and adapt to non-stationarities, including those from discrete turn boundaries and cross-modal switches, by dynamically updating the state and feature mappings. However, we acknowledge that a more explicit analysis of these potential invalidations is warranted. In the revision, we will include additional discussion and analysis in the framework description section to demonstrate that the kinematic features remain reliable and the bound holds under such conditions. revision: yes
Circularity Check
Bounded E[time-to-failure] presented as theoretical result but constructed directly from TRIAD's own trajectory features and Cox-HMM integration
specific steps
-
fitted input called prediction
[Abstract]
"These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively."
The 'theoretical analysis' is invoked immediately after describing the feature extraction and Cox-HMM integration. The bounded E[time-to-failure] and positive divergence are therefore outputs of the same continuous-trajectory representation and covariance/topological features that the framework introduces; the survival bound is statistically forced by the model definition rather than derived from independent premises.
full rationale
The paper's central theoretical claim reduces to a restatement of its modeling assumptions. The abstract defines the TRIAD framework by mapping conversational flow to a continuous trajectory, extracting covariance-shift and topological-acceleration features, and feeding them into a time-varying Cox model via Bayesian HMM. It then asserts that this same construction 'provides a mathematically bounded expected time-to-failure' with positive divergence for malicious cases. No independent derivation, external benchmark, or parameter-free proof is supplied; the bound is therefore equivalent to the input modeling choices by construction. This matches the fitted-input-called-prediction pattern at the level of the survival outcome itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Conversational flow can be represented as a continuous trajectory in a high-dimensional multimodal space
- domain assumption Covariance shifts and topological acceleration distinguish malicious drift from benign exploration
invented entities (2)
-
Triple-tier Anomaly Defense (TRIAD) framework
no independent evidence
-
Topological trajectory acceleration
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
maps multimodal and multi-turn conversational flow as a continuous trajectory... Ledoit-Wolf regularized Mahalanobis distance... topological trajectory acceleration... time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop
-
IndisputableMonolith/Foundation/ArrowOfTime.leanforward_accumulates unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 2: Positive Divergence of Adversarial Acceleration... a_t = d²/dt² D_M(t) remains strictly positive
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2602.16935
URLhttps: //arxiv.org/abs/2602.16935. arXiv preprint arXiv:2602.16935. Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adversarial images can 9 control generative models at runtime,
-
[2]
Image hijacks: Adversarial images can control generative models at runtime
URLhttps://arxiv.org/abs/2309.00236. arXiv preprint arXiv:2309.00236. Anshuman Chhabra, Shrestha Datta, Shahriar Kabir Nahin, and Prasant Mohapatra. Agentic AI security: Threats, defenses, evaluation, and open challenges,
-
[3]
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
URLhttps://arxiv.org/ abs/2510.23883. arXiv preprint arXiv:2510.23883. David R Cox. Regression models and life-tables.Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv preprint arXiv:2602.01025
URLhttps://arxiv.org/abs/2602.01025. arXiv preprint arXiv:2602.01025. Badhan Chandra Das, Md Tasnim Jawad, Joaquin Molto, M. Hadi Amini, and Yanzhao Wu. Multi- turn jailbreaking attack in multi-modal large language models,
-
[5]
arXiv preprint arXiv:2601.05339
URLhttps://arxiv.or g/abs/2601.05339. arXiv preprint arXiv:2601.05339. Jared L. Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yu- val Kluger. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.BMC Medical Research Methodology, 18(1):24, feb
-
[6]
Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger
doi: 10.1186/s12874-018-0482-1. URLhttps://doi.org/10.1186/s12874-018-0482-1. Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance matrices.Journal of Multivariate Analysis, 88(2):365–411,
-
[7]
A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks
URLhttps://arxiv.org/abs/1807 .03888. arXiv preprint arXiv:1807.03888. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation Forest. In2008 Eighth IEEE Interna- tional Conference on Data Mining, pages 413–422,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
doi: 10.1109/ICDM.2008.17. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,
-
[9]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
URLhttps://arxiv.org/abs/2310.04451. arXiv preprint arXiv:2310.04451. Atharva Mehta, Rajesh Kumar, Aman Singla, Kartik Bisht, Yaman Kumar Singla, and Rajiv Ratn Shah. Detecting LLM-assisted academic dishonesty using keystroke dynamics,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics
URLhttps: //arxiv.org/abs/2511.12468. arXiv preprint arXiv:2511.12468. Maximilian Mueller and Matthias Hein. Mahalanobis++: Improving OOD detection via fea- ture normalization,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2505.18032
URLhttps://arxiv.org/abs/2505.18032. arXiv preprint arXiv:2505.18032. OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, et al. GPT-4 technical report,
-
[12]
URL https://arxiv.org/abs/2303.08774. arXiv preprint arXiv:2303.08774. Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. Deep learning for anomaly detection: A review.ACM Comput. Surv., 54(2), mar
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Deep learning for anomaly detection
doi: 10.1145/3439950. URLhttps://doi.org/10.1145/3439950. 10 J. Ramprasath, S. Ramakrishnan, V. Tharani, R. Sushmitha, and D. Arunima. Cloud service anomaly traffic detection using Random Forest. In Shailesh Tiwari, Munesh C. Trivedi, Mohan L. Kolhe, and Brajesh Kumar Singh, editors,Advances in Data and Information Sciences, pages 269–279, Singapore,
-
[14]
arXiv preprint arXiv:1906.02845
URLhttps://arxiv.org/abs/1906.02845. arXiv preprint arXiv:1906.02845. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,
-
[15]
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
URLhttps://arxiv.org/abs/2404.01833. arXiv preprint arXiv:2404.01833. Abhishek Singhania, Christophe Dupuy, Shivam Mangale, and Amani Namboori. Multi-lingual multi-turn automated red teaming for LLMs,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
arXiv preprint arXiv:2504.03174
URLhttps://arxiv.org/abs/2504.03174. arXiv preprint arXiv:2504.03174. Songze Li, Ruishi He, Xiaojun Jia, Jun Wang, and Zhihui Fu. Knowledge-driven multi-turn jail- breaking on large language models,
-
[17]
arXiv preprint arXiv:2601.05445
URLhttps://arxiv.org/abs/2601.05445. arXiv preprint arXiv:2601.05445. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, et al. Gemini: A family of highly capable multimodal models,
-
[18]
Gemini: A Family of Highly Capable Multimodal Models
URLhttps://arxiv.org/abs/2312.11805. arXiv preprint arXiv:2312.11805. Xinkai Wang, Beibei Li, Zerui Shao, Ao Liu, Guangquan Xu, and Shouling Ji. PolyJailbreak: Cross-modal jailbreaking attacks on black-box multimodal LLMs,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URLhttps://arxiv. org/abs/2510.17277. arXiv preprint arXiv:2510.17277. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail?,
-
[20]
Jailbroken: How Does LLM Safety Training Fail?
URLhttps://arxiv.org/abs/2307.02483. arXiv preprint arXiv:2307.02483. Zixuan Weng, Xiaolong Jin, Jinyuan Jia Regel, and Xiangyu Zhang. Foot-in-the-door: A multi- turn jailbreak for LLMs,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
arXiv preprint arXiv:2502.19820
URLhttps://arxiv.org/abs/2502.19820. arXiv preprint arXiv:2502.19820. Yubo Li, Ramayya Krishnan, and Rema Padman. Time-to-inconsistency: A survival analysis of large language model robustness to adversarial attacks,
-
[22]
arXiv preprint arXiv:2510.02712
URLhttps://arxiv.org/abs/ 2510.02712. arXiv preprint arXiv:2510.02712. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models,
-
[23]
Universal and Transferable Adversarial Attacks on Aligned Language Models
URLhttps://arxiv.or g/abs/2307.15043. arXiv preprint arXiv:2307.15043. 11 Multimodal Input & Telemetric CovariatesV (t) Pillar 1: Structural Scout (Isolation Forest) CalculateS (t) iso S(t) iso > α Pillar 2: Distributional Anchoring & Kinematics CalculateD (t) M anda t CCM: Bayesian Belief Update HMM State Tracking Pillar 3: Survival Forecast Cox Hazardh(...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.