D²-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
Pith reviewed 2026-06-29 22:00 UTC · model grok-4.3
The pith
Hesitation count in denoising steps routes between light and heavy safety probes for diffusion LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety hesitation, defined as the number of intermediate hidden states falling within a small margin of the lightweight probe's decision boundary during the denoising trajectory, predicts probe failure on the final output and thereby provides an effective proxy of sample difficulty. This enables D²-Monitor, a bi-level safety monitor that jointly estimates hesitation and performs base classification with a lightweight probe, activating a more expressive but heavier probe only when hesitation exceeds a threshold. The resulting dynamic routing yields state-of-the-art performance with a compact parameter footprint (≤ 0.85M parameters) and the best effectiveness-efficiency trade-off relative to e
What carries the argument
Bi-level monitor with hesitation-aware routing, where the count of near-boundary intermediate states in the denoising trajectory decides whether to escalate from a lightweight always-on probe to a heavier one.
If this is right
- Monitoring cost scales with per-sample difficulty rather than remaining fixed at the cost of the heaviest probe.
- A compact always-on probe becomes sufficient for the majority of inputs while accuracy is preserved on difficult cases.
- The same hesitation signal can be extracted from any D-LLM that exposes intermediate hidden states during denoising.
- Parameter count stays under one million even when both probes are included.
Where Pith is reading between the lines
- The routing logic may transfer to other iterative generative processes that expose intermediate representations, such as certain autoregressive models with early-exit mechanisms.
- Hesitation counts could serve as a general uncertainty signal for tasks beyond safety classification.
- Replacing the heavier probe with a different architecture or distillation method might further improve the efficiency side of the trade-off.
Load-bearing premise
The number of hesitation steps near the lightweight probe's decision boundary reliably predicts when that probe will fail on the final output, and the same threshold works across the three datasets and four models tested.
What would settle it
Measuring whether hesitation count still predicts probe failure on a fifth D-LLM or on a new safety dataset collected after the original experiments.
Figures
read the original abstract
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes $D^2$-Monitor, a bi-level dynamic safety monitor for diffusion LLMs (D-LLMs). A lightweight probe performs always-on classification while also counting 'safety hesitation' steps (intermediate hidden states falling within a small margin of the probe's decision boundary). When the hesitation count exceeds a threshold, a heavier probe is routed to. The method is evaluated on WildguardMix, ToxicChat, and OpenAI-Moderation across four D-LLMs, claiming SOTA performance with ≤0.85M parameters and the best effectiveness-efficiency trade-off versus eight baselines.
Significance. If the hesitation count is shown to be a robust, non-circular predictor of probe failure that generalizes across datasets and models without per-dataset margin tuning, the bi-level routing approach would offer a practical way to balance monitoring accuracy and compute cost for D-LLMs. The trajectory-level analysis of denoising steps is a distinctive contribution relative to static AR-LLM monitors.
major comments (2)
- [Abstract] Abstract and evaluation sections: the hesitation margin and threshold are never defined numerically or procedurally, nor is any ablation or cross-dataset validation of their predictive power for final-output failure provided. Without these, the central claim that the hesitation count serves as a reliable proxy cannot be verified and risks circularity if the margin was selected using test-set information.
- [Evaluation] Evaluation (implied in abstract): no error bars, dataset statistics, or per-dataset correlation strengths between hesitation count and probe failure are reported. This leaves the SOTA and 'best trade-off' claims ungrounded, especially given the free parameters (margin, threshold) listed in the axiom ledger.
minor comments (1)
- [Abstract] The abstract states results on three datasets and four D-LLMs but supplies no table or figure references; adding explicit pointers to the relevant results tables would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for greater clarity on parameter definitions and additional evaluation details. We will revise the manuscript to address these points directly.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation sections: the hesitation margin and threshold are never defined numerically or procedurally, nor is any ablation or cross-dataset validation of their predictive power for final-output failure provided. Without these, the central claim that the hesitation count serves as a reliable proxy cannot be verified and risks circularity if the margin was selected using test-set information.
Authors: We agree that explicit numerical definitions, the selection procedure, and supporting ablations were insufficiently detailed. In the revision we will state the margin value (0.05) and threshold (2 steps), describe their selection via 5-fold cross-validation on held-out validation splits from each dataset, and add an ablation subsection quantifying how hesitation count predicts probe failure (with accuracy/F1 improvements when routing is enabled). All tuning will be documented as validation-only to eliminate circularity concerns. revision: yes
-
Referee: [Evaluation] Evaluation (implied in abstract): no error bars, dataset statistics, or per-dataset correlation strengths between hesitation count and probe failure are reported. This leaves the SOTA and 'best trade-off' claims ungrounded, especially given the free parameters (margin, threshold) listed in the axiom ledger.
Authors: We acknowledge these omissions. The revised evaluation section will report: standard error bars over five random seeds for all metrics; a table of dataset statistics (size, positive/negative ratio, average trajectory length); and per-dataset Pearson and Spearman correlations between hesitation count and probe failure rate. We will also clarify in the axiom ledger that margin and threshold are fixed after validation-set tuning and not re-tuned on test data, thereby grounding the SOTA and efficiency claims. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper identifies the hesitation signal via empirical analysis of trajectories on the given datasets and then constructs the bi-level router around that observed signal. No equations, fitted parameters, or self-citations are quoted that reduce the final performance metric or routing decision to a quantity defined by the same metric. Evaluation on three external datasets across four D-LLMs supplies independent benchmarks, so the reported gains do not collapse to a self-definitional or fitted-input construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- hesitation threshold
- hesitation margin
axioms (1)
- domain assumption Intermediate hidden states near the lightweight probe boundary indicate cases where the probe will fail
Reference graph
Works this paper leans on
-
[1]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman,et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Competition-level code generation with alphacode,
Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago,et al., “Competition-level code generation with alphacode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022
2022
-
[7]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano,et al., “Training verifiers to solve math word problems,”arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou,et al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022
2022
-
[9]
Large language diffusion models,
S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. ZHOU, Y . Lin, J.-R. Wen, and C. Li, “Large language diffusion models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[10]
LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models
F. Zhu, R. Wang, S. Nie, X. Zhang, C. Wu, J. Hu, J. Zhou, J. Chen, Y . Lin, J.-R. Wen,et al., “Llada 1.5: Variance-reduced preference optimization for large language diffusion models,”arXiv preprint arXiv:2505.19223, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
T. Bie, M. Cao, K. Chen, L. Du, M. Gong, Z. Gong, Y . Gu, J. Hu, Z. Huang, Z. Lan,et al., “Llada2.0: Scaling up diffusion language models to 100b,”arXiv preprint arXiv:2512.15745, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Mercury: Ultra-Fast Language Models Based on Diffusion
I. Labs, S. Khanna, S. Kharbanda, S. Li, H. Varma, E. Wang, S. Birnbaum, Z. Luo, Y . Miraoui, A. Palrecha, et al., “Mercury: Ultra-fast language models based on diffusion,”arXiv preprint arXiv:2506.17298, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Simple and effective masked diffusion language models,
S. S. Sahoo, M. Arriola, Y . Schiff, A. Gokaslan, E. Marroquin, J. T. Chiu, A. Rush, and V . Kuleshov, “Simple and effective masked diffusion language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 130136–130184, 2024
2024
-
[14]
Simplified and generalized masked diffusion for discrete data,
J. Shi, K. Han, Z. Wang, A. Doucet, and M. Titsias, “Simplified and generalized masked diffusion for discrete data,”Advances in neural information processing systems, vol. 37, pp. 103131–103167, 2024
2024
-
[15]
Introducing mercury 2
I. Labs, “Introducing mercury 2.” https://www.inceptionlabs.ai/blog/ introducing-mercury-2, 2026
2026
-
[16]
Disrupting the first reported ai-orchestrated cyber espionage campaign
Anthropic, “Disrupting the first reported ai-orchestrated cyber espionage campaign.” https://www. anthropic.com/news/disrupting-AI-espionage, 2025
2025
-
[17]
A2d: Any-order, any-step safety alignment for diffusion language models,
W. Jeung, S. Yoon, Y . Cho, D. Jeon, S. Shin, H. Hong, and A. No, “A2d: Any-order, any-step safety alignment for diffusion language models,” inThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[18]
Diffuguard: How intrinsic safety is lost and found in diffusion large language models,
Z. Li, Z. Nie, Z. Zhou, Y . Liu, Y . Zhang, Y . Cheng, Q. Wen, K. Wang, Y . Guo, and J. Zhang, “Diffuguard: How intrinsic safety is lost and found in diffusion large language models,” inThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[19]
M. Nasr, N. Carlini, C. Sitawarin, S. V . Schulhoff, J. Hayes, M. Ilie, J. Pluto, S. Song, H. Chaudhari, I. Shumailov,et al., “The attacker moves second: Stronger adaptive attacks bypass defenses against llm jailbreaks and prompt injections,”arXiv preprint arXiv:2510.09023, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,
S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y . Lin, N. Lambert, Y . Choi, and N. Dziri, “Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,”Advances in neural information processing systems, vol. 37, pp. 8093–8131, 2024
2024
-
[21]
Detecting strategic deception with linear probes,
N. Goldowsky-Dill, B. Chughtai, S. Heimersheim, and M. Hobbhahn, “Detecting strategic deception with linear probes,” inInternational Conference on Machine Learning, pp. 19755–19786, PMLR, 2025
2025
-
[22]
Simple probes can catch sleeper agents, 2024,
M. MacDiarmid, T. Maxwell, N. Schiefer, J. Mu, J. Kaplan, D. Duvenaud, S. Bowman, A. Tamkin, E. Perez, M. Sharma,et al., “Simple probes can catch sleeper agents, 2024,”URL https://www. anthropic. com/news/probes-catch-sleeper-agents
2024
-
[23]
Detecting high-stakes interactions with activation probes,
A. McKenzie, U. Pawar, P. Blandfort, W. Bankes, D. Krueger, E. S. Lubana, and D. Krasheninnikov, “Detecting high-stakes interactions with activation probes,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[24]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, et al., “Llama guard: Llm-based input-output safeguard for human-ai conversations,”arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Shieldgemma 2: Robust and tractable image content moderation
W. Zeng, D. Kurniawan, R. Mullins, Y . Liu, T. Saha, D. Ike-Njoku, J. Gu, Y . Song, C. Xu, J. Zhou,et al., “Shieldgemma 2: Robust and tractable image content moderation,”arXiv preprint arXiv:2504.01081, 2025
-
[26]
Understanding intermediate layers using linear classifier probes
G. Alain and Y . Bengio, “Understanding intermediate layers using linear classifier probes,”arXiv preprint arXiv:1610.01644, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[27]
Locating and editing factual associations in gpt,
K. Meng, D. Bau, A. Andonian, and Y . Belinkov, “Locating and editing factual associations in gpt,” Advances in neural information processing systems, vol. 35, pp. 17359–17372, 2022
2022
-
[28]
Building production-ready probes for Gemini.arXiv preprint arXiv:2601.11516,
J. Kramár, J. Engels, Z. Wang, B. Chughtai, R. Shah, N. Nanda, and A. Conmy, “Building production-ready probes for gemini,”arXiv preprint arXiv:2601.11516, 2026
-
[29]
Time is a feature: Exploiting temporal dynamics in diffusion language models,
W. Wang, B. Fang, C. Jing, Y . Shen, Y . Shen, Q. Wang, H. Ouyang, H. Chen, and C. Shen, “Time is a feature: Exploiting temporal dynamics in diffusion language models,” inThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[30]
Diffusion language model knows the answer before it decodes,
P. Li, Y . Zhou, D. Muhtar, L. Yin, S. Yan, L. Shen, Y . Liang, S. V osoughi, and S. Liu, “Diffusion language model knows the answer before it decodes,” inThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[31]
Structured denoising diffusion models in discrete state-spaces,
J. Austin, D. D. Johnson, J. Ho, D. Tarlow, and R. Van Den Berg, “Structured denoising diffusion models in discrete state-spaces,”Advances in neural information processing systems, vol. 34, pp. 17981–17993, 2021
2021
-
[32]
Argmax flows and multinomial diffusion: Learning categorical distributions,
E. Hoogeboom, D. Nielsen, P. Jaini, P. Forré, and M. Welling, “Argmax flows and multinomial diffusion: Learning categorical distributions,”Advances in neural information processing systems, vol. 34, pp. 12454– 12465, 2021
2021
-
[33]
Scaling up masked diffusion models on text,
S. Nie, F. Zhu, C. Du, T. Pang, Q. Liu, G. Zeng, M. Lin, and C. Li, “Scaling up masked diffusion models on text,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[34]
Diffusion models: A comprehensive survey of methods and applications,
L. Yang, Z. Zhang, Y . Song, S. Hong, R. Xu, Y . Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,”ACM computing surveys, vol. 56, no. 4, pp. 1–39, 2023
2023
-
[35]
A fragile guardrail: Diffusion llm’s safety blessing and its failure mode,
Z. He, Y . Chen, L. Lin, Y . Wang, S. Chang, E. Sommerlade, P. Torr, J. Yu, A. Bibi, and J. Yu, “A fragile guardrail: Diffusion llm’s safety blessing and its failure mode,”arXiv preprint arXiv:2602.00388, 2026
-
[36]
AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,
X. Liu, N. Xu, M. Chen, and C. Xiao, “AutoDAN: Generating stealthy jailbreak prompts on aligned large language models,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[37]
How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,
Y . Zeng, H. Lin, J. Zhang, D. Yang, R. Jia, and W. Shi, “How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14322–14350, 2024
2024
-
[38]
The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer
Y . Chen, J. Yu, A. Liu, P. Torr, and A. Bibi, “The alignment curse: Cross-modality jailbreak transfer in omni-models,”arXiv preprint arXiv:2602.02557, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Using gpt-4 for content moderation,
L. Weng, V . Goel, and A. Vallone, “Using gpt-4 for content moderation,” 2023. 12
2023
-
[40]
The linear representation hypothesis and the geometry of large language models,
K. Park, Y . J. Choe, and V . Veitch, “The linear representation hypothesis and the geometry of large language models,” inInternational Conference on Machine Learning, pp. 39643–39666, PMLR, 2024
2024
-
[41]
Simple factuality probes detect hallucinations in long-form natural language generation,
J. Han, N. Band, M. Razzak, J. Kossen, T. G. Rudner, and Y . Gal, “Simple factuality probes detect hallucinations in long-form natural language generation,”Findings of the Association for Computational Linguistics: EMNLP, pp. 16209–16226, 2025
2025
-
[42]
Toxicity detection for free,
Z. Hu, J. Piet, G. Zhao, J. Jiao, and D. Wagner, “Toxicity detection for free,”Advances in Neural Information Processing Systems, vol. 37, pp. 17518–17540, 2024
2024
-
[43]
Branchynet: Fast inference via early exiting from deep neural networks,
S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast inference via early exiting from deep neural networks,” in2016 23rd international conference on pattern recognition (ICPR), pp. 2464–2469, IEEE, 2016
2016
-
[44]
Designing and interpreting probes with control tasks,
J. Hewitt and P. Liang, “Designing and interpreting probes with control tasks,” inProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp), pp. 2733–2743, 2019
2019
-
[45]
Pareto probing: Trading off accuracy for complexity,
T. Pimentel, N. Saphra, A. Williams, and R. Cotterell, “Pareto probing: Trading off accuracy for complexity,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3138–3153, 2020
2020
-
[46]
Cost-effective constitutional classifiers via representation re-use
H. Cunningham, A. Peng, J. Wei, E. Ong, F. Roger, L. Petrini, M. Wagner, V . Mikulik, and M. Sharma, “Cost-effective constitutional classifiers via representation re-use.” Anthropic Alignment Science Blog, June 2025
2025
-
[47]
Beyond linear probes: Dynamic safety monitoring for language models,
J. Oldfield, P. Torr, I. Patras, A. Bibi, and F. Barez, “Beyond linear probes: Dynamic safety monitoring for language models,” inThe Fourteenth International Conference on Learning Representations, 2026
2026
-
[48]
Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks,
H. Cunningham, J. Wei, Z. Wang, A. Persic, A. Peng, J. Abderrachid, R. Agarwal, B. Chen, A. Co- hen, A. Dau,et al., “Constitutional classifiers++: Efficient production-grade defenses against universal jailbreaks,”arXiv preprint arXiv:2601.04603, 2026
-
[49]
Probing classifiers: Promises, shortcomings, and advances,
Y . Belinkov, “Probing classifiers: Promises, shortcomings, and advances,”Computational Linguistics, vol. 48, pp. 207–219, 04 2022
2022
-
[50]
A non-linear structural probe,
J. C. White, T. Pimentel, N. Saphra, and R. Cotterell, “A non-linear structural probe,” inProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 132–138, 2021
2021
-
[51]
Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation,
Z. Lin, Z. Wang, Y . Tong, Y . Wang, Y . Guo, Y . Wang, and J. Shang, “Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation,” 2023
2023
-
[52]
A holistic approach to undesired content detection,
T. Markov, C. Zhang, S. Agarwal, T. Eloundou, T. Lee, S. Adler, A. Jiang, and L. Weng, “A holistic approach to undesired content detection,”arXiv preprint arXiv:2208.03274, 2022
-
[53]
Truth as a trajectory: What internal representations reveal about large language model reasoning,
H. Damirchi, I. Meza De la Jara, E. Abbasnejad, A. Shamsi, Z. Zhang, and J. Shi, “Truth as a trajectory: What internal representations reveal about large language model reasoning,”arXiv e-prints, pp. arXiv–2603, 2026
2026
-
[54]
Llada2.1: Speeding up text diffusion via token editing
T. Bie, M. Cao, X. Cao, B. Chen, F. Chen, K. Chen, L. Du, D. Feng, H. Feng, M. Gong,et al., “Llada2.1: Speeding up text diffusion via token editing,”arXiv preprint arXiv:2602.08676, 2026
-
[55]
Obfuscated activations bypass LLM latent-space defenses,
L. Bailey, A. Serrano, A. Sheshadri, M. Seleznyov, J. Taylor, E. Jenner, J. Hilton, S. Casper, C. Guestrin, and S. Emmons, “Obfuscated activations bypass LLM latent-space defenses,” inThe Fourteenth International Conference on Learning Representations, 2026. 13 A Limitation We perform experiments on a variety of D-LLM models, where we show that D2-Monitor...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.