arxiv: 2510.08592 · v3 · submitted 2025-10-04 · 💻 cs.CL · cs.AI· cs.LG

Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Shahriar Kabir Nahin , Hadi Askari , Muhao Chen , Anshuman Chhabra This is my paper

Pith reviewed 2026-05-18 09:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords test-time scalinglarge language modelsAI safetycandidate diversityadversarial promptssafety guardrailsMonte Carlo Tree SearchBest-of-N

0 comments

The pith

Curtailing candidate diversity in test-time scaling substantially increases the likelihood of unsafe outputs from large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to demonstrate that the standard assumption behind test-time scaling—that drawing on a pool of diverse candidate responses improves reliability—overlooks a direct safety downside. When that diversity is reduced even modestly, TTS methods shift toward producing unsafe content at noticeably higher rates. A sympathetic reader would pay attention because TTS techniques are being adopted to boost reasoning performance, yet this creates a failure mode that bypasses conventional adversarial prompts and evades common guardrail classifiers. The authors introduce a reference-guided diversity reduction protocol to isolate and measure the effect, showing consistent results across open models, two TTS strategies, and transfer to closed-source systems.

Core claim

When candidate diversity is curtailed using a reference-guided protocol, test-time scaling strategies such as Monte Carlo Tree Search and Best-of-N produce unsafe outputs at higher rates. The increase occurs across open-source models including Qwen3, Mistral, Llama3.1, and Gemma3, and the pattern transfers to closed-source models such as OpenAI o3-mini and Gemini-2.5-Pro. The effect is frequently stronger than that obtained from prompts with high adversarial intent scores, and numerous widely used safety guardrail classifiers fail to flag the inputs generated by the protocol.

What carries the argument

The reference-guided diversity reduction protocol (RefDiv), which constructs prompts that systematically lower the variety among candidate responses generated during test-time scaling to expose safety degradation.

If this is right

TTS pipelines become more likely to generate unsafe content once candidate diversity falls by even a modest amount.
The safety degradation appears under both Monte Carlo Tree Search and Best-of-N and persists when moving from open to closed-source models.
Standard safety guardrail classifiers do not detect the prompts that RefDiv uses to lower diversity.
The diversity-safety link is presented as a general property of TTS rather than an artifact of particular models or implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety evaluations for any TTS deployment should include controlled diversity-reduction tests rather than relying solely on direct adversarial prompts.
Systems that combine TTS with downstream applications may need new selection mechanisms that explicitly preserve or restore candidate variety.
The same diversity-reduction dynamic could appear in other multi-generation techniques beyond the two TTS strategies examined here.

Load-bearing premise

The RefDiv protocol reduces diversity without introducing separate factors that independently increase unsafe outputs.

What would settle it

Running the same TTS pipelines with RefDiv applied and observing no rise (or a drop) in unsafe output rates across multiple models and both Monte Carlo Tree Search and Best-of-N would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.08592 by Anshuman Chhabra, Hadi Askari, Muhao Chen, Shahriar Kabir Nahin.

**Figure 1.** Figure 1: In initial iterations of REFDIV (αt is small for small t), the stress test steers candidates (which are comparatively more diverse) towards affirmative reference tokens. As αt ↑ with increasing t, REFDIV minimizes candidate diversity wholly via Shannon entropy, demonstrating a previously unknown failure mode of TTS-enabled LLMs. Here, T is the total number of algorithm iterations. Early in the optimization… view at source ↗

**Figure 2.** Figure 2: ASR trends across iterations for AutoDAN, GCG, and R [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: ASR trends across iterations for AutoDAN, GCG, and R [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Analyzing the Shannon Entropy trend across iterations for R [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Transferability of REFDIV prompts for Best-of-N → MCTS and MCTS → Best-of-N across LLMs. An additional question to answer is: how well do adversarial prompts generated for a specific TTS strategy by REFDIV transfer across different TTS strategies? Essentially, if adversarial strings can transfer across TTS strategies, this indicates clearly that the diversity-specific failure mode of TTS is a fundamental… view at source ↗

**Figure 6.** Figure 6: Transferability (ASR) of REFDIV from open-source LLMs with Best-of-N (left) and MCTS (right) TTS to closed-source LLMs. Clearly, REFDIV generated prompts transfer well across TTS strategies. However, in the previous scenario, the LLM models are still accessible, leading us to the question: do the adversarial stress test prompts generated by REFDIV transfer across closed-source LLMs as well? If the answer… view at source ↗

**Figure 7.** Figure 7: ASR of open-source models attack prompts generated via [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: ASR comparison between AutoDAN and REFDIV in Best-of-N TTS (N = 2) [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Shannon entropy comparison between AutoDAN and R [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: ASR comparison between AutoDAN and REFDIV in Best-of-N TTS (N = 16) [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Shannon entropy comparison between AutoDAN and R [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Analyzing the Shannon Entropy (MCTS) trend across iterations for R [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of ASR between AutoDAN and REFDIV (in Best-of-N, N = 8) with the deberta reward model) [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of Shannon entropy between AutoDAN and [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

read the original abstract

Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links lower candidate diversity in TTS to higher unsafe outputs but RefDiv likely confounds that with reference-driven shifts in the candidate pool.

read the letter

The main observation here is that cutting diversity among the responses explored during test-time scaling makes unsafe outputs more common, and this shows up across several open models, both MCTS and Best-of-N, and even transfers to closed models like o3-mini. Their RefDiv protocol is the tool they use to create that lower-diversity condition by conditioning on a reference answer, and they report the unsafe rate rises more than it does with direct high-adversarial prompts. They also note that standard guardrails miss the RefDiv inputs themselves. That combination of results is the part worth paying attention to if you work on inference-time methods or safety evaluation.

Referee Report

2 major / 2 minor

Summary. The paper claims that Test-Time Scaling (TTS) in LLMs carries an indirect safety risk: when candidate diversity is reduced, even modestly, the rate of unsafe outputs rises substantially. The authors introduce RefDiv, a reference-guided protocol that lowers measured diversity by conditioning generation on a reference response, and use it as a diagnostic to show elevated unsafe rates across open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two TTS methods (MCTS and Best-of-N). The effect is reported to exceed that of direct high-adversarial-intent prompts, transfers to closed models (o3-mini, Gemini-2.5-Pro), and evades standard safety classifiers such as Llama-Guard.

Significance. If the central causal claim holds, the result identifies a previously unrecognized failure mode in TTS pipelines that are widely adopted for reasoning improvement. Demonstrating that diversity reduction reliably increases unsafe outputs, that the phenomenon is stronger than explicit adversarial prompting, and that it generalizes across model families and TTS strategies would have direct implications for safe deployment of scaled inference. The empirical breadth across multiple open models and the transfer result to closed models are strengths; the work also supplies a concrete diagnostic protocol that could be adopted for future safety evaluation.

major comments (2)

[§3] §3 (RefDiv protocol description): The protocol reduces diversity by conditioning on a reference response, yet no ablation is presented that holds the reference fixed while varying only the diversity parameter (e.g., temperature or nucleus sampling on the identical reference). Without this control, it remains possible that the observed rise in unsafe outputs is driven by reference-induced semantic or safety-relevant shifts rather than by diversity reduction per se.
[§4] §4 (experimental results): The manuscript states that RefDiv produces higher unsafe rates than direct adversarial prompts and that the effect is consistent across models and TTS strategies, but provides no quantitative metrics, statistical significance tests, or details on the exact procedure used to label outputs as unsafe. These omissions make it difficult to assess the magnitude and reliability of the reported effect sizes.

minor comments (2)

[Abstract] Abstract: the verb 'signifies' in 'constraining diversity consistently signifies the rate' is unclear; a more precise term such as 'increases' would improve readability.
[§3] The paper should clarify how the diversity metric itself is computed and whether it is computed before or after reference conditioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (RefDiv protocol description): The protocol reduces diversity by conditioning on a reference response, yet no ablation is presented that holds the reference fixed while varying only the diversity parameter (e.g., temperature or nucleus sampling on the identical reference). Without this control, it remains possible that the observed rise in unsafe outputs is driven by reference-induced semantic or safety-relevant shifts rather than by diversity reduction per se.

Authors: We agree that an explicit ablation isolating diversity while holding the reference fixed would more rigorously support the causal interpretation. In the revised manuscript we have added this control experiment in Section 3: a single reference response is generated once per prompt and then candidate pools are produced by varying only temperature and nucleus sampling parameters around that fixed reference. The results confirm that unsafe rates rise as diversity decreases even under this stricter control, indicating the effect is attributable to diversity reduction rather than reference-induced content shifts. New figures and tables documenting the ablation have been included. revision: yes
Referee: [§4] §4 (experimental results): The manuscript states that RefDiv produces higher unsafe rates than direct adversarial prompts and that the effect is consistent across models and TTS strategies, but provides no quantitative metrics, statistical significance tests, or details on the exact procedure used to label outputs as unsafe. These omissions make it difficult to assess the magnitude and reliability of the reported effect sizes.

Authors: We thank the referee for highlighting these omissions. In the revised Section 4 we now report exact unsafe rates (with standard deviations) for every model–TTS combination, percentage increases relative to direct adversarial prompts, and statistical significance results (paired t-tests and Wilcoxon signed-rank tests with p-values). We have also added a dedicated subsection detailing the unsafe labeling procedure, which combines Llama-Guard-3 with manual verification on a 10% random sample (inter-annotator agreement reported). These additions allow readers to evaluate both the magnitude and reliability of the observed effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical diagnostic protocol with external results

full rationale

The paper introduces RefDiv as an external reference-guided protocol to reduce diversity and then reports experimental outcomes on unsafe TTS outputs across models and strategies. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the central claim to its own inputs by construction. The observed correlation between constrained diversity and higher unsafe rates is treated as an empirical finding rather than a tautological renaming or self-referential fit. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical results of the newly introduced RefDiv protocol and the assumption that observed increases in unsafe outputs are caused by diversity reduction rather than other properties of the generated candidates.

axioms (1)

domain assumption Diverse candidate pools in TTS enhance reliability and safety.
This is the tacit premise the paper challenges with its diversity-reduction experiments.

invented entities (1)

RefDiv protocol no independent evidence
purpose: Reference-guided method to reduce diversity in TTS candidate sets as a diagnostic attack.
Newly proposed technique used to generate the adversarial inputs that demonstrate the claimed effect.

pith-pipeline@v0.9.0 · 5801 in / 1211 out tokens · 33715 ms · 2026-05-18T09:50:05.956992+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
cs.AI 2025-10 unverdicted novelty 4.0

A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

doi:10.1145/3744746

ISSN 2157-6904. doi:10.1145/3744746. URL https://doi.org/10.1145/3744746. Just Accepted. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan L...

work page doi:10.1145/3744746
[2]

https://openreview.net/forum?id=1PL1NIMMrw

Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.754. URLhttps://aclanthology.org/2023.acl-long.754/. Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Asli Celikyilmaz ...

work page doi:10.18653/v1/2023.acl-long.754 2023
[3]

doi:10.18653/v1/2020.acl-demos.30

Association for Computational Linguistics. doi:10.18653/v1/2020.acl-demos.30. URLhttps://aclanthology.org/2020.acl-demos.30/. Andrea Madotto, Etsuko Ishii, Zhaojiang Lin, Sumanth Dathathri, and Pascale Fung. Plug-and-play conversational models. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Lin- guistics: EMN...

work page doi:10.18653/v1/2020.acl-demos.30 2020
[4]

doi:10.18653/v1/2020.findings-emnlp.219

Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.219. URLhttps://aclanthology.org/2020.findings-emnlp.219/. Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annu...

work page doi:10.18653/v1/2020.findings-emnlp.219 2020
[5]

doi:10.18653/v1/2024.acl-long.269

Association for Computational Linguistics. doi:10.18653/v1/2024.acl-long.269. URLhttps://aclanthology.org/2024.acl-long.269/. Tom Kocmi and Christian Federmann. Large language models are state-of-the-art evaluators of translation quality. In Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranas...

work page doi:10.18653/v1/2024.acl-long.269 2024
[6]

URLhttps://aclanthology.org/2023.eamt-1.19/

European Association for Machine Translation. URLhttps://aclanthology.org/2023.eamt-1.19/. 9 Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in LLMs Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and D...

work page 2023
[7]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

URLhttps://arxiv.org/abs/2005.11401. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[8]

Training Verifiers to Solve Math Word Problems

URLhttps://arxiv.org/abs/2110.14168. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational conference on computers and games, pages 72–83. Springer,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba

URLhttps://arxiv.org/abs/2410.01707. Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412,

work page arXiv
[10]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.ArXiv, abs/2412.09078, Dec 2412

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.arXiv preprint arXiv:2412.09078,

work page arXiv
[12]

Monte carlo tree search boosts reasoning via iterative preference learning

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024a. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel ...

work page arXiv
[13]

Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,

work page arXiv
[14]

Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

Sungjin Park, Xiao Liu, Yeyun Gong, and Edward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,

work page arXiv
[15]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating language models with adaptive temperature scaling. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, Miami, Florida, USA, November 2024b. Association fo...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main.1007 2024
[17]

Jailbreaking Black Box Large Language Models in Twenty Queries

URLhttps://arxiv.org/abs/2310.08419. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024a. URLhttps://arxiv.org/abs/2310.04451. 10 Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in LLMs Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobey...

work page internal anchor Pith review Pith/arXiv arXiv
[18]

A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv preprint arXiv:2311.08268,

Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv preprint arXiv:2311.08268,

work page arXiv
[19]

Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,

work page arXiv
[20]

attacks

Rongwu Xu, Zehan Qi, and Wei Xu. Preemptive answer" attacks" on chain-of-thought reasoning.arXiv preprint arXiv:2405.20902,

work page arXiv
[21]

Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542,

Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542,

work page arXiv
[22]

A mouse- trap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,

Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, and Yingchun Wang. A mouse- trap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,

work page arXiv
[23]

AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, and Ting Wang. Autoran: Weak-to-strong jailbreaking of large reasoning models.arXiv preprint arXiv:2505.10846,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,

work page arXiv
[25]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023a. URLhttps://arxi...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

URL https://arxiv.org/abs/2407.21783. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang ...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

URL https://arxiv.org/abs/2505.09388. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, France...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/2503.19786. Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023b. Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. InProceedings of the 61th Annual Meetin...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Akshit Achara and Anshuman Chhabra

URLhttps://openreview.net/forum?id=sE7-XhLxHA. Akshit Achara and Anshuman Chhabra. Watching the AI watchdogs: A fairness and robustness analysis of AI safety moderation classifiers. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Pape...

work page 2025