Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models
Pith reviewed 2026-05-18 09:50 UTC · model grok-4.3
The pith
Curtailing candidate diversity in test-time scaling substantially increases the likelihood of unsafe outputs from large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When candidate diversity is curtailed using a reference-guided protocol, test-time scaling strategies such as Monte Carlo Tree Search and Best-of-N produce unsafe outputs at higher rates. The increase occurs across open-source models including Qwen3, Mistral, Llama3.1, and Gemma3, and the pattern transfers to closed-source models such as OpenAI o3-mini and Gemini-2.5-Pro. The effect is frequently stronger than that obtained from prompts with high adversarial intent scores, and numerous widely used safety guardrail classifiers fail to flag the inputs generated by the protocol.
What carries the argument
The reference-guided diversity reduction protocol (RefDiv), which constructs prompts that systematically lower the variety among candidate responses generated during test-time scaling to expose safety degradation.
If this is right
- TTS pipelines become more likely to generate unsafe content once candidate diversity falls by even a modest amount.
- The safety degradation appears under both Monte Carlo Tree Search and Best-of-N and persists when moving from open to closed-source models.
- Standard safety guardrail classifiers do not detect the prompts that RefDiv uses to lower diversity.
- The diversity-safety link is presented as a general property of TTS rather than an artifact of particular models or implementations.
Where Pith is reading between the lines
- Safety evaluations for any TTS deployment should include controlled diversity-reduction tests rather than relying solely on direct adversarial prompts.
- Systems that combine TTS with downstream applications may need new selection mechanisms that explicitly preserve or restore candidate variety.
- The same diversity-reduction dynamic could appear in other multi-generation techniques beyond the two TTS strategies examined here.
Load-bearing premise
The RefDiv protocol reduces diversity without introducing separate factors that independently increase unsafe outputs.
What would settle it
Running the same TTS pipelines with RefDiv applied and observing no rise (or a drop) in unsafe output rates across multiple models and both Monte Carlo Tree Search and Best-of-N would falsify the central claim.
Figures
read the original abstract
Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Test-Time Scaling (TTS) in LLMs carries an indirect safety risk: when candidate diversity is reduced, even modestly, the rate of unsafe outputs rises substantially. The authors introduce RefDiv, a reference-guided protocol that lowers measured diversity by conditioning generation on a reference response, and use it as a diagnostic to show elevated unsafe rates across open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and two TTS methods (MCTS and Best-of-N). The effect is reported to exceed that of direct high-adversarial-intent prompts, transfers to closed models (o3-mini, Gemini-2.5-Pro), and evades standard safety classifiers such as Llama-Guard.
Significance. If the central causal claim holds, the result identifies a previously unrecognized failure mode in TTS pipelines that are widely adopted for reasoning improvement. Demonstrating that diversity reduction reliably increases unsafe outputs, that the phenomenon is stronger than explicit adversarial prompting, and that it generalizes across model families and TTS strategies would have direct implications for safe deployment of scaled inference. The empirical breadth across multiple open models and the transfer result to closed models are strengths; the work also supplies a concrete diagnostic protocol that could be adopted for future safety evaluation.
major comments (2)
- [§3] §3 (RefDiv protocol description): The protocol reduces diversity by conditioning on a reference response, yet no ablation is presented that holds the reference fixed while varying only the diversity parameter (e.g., temperature or nucleus sampling on the identical reference). Without this control, it remains possible that the observed rise in unsafe outputs is driven by reference-induced semantic or safety-relevant shifts rather than by diversity reduction per se.
- [§4] §4 (experimental results): The manuscript states that RefDiv produces higher unsafe rates than direct adversarial prompts and that the effect is consistent across models and TTS strategies, but provides no quantitative metrics, statistical significance tests, or details on the exact procedure used to label outputs as unsafe. These omissions make it difficult to assess the magnitude and reliability of the reported effect sizes.
minor comments (2)
- [Abstract] Abstract: the verb 'signifies' in 'constraining diversity consistently signifies the rate' is unclear; a more precise term such as 'increases' would improve readability.
- [§3] The paper should clarify how the diversity metric itself is computed and whether it is computed before or after reference conditioning.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§3] §3 (RefDiv protocol description): The protocol reduces diversity by conditioning on a reference response, yet no ablation is presented that holds the reference fixed while varying only the diversity parameter (e.g., temperature or nucleus sampling on the identical reference). Without this control, it remains possible that the observed rise in unsafe outputs is driven by reference-induced semantic or safety-relevant shifts rather than by diversity reduction per se.
Authors: We agree that an explicit ablation isolating diversity while holding the reference fixed would more rigorously support the causal interpretation. In the revised manuscript we have added this control experiment in Section 3: a single reference response is generated once per prompt and then candidate pools are produced by varying only temperature and nucleus sampling parameters around that fixed reference. The results confirm that unsafe rates rise as diversity decreases even under this stricter control, indicating the effect is attributable to diversity reduction rather than reference-induced content shifts. New figures and tables documenting the ablation have been included. revision: yes
-
Referee: [§4] §4 (experimental results): The manuscript states that RefDiv produces higher unsafe rates than direct adversarial prompts and that the effect is consistent across models and TTS strategies, but provides no quantitative metrics, statistical significance tests, or details on the exact procedure used to label outputs as unsafe. These omissions make it difficult to assess the magnitude and reliability of the reported effect sizes.
Authors: We thank the referee for highlighting these omissions. In the revised Section 4 we now report exact unsafe rates (with standard deviations) for every model–TTS combination, percentage increases relative to direct adversarial prompts, and statistical significance results (paired t-tests and Wilcoxon signed-rank tests with p-values). We have also added a dedicated subsection detailing the unsafe labeling procedure, which combines Llama-Guard-3 with manual verification on a 10% random sample (inter-annotator agreement reported). These additions allow readers to evaluate both the magnitude and reliability of the observed effects. revision: yes
Circularity Check
No circularity: empirical diagnostic protocol with external results
full rationale
The paper introduces RefDiv as an external reference-guided protocol to reduce diversity and then reports experimental outcomes on unsafe TTS outputs across models and strategies. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the central claim to its own inputs by construction. The observed correlation between constrained diversity and higher unsafe rates is treated as an empirical finding rather than a tautological renaming or self-referential fit. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diverse candidate pools in TTS enhance reliability and safety.
invented entities (1)
-
RefDiv protocol
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges
A survey that taxonomizes threats to agentic AI, reviews benchmarks and evaluation methods, discusses technical and governance defenses, and identifies open challenges.
Reference graph
Works this paper leans on
-
[1]
ISSN 2157-6904. doi:10.1145/3744746. URL https://doi.org/10.1145/3744746. Just Accepted. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan L...
-
[2]
https://openreview.net/forum?id=1PL1NIMMrw
Association for Computational Linguistics. doi:10.18653/v1/2023.acl-long.754. URLhttps://aclanthology.org/2023.acl-long.754/. Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Asli Celikyilmaz ...
-
[3]
doi:10.18653/v1/2020.acl-demos.30
Association for Computational Linguistics. doi:10.18653/v1/2020.acl-demos.30. URLhttps://aclanthology.org/2020.acl-demos.30/. Andrea Madotto, Etsuko Ishii, Zhaojiang Lin, Sumanth Dathathri, and Pascale Fung. Plug-and-play conversational models. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Lin- guistics: EMN...
-
[4]
doi:10.18653/v1/2020.findings-emnlp.219
Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.219. URLhttps://aclanthology.org/2020.findings-emnlp.219/. Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annu...
-
[5]
doi:10.18653/v1/2024.acl-long.269
Association for Computational Linguistics. doi:10.18653/v1/2024.acl-long.269. URLhttps://aclanthology.org/2024.acl-long.269/. Tom Kocmi and Christian Federmann. Large language models are state-of-the-art evaluators of translation quality. In Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranas...
-
[6]
URLhttps://aclanthology.org/2023.eamt-1.19/
European Association for Machine Translation. URLhttps://aclanthology.org/2023.eamt-1.19/. 9 Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in LLMs Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and D...
work page 2023
-
[7]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
URLhttps://arxiv.org/abs/2005.11401. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: deliberate problem solving with large language models. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[8]
Training Verifiers to Solve Math Word Problems
URLhttps://arxiv.org/abs/2110.14168. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational conference on computers and games, pages 72–83. Springer,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba
URLhttps://arxiv.org/abs/2410.01707. Yuichi Inoue, Kou Misaki, Yuki Imajuku, So Kuroki, Taishi Nakamura, and Takuya Akiba. Wider or deeper? scaling llm inference-time compute with adaptive branching tree search.arXiv preprint arXiv:2503.04412,
-
[10]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. Forest-of-thought: Scaling test-time compute for enhancing llm reasoning.arXiv preprint arXiv:2412.09078,
-
[12]
Monte carlo tree search boosts reasoning via iterative preference learning
Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning.arXiv preprint arXiv:2405.00451, 2024a. Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel ...
-
[13]
Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,
Costas Mavromatis, Petros Karypis, and George Karypis. Pack of llms: Model fusion at test-time via perplexity optimization.arXiv preprint arXiv:2404.11531,
-
[14]
Sungjin Park, Xiao Liu, Yeyun Gong, and Edward Choi. Ensembling large language models with process reward-guided tree search for better complex reasoning.arXiv preprint arXiv:2412.15797,
-
[15]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step questions.arXiv preprint arXiv:2212.10509,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Johnathan Xie, Annie S Chen, Yoonho Lee, Eric Mitchell, and Chelsea Finn. Calibrating language models with adaptive temperature scaling. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18128–18138, Miami, Florida, USA, November 2024b. Association fo...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main.1007 2024
-
[17]
Jailbreaking Black Box Large Language Models in Twenty Queries
URLhttps://arxiv.org/abs/2310.08419. Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2024a. URLhttps://arxiv.org/abs/2310.04451. 10 Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in LLMs Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobey...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv preprint arXiv:2311.08268,
-
[19]
Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,
Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191,
- [20]
-
[21]
Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542,
Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542,
-
[22]
Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, and Yingchun Wang. A mouse- trap: Fooling large reasoning models for jailbreak with chain of iterative chaos.arXiv preprint arXiv:2502.15806,
-
[23]
AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models
Jiacheng Liang, Tanqiu Jiang, Yuhui Wang, Rongyi Zhu, Fenglong Ma, and Ting Wang. Autoran: Weak-to-strong jailbreaking of large reasoning models.arXiv preprint arXiv:2505.10846,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,
-
[25]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023a. URLhttps://arxi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URL https://arxiv.org/abs/2407.21783. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
URL https://arxiv.org/abs/2505.09388. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, France...
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
URLhttps://arxiv.org/abs/2503.19786. Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023b. Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. InProceedings of the 61th Annual Meetin...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Akshit Achara and Anshuman Chhabra
URLhttps://openreview.net/forum?id=sE7-XhLxHA. Akshit Achara and Anshuman Chhabra. Watching the AI watchdogs: A fairness and robustness analysis of AI safety moderation classifiers. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Pape...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.