pith. sign in

arxiv: 2603.17305 · v2 · pith:F472M7BBnew · submitted 2026-03-18 · 💻 cs.AI · cs.CL· cs.LG

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Pith reviewed 2026-05-21 11:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords jailbreak defensereasoning modelshidden representationscontrastive learningreinforcement learningsafety alignmentred-teaming
0
0 comments X

The pith

CRAFT aligns reasoning models to safety by optimizing objectives directly over hidden state representations using contrastive learning and reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CRAFT as a framework that improves robustness to jailbreak attacks by aligning the internal reasoning processes of large models rather than only their final outputs. It does this by applying contrastive representation learning to separate safe and unsafe reasoning trajectories in the hidden state space and then using reinforcement learning to reinforce the safe ones. A theoretical argument shows that adding latent-textual consistency to the GRPO policy optimization step removes superficial alignments as possible stable solutions. Experiments on Qwen3-4B-Thinking and R1-Distill-Llama-8B report large gains in both reasoning safety and final-response safety over base models and prior defenses. The approach matters because it targets the geometry of reasoning traces themselves instead of surface-level filters.

Core claim

CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust safety alignment; incorporating latent-textual consistency into GRPO rules out superficially aligned policies as local optima.

What carries the argument

Contrastive representation learning combined with reinforcement learning over hidden states to create a latent-space geometry separating safe and unsafe reasoning trajectories.

If this is right

  • Reasoning safety improves by an average of 79.0 percent over base models across the evaluated safety benchmarks.
  • Final-response safety improves by an average of 87.7 percent over base models.
  • The method outperforms IPO and SafeKey on multiple safety benchmarks when applied to Qwen3-4B-Thinking and R1-Distill-Llama-8B.
  • Superficially aligned policies are eliminated because latent-textual consistency rules them out as local optima in the GRPO objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hidden-state separation technique could be applied to alignment goals other than safety, such as reducing hallucination or enforcing step-by-step correctness.
  • Testing the learned latent geometry against attack types withheld from the contrastive training set would clarify how much of the reported robustness is attack-specific.
  • If the latent geometry proves stable, downstream systems could rely less on external output filters and more on internal reasoning constraints.

Load-bearing premise

Optimizing objectives defined over hidden states via contrastive representation learning and reinforcement learning will produce safety alignments that generalize beyond the tested benchmarks and attack types.

What would settle it

A new jailbreak attack that elicits unsafe final outputs from a CRAFT-aligned model by using a reasoning trajectory whose hidden-state representation was not separated from unsafe ones during training.

Figures

Figures reproduced from arXiv: 2603.17305 by Binghui Wang, Haozheng Luo, Jiahao Yu, Yan Chen, Yimin Wang.

Figure 1
Figure 1. Figure 1: Comparison of reasoning-level safety behaviors under a jailbreak prompt. The base model exhibits superficial safety alignment, where harmful expressions appear in reasoning despite a safe refusal. An aligned baseline reduces explicit toxicity but still retains risky reason￾ing patterns. CRAFT aligns reasoning at the latent level, avoiding explicit harmful expressions and guiding the reasoning process towar… view at source ↗
Figure 2
Figure 2. Figure 2: Latent separation of reasoning traces. Left: PCA projection of hidden states from DeepSeek-R1-Distill-Llama-8B reveals a clear geometric separation between safe and unsafe rea￾soning traces, with rethink traces forming a distinct transitional subspace. Right: PCA projection of hidden states from Qwen3-4B-Thinking exhibits the same separation pattern, indicating model￾agnostic latent structure. 3 [PITH_FUL… view at source ↗
Figure 3
Figure 3. Figure 3: Overall pipeline of CRAFT. The framework integrates Latent Contrastive Learning for Reasoning (LCLR) with Reinforcement over Reasoning Latents (R2L). LCLR geometrically structures the latent space of reasoning traces by separating safe, unsafe, and rethink states into distinct regions, yielding a stable and interpretable safety representation. Building on this struc￾ture, R2L applies latent-aware reinforce… view at source ↗
Figure 4
Figure 4. Figure 4: CRAFT performance under advanced jailbreak attacks. We evaluate CRAFT against two strong jailbreak methods, GPTFuzzer and AutoDAN. Performance is measured us￾ing the StrongReject score on final responses and the safety rate of intermediate reasoning traces. Across settings, CRAFT achieves substantial safety improvements, with gains of 72.1% in reasoning-trace safety and 85.9% in final-response safety, demo… view at source ↗
Figure 5
Figure 5. Figure 5: Prompt used for reasoning-trace safety evaluation. 17 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes CRAFT, a red-teaming alignment framework for large reasoning models that integrates contrastive representation learning with reinforcement learning over hidden states to separate safe and unsafe reasoning trajectories. It claims a theoretical result that adding latent-textual consistency to GRPO rules out superficially aligned policies as local optima, and reports average empirical gains of 79.0% in reasoning safety and 87.7% in final-response safety over base models (Qwen3-4B-Thinking and R1-Distill-Llama-8B), outperforming IPO and SafeKey on multiple safety benchmarks.

Significance. If the empirical results and theoretical claim hold under scrutiny, the work could meaningfully advance safety alignment by shifting focus from output-level defenses to latent-space geometry in reasoning models, offering a potential path to more robust generalization against jailbreaks.

major comments (3)
  1. [Abstract] The abstract states empirical gains and a theoretical result about GRPO local optima but supplies no experimental details, statistical tests, ablation studies, or derivation steps, so the support for the central claims cannot be assessed from the provided text.
  2. [Theoretical Analysis (inferred from abstract)] The theoretical claim that latent-textual consistency rules out superficial policies appears to rest on an external RL method (GRPO) rather than reducing directly to fitted quantities defined inside the paper; full equations and proof steps are required to verify internal consistency.
  3. [Empirical Evaluation (inferred from abstract)] No results address whether the induced latent geometry prevents new failure modes on out-of-distribution jailbreaks, preserves performance on non-safety reasoning tasks, or avoids reward hacking when the consistency term is removed, which is load-bearing for the generalization claim.
minor comments (2)
  1. [Abstract] Clarify the exact definition and computation of the 79.0% and 87.7% improvement metrics, including baselines and variance across runs.
  2. [Method (inferred)] Provide explicit notation for the contrastive loss and GRPO objective to make the integration with hidden representations reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our theoretical and empirical contributions. We address each major comment below and commit to revisions that will make the support for our claims more explicit and verifiable.

read point-by-point responses
  1. Referee: [Abstract] The abstract states empirical gains and a theoretical result about GRPO local optima but supplies no experimental details, statistical tests, ablation studies, or derivation steps, so the support for the central claims cannot be assessed from the provided text.

    Authors: The abstract is intentionally concise and summarizes the key claims; the full manuscript contains the requested details. Section 4 describes the two base models, safety benchmarks, and evaluation protocol. Table 2 reports mean improvements with standard deviations and paired t-test p-values. Section 5.3 presents ablations on the contrastive loss and consistency term. The theoretical derivation appears in Section 3.2 with the complete proof in Appendix B. To address the concern directly, we will expand the abstract by one sentence referencing the evaluation models and the presence of ablations and statistical tests. revision: partial

  2. Referee: [Theoretical Analysis (inferred from abstract)] The theoretical claim that latent-textual consistency rules out superficial policies appears to rest on an external RL method (GRPO) rather than reducing directly to fitted quantities defined inside the paper; full equations and proof steps are required to verify internal consistency.

    Authors: Section 3 defines the latent-textual consistency term (Equation 5) as an additive regularizer on the GRPO advantage function. We prove that any policy achieving output-level alignment without separating safe/unsafe trajectories in hidden space receives a strictly lower expected advantage and is therefore not a local optimum. The proof reduces the modified objective to the fitted Q-values and value functions used by GRPO. The full derivation, including all intermediate steps and the list of assumptions, is already in Appendix B. We will move a self-contained proof sketch into the main text of Section 3 and add a short paragraph explicitly showing the reduction to the paper's own fitted quantities. revision: yes

  3. Referee: [Empirical Evaluation (inferred from abstract)] No results address whether the induced latent geometry prevents new failure modes on out-of-distribution jailbreaks, preserves performance on non-safety reasoning tasks, or avoids reward hacking when the consistency term is removed, which is load-bearing for the generalization claim.

    Authors: We agree these checks are necessary to substantiate the generalization claim. The current evaluation uses standard in-distribution safety benchmarks; we did not report OOD jailbreak transfer, capability preservation on math or coding tasks, or an explicit ablation that removes only the consistency term. We will add a new subsection (5.4) containing: (i) results on AdvBench and two additional OOD jailbreak suites, (ii) accuracy on GSM8K and HumanEval before and after CRAFT, and (iii) a controlled ablation that disables the consistency regularizer while keeping all other components fixed, with monitoring for reward hacking via hidden-state separation metrics. These experiments will be run on the same two base models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper claims a theoretical result that adding latent-textual consistency to GRPO rules out superficial policies as local optima, alongside empirical gains from contrastive RL on hidden states. No provided equations or steps reduce the safety improvements, the elimination of policies, or the latent geometry directly to fitted inputs, self-defined quantities, or a self-citation chain by construction. GRPO is treated as an external RL method, the contrastive objectives are presented as a methodological integration rather than a renaming or ansatz smuggled from prior self-work, and the reported benchmark improvements stand as independent measurements. The derivation therefore remains self-contained against external benchmarks and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical details are omitted.

pith-pipeline@v0.9.0 · 5728 in / 1146 out tokens · 56295 ms · 2026-05-21T11:18:44.605366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 16 internal anchors

  1. [1]

    Phi-4-reasoning Technical Report

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318,

  2. [2]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical com- mon sense to embodied reasoning.arXiv preprint arXiv:2503.15558,

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

  5. [5]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  6. [6]

    Minerva: Solving quantitative reasoning problems with language models.June, 30:2022,

    Ethan Dyer and Guy Gur-Ari. Minerva: Solving quantitative reasoning problems with language models.June, 30:2022,

  7. [7]

    Erpo: Advancing safety alignment via ex-ante reasoning preference optimization.arXiv preprint arXiv:2504.02725,

    Kehua Feng, Keyan Ding, Jing Yu, Menghan Li, Yuhao Wang, Tong Xu, Xinda Wang, Qiang Zhang, and Huajun Chen. Erpo: Advancing safety alignment via ex-ante reasoning preference optimization.arXiv preprint arXiv:2504.02725,

  8. [8]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,

  9. [9]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  10. [10]

    Y ., Joglekar, M., Wallace, E., Jain, S., Barak, B., Heylar, A., Dias, R., Vallone, A., Ren, H., Wei, J., et al

    Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning en- ables safer language models.arXiv preprint arXiv:2412.16339,

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  12. [12]

    Safety tax: Safety alignment makes your large reasoning models less reasonable

    Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555,

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  14. [14]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  15. [15]

    Learning to rank chain-of-thought: An energy-based approach with outcome supervision.arXiv preprint arXiv:2505.14999, 2025a

    Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, et al. Learning to rank chain-of-thought: An energy-based approach with outcome supervision.arXiv preprint arXiv:2505.14999, 2025a. Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and ...

  16. [16]

    Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025

    20 Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthinking: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542,

  17. [17]

    Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,

  18. [18]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  19. [19]

    Reasoningshield: Con- tent safety detection over reasoning traces of large reasoning models.arXiv preprint arXiv:2505.17244, 2025a

    Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, and Min Yang. Reasoningshield: Con- tent safety detection over reasoning traces of large reasoning models.arXiv preprint arXiv:2505.17244, 2025a. Yu Li, Han Jiang, and Zhihua Wei. DeTAM: Defending LLMs against jailbreak attacks via tar- geted attention modification. In Wanxiang Che, Joyce Nabende, Ekaterina S...

  20. [20]

    Adversarial tuning: Defending against jailbreak attacks for llms

    Fan Liu, Zhao Xu, and Hao Liu. Adversarial tuning: Defending against jailbreak attacks for llms. arXiv preprint arXiv:2406.06622,

  21. [21]

    Guardreasoner: Towards reasoning-based llm safeguards

    Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492,

  22. [22]

    Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. InThe Thir- teenth International Conference on Learning Representations, 2025a. 21 Haozheng Luo, Jiahao Yu, ...

  23. [23]

    Saro: Enhancing llm safety through reasoning-based alignment.arXiv preprint arXiv:2504.09420,

    Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. Saro: Enhancing llm safety through reasoning-based alignment.arXiv preprint arXiv:2504.09420,

  24. [24]

    American invitational mathematics examination 2024,

    Mathematical Association of America. American invitational mathematics examination 2024,

  25. [25]

    Conv-coa: Improving open-domain ques- tion answering in large language models via conversational chain-of-action.arXiv preprint arXiv:2405.17822,

    Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. Conv-coa: Improving open-domain ques- tion answering in large language models via conversational chain-of-action.arXiv preprint arXiv:2405.17822,

  26. [26]

    Prorefine: Inference-time prompt refinement with textual feedback.arXiv preprint arXiv:2506.05305,

    Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M Homan, and Wei Wei. Prorefine: Inference-time prompt refinement with textual feedback.arXiv preprint arXiv:2506.05305,

  27. [27]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277,

  28. [28]

    Measuring and narrowing the compositionality gap in language models

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, December

  29. [29]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  30. [30]

    R- PRM: Reasoning-driven process reward modeling

    Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R- PRM: Reasoning-driven process reward modeling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13438–13451, Suzhou, China, November

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  32. [32]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

  33. [33]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  34. [34]

    Safety in large reason- ing models: A survey

    23 Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhong-Zhi Li, Yingwei Ma, Yufei He, Shengju Yu, Xinfeng Li, Junfeng Fang, Jiaheng Zhang, and Bryan Hooi. Safety in large reason- ing models: A survey. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP ...

  35. [35]

    ThinkGuard: Delibera- tive slow thinking leads to cautious guardrails

    Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. ThinkGuard: Delibera- tive slow thinking leads to cautious guardrails. In Wanxiang Che, Joyce Nabende, Ekaterina 24 Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2025, pages 13698–13713, Vienna, Austria, July

  36. [36]

    Jiaqi Wu, Chen Chen, Chunyan Hou, and Xiaojie Yuan

    ISSN 0975-3826. Jiaqi Wu, Chen Chen, Chunyan Hou, and Xiaojie Yuan. SafeInt: Shielding large language models from jailbreak attacks via safety-aware representation intervention. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8473–8488, ...

  37. [37]

    Tong Wu, Chong Xiang, Jiachen T. Wang, G. Edward Suh, and Prateek Mittal. Effectively control- ling reasoning models through thinking intervention. InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025, 2025b. Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against jai...

  38. [38]

    Qwen3 Technical Report

    Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning. InForty-se...

  39. [39]

    Trading inference-time compute for adversarial robustness.arXiv preprint arXiv:2501.18841,

    Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, et al. Trading inference-time compute for adversarial robustness.arXiv preprint arXiv:2501.18841,

  40. [40]

    Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks.arXiv preprint arXiv:2503.21696, 2025a

    Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, et al. Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks.arXiv preprint arXiv:2503.21696, 2025a. Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Han...