Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations
Pith reviewed 2026-05-21 11:18 UTC · model grok-4.3
The pith
CRAFT aligns reasoning models to safety by optimizing objectives directly over hidden state representations using contrastive learning and reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust safety alignment; incorporating latent-textual consistency into GRPO rules out superficially aligned policies as local optima.
What carries the argument
Contrastive representation learning combined with reinforcement learning over hidden states to create a latent-space geometry separating safe and unsafe reasoning trajectories.
If this is right
- Reasoning safety improves by an average of 79.0 percent over base models across the evaluated safety benchmarks.
- Final-response safety improves by an average of 87.7 percent over base models.
- The method outperforms IPO and SafeKey on multiple safety benchmarks when applied to Qwen3-4B-Thinking and R1-Distill-Llama-8B.
- Superficially aligned policies are eliminated because latent-textual consistency rules them out as local optima in the GRPO objective.
Where Pith is reading between the lines
- The same hidden-state separation technique could be applied to alignment goals other than safety, such as reducing hallucination or enforcing step-by-step correctness.
- Testing the learned latent geometry against attack types withheld from the contrastive training set would clarify how much of the reported robustness is attack-specific.
- If the latent geometry proves stable, downstream systems could rely less on external output filters and more on internal reasoning constraints.
Load-bearing premise
Optimizing objectives defined over hidden states via contrastive representation learning and reinforcement learning will produce safety alignments that generalize beyond the tested benchmarks and attack types.
What would settle it
A new jailbreak attack that elicits unsafe final outputs from a CRAFT-aligned model by using a reasoning trajectory whose hidden-state representation was not separated from unsafe ones during training.
Figures
read the original abstract
We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CRAFT, a red-teaming alignment framework for large reasoning models that integrates contrastive representation learning with reinforcement learning over hidden states to separate safe and unsafe reasoning trajectories. It claims a theoretical result that adding latent-textual consistency to GRPO rules out superficially aligned policies as local optima, and reports average empirical gains of 79.0% in reasoning safety and 87.7% in final-response safety over base models (Qwen3-4B-Thinking and R1-Distill-Llama-8B), outperforming IPO and SafeKey on multiple safety benchmarks.
Significance. If the empirical results and theoretical claim hold under scrutiny, the work could meaningfully advance safety alignment by shifting focus from output-level defenses to latent-space geometry in reasoning models, offering a potential path to more robust generalization against jailbreaks.
major comments (3)
- [Abstract] The abstract states empirical gains and a theoretical result about GRPO local optima but supplies no experimental details, statistical tests, ablation studies, or derivation steps, so the support for the central claims cannot be assessed from the provided text.
- [Theoretical Analysis (inferred from abstract)] The theoretical claim that latent-textual consistency rules out superficial policies appears to rest on an external RL method (GRPO) rather than reducing directly to fitted quantities defined inside the paper; full equations and proof steps are required to verify internal consistency.
- [Empirical Evaluation (inferred from abstract)] No results address whether the induced latent geometry prevents new failure modes on out-of-distribution jailbreaks, preserves performance on non-safety reasoning tasks, or avoids reward hacking when the consistency term is removed, which is load-bearing for the generalization claim.
minor comments (2)
- [Abstract] Clarify the exact definition and computation of the 79.0% and 87.7% improvement metrics, including baselines and variance across runs.
- [Method (inferred)] Provide explicit notation for the contrastive loss and GRPO objective to make the integration with hidden representations reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the presentation of our theoretical and empirical contributions. We address each major comment below and commit to revisions that will make the support for our claims more explicit and verifiable.
read point-by-point responses
-
Referee: [Abstract] The abstract states empirical gains and a theoretical result about GRPO local optima but supplies no experimental details, statistical tests, ablation studies, or derivation steps, so the support for the central claims cannot be assessed from the provided text.
Authors: The abstract is intentionally concise and summarizes the key claims; the full manuscript contains the requested details. Section 4 describes the two base models, safety benchmarks, and evaluation protocol. Table 2 reports mean improvements with standard deviations and paired t-test p-values. Section 5.3 presents ablations on the contrastive loss and consistency term. The theoretical derivation appears in Section 3.2 with the complete proof in Appendix B. To address the concern directly, we will expand the abstract by one sentence referencing the evaluation models and the presence of ablations and statistical tests. revision: partial
-
Referee: [Theoretical Analysis (inferred from abstract)] The theoretical claim that latent-textual consistency rules out superficial policies appears to rest on an external RL method (GRPO) rather than reducing directly to fitted quantities defined inside the paper; full equations and proof steps are required to verify internal consistency.
Authors: Section 3 defines the latent-textual consistency term (Equation 5) as an additive regularizer on the GRPO advantage function. We prove that any policy achieving output-level alignment without separating safe/unsafe trajectories in hidden space receives a strictly lower expected advantage and is therefore not a local optimum. The proof reduces the modified objective to the fitted Q-values and value functions used by GRPO. The full derivation, including all intermediate steps and the list of assumptions, is already in Appendix B. We will move a self-contained proof sketch into the main text of Section 3 and add a short paragraph explicitly showing the reduction to the paper's own fitted quantities. revision: yes
-
Referee: [Empirical Evaluation (inferred from abstract)] No results address whether the induced latent geometry prevents new failure modes on out-of-distribution jailbreaks, preserves performance on non-safety reasoning tasks, or avoids reward hacking when the consistency term is removed, which is load-bearing for the generalization claim.
Authors: We agree these checks are necessary to substantiate the generalization claim. The current evaluation uses standard in-distribution safety benchmarks; we did not report OOD jailbreak transfer, capability preservation on math or coding tasks, or an explicit ablation that removes only the consistency term. We will add a new subsection (5.4) containing: (i) results on AdvBench and two additional OOD jailbreak suites, (ii) accuracy on GSM8K and HumanEval before and after CRAFT, and (iii) a controlled ablation that disables the consistency regularizer while keeping all other components fixed, with monitoring for reward hacking via hidden-state separation metrics. These experiments will be run on the same two base models. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper claims a theoretical result that adding latent-textual consistency to GRPO rules out superficial policies as local optima, alongside empirical gains from contrastive RL on hidden states. No provided equations or steps reduce the safety improvements, the elimination of policies, or the latent geometry directly to fitted inputs, self-defined quantities, or a self-citation chain by construction. GRPO is treated as an external RL method, the contrastive objectives are presented as a methodological integration rather than a renaming or ansatz smuggled from prior self-work, and the reported benchmark improvements stand as independent measurements. The derivation therefore remains self-contained against external benchmarks and does not collapse to its own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Phi-4-reasoning Technical Report
Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical com- mon sense to embodied reasoning.arXiv preprint arXiv:2503.15558,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[6]
Minerva: Solving quantitative reasoning problems with language models.June, 30:2022,
Ethan Dyer and Guy Gur-Ari. Minerva: Solving quantitative reasoning problems with language models.June, 30:2022,
work page 2022
-
[7]
Kehua Feng, Keyan Ding, Jing Yu, Menghan Li, Yuhao Wang, Tong Xu, Xinda Wang, Qiang Zhang, and Huajun Chen. Erpo: Advancing safety alignment via ex-ante reasoning preference optimization.arXiv preprint arXiv:2504.02725,
-
[8]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning en- ables safer language models.arXiv preprint arXiv:2412.16339,
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Safety tax: Safety alignment makes your large reasoning models less reasonable
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555,
-
[13]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, et al. Learning to rank chain-of-thought: An energy-based approach with outcome supervision.arXiv preprint arXiv:2505.14999, 2025a. Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and ...
-
[16]
Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025
20 Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthinking: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542,
-
[17]
Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking.arXiv preprint arXiv:2502.12893,
-
[18]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Changyi Li, Jiayi Wang, Xudong Pan, Geng Hong, and Min Yang. Reasoningshield: Con- tent safety detection over reasoning traces of large reasoning models.arXiv preprint arXiv:2505.17244, 2025a. Yu Li, Han Jiang, and Zhihua Wei. DeTAM: Defending LLMs against jailbreak attacks via tar- geted attention modification. In Wanxiang Che, Joyce Nabende, Ekaterina S...
-
[20]
Adversarial tuning: Defending against jailbreak attacks for llms
Fan Liu, Zhao Xu, and Hao Liu. Adversarial tuning: Defending against jailbreak attacks for llms. arXiv preprint arXiv:2406.06622,
-
[21]
Guardreasoner: Towards reasoning-based llm safeguards
Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards.arXiv preprint arXiv:2501.18492,
-
[22]
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian-Guang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Yansong Tang, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. InThe Thir- teenth International Conference on Learning Representations, 2025a. 21 Haozheng Luo, Jiahao Yu, ...
work page 2025
-
[23]
Saro: Enhancing llm safety through reasoning-based alignment.arXiv preprint arXiv:2504.09420,
Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. Saro: Enhancing llm safety through reasoning-based alignment.arXiv preprint arXiv:2504.09420,
-
[24]
American invitational mathematics examination 2024,
Mathematical Association of America. American invitational mathematics examination 2024,
work page 2024
-
[25]
Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. Conv-coa: Improving open-domain ques- tion answering in large language models via conversational chain-of-action.arXiv preprint arXiv:2405.17822,
-
[26]
Prorefine: Inference-time prompt refinement with textual feedback.arXiv preprint arXiv:2506.05305,
Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M Homan, and Wei Wei. Prorefine: Inference-time prompt refinement with textual feedback.arXiv preprint arXiv:2506.05305,
-
[27]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Measuring and narrowing the compositionality gap in language models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, December
work page 2023
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
R- PRM: Reasoning-driven process reward modeling
Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R- PRM: Reasoning-driven process reward modeling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13438–13451, Suzhou, China, November
work page 2025
-
[31]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Safety in large reason- ing models: A survey
23 Cheng Wang, Yue Liu, Baolong Bi, Duzhen Zhang, Zhong-Zhi Li, Yingwei Ma, Yufei He, Shengju Yu, Xinfeng Li, Junfeng Fang, Jiaheng Zhang, and Bryan Hooi. Safety in large reason- ing models: A survey. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP ...
-
[35]
ThinkGuard: Delibera- tive slow thinking leads to cautious guardrails
Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. ThinkGuard: Delibera- tive slow thinking leads to cautious guardrails. In Wanxiang Che, Joyce Nabende, Ekaterina 24 Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computa- tional Linguistics: ACL 2025, pages 13698–13713, Vienna, Austria, July
work page 2025
-
[36]
Jiaqi Wu, Chen Chen, Chunyan Hou, and Xiaojie Yuan
ISSN 0975-3826. Jiaqi Wu, Chen Chen, Chunyan Hou, and Xiaojie Yuan. SafeInt: Shielding large language models from jailbreak attacks via safety-aware representation intervention. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8473–8488, ...
work page 2025
-
[37]
Tong Wu, Chong Xiang, Jiachen T. Wang, G. Edward Suh, and Prateek Mittal. Effectively control- ling reasoning models through thinking intervention. InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025, 2025b. Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against jai...
work page 2025
-
[38]
Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Shiming Yang, Yuxuan Tong, Xinyao Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning. InForty-se...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Trading inference-time compute for adversarial robustness.arXiv preprint arXiv:2501.18841,
Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, et al. Trading inference-time compute for adversarial robustness.arXiv preprint arXiv:2501.18841,
-
[40]
Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, et al. Embodied-reasoner: Synergizing visual search, reasoning, and action for embodied interactive tasks.arXiv preprint arXiv:2503.21696, 2025a. Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Han...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.