pith. machine review for the scientific record. sign in

arxiv: 2412.16720 · v2 · submitted 2024-12-21 · 💻 cs.AI

Recognition: unknown

OpenAI o1 System Card

OpenAI: Aaron Jaech , Adam Kalai , Adam Lerer , Adam Richardson , Ahmed El-Kishky , Aiden Low , Alec Helyar , Aleksander Madry
show 254 more authors
Alex Beutel Alex Carney Alex Iftimie Alex Karpenko Alex Tachard Passos Alexander Neitz Alexander Prokofiev Alexander Wei Allison Tam Ally Bennett Ananya Kumar Andre Saraiva Andrea Vallone Andrew Duberstein Andrew Kondrich Andrey Mishchenko Andy Applebaum Angela Jiang Ashvin Nair Barret Zoph Behrooz Ghorbani Bohan Zhang Ben Rossen Benjamin Sokolowsky Boaz Barak Bob McGrew Borys Minaiev Botao Hao Bowen Baker Brandon Houghton Brandon McKinzie Brydon Eastman Camillo Lugaresi Cary Bassin Cary Hudson Chak Ming Li Charles de Bourcy Chelsea Voss Chen Shen Chong Zhang Chris Koch Chris Orsinger Christopher Hesse Claudia Fischer Clive Chan Dan Roberts Daniel Kappler Daniel Levy Daniel Selsam David Dohan David Farhi David Mely David Robinson Dimitris Tsipras Doug Li Dragos Oprica Eben Freeman Eddie Zhang Edmund Wong Elizabeth Proehl Enoch Cheung Eric Mitchell Eric Wallace Erik Ritter Evan Mays Fan Wang Felipe Petroski Such Filippo Raso Florencia Leoni Foivos Tsimpourlas Francis Song Fred von Lohmann Freddie Sulit Geoff Salmon Giambattista Parascandolo Gildas Chabot Grace Zhao Greg Brockman Guillaume Leclerc Hadi Salman Haiming Bao Hao Sheng Hart Andrin Hessam Bagherinezhad Hongyu Ren Hunter Lightman Hyung Won Chung Ian Kivlichan Ian O'Connell Ian Osband Ignasi Clavera Gilaberte Ilge Akkaya Ilya Kostrikov Ilya Sutskever Irina Kofman Jakub Pachocki James Lennon Jason Wei Jean Harb Jerry Twore Jiacheng Feng Jiahui Yu Jiayi Weng Jie Tang Jieqi Yu Joaquin Qui\~nonero Candela Joe Palermo Joel Parish Johannes Heidecke John Hallman John Rizzo Jonathan Gordon Jonathan Uesato Jonathan Ward Joost Huizinga Julie Wang Kai Chen Kai Xiao Karan Singhal Karina Nguyen Karl Cobbe Katy Shi Kayla Wood Kendra Rimbach Keren Gu-Lemberg Kevin Liu Kevin Lu Kevin Stone Kevin Yu Lama Ahmad Lauren Yang Leo Liu Leon Maksin Leyton Ho Liam Fedus Lilian Weng Linden Li Lindsay McCallum Lindsey Held Lorenz Kuhn Lukas Kondraciuk Lukasz Kaiser Luke Metz Madelaine Boyd Maja Trebacz Manas Joglekar Mark Chen Marko Tintor Mason Meyer Matt Jones Matt Kaufer Max Schwarzer Meghan Shah Mehmet Yatbaz Melody Y. Guan Mengyuan Xu Mengyuan Yan Mia Glaese Mianna Chen Michael Lampe Michael Malek Michele Wang Michelle Fradin Mike McClay Mikhail Pavlov Miles Wang Mingxuan Wang Mira Murati Mo Bavarian Mostafa Rohaninejad Nat McAleese Neil Chowdhury Nick Ryder Nikolas Tezak Noam Brown Ofir Nachum Oleg Boiko Oleg Murk Olivia Watkins Patrick Chao Paul Ashbourne Pavel Izmailov Peter Zhokhov Rachel Dias Rahul Arora Randall Lin Rapha Gontijo Lopes Raz Gaon Reah Miyara Reimar Leike Renny Hwang Rhythm Garg Robin Brown Roshan James Rui Shu Ryan Cheu Ryan Greene Saachi Jain Sam Altman Sam Toizer Sam Toyer Samuel Miserendino Sandhini Agarwal Santiago Hernandez Sasha Baker Scott McKinney Scottie Yan Shengjia Zhao Shengli Hu Shibani Santurkar Shraman Ray Chaudhuri Shuyuan Zhang Siyuan Fu Spencer Papay Steph Lin Suchir Balaji Suvansh Sanjeev Szymon Sidor Tal Broda Aidan Clark Tao Wang Taylor Gordon Ted Sanders Tejal Patwardhan Thibault Sottiaux Thomas Degry Thomas Dimson Tianhao Zheng Timur Garipov Tom Stasi Trapit Bansal Trevor Creech Troy Peterson Tyna Eloundou Valerie Qi Vineet Kosaraju Vinnie Monaco Vitchyr Pong Vlad Fomenko Weiyi Zheng Wenda Zhou Wenting Zhan Wes McCabe Wojciech Zaremba Yann Dubois Yinghai Lu Yining Chen Young Cha Yu Bai Yuchen He Yuchen Zhang Yunyun Wang Zheng Shao Zhuohan Li
Authors on Pith no claims yet
classification 💻 cs.AI
keywords modelssafetyopenaialignmentchainevaluationspotentialreason
0
0 comments X
read the original abstract

The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought. These advanced reasoning capabilities provide new avenues for improving the safety and robustness of our models. In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment. This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence. Our results underscore the need for building robust alignment methods, extensively stress-testing their efficacy, and maintaining meticulous risk management protocols. This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

    cs.CL 2026-04 unverdicted novelty 8.0

    MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....

  2. RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

    cs.RO 2026-04 unverdicted novelty 8.0

    RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

  3. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  4. Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    HAM³ achieves up to 78.3% attack success rate on the GQA benchmark by hierarchically attacking perception, communication, and reasoning layers in multi-modal multi-agent systems.

  5. Formalize, Don't Optimize: The Heuristic Trap in LLM-Generated Combinatorial Solvers

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM-generated combinatorial solvers achieve highest correctness when the model formalizes problems for verified backends rather than attempting to optimize search, which often causes regressions.

  6. Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...

  7. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  8. DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    DeepRefine refines agent-compiled knowledge bases via multi-turn abductive diagnosis and RL training with a GBD reward, yielding consistent downstream task gains.

  9. Relative Score Policy Optimization for Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

  10. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  11. V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

  12. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  13. Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...

  14. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  15. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  16. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  17. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  18. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE estimates value baselines for RL in LLMs from the actor's internal states via a lightweight probe and cross-rollout construction, matching DAPO performance with lower compute on math reasoning benchmarks.

  19. Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

    cs.LG 2026-05 unverdicted novelty 7.0

    POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

  20. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  21. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  22. A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

    cs.LG 2026-05 unverdicted novelty 7.0

    Online mistake bounds for autoregressive output learning can grow logarithmically with generation horizon M under end-to-end feedback but become independent of M with chain-of-thought trajectory access.

  23. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  24. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  25. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  26. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  27. SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

    cs.CR 2026-05 unverdicted novelty 7.0

    SRTJ is a training-free jailbreak method that evolves hierarchical attack rules using iterative verifier feedback and ASP-based constraint-aware composition to achieve stable high success rates on HarmBench across mul...

  28. LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM+ASP framework enables task-agnostic nonmonotonic reasoning by having LLMs generate and self-correct ASP programs using solver feedback, outperforming SMT alternatives on diverse benchmarks.

  29. Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective

    cs.AI 2026-04 unverdicted novelty 7.0

    A rule-generation perspective lets LLMs write programs as rules for data mapping and applies complexity theory to estimate their compositionality, tested on string-to-grid tasks.

  30. When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

    cs.IR 2026-04 unverdicted novelty 7.0

    ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

  31. An Empirical Study of Speculative Decoding on Software Engineering Tasks

    cs.SE 2026-04 unverdicted novelty 7.0

    Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.

  32. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  33. EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 7.0

    EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

  34. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  35. TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics

    cs.AI 2026-04 conditional novelty 7.0

    TacticGen generates realistic, adaptable football tactics via a multi-agent diffusion transformer trained on 3.3M events and 100M frames, supporting rule-, language-, or model-based guidance at inference time.

  36. Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    Metacognitive Consolidation lets LLMs accumulate reusable meta-reasoning skills from past episodes to improve future performance across benchmarks.

  37. TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.

  38. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

    cs.AI 2026-04 unverdicted novelty 7.0

    IG-Search computes step-level information gain rewards from policy probabilities to improve credit assignment in RL training for search-augmented QA, yielding 1.6-point gains over trajectory-level baselines on multi-h...

  39. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  40. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  41. AI Achieves a Perfect LSAT Score

    cs.AI 2026-04 unverdicted novelty 7.0

    Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

  42. VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...

  43. Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives

    cs.CL 2026-04 unverdicted novelty 7.0

    Social dynamics in LLM collectives cause representative agents to make less accurate decisions as peer pressure increases through larger adversarial groups, more capable peers, longer arguments, and persuasive styles.

  44. Learning 3D Reconstruction with Priors in Test Time

    cs.CV 2026-04 unverdicted novelty 7.0

    Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.

  45. Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

    cs.LG 2026-04 unverdicted novelty 7.0

    RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

  46. Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.

  47. Think Anywhere in Code Generation

    cs.SE 2026-03 unverdicted novelty 7.0

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  48. Internalized Reasoning for Long-Context Visual Document Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    A synthetic pipeline creates and internalizes reasoning traces in VLMs for long-context visual document understanding, with a 32B model surpassing a 235B model on MMLongBenchDoc and showing 12.4x fewer output tokens.

  49. LongTail Driving Scenarios with Reasoning Traces: The KITScenes LongTail Dataset

    cs.CV 2026-03 unverdicted novelty 7.0

    KITScenes LongTail supplies multimodal driving data and multilingual expert reasoning traces to benchmark models on rare scenarios beyond basic safety metrics.

  50. How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

    cs.CL 2026-03 conditional novelty 7.0

    TESSY creates stylistically consistent synthetic data via teacher-student token interleaving, yielding 11.25% and 6.68% gains on code benchmarks where pure teacher data causes 3.25% and 10.02% drops.

  51. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

    cs.LG 2026-03 unverdicted novelty 7.0

    SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.

  52. MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    cs.AI 2025-05 unverdicted novelty 7.0

    MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

  53. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    cs.CV 2025-04 unverdicted novelty 7.0

    GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...

  54. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  55. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    cs.SE 2025-02 unverdicted novelty 7.0

    SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

  56. OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

    cs.AI 2026-05 unverdicted novelty 6.0

    OpenDeepThink improves LLM reasoning by ranking parallel candidate traces via Bradley-Terry aggregation of LLM pairwise judgments, achieving a +405 Codeforces Elo gain on Gemini 3.1 Pro after eight rounds.

  57. Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

    cs.LG 2026-05 conditional novelty 6.0

    ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 per...

  58. Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

    stat.ML 2026-05 unverdicted novelty 6.0

    A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% ...

  59. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  60. STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

    cs.CL 2026-05 unverdicted novelty 6.0

    STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.