hub Mixed citations

Phi-4-reasoning Technical Report

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen · 2025 · cs.AI · arXiv 2504.21318

Mixed citation behavior. Most common role is background (67%).

38 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning language models, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1 method 1

citation-polarity summary

background 4 baseline 1 use method 1

representative citing papers

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

ThinkProbe builds non-generative Thought Graphs from 4200 LLM traces across 7 models and 200 questions to extract 5D cognitive profiles, finding model-level stability in reasoning structure that exceeds domain effects in four dimensions.

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

IDPR is a response-conditioned inhibitory deliberation method that trains a controller on fast-slow outcome pairs to decide when to override LLM fast answers, improving accuracy from 47.90% to 48.92% with slow reasoning invoked on only 8.20% of a 5,000-example math test set.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.

When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems

cs.SE · 2026-04-20 · unverdicted · novelty 7.0

Pre-trained models are added late in projects, accumulate rather than get replaced, and change three times less often than libraries, with distinct documentation driven by capability needs and testing uncertainty.

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

cs.CL · 2025-07-05 · conditional · novelty 7.0

Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

cs.AI · 2025-05-29 · unverdicted · novelty 7.0

MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

Benchmarking Large Language Models on Floating-Point Error Classification

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

Introduces InterFLOPBench benchmark and evaluates 14 LLMs on multi-label classification of six floating-point error categories in C code, with top models exceeding 0.88 overall F1 but lower scores on subtle errors like underflow.

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

cs.CL · 2026-06-09 · conditional · novelty 6.0

CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.

Rethinking Molecular Text Representations for LLMs: An Empirical Study

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Structured text representations like CML and MolJSON outperform SMILES variants on structural tasks while IUPAC dominates semantic tasks such as molecule retrieval across all tested LLMs.

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0 · 2 refs

OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

SeLaR: Selective Latent Reasoning in Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective

cs.LG · 2026-02-10 · unverdicted · novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.

Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

cs.CL · 2026-01-29 · unverdicted · novelty 6.0

RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.

The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

cs.LG · 2026-01-25 · unverdicted · novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

cs.AI · 2025-10-05 · unverdicted · novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

citing papers explorer

Showing 38 of 38 citing papers.

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs cs.CL · 2026-06-27 · unverdicted · none · ref 24 · internal anchor
ThinkProbe builds non-generative Thought Graphs from 4200 LLM traces across 7 models and 200 questions to extract 5D cognitive profiles, finding model-level stability in reasoning structure that exceeds domain effects in four dimensions.
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning cs.AI · 2026-06-08 · unverdicted · none · ref 30 · internal anchor
RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.
When to Think Deeply: Inhibitory Deliberation for LLM Reasoning cs.CL · 2026-06-04 · unverdicted · none · ref 26 · internal anchor
IDPR is a response-conditioned inhibitory deliberation method that trains a controller on fast-slow outcome pairs to decide when to override LLM fast answers, improving accuracy from 47.90% to 48.92% with slow reasoning invoked on only 8.20% of a 5,000-example math test set.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 109 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 1 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation cs.LG · 2026-04-28 · unverdicted · none · ref 1 · internal anchor
VLM judges exhibit task-dependent uncertainty in their scores, with conformal prediction revealing wide intervals for complex tasks and a decoupling between good ranking performance and poor absolute scoring reliability.
When AI Models Become Dependencies: Studying the Evolution of Pre-Trained Model Reuse in Downstream Software Systems cs.SE · 2026-04-20 · unverdicted · none · ref 37 · internal anchor
Pre-trained models are added late in projects, accumulate rather than get replaced, and change three times less often than libraries, with distinct documentation driven by capability needs and testing uncertainty.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models cs.CL · 2025-07-05 · conditional · none · ref 3 · internal anchor
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
MathArena: Evaluating LLMs on Uncontaminated Math Competitions cs.AI · 2025-05-29 · unverdicted · none · ref 1 · internal anchor
MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.
Benchmarking Large Language Models on Floating-Point Error Classification cs.AI · 2026-06-30 · unverdicted · none · ref 3 · internal anchor
Introduces InterFLOPBench benchmark and evaluates 14 LLMs on multi-label classification of six floating-point error categories in C code, with top models exceeding 0.88 overall F1 but lower scores on subtle errors like underflow.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning cs.CL · 2026-06-16 · unverdicted · none · ref 4 · internal anchor
The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It cs.CL · 2026-06-09 · conditional · none · ref 4 · internal anchor
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
Rethinking Molecular Text Representations for LLMs: An Empirical Study cs.LG · 2026-06-02 · unverdicted · none · ref 77 · internal anchor
Structured text representations like CML and MolJSON outperform SMILES variants on structural tasks while IUPAC dominates semantic tasks such as molecule retrieval across all tested LLMs.
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 1 · 2 links · internal anchor
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution cs.CL · 2026-05-19 · unverdicted · none · ref 59 · internal anchor
SCA applies the Information Bottleneck principle via NIBS and GIBS methods to identify erroneous steps in black-box LLM reasoning and boosts self-correction success by up to 13.5%.
TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction cs.AI · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 107 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 57 · internal anchor
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
SeLaR: Selective Latent Reasoning in Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
SeLaR selectively applies latent soft reasoning in LLMs via entropy gating and contrastive regularization, outperforming standard CoT on five benchmarks without training.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 1 · internal anchor
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective cs.LG · 2026-02-10 · unverdicted · none · ref 1 · internal anchor
Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling cs.CL · 2026-01-29 · unverdicted · none · ref 1 · internal anchor
RSE distills search trajectories into an experience bank for positive and negative recycling, yielding efficiency gains over independent sampling on math reasoning benchmarks.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning cs.LG · 2026-01-25 · unverdicted · none · ref 1 · internal anchor
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 96 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Trust Region On-Policy Distillation cs.LG · 2026-05-31 · unverdicted · none · ref 235 · internal anchor
TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.
TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning cs.IR · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
TwiSTAR learns to switch between fast SID retrieval and slow rationale-generating reasoning in generative recommendation, yielding better accuracy-latency trade-offs on three datasets.
Chain-of-Thought Reasoning Enhances In-Context Learning for LLM-Based Mobile Traffic Prediction cs.NI · 2026-05-10 · unverdicted · none · ref 37 · internal anchor
Chain-of-thought reasoning with plan-based demonstrations and similarity retrieval improves LLM mobile traffic prediction accuracy by up to 15% over standard in-context learning on real 5G data.
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 97 · 2 links · internal anchor
MathArena is broadened into a maintained platform with new benchmarks for proofs, research questions, and formal verification, where GPT-5.5 scores 98% on 2026 USAMO and 74% on research-level tasks.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI · 2026-03-18 · unverdicted · none · ref 1 · internal anchor
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
Ranking Reasoning LLMs under Test-Time Scaling cs.LG · 2026-03-11 · accept · none · ref 1 · internal anchor
Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.
Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling cs.CV · 2025-09-19 · unverdicted · none · ref 13 · internal anchor
VRA is a training-free agentic framework that orchestrates off-the-shelf LVLMs with a reasoning model via iterative verification and refinement, raising accuracy on remote sensing VQA from 52.8% to 78.8% and delivering up to 40.67% gains on hard question types.
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing cs.LG · 2025-07-29 · unverdicted · none · ref 3 · internal anchor
ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.
Unified Deployment-Aware Evaluation of Open Reasoning Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 2 · 2 links · internal anchor
A controlled multi-model evaluation on shared data subsets shows that deployment metrics and prompting choices create important tradeoffs and alter model rankings beyond accuracy alone.
XekRung Technical Report cs.CR · 2026-04-30 · unverdicted · none · ref 17 · internal anchor
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 2 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision? cs.SE · 2026-04-09 · unreviewed · ref 1 · internal anchor
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment cs.CL · 2026-01-20 · unreviewed · ref 1 · internal anchor
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unreviewed · ref 170 · internal anchor

Phi-4-reasoning Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer