arxiv: 2412.08905 · v1 · submitted 2024-12-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Phi-4 Technical Report

Adil Salim, Anh Nguyen, Caio C. T. Mendes, Cyril Zhang, Dingli Yu, Eric Price, Gustavo de Rosa, Harkirat Behl, James R. Lee, Jyoti Aneja, Marah Abdin, Michael Harrison, Mojan Javaheripi, Olli Saarikivi, Piero Kauffmann, Rachel Ward, Ronen Eldan, Russell J. Hewett, S\'ebastien Bubeck, Shital Shah, Suriya Gunasekar, Weishung Liu, Xin Wang, Yin Tat Lee, Yi Zhang, Yuanzhi Li, Yue Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 06:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsynthetic dataSTEM reasoningdata qualitypost-trainingquestion answeringtraining curriculummodel efficiency

0 comments

The pith

A 14-billion parameter model surpasses its teacher on STEM question answering by prioritizing high-quality synthetic data throughout training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents phi-4 as a language model whose training centers on data quality rather than scale or organic web sources. It incorporates synthetic data at multiple stages and applies targeted post-training steps to reach higher results than its teacher model on science, technology, engineering, and mathematics question-answering tasks. A reader cares because the outcome shows that deliberate choices in data creation and curriculum can produce compact models capable of technical reasoning without merely copying from larger systems. If correct, this points to training methods that improve efficiency while maintaining or increasing performance on reasoning benchmarks.

Core claim

Despite only minimal changes to the architecture used in the preceding version of the series, phi-4 achieves strong performance relative to its size on reasoning-focused benchmarks. The gains arise from a training process that strategically blends synthetic data into pre-training and post-training, combined with an improved curriculum and innovations in the post-training scheme, allowing results that exceed those of the teacher model on STEM QA.

What carries the argument

A training recipe centered on data quality that integrates synthetic data across stages and refines the post-training scheme.

If this is right

Smaller models can exceed teacher performance on reasoning tasks when data quality receives central attention.
Synthetic data generation techniques can create capabilities that go beyond simple distillation from a larger model.
Post-training innovations play a decisive role in converting data improvements into measurable gains on technical benchmarks.
Reasoning-focused performance scales with curriculum design even when architecture changes remain minimal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-quality emphasis could be tested on domains outside STEM to check whether efficiency gains appear more broadly.
Future work might explore whether the approach reduces the need for ever-larger organic datasets collected from public sources.
If the pattern holds, training recipes could shift priority from model scale toward controlled data synthesis for specialized tasks.

Load-bearing premise

The chosen benchmarks and evaluation protocol reflect genuine STEM reasoning ability rather than performance shaped by the synthetic data distribution or test format.

What would settle it

Independent STEM questions created without reference to the synthetic training distribution on which phi-4 scores lower than its teacher model would falsify the claim of improvement beyond distillation.

read the original abstract

We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Phi-4, a 14-billion parameter language model whose training recipe centers on data quality through strategic incorporation of synthetic data at all stages. It claims that, in contrast to prior Phi models that primarily distill from GPT-4, Phi-4 substantially surpasses the teacher on STEM-focused QA tasks, thereby demonstrating that the data-generation and post-training pipeline achieves capabilities beyond distillation. With only minimal architectural changes from Phi-3, the model attains strong results on reasoning benchmarks.

Significance. If the performance claims are substantiated with rigorous evidence, the work would be significant for showing that high-quality synthetic data combined with targeted post-training can allow a smaller model to exceed its larger teacher on reasoning tasks. This would support more efficient development paths that rely less on massive organic web-scale corpora.

major comments (2)

Abstract: The central claim that Phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities is load-bearing for the paper's contribution, yet the abstract supplies no benchmark scores, ablation results, data-composition statistics, or evaluation-protocol details to support the assertion.
Evaluation section: To establish that observed gains reflect genuine reasoning improvements rather than synthetic-data alignment or format overfitting, the manuscript must report contamination checks, exact benchmark-overlap analysis, and identical zero-shot/few-shot protocols for Phi-4 versus the teacher model; without these the outperformance claim remains vulnerable to the artifact explanation raised in the stress-test note.

minor comments (1)

Abstract: Adding at least one concrete benchmark number (e.g., a STEM QA accuracy) would immediately strengthen readability and allow readers to gauge the scale of the reported improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the Phi-4 technical report. We have reviewed the major comments carefully and provide point-by-point responses below. We agree with the need for greater specificity in the abstract and evaluation details and will revise the manuscript accordingly to strengthen the presentation of our claims.

read point-by-point responses

Referee: Abstract: The central claim that Phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities is load-bearing for the paper's contribution, yet the abstract supplies no benchmark scores, ablation results, data-composition statistics, or evaluation-protocol details to support the assertion.

Authors: We agree that the abstract would be strengthened by including concrete support for the central claim. In the revised manuscript, we will update the abstract to include specific benchmark scores on STEM QA tasks where phi-4 outperforms the teacher model (GPT-4), along with brief references to the synthetic data incorporation and post-training innovations. Full ablation studies and data-composition statistics are already provided in Sections 3 and 4 of the paper; we will add concise highlights of these to the abstract to better substantiate the claim while preserving its high-level summary style. revision: yes
Referee: Evaluation section: To establish that observed gains reflect genuine reasoning improvements rather than synthetic-data alignment or format overfitting, the manuscript must report contamination checks, exact benchmark-overlap analysis, and identical zero-shot/few-shot protocols for Phi-4 versus the teacher model; without these the outperformance claim remains vulnerable to the artifact explanation raised in the stress-test note.

Authors: We acknowledge the value of explicitly ruling out evaluation artifacts. The manuscript already describes the evaluation protocols, including consistent zero-shot settings for model comparisons. To directly address concerns about synthetic-data alignment or format overfitting, we will expand the evaluation section in the revision to include detailed contamination checks, exact benchmark-overlap analysis demonstrating minimal leakage, and explicit confirmation of identical protocols versus the teacher model. These additions will clarify that performance gains arise from our data quality and training curriculum rather than artifacts. We are prepared to incorporate any additional specifics from the stress-test note if provided. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims rest on external comparisons with no self-referential derivations or reductions.

full rationale

The paper is an empirical technical report describing the training of Phi-4 using synthetic data and post-training innovations. Its central claim—that the model surpasses its teacher (GPT-4) on STEM QA, evidencing techniques beyond distillation—is supported by benchmark performance numbers rather than any mathematical derivation, equation, or fitted parameter renamed as a prediction. No self-definitional loops, ansatzes smuggled via self-citation, or uniqueness theorems appear in the provided abstract or described content. The training process and evaluation protocol are presented as independent steps whose outputs (benchmark scores) are measured against external standards, not constructed tautologically from the inputs. This is a standard model release paper whose logic chain is the sequence of data curation, training, and testing, with no reduction of results to their own premises by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The report is empirical rather than axiomatic; no free parameters, axioms, or invented entities are explicitly introduced in the abstract. The central claim rests on the unstated assumption that benchmark scores measure the intended capabilities.

pith-pipeline@v0.9.0 · 5528 in / 1083 out tokens · 41795 ms · 2026-05-11T06:27:14.263554+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
synthetic data throughout the training process... improved data, training curriculum, and innovations in the post-training scheme.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Inducing Artificial Uncertainty in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
eess.AS 2026-05 unverdicted novelty 7.0

CEAEval is a context-aware evaluation system for speech expressive appropriateness, supported by a new Mandarin dataset with multi-dimensional human annotations and a model that outperforms prior systems.
When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.
DataDignity: Training Data Attribution for Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs
cs.CR 2026-05 unverdicted novelty 7.0

Misrouter enables input-only attacks on MoE LLMs by optimizing queries on open-source surrogates to route toward weakly aligned experts and transferring them to public APIs.
Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs
cs.DC 2026-05 unverdicted novelty 7.0

Coral cuts multi-LLM serving costs by up to 2.79x and raises goodput by up to 2.39x on heterogeneous GPUs through adaptive joint optimization and a lossless two-stage decomposition that solves quickly.
A foundation model of vision, audition, and language for in-silico neuroscience
q-bio.NC 2026-05 unverdicted novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
cs.SE 2026-05 unverdicted novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection
cs.CV 2026-05 unverdicted novelty 7.0

VoxAfford fuses multi-scale voxel features into MLLM output tokens using cross-attention with a learned compatibility gate to achieve SOTA open-vocabulary 3D affordance detection with ~8% mIoU gain and zero-shot robot...
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass
cs.IT 2026-05 unverdicted novelty 7.0

SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression
cs.LG 2026-04 unverdicted novelty 7.0

Auto-FlexSwitch achieves efficient dynamic model merging by decomposing task vectors into sparse masks, signs, and scalars, then making the compression learnable via gating and adaptive bit selection with KNN-based retrieval.
Evaluating Temporal Consistency in Multi-Turn Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.
MathDuels: Evaluating LLMs as Problem Posers and Solvers
cs.CL 2026-04 unverdicted novelty 7.0

Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
cs.CL 2026-04 unverdicted novelty 7.0

MORPHOGEN is a new multilingual benchmark for testing LLMs on gender-aware morphological generation via rewriting first-person sentences to the opposite gender in French, Arabic, and Hindi.
Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning
cs.CL 2026-04 unverdicted novelty 7.0

Legal2LogicICL improves accuracy and generalization when mapping legal cases to logical formulas by retrieving balanced diverse exemplars at semantic and structural levels, backed by the new Legal2Proleg dataset.
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software
cs.SE 2026-04 conditional novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with co...
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
cs.CL 2026-04 unverdicted novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
cs.AR 2026-03 unverdicted novelty 7.0

ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
LIFT: Last-Mile Fine-Tuning for Table Explicitation
cs.LG 2026-05 unverdicted novelty 6.0

LIFT pairs a pre-trained LLM for initial table extraction with a fine-tuned SLM for error repair, matching end-to-end SLM fine-tuning on TEDS while needing only 1,000 examples and gaining robustness.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
Causal Bias Detection in Generative Artifical Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

A causal framework unifies fairness analysis across generative AI and standard ML by deriving decompositions that separate biases along causal pathways and differences between real-world and model mechanisms.
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics
cs.CV 2026-05 unverdicted novelty 6.0

VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
Skill Neologisms: Towards Skill-based Continual Learning
cs.LG 2026-05 unverdicted novelty 6.0

Skill neologisms are optimized soft tokens that improve LLM performance on targeted skills without weight updates and allow zero-shot composition for continual learning.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
How Compliant Are GitHub Actions Workflows? A Checklist-Based Study with LLM-Assisted Auditing
cs.SE 2026-05 accept novelty 6.0

GitHub Actions workflows achieve only 28% overall compliance with best practices, with LLMs enabling an 81% reduction in verification effort via hybrid adjudication but still requiring expert oversight for security judgments.
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
cs.AI 2026-05 unverdicted novelty 6.0

Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
cs.CL 2026-04 unverdicted novelty 6.0

LLM-ReSum uses LLM self-evaluation in a closed feedback loop to refine summaries, improving factual accuracy by up to 33% and coverage by 39% with 89% human preference.
Estimating Tail Risks in Language Model Output Distributions
cs.LG 2026-04 unverdicted novelty 6.0

Importance sampling with unsafe model variants estimates tail probabilities of harmful language model outputs using 10-20x fewer samples than brute-force Monte Carlo.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
cs.CL 2026-04 unverdicted novelty 6.0

Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...
Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Each tested LLM shows its own characteristic unreliability when engaging in repair during extended math-question dialogues.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Expressing Social Emotions: Misalignment Between LLMs and Human Cultural Emotion Norms
cs.CL 2026-04 unverdicted novelty 6.0

Frontier LLMs over-express engaging emotions relative to disengaging ones and generate deterministic responses that fail to match the cultural and individual diversity observed in human social emotion expression.
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.
How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

LLMs perform substantially better as pragmatic listeners judging language than as speakers generating it, revealing weak alignment between the two roles.
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
cs.CV 2026-04 unverdicted novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization
cs.CV 2026-04 unverdicted novelty 6.0

Parameter-efficient fine-tuning lets MLLMs serve as effective retrievers for natural-language-guided cross-view geo-localization, beating dual-encoder baselines on GeoText-1652 and CVG-Text while using far fewer train...
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
cs.CR 2026-04 unverdicted novelty 6.0

BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
cs.DC 2026-04 unverdicted novelty 6.0

Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while re...
LLMs Get Lost In Multi-Turn Conversation
cs.CL 2025-05 unverdicted novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness
cs.CL 2026-05 unverdicted novelty 5.0

Phi-4 and Gemma-2-9B maintain high intra-model consistency (ICC > 0.89) and ASR robustness for HADS scoring while Llama-3.1-8B degrades sharply, with all models showing score-evidence dissociation.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 5.0

Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
Emotion-Aware Clickbait Attack in Social Media
cs.CL 2026-04 unverdicted novelty 5.0

Emotion-aware stylistic rewrites of headlines using VAD space and LLMs increase misclassification rates of clickbait detectors by up to 30.63%.
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
cs.AI 2026-04 unverdicted novelty 5.0

LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not y...
Retrieval-Guided Generation for Safer Histopathology Image Captioning
cs.CV 2026-04 unverdicted novelty 5.0

Retrieval-guided captioning from similar cases achieves higher semantic alignment (cosine similarity ~0.60 vs ~0.47) and fewer unsupported diagnoses than MedGemma on the ARCH dataset.
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
cs.CL 2026-04 unverdicted novelty 5.0

Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
cs.LG 2026-04 unverdicted novelty 5.0

Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference
cs.LG 2026-04 unverdicted novelty 5.0

Fine-tuned small language models (3-4B parameters) preserve ordinal consistency in ranking graph structural properties for graphs larger than training data and from held-out families, showing architecture-specific deg...
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing
cs.CV 2026-04 unverdicted novelty 5.0

The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity lea...
SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs
cs.IR 2026-03 conditional novelty 5.0

SUMMIR is a multimetric ranking model that orders LLM-generated sports insights by importance while incorporating hallucination detection to improve factual reliability across cricket, soccer, basketball, and baseball...
Qwen3 Technical Report
cs.CL 2025-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 64 Pith papers

[1]

box your final answer

We settled on our final model based on other factors, before measuring its score but after seeing the scores for the other two candidates. 9https://artofproblemsolving.com/wiki/index.php/2024_AMC_10A (10B, 12A, 12B) 29 {{question}} (A) {{option_a}} (B) {{option_b}} (C) {{option_c}} (D) {{option_d}} (E) {{option_e}} With each question formatted this way, w...

work page
[2]

Conclusion: These SNPs may affect AD severity by modulating IgE levels

Assumption: Specific SNPs influence immune response. Conclusion: These SNPs may affect AD severity by modulating IgE levels. Description: The study links genetic variations to clinical phenotypes

work page
[3]

Conclusion: Replication supports their role as potential genetic markers

Assumption: SNPs associated with AD are replicated across cohorts. Conclusion: Replication supports their role as potential genetic markers. Description: Statistical analysis confirms the robustness of the findings. Dependencies: Previous GWAS findings, statistical methods (lines 10-15) The raw content will later be filtered based on the metadata, and bot...

work page
[4]

A monotonic relationship between stress and methylation is confirmed

work page
[5]

Differential methylation patterns reveal bidirectional sensitivity (correct).↪

work page
[6]

Early-life adversities remain the primary influence

work page
[7]

Positive experiences have negligible effects. **Requirements (Flavored):** - Exercises must require both understanding the text and external advanced knowledge.↪ - Incorrect choices should be plausible enough to mislead without full comprehension.↪ - The content should challenge advanced undergraduates in genetics or epigenetics.↪ **Critique (Agent):** - ...

work page
[8]

Cortisol is irrelevant to the modifiers discussed

work page
[9]

Cortisol effects are linear and align with classical models

work page
[10]

The dynamic epigenetic role of cortisol enriches research paradigms (correct).↪

work page
[11]

**Critique (Agent):** - **Challenge Level:** Still insufficiently difficult for advanced undergraduates

Cortisol’s role is limited to downregulation of methylation. **Critique (Agent):** - **Challenge Level:** Still insufficiently difficult for advanced undergraduates. *Score: 1*↪ - **Suggestions:** Add nuanced alternatives based on theories like eustress vs. distress or glucocorticoid response elements.↪ **Revision 2:** *Plan:* Refine incorrect options and...

work page
[12]

This step may also include an optional curation of a participant’s profile that guides the tone and style of dialogue throughout the generation

Initial Turn: Based on the snippet, the model generates a foundational response that presents core information in a clear and direct manner. This step may also include an optional curation of a participant’s profile that guides the tone and style of dialogue throughout the generation

work page
[13]

Subsequent Turns: The dialogue grows through a dynamic flow of agents generating alternating turns, directing the conversation, summarizing earlier context and injecting new scenarios and topics consistent with earlier messages

work page
[14]

small" nodes identified by lowercase strings. Key conditions: ↪ ↪ - Start and end nodes cannot be revisited. -

Self-Revision: After each turn, the model evaluates its responses, brainstorms the best next course of action given the current context, to maximize the complexity and nuance of the discussion. This includes suggestions for exploring alternative perspectives or creating connections to broader contexts. D.2 Fill-in-the-middle exercises Many data generation...

work page
[15]

**Current Configuration:** - Rod A: |bottom, [0], top| - Rod B: |bottom, [1], top| - Rod C: |bottom, [2], top|

work page
[16]

These failures were due to rule violations, such as attempting to place a larger disk on a smaller one

**Past Performance:** - There have been repeated unsuccessful attempts to move disks, particularly from rod B to C and A to C. These failures were due to rule violations, such as attempting to place a larger disk on a smaller one. ↪ ↪ ↪

work page
[17]

- A larger disk cannot be placed on a smaller disk

**Environmental Patterns and Rules:** - Only one disk can be moved at a time. - A larger disk cannot be placed on a smaller disk. - The goal is to move all disks from rod A to rod C. 34 ### Reflection on Past Performance - **Mistakes Identified:** - Repeated attempts to make illegal moves indicate a misunderstanding or oversight of the rules.↪ - The plan ...

work page
[18]

- Certainty: Known

**Move Disk 1 from B to C:** - Completion Quantification: Disk 1 is on rod C. - Certainty: Known. This move is legal and necessary to progress towards the goal configuration.↪

work page
[19]

- Certainty: Known

**Move Disk 0 from A to C:** - Completion Quantification: Disk 0 is on rod C, on top of disk 1. - Certainty: Known. This move is legal once disk 1 is on rod C

work page
[20]

- Certainty: Known

**Move Disk 2 from C to B:** - Completion Quantification: Disk 2 is on rod B. - Certainty: Known. This move is necessary to free rod C for the final sequence.↪ 35 ### Chosen Action Based on the analysis and the updated plan, the next logical and legal action is to move disk 1 from rod B to rod C. This move adheres to the rules and progresses towards the g...

work page
[21]

- Certainty: Known

**Move Disk 1 from B to C**: - Completion Quantification: Disk 1 is on rod C. - Certainty: Known. This move is legal and necessary to progress towards the goal configuration.↪

work page
[22]

- Certainty: Known

**Move Disk 0 from A to C**: - Completion Quantification: Disk 0 is on rod C, on top of disk 1. - Certainty: Known. This move is legal once disk 1 is on rod C

work page
[23]

- Certainty: Known

**Move Disk 2 from C to B**: - Completion Quantification: Disk 2 is on rod B. - Certainty: Known. This move is necessary to free rod C for the final sequence.↪ </updated plan> <action>Move the top disk of rod B to the top of rod C</action> 36

work page