hub Mixed citations

Textbooks Are All You Need II: phi-1.5 technical report

· 2023 · cs.CL · arXiv 2309.05463

Mixed citation behavior. Most common role is background (64%).

40 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 method 4 dataset 1

citation-polarity summary

background 9 use method 4 use dataset 1

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

cs.CV · 2026-02-28 · unverdicted · novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

cs.CV · 2025-11-28 · conditional · novelty 7.0

UniGeoSeg releases the first million-scale dataset for instruction-driven remote sensing segmentation and a unified model that achieves state-of-the-art results with strong zero-shot generalization.

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

Hyperfitting improves LLM generation via context-dependent rank reordering from geometric expansion in the terminal transformer block, distinct from temperature scaling, and enables efficient Late-Stage LoRA fine-tuning.

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

DualOptim+ introduces base and delta optimizer states that adaptively bridge shared and decoupled components based on gradient directional conflicts to improve trade-offs in LLM machine unlearning.

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level baselines on TOFU and WMDP.

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

cs.SE · 2026-04-27 · unverdicted · novelty 6.0

Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.

CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.

Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization

cs.DB · 2026-04-09 · unverdicted · novelty 6.0

GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.

ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

cs.RO · 2026-02-09 · unverdicted · novelty 6.0

R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.

Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs

cs.CL · 2026-01-06 · unverdicted · novelty 6.0

New evidence-extraction metrics and a redact-and-retry framework with constrained filtering substantially improve LLM performance on document inconsistency detection, supported by experiments on a released semi-synthetic dataset.

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation

cs.SE · 2025-06-04 · accept · novelty 6.0

Cross-lingual RACG shows non-trivial but unequal knowledge transfer across 13 programming languages, depending on linguistic affinity and pretraining diversity, with limited reliance on natural language information when using code-specific retrievers.

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

cs.LG · 2025-05-22 · conditional · novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.

GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization

cs.CL · 2024-10-31 · unverdicted · novelty 6.0

GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

OpenVLA: An Open-Source Vision-Language-Action Model

cs.RO · 2024-06-13 · unverdicted · novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL · 2024-04-22 · accept · novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

citing papers explorer

Showing 40 of 40 citing papers.

ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 33 · internal anchor
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Fine-Tuning Small Reasoning Models for Quantum Field Theory cs.LG · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding cs.CV · 2026-02-28 · unverdicted · none · ref 27 · internal anchor
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes cs.CV · 2025-11-28 · conditional · none · ref 31 · internal anchor
UniGeoSeg releases the first million-scale dataset for instruction-driven remote sensing segmentation and a unified model that achieves state-of-the-art results with strong zero-shot generalization.
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion cs.CL · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
Hyperfitting improves LLM generation via context-dependent rank reordering from geometric expansion in the terminal transformer block, distinct from temperature scaling, and enables efficient Late-Stage LoRA fine-tuning.
DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 3 · internal anchor
DualOptim+ introduces base and delta optimizer states that adaptively bridge shared and decoupled components based on gradient directional conflicts to improve trade-offs in LLM machine unlearning.
Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning cs.LG · 2026-05-16 · unverdicted · none · ref 47 · internal anchor
Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 48 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning cs.CL · 2026-05-01 · unverdicted · none · ref 9 · internal anchor
TokenUnlearn identifies critical tokens via masking and entropy signals then applies hard selection or soft weighting to unlearn only those tokens, yielding better forgetting and retained utility than sequence-level baselines on TOFU and WMDP.
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora cs.SE · 2026-04-27 · unverdicted · none · ref 27 · internal anchor
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging cs.CV · 2026-04-24 · unverdicted · none · ref 21 · internal anchor
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs cs.CV · 2026-04-23 · unverdicted · none · ref 51 · internal anchor
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
GRACE: A Dynamic Coreset Selection Framework for Large Language Model Optimization cs.DB · 2026-04-09 · unverdicted · none · ref 46 · internal anchor
GRACE dynamically constructs and updates coresets for LLM training using representation diversity, gradient-based importance, and k-NN graph propagation to improve efficiency and performance.
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator cs.CV · 2026-04-09 · unverdicted · none · ref 21 · internal anchor
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.
ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving cs.CV · 2026-04-03 · unverdicted · none · ref 30 · internal anchor
ExploreVLA augments VLA driving models with future RGB and depth prediction for dense supervision and uses prediction uncertainty as a safety-gated intrinsic reward for RL-based exploration, reaching SOTA PDMS 93.7 on NAVSIM.
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO · 2026-02-09 · unverdicted · none · ref 57 · internal anchor
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs cs.CL · 2026-01-06 · unverdicted · none · ref 3 · internal anchor
New evidence-extraction metrics and a redact-and-retry framework with constrained filtering substantially improve LLM performance on document inconsistency detection, supported by experiments on a released semi-synthetic dataset.
Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation cs.SE · 2025-06-04 · accept · none · ref 51 · internal anchor
Cross-lingual RACG shows non-trivial but unequal knowledge transfer across 13 programming languages, depending on linguistic affinity and pretraining diversity, with limited reliance on natural language information when using code-specific retrievers.
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning cs.LG · 2025-05-22 · conditional · none · ref 20 · internal anchor
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization cs.CL · 2024-10-31 · unverdicted · none · ref 42 · internal anchor
GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 107 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
OpenVLA: An Open-Source Vision-Language-Action Model cs.RO · 2024-06-13 · unverdicted · none · ref 36 · internal anchor
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone cs.CL · 2024-04-22 · accept · none · ref 12 · internal anchor
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 272 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
TOFU: A Task of Fictitious Unlearning for LLMs cs.LG · 2024-01-11 · conditional · none · ref 19 · internal anchor
TOFU is a new benchmark with synthetic profiles and metrics demonstrating that existing unlearning algorithms for LLMs fail to achieve effective forgetting of targeted information.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 230 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
No Data? No Problem: Synthesizing Security Graphs for Better Intrusion Detection cs.CR · 2025-06-06 · unverdicted · none · ref 38 · internal anchor
PROVSYN synthesizes high-fidelity security provenance graphs via graph generation and LLMs to augment imbalanced datasets, improving downstream APT detection accuracy by up to 38% on benchmarks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 196 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
SEDD: Scalable and Efficient Dataset Deduplication with GPUs cs.CL · 2025-01-02 · unverdicted · none · ref 9 · internal anchor
SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 163 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation cs.CV · 2024-08-22 · unverdicted · none · ref 11 · internal anchor
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression cs.CL · 2024-06-17 · unverdicted · none · ref 32 · internal anchor
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 184 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs cs.CR · 2026-04-21 · unverdicted · none · ref 45 · internal anchor
DP-FLogTinyLLM combines federated learning, differential privacy, and LoRA-tuned tiny LLMs to match centralized log anomaly detection performance on Thunderbird and BGL datasets while preserving privacy.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 70 · internal anchor
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production cs.LG · 2026-04-10 · unverdicted · none · ref 31 · internal anchor
Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losing general capabilities.
Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 93 · internal anchor
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 153 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 39 · internal anchor
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 211 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Textbooks Are All You Need II: phi-1.5 technical report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer