hub Baseline reference

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi · 2024 · cs.CL · arXiv 2406.08464

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

25 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 dataset 4 baseline 2 method 1

citation-polarity summary

background 4 use dataset 4 baseline 2 use method 1

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

cs.LG · 2025-02-07 · unverdicted · novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

cs.LG · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

NodeSynth creates evidence-based synthetic queries via a taxonomy generator to evaluate LLMs, revealing up to 5x higher failure rates than human benchmarks and gaps in guard models.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

Hypothesis generation and updating in large language models

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

Domain-adapted clinical LLMs provide only marginal and unstable gains over general models on English clinical MCQA benchmarks, while new Spanish Marmoka models perform better.

Multi-Token Prediction via Self-Distillation

cs.CL · 2026-02-05 · unverdicted · novelty 6.0

Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.

The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models

cs.LG · 2025-07-25 · unverdicted · novelty 6.0

Populations of 1-4B parameter LLMs using peer verification and shared cultural memory achieve 8.8-18.9 point gains on mathematical reasoning tasks and close much of the gap to 70B+ single models.

eLLM: Elastic Memory Management Framework for Efficient LLM Serving

cs.DC · 2025-06-18 · unverdicted · novelty 6.0

eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.

SmolVLM: Redefining small and efficient multimodal models

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

cs.CL · 2025-08-25 · unverdicted · novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

cs.CL · 2025-02-04 · unverdicted · novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

cs.CV · 2024-12-13 · accept · novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.

LLaVA-OneVision: Easy Visual Task Transfer

cs.CV · 2024-08-06 · unverdicted · novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression

cs.CL · 2024-06-17 · unverdicted · novelty 5.0

Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

cs.CV · 2025-01-22 · unverdicted · novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

cs.AI · 2024-10-24 · unverdicted · novelty 4.0

Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

cs.CL · 2026-05-06 · 2 refs

citing papers explorer

Showing 25 of 25 citing papers.

Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 104 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach cs.LG · 2025-02-07 · unverdicted · none · ref 171 · internal anchor
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation cs.LG · 2026-05-14 · unverdicted · none · ref 3 · 2 links · internal anchor
NodeSynth creates evidence-based synthetic queries via a taxonomy generator to evaluate LLMs, revealing up to 5x higher failure rates than human benchmarks and gaps in guard models.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 60 · 2 links · internal anchor
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 31 · internal anchor
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
Hypothesis generation and updating in large language models cs.LG · 2026-05-07 · unverdicted · none · ref 52 · internal anchor
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 65 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 32 · internal anchor
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models cs.CL · 2026-04-08 · unverdicted · none · ref 5 · internal anchor
Domain-adapted clinical LLMs provide only marginal and unstable gains over general models on English clinical MCQA benchmarks, while new Spanish Marmoka models perform better.
Multi-Token Prediction via Self-Distillation cs.CL · 2026-02-05 · unverdicted · none · ref 18 · internal anchor
Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.
The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models cs.LG · 2025-07-25 · unverdicted · none · ref 51 · internal anchor
Populations of 1-4B parameter LLMs using peer verification and shared cultural memory achieve 8.8-18.9 point gains on mathematical reasoning tasks and close much of the gap to 70B+ single models.
eLLM: Elastic Memory Management Framework for Efficient LLM Serving cs.DC · 2025-06-18 · unverdicted · none · ref 36 · internal anchor
eLLM unifies LLM memory management with virtual tensors and elastic ballooning to CPU memory, reporting 2.32x higher decoding throughput and 3x larger batch sizes for 128K inputs.
SmolVLM: Redefining small and efficient multimodal models cs.AI · 2025-04-07 · unverdicted · none · ref 34 · internal anchor
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 267 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 52 · internal anchor
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 75 · internal anchor
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 242 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding cs.CV · 2024-12-13 · accept · none · ref 99 · internal anchor
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B activated parameters.
LLaVA-OneVision: Easy Visual Task Transfer cs.CV · 2024-08-06 · unverdicted · none · ref 144 · internal anchor
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression cs.CL · 2024-06-17 · unverdicted · none · ref 60 · internal anchor
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 95 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs cs.AI · 2024-10-24 · unverdicted · none · ref 23 · internal anchor
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 209 · internal anchor
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training cs.CL · 2026-05-06 · unreviewed · ref 27 · 2 links · internal anchor
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding cs.DC · 2026-02-10 · unreviewed · ref 53 · internal anchor

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer