ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.
hub
International Conference on Learning Representations , year=
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Llama 3.1 8B fine-tuned with calibrated 5% synthetic data augmentation reaches 0.6234 F1-macro on multi-class toxicity detection in gaming chat and places fourth among 35 teams.
citing papers explorer
-
Evolutionary Negative Module Pruning for Better LoRA Merging
ENMP prunes negative LoRA modules via evolutionary search to boost merging performance to new state-of-the-art levels across language and vision tasks.
-
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
-
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
LoRA adapters should be scaled by 1/sqrt(rank) rather than 1/rank to stabilize learning and enable effective use of higher ranks during fine-tuning of large language models.
-
Detecting Pretraining Data from Large Language Models
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
-
LIMA: Less Is More for Alignment
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
-
E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring
E-PMQ improves 4-bit quantization accuracy on merged models by 8-42 points across CLIP and GLUE tasks through expert-guided calibration and merged-weight anchoring.
-
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
-
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
-
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
DoLa reduces hallucinations in LLMs by contrasting logits from later versus earlier layers during decoding, improving truthfulness on TruthfulQA by 12-17 absolute points without fine-tuning or retrieval.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
-
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
-
REPLUG: Retrieval-Augmented Black-Box Language Models
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
-
Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering
IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.
-
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
-
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
-
PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat
Llama 3.1 8B fine-tuned with calibrated 5% synthetic data augmentation reaches 0.6234 F1-macro on multi-class toxicity detection in gaming chat and places fourth among 35 teams.