GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.
super hub Mixed citations
Finetuned Language Models Are Zero-Shot Learners
Mixed citation behavior. Most common role is background (68%).
abstract
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and sur
authors
co-cited works
representative citing papers
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.
ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.
LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
LLM agents outperform humans in romance-baiting scams, eliciting greater trust and 46% compliance versus 18%, with 0% detection by safety filters and 87% of scam tasks automatable.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Sundial uses TimeFlow Loss for native pre-training of Transformers on continuous time series from TimeBench, achieving SOTA point and probabilistic forecasting with millisecond inference.
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
citing papers explorer
-
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
-
Visual Compositional Tuning
COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
MolReFlect introduces a teacher-student framework that automatically creates fine-grained molecule-text alignments to achieve SOTA results on molecule-caption translation.
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
RouteLLM: Learning to Route LLMs with Preference Data
Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
Uncovering Logit Suppression Vulnerabilities in LLM Safety Alignment
SSAG bypasses logit suppression in five LLMs to produce harmful responses at 95% success rate and 86% lower latency; VulMine reaches 77% attack success against defenses.
-
Capabilities of Gemini Models in Medicine
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
Steering Llama 2 via Contrastive Activation Addition
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Llemma: An Open Language Model For Mathematics
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
-
Simple synthetic data reduces sycophancy in large language models
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
-
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
Pre-training loss predicts LLM math reasoning better than parameter count; rejection sampling fine-tuning with diverse paths raises LLaMA-7B accuracy on GSM8K from 35.9% with SFT to 49.3%.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
AWQ quantizes LLM weights to low bits by scaling salient channels based on activation statistics, outperforming prior methods on language, coding, math, and multi-modal benchmarks.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
PandaGPT: One Model To Instruction-Follow Them All
A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
Towards Expert-Level Medical Question Answering with Large Language Models
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
-
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
Inner Monologue: Embodied Reasoning through Planning with Language Models
LLMs form an inner monologue from closed-loop language feedback to improve high-level instruction completion in simulated and real robotic rearrangement and kitchen manipulation tasks.
-
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning
MRKL is a modular neuro-symbolic architecture that integrates LLMs with external knowledge and discrete reasoning to overcome limitations of pure neural language models.
-
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
SymNoise applies symmetric noise to embeddings during instruction fine-tuning and reports 6.7% higher AlpacaEval scores than NEFTune on LLaMA-2-7B.
-
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.
-
Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective
SFT on LLMs removes noise-like token interactions in a brief early phase before introducing overfitted ones, explaining inconsistent effectiveness across model scales.
-
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Tokenizer fertility varies 1.6x across models on Ukrainian legal text, Qwen uses 60% more tokens than Llama-family models, zero-shot outperforms few-shot by up to 26 points, and pre-war classifiers lose 27.9 points on invasion-era decisions.
-
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
-
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation
Expert alignment in subjective LLM evaluations is difficult because expert judgments are heterogeneous, partly tacit, dimension-dependent, and temporally unstable.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
Understanding the Prompt Sensitivity
LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.
-
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
-
Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems
An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.
-
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.