GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.
super hub Mixed citations
Finetuned Language Models Are Zero-Shot Learners
Mixed citation behavior. Most common role is background (68%).
abstract
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and sur
authors
co-cited works
representative citing papers
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.
ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.
LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
LLM agents outperform humans in romance-baiting scams, eliciting greater trust and 46% compliance versus 18%, with 0% detection by safety filters and 87% of scam tasks automatable.
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Sundial uses TimeFlow Loss for native pre-training of Transformers on continuous time series from TimeBench, achieving SOTA point and probabilistic forecasting with millisecond inference.
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
citing papers explorer
-
Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization
GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.
-
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
-
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
-
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.
-
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design
ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.
-
LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains
LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.
-
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
-
AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise
AnyEdit++ proposes Bayes-Chunk, an adaptive segmentation method based on Bayesian Surprise, with theoretical claims of structural independence and causal locality, reporting superior results over baselines on math, code, and narrative tasks.
-
Trustworthy Recommendation in the Era of Large Language Models: Opportunities and Challenges
A systematic review of over 200 studies concludes that LLMs in recommender systems act as a double-edged sword, creating both opportunities and new risks for trustworthiness.
-
Fine-Tuning Improves Information Conveyance in Language Models
Fine-tuning reorganizes uncertainty in LLMs into more efficient information conveyance, as shown by stronger length-entropy correlations and a tripling of entropy-semantic diversity links after controls.
-
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains
MEMENTO framework uses adaptive web exploration via AET and dual-channel memory to acquire domain expertise from interaction trajectories, yielding +25.6% and +36.5% gains over ReAct baselines in sales automation and legal research.
-
Hypergraph as Language
Hyper-Align is a hypergraph-native framework that serializes high-order relations into LLM-compatible tokens via HIDT-O templates and a HIP projector, outperforming graph-centric methods on HyperAlign-Bench.
-
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
-
Rotation-Preserving Supervised Fine-Tuning
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces
UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalization and efficiency.
-
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
-
x1: Learning to Think Adaptively Across Languages and Cultures
x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.
-
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
-
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Task-Centric Personalized Federated Fine-Tuning of Language Models
FedRouter clusters adapters locally per task samples and globally across clients to create task-centric personalized models, improving generalization and reducing task interference in federated fine-tuning.
-
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks
iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
-
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
-
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
SymNoise applies symmetric noise to embeddings during instruction fine-tuning and reports 6.7% higher AlpacaEval scores than NEFTune on LLaMA-2-7B.
-
Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost
Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.
-
Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective
SFT on LLMs removes noise-like token interactions in a brief early phase before introducing overfitted ones, explaining inconsistent effectiveness across model scales.
-
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Tokenizer fertility varies 1.6x across models on Ukrainian legal text, Qwen uses 60% more tokens than Llama-family models, zero-shot outperforms few-shot by up to 26 points, and pre-war classifiers lose 27.9 points on invasion-era decisions.
-
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or exceeds other Italian open-source LLMs on most international benchmarks while remaining competitive on ITALIC, though it trails some top international models.
-
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation
Expert alignment in subjective LLM evaluations is difficult because expert judgments are heterogeneous, partly tacit, dimension-dependent, and temporally unstable.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
-
Understanding the Prompt Sensitivity
LLMs disperse meaning-preserving prompts internally instead of clustering them, which produces an excessively high upper bound on output log-probability differences via Taylor expansion and Cauchy-Schwarz.
-
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
-
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without ground-truth code.
-
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
-
Instruction-Tuned LLMs for Parsing and Mining Unstructured Logs on Leadership HPC Systems
An instruction-tuned 8B LLaMA model parses HPC logs with accuracy matching larger models and processes 600 million Frontier supercomputer logs to reveal temporal patterns and anomalies.
-
Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts
Mainstream conversational models show escalating affective misalignments and ethical guidance failures during staged emotional trajectories, organized into a taxonomy of interactional breakdowns.
-
Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design
Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.
-
ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
A new source-grounded QA dataset for U.S. immigration law is built from official documents and used to fine-tune a 3B model, yielding a 27% mean score improvement over the base model on a held-out sample.
-
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
-
Co-creation of AI technology, empowering curators of cultural heritage information and guarding research commons
Describes an engineering sequence implementing local RAG chatbots for cultural heritage collections to empower curators while using Dataverse for archiving.
-
Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages
A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.