super hub Mixed citations

Finetuned Language Models Are Zero-Shot Learners

Adams Wei Yu, Brian Lester, Jason Wei, Kelvin Guu, Maarten Bosma, Vincent Y. Zhao · 2021 · cs.CL · arXiv 2109.01652

Mixed citation behavior. Most common role is background (68%).

155 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 155 citing papers more from Adams Wei Yu arXiv PDF

abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 other 5 dataset 3 method 3

citation-polarity summary

background 30 unclear 7 use dataset 3 support 2 use method 2

claims ledger

abstract This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and sur

authors

Adams Wei Yu Brian Lester Jason Wei Kelvin Guu Maarten Bosma Vincent Y. Zhao

co-cited works

representative citing papers

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

Introduces Lexical Alignment Score and Triangulated Preference Shift metrics to automatically identify lexical overuse in LLMs and attribute portions to preference learning stages via windowed prevalence on PubMed data.

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

cs.CL · 2026-05-29 · conditional · novelty 7.0

Introduces the MCN multilingual citation-needed detection corpus for 18 languages and demonstrates that fine-tuned small decoder models outperform prompted LLMs in both multilingual and cross-lingual transfer settings.

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.

ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

q-bio.QM · 2026-04-18 · unverdicted · novelty 7.0

ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.

LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains

cs.IR · 2026-03-25 · unverdicted · novelty 7.0

LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.

Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

cs.HC · 2026-01-17 · unverdicted · novelty 7.0

Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.

Love, Lies, and Language Models: Investigating AI's Role in Romance-Baiting Scams

cs.CR · 2025-12-18 · unverdicted · novelty 7.0

LLM agents outperform humans in romance-baiting scams, eliciting greater trust and 46% compliance versus 18%, with 0% detection by safety filters and 87% of scam tasks automatable.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

cs.LG · 2025-08-28 · unverdicted · novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

cs.CV · 2025-08-06 · unverdicted · novelty 7.0

The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.

MetaLint: Easy-to-Hard Generalization for Code Linting

cs.SE · 2025-07-15 · unverdicted · novelty 7.0

MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.

citing papers explorer

Showing 50 of 121 citing papers after filters.

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization cs.LG · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
GCRL and MISL are unified as control maximization, with three inequivalent GCRL formulations each matched to a MISL objective via bounds on goal-sensitivity.
Discovering Language Model Behaviors with Model-Written Evaluations cs.CL · 2022-12-19 · unverdicted · none · ref 12 · internal anchor
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees cs.CR · 2026-06-08 · unverdicted · none · ref 8 · internal anchor
PrivCode++ introduces the first DP code generation method protecting both prompts and code via latent-conditioned two-stage training, claiming higher utility and stronger privacy than prior baselines.
On the Geometry of On-Policy Distillation cs.LG · 2026-06-05 · unverdicted · none · ref 17 · internal anchor
OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
OPRD: On-Policy Representation Distillation cs.LG · 2026-06-04 · unverdicted · none · ref 36 · internal anchor
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models cs.CL · 2026-06-02 · unverdicted · none · ref 22 · internal anchor
Introduces Lexical Alignment Score and Triangulated Preference Shift metrics to automatically identify lexical overuse in LLMs and attribute portions to preference learning stages via windowed prevalence on PubMed data.
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning cs.CV · 2026-05-11 · unverdicted · none · ref 42 · internal anchor
DRAPE generates query-image conditioned prompts on the fly for multimodal continual instruction tuning and reports SOTA results on MCIT benchmarks.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training cs.LG · 2026-04-23 · unverdicted · none · ref 151 · internal anchor
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations cs.CL · 2026-04-20 · unverdicted · none · ref 46 · internal anchor
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation cs.CL · 2026-04-18 · unverdicted · none · ref 12 · internal anchor
MNAFT identifies language-agnostic and language-specific neurons via activation analysis and selectively fine-tunes only relevant ones in MLLMs to close the modality gap and outperform full fine-tuning and other methods on image translation benchmarks.
ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design q-bio.QM · 2026-04-18 · unverdicted · none · ref 6 · internal anchor
ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.
LLMAR: A Tuning-Free Recommendation Framework for Sparse and Text-Rich Industrial Domains cs.IR · 2026-03-25 · unverdicted · none · ref 38 · internal anchor
LLMAR applies LLM reasoning with a self-correction reflection loop to generate semantic user motives for tuning-free recommendations, showing up to 54.6% nDCG@10 gains on a sparse industrial dataset over trained baselines.
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI cs.HC · 2026-01-17 · unverdicted · none · ref 78 · internal anchor
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
Love, Lies, and Language Models: Investigating AI's Role in Romance-Baiting Scams cs.CR · 2025-12-18 · unverdicted · none · ref 52 · internal anchor
LLM agents outperform humans in romance-baiting scams, eliciting greater trust and 46% compliance versus 18%, with 0% detection by safety filters and 87% of scam tasks automatable.
Activation Steering with a Feedback Controller cs.LG · 2025-10-05 · unverdicted · none · ref 26 · internal anchor
Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning cs.LG · 2025-08-28 · unverdicted · none · ref 53 · internal anchor
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting cs.CV · 2025-08-06 · unverdicted · none · ref 58 · internal anchor
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
MetaLint: Easy-to-Hard Generalization for Code Linting cs.SE · 2025-07-15 · unverdicted · none · ref 47 · internal anchor
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 210 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 126 · internal anchor
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning cs.CV · 2026-07-01 · unverdicted · none · ref 61 · internal anchor
StochasT uses stochastic clustering of language tasks into varying turn depths for the same image to improve LVLMs on both single-turn and multi-turn scenarios without discarding data.
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning cs.CL · 2026-06-30 · unverdicted · none · ref 79 · internal anchor
CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.
SHIFT: Gate-Modulated Activation Steering for Knowledge Conflict Mitigation in Retrieval-Augmented Generation cs.CL · 2026-06-26 · unverdicted · none · ref 52 · internal anchor
SHIFT reformulates neuron editing as learnable gate modulation on under 0.01% parameters to let LLMs adaptively balance contextual and parametric knowledge during RAG generation.
Merit or networks? What decides where research is published econ.GN · 2026-06-02 · unverdicted · none · ref 23 · internal anchor
LLM-based pre-publication idea quality scoring on 6208 economics papers shows execution sets a meritocratic floor, idea quality grades intermediate rungs, and connections provide a bounded advantage mainly at top journals.
AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise cs.AI · 2026-05-31 · unverdicted · none · ref 17 · internal anchor
AnyEdit++ proposes Bayes-Chunk, an adaptive segmentation method based on Bayesian Surprise, with theoretical claims of structural independence and causal locality, reporting superior results over baselines on math, code, and narrative tasks.
Trustworthy Recommendation in the Era of Large Language Models: Opportunities and Challenges cs.IR · 2026-05-30 · unverdicted · none · ref 259 · internal anchor
A systematic review of over 200 studies concludes that LLMs in recommender systems act as a double-edged sword, creating both opportunities and new risks for trustworthiness.
Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning cs.CL · 2026-05-29 · unverdicted · none · ref 19 · internal anchor
Introduces a triangulation-based metric to quantify lexical shifts attributable to preference tuning without requiring manual curation of examples.
Graph-GRPO: Dependency-Aware Credit Assignment for Generative E-commerce Search Relevance cs.IR · 2026-05-29 · unverdicted · none · ref 40 · internal anchor
Graph-GRPO builds a dependency graph over CoT steps and propagates outcome rewards to enable finer credit assignment in generative relevance modeling for e-commerce search.
Fine-Tuning Improves Information Conveyance in Language Models cs.CL · 2026-05-29 · unverdicted · none · ref 36 · internal anchor
Fine-tuning reorganizes uncertainty in LLMs into more efficient information conveyance, as shown by stronger length-entropy correlations and a tripling of entropy-semantic diversity links after controls.
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains cs.AI · 2026-05-28 · unverdicted · none · ref 14 · internal anchor
MEMENTO framework uses adaptive web exploration via AET and dual-channel memory to acquire domain expertise from interaction trajectories, yielding +25.6% and +36.5% gains over ReAct baselines in sales automation and legal research.
Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models cs.SE · 2026-05-25 · unverdicted · none · ref 18 · internal anchor
A two-stage LLM pipeline for taxonomy-based labeling of code changes in patches achieves up to 84% recall and 81% precision on a manually curated benchmark of natural and synthetic patches.
A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback cs.CL · 2026-05-25 · unverdicted · none · ref 35 · internal anchor
A multi-agent LLM system discovers criteria such as Encouraging, Urgent, and Clear for surgical feedback and uses them to score 4.2k instances, outperforming prior content-based approaches in predicting trainee behavior changes and trainer approval.
Hypergraph as Language cs.CL · 2026-05-21 · unverdicted · none · ref 2 · internal anchor
Hyper-Align is a hypergraph-native framework that serializes high-order relations into LLM-compatible tokens via HIDT-O templates and a HIP projector, outperforming graph-centric methods on HyperAlign-Bench.
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection cs.CL · 2026-05-16 · unverdicted · none · ref 58 · internal anchor
MixSD uses dynamic mixing of the model's expert and naive conditionals to create distribution-aligned supervision that improves the memorization-retention tradeoff over standard SFT.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts cs.CL · 2026-05-13 · unverdicted · none · ref 71 · internal anchor
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · unverdicted · none · ref 2 · 3 links · internal anchor
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Rotation-Preserving Supervised Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 54 · internal anchor
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 169 · internal anchor
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 35 · 2 links · internal anchor
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces cs.NE · 2026-04-30 · unverdicted · none · ref 42 · internal anchor
UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalization and efficiency.
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation cs.CV · 2026-04-19 · unverdicted · none · ref 56 · internal anchor
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
x1: Learning to Think Adaptively Across Languages and Cultures cs.CL · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
x1 models adaptively select an advantageous language for reasoning per instance, yielding gains on multilingual math and cultural tasks while showing that scaling does not erase culture-language advantages.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 54 · internal anchor
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders cs.IR · 2026-04-09 · unverdicted · none · ref 52 · internal anchor
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 67 · internal anchor
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Task-Centric Personalized Federated Fine-Tuning of Language Models cs.LG · 2026-03-30 · unverdicted · none · ref 10 · internal anchor
FedRouter clusters adapters locally per task samples and globally across clients to create task-centric personalized models, improving generalization and reducing task interference in federated fine-tuning.
A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks cs.LG · 2026-03-23 · unverdicted · none · ref 19 · 2 links · internal anchor
iAmTime is a time-series foundation model that uses instruction-conditioned in-context learning from demonstrations to perform zero-shot adaptation on forecasting, imputation, classification, and related tasks.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG · 2026-03-11 · unverdicted · none · ref 15 · internal anchor
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards cs.CL · 2025-09-25 · unverdicted · none · ref 44 · internal anchor
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.

Finetuned Language Models Are Zero-Shot Learners

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer