hub

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al · 1901

41 Pith papers cite this work. Polarity classification is still indexing.

41 Pith papers citing it

browse 41 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

SurF applies the Time Rescaling Theorem as a learnable bijection to create a single generative model for forecasting irregular multivariate event streams that outperforms or matches baselines on six benchmarks.

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

cs.CR · 2026-04-01 · unverdicted · novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.

Post-Selection Distributional Model Evaluation

stat.ML · 2026-03-24 · unverdicted · novelty 7.0

PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The Adam-SGD gap in large-batch LLM pre-training arises mainly from SGD's restricted effective learning rates caused by small gradients and output-layer spikes; clipping lets SGD recover nearly all of Adam's performance.

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

AURORA is a representation learning framework that uses contextual orthogonalization and relational alignment to create disentangled, geometrically interpretable latent spaces in healthcare foundation models.

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.

Event Fields: Learning Latent Event Structure for Waveform Foundation Models

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on physiological tasks.

A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.

MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.

Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

BinomMAML uses a binomial expansion to estimate meta-gradients more accurately than prior approximations, with error bounds that improve on existing methods and decay super-exponentially under mild conditions.

Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.

Rethinking Residual Errors in Compensation-based LLM Quantization

cs.LG · 2026-04-09 · conditional · novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.

Reflective Context Learning: Studying the Optimization Primitives of Context Space

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

cs.LG · 2026-03-15 · unverdicted · novelty 6.0

M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.

Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring

cs.LG · 2025-11-28 · unverdicted · novelty 6.0

Delta-XAI wraps existing XAI methods for online time series and introduces SWING to explain prediction changes while accounting for temporal dependencies.

SAM 3D: 3Dfy Anything in Images

cs.CV · 2025-11-20 · unverdicted · novelty 6.0

SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

Differentiable Filtering for Learning Hidden Markov Models

cs.LG · 2025-11-13 · unverdicted · novelty 6.0

Belief Net learns HMM parameters by implementing the forward filter as a decoder-only neural network whose weights are the logits of the initial, transition, and emission distributions, trained end-to-end with autoregressive loss.

Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions

stat.ML · 2025-11-03 · conditional · novelty 6.0

Derives closed-form optimal attention temperature minimizing ICL generalization error under distribution shift, linked to pre-softmax score moments, with LLM validation.

citing papers explorer

Showing 41 of 41 citing papers.

SurF: A Generative Model for Multivariate Irregular Time Series Forecasting cs.LG · 2026-05-13 · unverdicted · none · ref 11
SurF applies the Time Rescaling Theorem as a learnable bijection to create a single generative model for forecasting irregular multivariate event streams that outperforms or matches baselines on six benchmarks.
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation cs.CV · 2026-04-23 · unverdicted · none · ref 3
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo cs.LG · 2026-04-07 · unverdicted · none · ref 3
Sequential Monte Carlo sampling from a reward-augmented sequence distribution improves LLM performance on HumanEval by up to 54.9% and MATH500 by up to 8.8%, outperforming standard sampling and GRPO.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits cs.CR · 2026-04-01 · unverdicted · none · ref 2
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.
Post-Selection Distributional Model Evaluation stat.ML · 2026-03-24 · unverdicted · none · ref 2
PS-DME is a new framework that controls post-selection false coverage rate for distributional KPI estimates via e-values and is provably more sample-efficient than data splitting under explicit conditions.
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention cs.LG · 2025-10-05 · unverdicted · none · ref 3
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates cs.LG · 2026-05-18 · unverdicted · none · ref 2
The Adam-SGD gap in large-batch LLM pre-training arises mainly from SGD's restricted effective learning rates caused by small gradients and output-layer spikes; clipping lets SGD recover nearly all of Adam's performance.
AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models cs.LG · 2026-05-18 · unverdicted · none · ref 17
AURORA is a representation learning framework that uses contextual orthogonalization and relational alignment to create disentangled, geometrically interpretable latent spaces in healthcare foundation models.
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits cs.LG · 2026-05-14 · unverdicted · none · ref 3
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion cs.LG · 2026-05-12 · unverdicted · none · ref 4 · 2 links
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion cs.CL · 2026-05-12 · unverdicted · none · ref 4
BitLM replaces per-token softmax with bitwise continuous diffusion inside causal blocks to generate multiple tokens in parallel while preserving autoregressive structure.
Event Fields: Learning Latent Event Structure for Waveform Foundation Models cs.LG · 2026-05-09 · unverdicted · none · ref 16
Event-centric waveform foundation models are learned via self-supervised consistency on latent event structures and interactions, yielding improved performance and label efficiency over sequence-based baselines on physiological tasks.
A Qualitative Test-Risk Mechanism for Scaling Behavior in Normalized Residual Networks cs.LG · 2026-05-08 · unverdicted · none · ref 1
Depth expansion in normalized residual networks yields provable test-risk improvement through representational, optimization, and generalization gains under first-order descent and norm-control conditions.
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 3
MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
Binomial Gradient-Based Meta-Learning for Enhanced Meta-Gradient Estimation cs.LG · 2026-04-14 · unverdicted · none · ref 2
BinomMAML uses a binomial expansion to estimate meta-gradients more accurately than prior approximations, with error bounds that improve on existing methods and decay super-exponentially under mild conditions.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task cs.LG · 2026-04-14 · unverdicted · none · ref 3
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
Rethinking Residual Errors in Compensation-based LLM Quantization cs.LG · 2026-04-09 · conditional · none · ref 2
Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment cs.LG · 2026-04-06 · unverdicted · none · ref 4
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
Reflective Context Learning: Studying the Optimization Primitives of Context Space cs.LG · 2026-04-03 · unverdicted · none · ref 2
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, and grouped rollouts, yielding improvements on AppWorld, BrowseComp+, and RewardBene
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling cs.LG · 2026-03-15 · unverdicted · none · ref 3
M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring cs.LG · 2025-11-28 · unverdicted · none · ref 1
Delta-XAI wraps existing XAI methods for online time series and introduces SWING to explain prediction changes while accounting for temporal dependencies.
SAM 3D: 3Dfy Anything in Images cs.CV · 2025-11-20 · unverdicted · none · ref 3
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
Differentiable Filtering for Learning Hidden Markov Models cs.LG · 2025-11-13 · unverdicted · none · ref 4
Belief Net learns HMM parameters by implementing the forward filter as a decoder-only neural network whose weights are the logits of the initial, transition, and emission distributions, trained end-to-end with autoregressive loss.
Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions stat.ML · 2025-11-03 · conditional · none · ref 1
Derives closed-form optimal attention temperature minimizing ICL generalization error under distribution shift, linked to pre-softmax score moments, with LLM validation.
SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization cs.LG · 2025-10-29 · unverdicted · none · ref 2
SemanticOpt fine-tunes LLMs on structured Bayesian optimization trajectories augmented with natural-language context to jointly use numerical and semantic evidence for black-box optimization.
Speculative Coupled Decoding for Training-Free Lossless Acceleration of Autoregressive Visual Generation cs.CV · 2025-10-28 · unverdicted · none · ref 8
Speculative Coupled Decoding stabilizes draft sampling in Speculative Jacobi Decoding via an information-theoretic coupling step, delivering up to 4.2x image and 13.6x video speedups with no quality loss or training.
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs cs.LG · 2025-10-21 · unverdicted · none · ref 8
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
LILO: Bayesian Optimization with Natural Language Feedback cs.LG · 2025-10-20 · unverdicted · none · ref 3
LILO integrates LLMs to translate natural language feedback into preference signals for Gaussian process-based Bayesian optimization, outperforming standard preference BO and LLM-only methods on benchmarks.
AtlasKV: Augmenting LLMs with Billion-Scale Knowledge Graphs in 20GB VRAM cs.CL · 2025-10-20 · unverdicted · none · ref 2
AtlasKV integrates billion-scale KGs into LLMs parametrically with sub-linear complexity and low memory by converting triples into key-value representations handled by the model's attention.
ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards cs.CL · 2025-10-01 · unverdicted · none · ref 3
ReSeek adds self-correction via a JUDGE action and a dense instructive reward (correctness plus utility) to RL training of search agents, yielding higher success and faithfulness on a new contamination-resistant benchmark.
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference cs.DC · 2025-09-29 · unverdicted · none · ref 1
GRACE-MoE integrates expert grouping, dynamic replication, and locality-aware routing with hierarchical sparse communication to reduce end-to-end latency in distributed SMoE inference.
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining cs.CL · 2025-09-08 · unverdicted · none · ref 4
MachineLearningLM uses continued pretraining on SCM-synthesized ML tasks with random-forest distillation to give LLMs robust many-shot in-context learning on tabular classification, reaching random-forest accuracy levels while preserving general chat performance.
A3 : an Analytical Low-Rank Approximation Framework for Attention cs.CL · 2025-05-19 · conditional · none · ref 3
A3 splits Transformer layers into QK, OV, and MLP components and derives analytical low-rank approximations that reduce hidden dimensions while minimizing each component's functional loss, yielding better perplexity than prior low-rank methods on LLaMA models.
WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records cs.LG · 2026-05-10 · unverdicted · none · ref 17
WISTERIA learns robust clinical representations from noisy EHR labels by enforcing consistency across multiple weak supervision views plus ontology regularization.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 4
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations cs.AI · 2026-03-18 · unverdicted · none · ref 5
CRAFT uses contrastive representation learning and RL on hidden states to align reasoning models for improved safety against jailbreaks, reporting 79% and 87.7% gains over base models.
Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models cs.AI · 2025-11-11 · unverdicted · none · ref 2
Prepending a compact learnable prefix to LLMs produces safety gains comparable to next-generation aligned models while preserving fluency and adding negligible parameters.
Online In-Context Distillation for Low-Resource Vision Language Models cs.CV · 2025-10-20 · unverdicted · none · ref 5
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
Diffusion Models as Dataset Distillation Priors cs.LG · 2025-10-20 · unverdicted · none · ref 1
DAP formalizes a representativeness prior via Mercer kernel similarity in feature space and uses it to guide diffusion reverse process for higher-quality distilled datasets on ImageNet without retraining.
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning cs.LG · 2025-05-16 · unverdicted · none · ref 4
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
On the Quantization Robustness of Diffusion Language Models in Coding Benchmarks cs.LG · 2026-04-22 · unverdicted · none · ref 3
Diffusion coding model CoDA shows smaller accuracy drops than Qwen3-1.7B under 2-4 bit quantization on HumanEval and MBPP.

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer