arxiv: 2111.02080 · v6 · submitted 2021-11-03 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

An Explanation of In-context Learning as Implicit Bayesian Inference

Sang Michael Xie , Aditi Raghunathan , Percy Liang , Tengyu Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:18 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords in-context learningimplicit Bayesian inferencelanguage modelshidden Markov modelspretraininglatent conceptsGINC datasetscaling effects

0 comments

The pith

Large language models perform in-context learning by implicitly inferring latent concepts that explain coherence in both pretraining data and prompt examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in-context learning arises naturally when language models are pretrained on data with long-range coherence driven by latent document-level concepts. During pretraining, the model learns to infer which concept is active to predict the next token accurately. At test time, a prompt with input-output examples allows the model to infer the shared latent concept governing the task, enabling it to generalize to new inputs. This holds even with a mismatch between the prompt distribution and pretraining data, as proven in a mixture-of-HMMs model. Experiments on a synthetic dataset called GINC confirm that both Transformers and LSTMs display in-context learning and replicate real-world behaviors like scaling benefits and order sensitivity.

Core claim

In-context learning occurs because the language model treats the prompt examples as observations from the same latent concept that structures coherent pretraining documents. By performing implicit Bayesian inference over which concept is present in the prompt, the model can predict the label for a new query example without any gradient updates. This mechanism is formalized and proven to work in a setting where the pretraining distribution is a mixture of hidden Markov models, each corresponding to a different latent concept, even when the test prompts are drawn from a different distribution.

What carries the argument

A mixture of hidden Markov models (HMMs) as the generative model for pretraining documents, where each HMM component represents a distinct latent concept that enforces long-range coherence across the document.

If this is right

Models will exhibit in-context learning on prompts that share a latent concept with the pretraining distribution.
Performance on in-context tasks improves with increased model size even when pretraining loss stays constant.
Predictions are sensitive to the order of examples in the prompt because order affects the inferred posterior over concepts.
Zero-shot performance can exceed few-shot in some cases when additional examples introduce mismatch in the inferred concept.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If true, this predicts that in-context learning will be stronger for tasks where examples share clear latent structures, such as consistent topic or style.
Extending the framework could explain why in-context learning works better with longer contexts that provide more evidence for the latent concept.
The theory suggests designing pretraining data with explicit latent concepts to enhance in-context capabilities without larger models.

Load-bearing premise

Real pretraining data has long-range coherence from latent concepts that a mixture of HMMs can capture and that this drives in-context behavior in models trained on web data.

What would settle it

Training a model on synthetic data without latent document-level concepts, such as independent tokens or short-range chains only, and checking whether in-context learning still appears on structured prompts despite low next-token error.

read the original abstract

Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves that in-context learning emerges as implicit Bayesian inference over latent concepts in a mixture-of-HMMs pretraining setup and validates the account on a matching synthetic dataset, but the link to real web-scale training data stays an assumption.

read the letter

The core contribution is a proof that in-context learning arises when a model infers a shared latent document-level concept across prompt examples, and this holds for a mixture-of-HMMs pretraining distribution even with some mismatch to the prompt distribution. They also release GINC, a small synthetic dataset generated from the same process, which reproduces scaling improvements, order sensitivity, and cases where zero-shot beats few-shot without any parameter fitting to those outcomes.

Referee Report

2 major / 2 minor

Summary. The paper claims that in-context learning emerges in language models via implicit Bayesian inference over latent document-level concepts when pretraining data has long-range coherence. They model pretraining as a mixture of HMMs, prove that ICL occurs despite prompt-pretraining mismatch in this setting, and validate the account on a synthetic GINC dataset generated from the identical process, where both Transformers and LSTMs reproduce scaling benefits, order sensitivity, and zero-shot vs. few-shot patterns.

Significance. If the core modeling assumption holds, the work supplies a mechanistic Bayesian explanation for ICL that accounts for several empirical regularities in a controlled, falsifiable manner. The explicit proof for the HMM-mixture case and the independently generated GINC experiments constitute clear strengths, providing a reproducible testbed that goes beyond post-hoc interpretations of real-model behavior.

major comments (2)

[§3] §3 (theoretical derivation): the proof establishes the result only for exact mixture-of-HMMs pretraining; the central claim that this mechanism explains ICL in large language models trained on web-scale data therefore rests on the unverified modeling assumption that real corpora are dominated by comparable long-range latent document structure, which is load-bearing yet receives no direct empirical check.
[§4] §4 (GINC experiments): because the synthetic data is generated from the identical HMM mixture used in the theory, the reported scaling, order, and zero-shot phenomena confirm internal consistency rather than provide an independent test of whether the Bayesian mechanism operates under the distribution mismatch that characterizes actual pretraining corpora.

minor comments (2)

[Abstract] Abstract: the wording should more explicitly separate the proven HMM-mixture regime from the conjectural extension to real pretraining data.
[§2] Notation in the HMM definition: ensure the distinction between prompt-level and document-level latent variables is introduced before the first use of the posterior-inference argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address the major concerns regarding the theoretical assumptions and the nature of the GINC experiments below. We have made partial revisions to the manuscript to incorporate additional discussion on these points.

read point-by-point responses

Referee: [§3] §3 (theoretical derivation): the proof establishes the result only for exact mixture-of-HMMs pretraining; the central claim that this mechanism explains ICL in large language models trained on web-scale data therefore rests on the unverified modeling assumption that real corpora are dominated by comparable long-range latent document structure, which is load-bearing yet receives no direct empirical check.

Authors: We agree that our proof is specific to the mixture-of-HMMs model and that generalizing to real LLMs assumes similar long-range latent structure in web data. This assumption is motivated by the fact that natural language documents often maintain coherence around latent concepts (e.g., topics) over extended contexts. Although we do not perform a direct empirical validation on real corpora, which would require scalable methods to identify such latents, the paper provides a proof-of-concept in a controlled setting. In the revised manuscript, we have expanded Section 3 with a discussion of this assumption, its plausibility, and suggestions for future empirical tests. revision: partial
Referee: [§4] §4 (GINC experiments): because the synthetic data is generated from the identical HMM mixture used in the theory, the reported scaling, order, and zero-shot phenomena confirm internal consistency rather than provide an independent test of whether the Bayesian mechanism operates under the distribution mismatch that characterizes actual pretraining corpora.

Authors: We acknowledge that GINC matches the theoretical generative process, serving to verify that the Bayesian inference mechanism produces the observed ICL behaviors under the modeled conditions, including the prompt-pretraining mismatch. This internal validation is valuable for establishing the mechanism's sufficiency. We recognize it does not directly probe real-world distribution mismatches. We have revised Section 4 to better articulate the role of GINC as a minimal testbed that reproduces key real-world phenomena, and added a limitations section addressing the gap to real data. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation applies standard Bayesian inference to explicitly stated generative model

full rationale

The paper states an explicit mixture-of-HMMs pretraining distribution, derives that in-context learning corresponds to implicit posterior inference over the shared latent document-level HMM parameters, and proves the result holds under prompt-pretraining mismatch. This is a direct mathematical application of Bayes rule to the given generative process; no predicted quantity is obtained by fitting to the target in-context observations, and the GINC experiments are generated from the identical process to verify the derived behavior rather than to supply the quantities being predicted. No self-citation, ansatz, or uniqueness theorem is invoked as a load-bearing step for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling choice that pretraining data can be treated as samples from a mixture of HMMs whose components correspond to latent concepts, plus the assumption that real documents exhibit sufficient long-range coherence for this inference to be useful.

axioms (2)

domain assumption Pretraining documents exhibit long-range coherence because they are generated from a shared latent concept
Invoked to justify that next-token prediction requires inferring the latent concept during pretraining.
domain assumption The pretraining distribution is a mixture of HMMs
Used as the setting in which the proof of in-context learning is carried out.

pith-pipeline@v0.9.0 · 5527 in / 1493 out tokens · 37934 ms · 2026-05-16T22:18:23.706449+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What learning algorithm is in-context learning? Investigations with linear models
cs.LG 2022-11 accept novelty 8.0

Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
cs.LG 2026-05 unverdicted novelty 7.0

Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
cs.LG 2026-05 unverdicted novelty 7.0

PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning
cs.LG 2026-05 unverdicted novelty 7.0

PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
cs.CL 2026-05 unverdicted novelty 7.0

LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...
"What Are You Really Trying to Do?": Co-Creating Life Goals from Everyday Computer Use
cs.HC 2026-05 unverdicted novelty 7.0

A co-creation process for inferring and refining personal strivings from computer activity logs yields more representative goals and higher user agency than baselines in a 14-person week-long study.
Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks
cs.CV 2026-04 unverdicted novelty 7.0

Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
cs.CL 2026-04 accept novelty 7.0

SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
cs.AI 2026-05 conditional novelty 6.0

Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
SnapAudit: Active Auditing of Differentially Private In-Context Learning via Snapshot-Based Simulation
cs.CR 2025-11 conditional novelty 6.0

SnapAudit decomposes DP-ICL into a deterministic snapshot stage and a stochastic noise stage, using bootstrap simulation to achieve 80-200x faster auditing and exposing privacy bound violations in existing Gaussian an...
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
cs.CL 2022-10 accept novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
VIP-COP: Context Optimization for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.
Can LLMs Take Retrieved Information with a Grain of Salt?
cs.CL 2026-05 unverdicted novelty 5.0

LLMs exhibit systematic failures in obeying expressed certainty in retrieved contexts, but a combination of prior reminders, certainty recalibration, and context simplification reduces obedience errors by 25%.
When Context Sticks: Studying Interference in In-Context Learning
cs.LG 2026-04 unverdicted novelty 5.0

In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification
cs.CL 2026-04 unverdicted novelty 5.0

SCHK-HTC uses sibling contrastive learning plus hierarchical prompt tuning to improve discrimination between confusable sibling classes in few-shot hierarchical text classification.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 19 Pith papers · 7 internal anchors

[1]

Statistical inference for probabilistic functions of finite state markov chains

Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37 0 (6): 0 1554--1563, 1966

work page 1966
[2]

Blei, Andrew Ng, and M

D. Blei, Andrew Ng, and M. I. Jordan. Latent D irichlet allocation. Journal of Machine Learning Research (JMLR), 3: 0 993--1022, 2003

work page 2003
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

Le, and Christopher D

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[5]

A. P. Dempster, Laird N. M., and Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39 0 (1): 0 1--38, 1977

work page 1977
[6]

BERT : Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pages 4171--4186, 2019

work page 2019
[7]

Making pre-trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv, 2021

work page 2021
[8]

Factorial hidden M arkov models

Zoubin Ghahramani and Michael Jordan. Factorial hidden M arkov models. Machine Learning, 29: 0 245--273, 1997

work page 1997
[9]

Hidden topic Markov models

Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. Hidden topic Markov models. In Artificial Intelligence and Statistics (AISTATS), 2007

work page 2007
[10]

Gunst and O

M. Gunst and O. Shcherbakova. Asymptotic behavior of Bayes estimators for hidden Markov models with application to ion channels. Mathematical Methods of Statistics, 17, 2008

work page 2008
[11]

Hastings

Keith W. Hastings. M onte C arlo sampling methods using M arkov chains and their applications. Biometrika, 57 0 (1): 0 97--109, 1970

work page 1970
[12]

Long short-term memory

Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[13]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020

work page 2020
[14]

Surface form competition: Why the highest probability answer isn't always right, 2021

Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn't always right, 2021

work page 2021
[15]

How can we know what language models know? In Association for Computational Linguistics (ACL), 2020

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? In Association for Computational Linguistics (ACL), 2020

work page 2020
[16]

Jordan, Zoubin Ghahramani, Tommi S

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37: 0 183--233, 1999

work page 1999
[17]

TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL), 2017

work page 2017
[18]

Adam: A method for stochastic optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015

work page 2015
[19]

Kleijn and A.W

B.J.K. Kleijn and A.W. van der Vaart. The Bernstein -von mises theorem under misspecification. Electronic Journal of Statistics, 6, 2012

work page 2012
[20]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Association for Computational Linguistics (ACL), 2020

work page 2020
[22]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics (ACL), 2021

work page 2021
[23]

Jurassic-1: Technical details and evaluation

Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021

work page 2021
[24]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. R o BERT a: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[25]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019

work page 2019
[26]

Rosenbluth, Marshall N

Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21 0 (6): 0 1087--1092, 1953

work page 1953
[27]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Association for Computational Linguistics (ACL), 2016

work page 2016
[28]

Grokking: Generalization beyond overfitting on small algorithmic datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. In ICLR MATH AI Workshop, 2021

work page 2021
[29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8), 2019

work page 2019
[30]

Optimization as a model for few-shot learning

Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017

work page 2017
[31]

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...

work page 2021
[32]

Exploiting cloze questions for few shot text classification and natural language inference

Timo Schick and Hinrich Schütze. Exploiting cloze questions for few shot text classification and natural language inference. In European Association for Computational Linguistics (EACL), 2021

work page 2021
[33]

Eliciting knowledge from language models using automatically generated prompts

Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Eliciting knowledge from language models using automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[34]

How to compare different loss functions and their risks

Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26, 2007

work page 2007
[35]

A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998

work page 1998
[36]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021
[38]

Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning

Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. arXiv, 2021 a

work page 2021
[39]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv, 2021 b

work page 2021
[40]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. HuggingFace 's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[41]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017

work page 2017
[42]

Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning (ICML), 2021

work page 2021
[43]

Multiclass classification calibration functions

Bernardo Ávila Pires and Csaba Szepesvári. Multiclass classification calibration functions. arXiv, 2016

work page 2016
[44]

Measuring neural net robustness with constraints , year =

Osbert Bastani and Yani Ioannou and Leonidas Lampropoulos and Dimitrios Vytiniotis and Aditya Nori and Antonio Criminisi , booktitle =. Measuring neural net robustness with constraints , year =

work page
[45]

Zico Kolter , booktitle =

Eric Wong and J. Zico Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =

work page
[46]

A Dual Approach to Scalable Verification of Deep Networks , year =

Krishnamurthy Dvijotham and Robert Stanforth and Sven Gowal and Timothy Mann and Pushmeet Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =

work page
[47]

Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

Matthias Hein and Maksym Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

work page
[48]

Amir Ali Ahmadi and Anirudha Majumdar , journal =

work page
[49]

Training verified learners with learned verifiers , year =

Krishnamurthy Dvijotham and Sven Gowal and Robert Stanforth and Relja Arandjelovic and Brendan O'Donoghue and Jonathan Uesato and Pushmeet Kohli , journal =. Training verified learners with learned verifiers , year =

work page
[50]

Scaling provable adversarial defenses , year =

Eric Wong and Frank Schmidt and Jan Hendrik Metzen and J Zico Kolter , booktitle =. Scaling provable adversarial defenses , year =

work page
[51]

On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , year =

Sven Gowal and Krishnamurthy Dvijotham and Robert Stanforth and Rudy Bunel and Chongli Qin and Jonathan Uesato and Timothy Mann and Pushmeet Kohli , journal =. On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , year =

work page
[52]

Synthetic and natural noise both break neural machine translation , year =

Yonatan Belinkov and Yonatan Bisk , booktitle =. Synthetic and natural noise both break neural machine translation , year =

work page
[53]

Hotflip: White-box adversarial examples for text classification , year =

Javid Ebrahimi and Anyi Rao and Daniel Lowd and Dejing Dou , booktitle =. Hotflip: White-box adversarial examples for text classification , year =

work page
[54]

There is no free lunch in adversarial robustness (but there are unexpected benefits) , year =

Dimitris Tsipras and Shibani Santurkar and Logan Engstrom and Alexander Turner and Aleksander Madry , journal =. There is no free lunch in adversarial robustness (but there are unexpected benefits) , year =

work page
[55]

Adversarially robust generalization requires more data , year =

Ludwig Schmidt and Shibani Santurkar and Dimitris Tsipras and Kunal Talwar and Aleksander Madry , booktitle =. Adversarially robust generalization requires more data , year =

work page
[56]

Theoretically principled trade-off between robustness and accuracy , year =

Hongyang Zhang and Yaodong Yu and Jiantao Jiao and Eric P Xing and Laurent El Ghaoui and Michael I Jordan , booktitle =. Theoretically principled trade-off between robustness and accuracy , year =

work page
[57]

Improving the robustness of deep neural networks via stability training , year =

Stephan Zheng and Yang Song and Thomas Leung and Ian Goodfellow , booktitle =. Improving the robustness of deep neural networks via stability training , year =

work page
[58]

Certified adversarial robustness via randomized smoothing , year =

Jeremy M Cohen and Elan Rosenfeld and J Zico Kolter , booktitle =. Certified adversarial robustness via randomized smoothing , year =

work page
[59]

Semi-supervised self-training of object detection models , year =

Chuck Rosenberg and Martial Hebert and Henry Schneiderman , booktitle =. Semi-supervised self-training of object detection models , year =

work page
[60]

Virtual adversarial training: a regularization method for supervised and semi-supervised learning , year =

Takeru Miyato and Shin-ichi Maeda and Shin Ishii and Masanori Koyama , journal =. Virtual adversarial training: a regularization method for supervised and semi-supervised learning , year =

work page
[61]

Autoaugment: Learning augmentation policies from data , year =

Ekin D Cubuk and Barret Zoph and Dandelion Mane and Vijay Vasudevan and Quoc V Le , booktitle =. Autoaugment: Learning augmentation policies from data , year =

work page
[62]

Improved regularization of convolutional neural networks with cutout , year =

Terrance DeVries and Graham W Taylor , journal =. Improved regularization of convolutional neural networks with cutout , year =

work page
[63]

80 million tiny images: A large data set for nonparametric object and scene recognition , volume =

Antonio Torralba and Rob Fergus and William T Freeman , journal =. 80 million tiny images: A large data set for nonparametric object and scene recognition , volume =

work page
[64]

Certified robustness to adversarial examples with differential privacy , year =

Mathias Lecuyer and Vaggelis Atlidakis and Roxana Geambasu and Daniel Hsu and Suman Jana , booktitle =. Certified robustness to adversarial examples with differential privacy , year =

work page
[65]

Second-Order Adversarial Attack and Certifiable Robustness , year =

Bai Li and Changyou Chen and Wenlin Wang and Lawrence Carin , journal =. Second-Order Adversarial Attack and Certifiable Robustness , year =

work page
[66]

Wide residual networks , year =

Sergey Zagoruyko and Nikos Komodakis , booktitle =. Wide residual networks , year =

work page
[67]

Realistic evaluation of deep semi-supervised learning algorithms , year =

Avital Oliver and Augustus Odena and Colin A Raffel and Ekin Dogus Cubuk and Ian Goodfellow , booktitle =. Realistic evaluation of deep semi-supervised learning algorithms , year =

work page
[68]

Unsupervised data augmentation , year =

Qizhe Xie and Zihang Dai and Eduard Hovy and Minh-Thang Luong and Quoc V Le , journal =. Unsupervised data augmentation , year =

work page
[69]

Temporal ensembling for semi-supervised learning , year =

Samuli Laine and Timo Aila , booktitle =. Temporal ensembling for semi-supervised learning , year =

work page
[70]

Regularization with stochastic transformations and perturbations for deep semi-supervised learning , year =

Mehdi Sajjadi and Mehran Javanmardi and Tolga Tasdizen , booktitle =. Regularization with stochastic transformations and perturbations for deep semi-supervised learning , year =

work page
[71]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , year =

Antti Tarvainen and Harri Valpola , booktitle =. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , year =

work page
[72]

Semi-supervised learning using gaussian fields and harmonic functions , year =

Xiaojin Zhu and Zoubin Ghahramani and John D Lafferty , booktitle =. Semi-supervised learning using gaussian fields and harmonic functions , year =

work page
[73]

Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , year =

Dong-Hyun Lee , booktitle =. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , year =

work page
[74]

Omar Montasser and Steve Hanneke and Nathan Srebro , journal =

work page
[75]

Adversarial examples from computational constraints , year =

Sebastien Bubeck and Eric Price and Ilya Razenshteyn , booktitle =. Adversarial examples from computational constraints , year =

work page
[76]

Adversarial spheres , year =

Justin Gilmer and Luke Metz and Fartash Faghri and Samuel S Schoenholz and Maithra Raghu and Martin Wattenberg and Ian Goodfellow , journal =. Adversarial spheres , year =

work page
[77]

Robustness may be at odds with accuracy , year =

Dimitris Tsipras and Shibani Santurkar and Logan Engstrom and Alexander Turner and Aleksander Madry , booktitle =. Robustness may be at odds with accuracy , year =

work page
[78]

Analysis of classifiers' robustness to adversarial perturbations , volume =

Alhussein Fawzi and Omar Fawzi and Pascal Frossard , journal =. Analysis of classifiers' robustness to adversarial perturbations , volume =

work page
[79]

Adversarial logit pairing , year =

Harini Kannan and Alexey Kurakin and Ian Goodfellow , journal =. Adversarial logit pairing , year =

work page
[80]

Evaluating and understanding the robustness of adversarial logit pairing , year =

Logan Engstrom and Andrew Ilyas and Anish Athalye , journal =. Evaluating and understanding the robustness of adversarial logit pairing , year =

work page

Showing first 80 references.