pith. machine review for the scientific record. sign in

arxiv: 2111.02080 · v6 · submitted 2021-11-03 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

An Explanation of In-context Learning as Implicit Bayesian Inference

Authors on Pith no claims yet
Pith Number pith:VPSR4ZIC state: computed view record JSON
4 claims · 300 references · 2 theorem links. This is the computed registry record for this paper; it is not author-attested yet.

Pith reviewed 2026-05-16 22:18 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords in-context learningimplicit Bayesian inferencelanguage modelshidden Markov modelspretraininglatent conceptsGINC datasetscaling effects
0
0 comments X

The pith

Large language models perform in-context learning by implicitly inferring latent concepts that explain coherence in both pretraining data and prompt examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in-context learning arises naturally when language models are pretrained on data with long-range coherence driven by latent document-level concepts. During pretraining, the model learns to infer which concept is active to predict the next token accurately. At test time, a prompt with input-output examples allows the model to infer the shared latent concept governing the task, enabling it to generalize to new inputs. This holds even with a mismatch between the prompt distribution and pretraining data, as proven in a mixture-of-HMMs model. Experiments on a synthetic dataset called GINC confirm that both Transformers and LSTMs display in-context learning and replicate real-world behaviors like scaling benefits and order sensitivity.

Core claim

In-context learning occurs because the language model treats the prompt examples as observations from the same latent concept that structures coherent pretraining documents. By performing implicit Bayesian inference over which concept is present in the prompt, the model can predict the label for a new query example without any gradient updates. This mechanism is formalized and proven to work in a setting where the pretraining distribution is a mixture of hidden Markov models, each corresponding to a different latent concept, even when the test prompts are drawn from a different distribution.

What carries the argument

A mixture of hidden Markov models (HMMs) as the generative model for pretraining documents, where each HMM component represents a distinct latent concept that enforces long-range coherence across the document.

If this is right

  • Models will exhibit in-context learning on prompts that share a latent concept with the pretraining distribution.
  • Performance on in-context tasks improves with increased model size even when pretraining loss stays constant.
  • Predictions are sensitive to the order of examples in the prompt because order affects the inferred posterior over concepts.
  • Zero-shot performance can exceed few-shot in some cases when additional examples introduce mismatch in the inferred concept.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If true, this predicts that in-context learning will be stronger for tasks where examples share clear latent structures, such as consistent topic or style.
  • Extending the framework could explain why in-context learning works better with longer contexts that provide more evidence for the latent concept.
  • The theory suggests designing pretraining data with explicit latent concepts to enhance in-context capabilities without larger models.

Load-bearing premise

Real pretraining data has long-range coherence from latent concepts that a mixture of HMMs can capture and that this drives in-context behavior in models trained on web data.

What would settle it

Training a model on synthetic data without latent document-level concepts, such as independent tokens or short-range chains only, and checking whether in-context learning still appears on structured prompts despite low next-token error.

read the original abstract

Large language models (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in-context learning emerges in language models via implicit Bayesian inference over latent document-level concepts when pretraining data has long-range coherence. They model pretraining as a mixture of HMMs, prove that ICL occurs despite prompt-pretraining mismatch in this setting, and validate the account on a synthetic GINC dataset generated from the identical process, where both Transformers and LSTMs reproduce scaling benefits, order sensitivity, and zero-shot vs. few-shot patterns.

Significance. If the core modeling assumption holds, the work supplies a mechanistic Bayesian explanation for ICL that accounts for several empirical regularities in a controlled, falsifiable manner. The explicit proof for the HMM-mixture case and the independently generated GINC experiments constitute clear strengths, providing a reproducible testbed that goes beyond post-hoc interpretations of real-model behavior.

major comments (2)
  1. [§3] §3 (theoretical derivation): the proof establishes the result only for exact mixture-of-HMMs pretraining; the central claim that this mechanism explains ICL in large language models trained on web-scale data therefore rests on the unverified modeling assumption that real corpora are dominated by comparable long-range latent document structure, which is load-bearing yet receives no direct empirical check.
  2. [§4] §4 (GINC experiments): because the synthetic data is generated from the identical HMM mixture used in the theory, the reported scaling, order, and zero-shot phenomena confirm internal consistency rather than provide an independent test of whether the Bayesian mechanism operates under the distribution mismatch that characterizes actual pretraining corpora.
minor comments (2)
  1. [Abstract] Abstract: the wording should more explicitly separate the proven HMM-mixture regime from the conjectural extension to real pretraining data.
  2. [§2] Notation in the HMM definition: ensure the distinction between prompt-level and document-level latent variables is introduced before the first use of the posterior-inference argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our paper. We address the major concerns regarding the theoretical assumptions and the nature of the GINC experiments below. We have made partial revisions to the manuscript to incorporate additional discussion on these points.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical derivation): the proof establishes the result only for exact mixture-of-HMMs pretraining; the central claim that this mechanism explains ICL in large language models trained on web-scale data therefore rests on the unverified modeling assumption that real corpora are dominated by comparable long-range latent document structure, which is load-bearing yet receives no direct empirical check.

    Authors: We agree that our proof is specific to the mixture-of-HMMs model and that generalizing to real LLMs assumes similar long-range latent structure in web data. This assumption is motivated by the fact that natural language documents often maintain coherence around latent concepts (e.g., topics) over extended contexts. Although we do not perform a direct empirical validation on real corpora, which would require scalable methods to identify such latents, the paper provides a proof-of-concept in a controlled setting. In the revised manuscript, we have expanded Section 3 with a discussion of this assumption, its plausibility, and suggestions for future empirical tests. revision: partial

  2. Referee: [§4] §4 (GINC experiments): because the synthetic data is generated from the identical HMM mixture used in the theory, the reported scaling, order, and zero-shot phenomena confirm internal consistency rather than provide an independent test of whether the Bayesian mechanism operates under the distribution mismatch that characterizes actual pretraining corpora.

    Authors: We acknowledge that GINC matches the theoretical generative process, serving to verify that the Bayesian inference mechanism produces the observed ICL behaviors under the modeled conditions, including the prompt-pretraining mismatch. This internal validation is valuable for establishing the mechanism's sufficiency. We recognize it does not directly probe real-world distribution mismatches. We have revised Section 4 to better articulate the role of GINC as a minimal testbed that reproduces key real-world phenomena, and added a limitations section addressing the gap to real data. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation applies standard Bayesian inference to explicitly stated generative model

full rationale

The paper states an explicit mixture-of-HMMs pretraining distribution, derives that in-context learning corresponds to implicit posterior inference over the shared latent document-level HMM parameters, and proves the result holds under prompt-pretraining mismatch. This is a direct mathematical application of Bayes rule to the given generative process; no predicted quantity is obtained by fitting to the target in-context observations, and the GINC experiments are generated from the identical process to verify the derived behavior rather than to supply the quantities being predicted. No self-citation, ansatz, or uniqueness theorem is invoked as a load-bearing step for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling choice that pretraining data can be treated as samples from a mixture of HMMs whose components correspond to latent concepts, plus the assumption that real documents exhibit sufficient long-range coherence for this inference to be useful.

axioms (2)
  • domain assumption Pretraining documents exhibit long-range coherence because they are generated from a shared latent concept
    Invoked to justify that next-token prediction requires inferring the latent concept during pretraining.
  • domain assumption The pretraining distribution is a mixture of HMMs
    Used as the setting in which the proof of in-context learning is carried out.

pith-pipeline@v0.9.0 · 5527 in / 1493 out tokens · 37934 ms · 2026-05-16T22:18:23.706449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What learning algorithm is in-context learning? Investigations with linear models

    cs.LG 2022-11 accept novelty 8.0

    Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.

  2. Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

    cs.LG 2026-05 unverdicted novelty 7.0

    Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.

  3. Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.

  4. Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    PIQL integrates train-time-only privileged information into tabular foundation models via new constructions and a reconstruction architecture to achieve faster convergence and better generalization.

  5. Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-mo...

  6. "What Are You Really Trying to Do?": Co-Creating Life Goals from Everyday Computer Use

    cs.HC 2026-05 unverdicted novelty 7.0

    A co-creation process for inferring and refining personal strivings from computer activity logs yields more representative goals and higher user agency than baselines in a 14-person week-long study.

  7. Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

    cs.CV 2026-04 unverdicted novelty 7.0

    Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.

  8. SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

    cs.CL 2026-04 accept novelty 7.0

    SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

  9. In-context Learning and Induction Heads

    cs.LG 2022-09 unverdicted novelty 7.0

    Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...

  10. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  11. Belief or Circuitry? Causal Evidence for In-Context Graph Learning

    cs.AI 2026-05 conditional novelty 6.0

    Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.

  12. SnapAudit: Active Auditing of Differentially Private In-Context Learning via Snapshot-Based Simulation

    cs.CR 2025-11 conditional novelty 6.0

    SnapAudit decomposes DP-ICL into a deterministic snapshot stage and a stochastic noise stage, using bootstrap simulation to achieve 80-200x faster auditing and exposing privacy bound violations in existing Gaussian an...

  13. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  14. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    cs.CL 2022-10 accept novelty 6.0

    Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

  15. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  16. VIP-COP: Context Optimization for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 5.0

    VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...

  17. One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.

  18. Can LLMs Take Retrieved Information with a Grain of Salt?

    cs.CL 2026-05 unverdicted novelty 5.0

    LLMs exhibit systematic failures in obeying expressed certainty in retrieved contexts, but a combination of prior reminders, certainty recalibration, and context simplification reduces obedience errors by 25%.

  19. When Context Sticks: Studying Interference in In-Context Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.

  20. SCHK-HTC: Sibling Contrastive Learning with Hierarchical Knowledge-Aware Prompt Tuning for Hierarchical Text Classification

    cs.CL 2026-04 unverdicted novelty 5.0

    SCHK-HTC uses sibling contrastive learning plus hierarchical prompt tuning to improve discrimination between confusable sibling classes in few-shot hierarchical text classification.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 19 Pith papers · 7 internal anchors

  1. [1]

    Statistical inference for probabilistic functions of finite state markov chains

    Leonard E Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37 0 (6): 0 1554--1563, 1966

  2. [2]

    Blei, Andrew Ng, and M

    D. Blei, Andrew Ng, and M. I. Jordan. Latent D irichlet allocation. Journal of Machine Learning Research (JMLR), 3: 0 993--1022, 2003

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Le, and Christopher D

    Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations (ICLR), 2020

  5. [5]

    A. P. Dempster, Laird N. M., and Rubin D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39 0 (1): 0 1--38, 1977

  6. [6]

    BERT : Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Association for Computational Linguistics (ACL), pages 4171--4186, 2019

  7. [7]

    Making pre-trained language models better few-shot learners

    Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv, 2021

  8. [8]

    Factorial hidden M arkov models

    Zoubin Ghahramani and Michael Jordan. Factorial hidden M arkov models. Machine Learning, 29: 0 245--273, 1997

  9. [9]

    Hidden topic Markov models

    Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. Hidden topic Markov models. In Artificial Intelligence and Statistics (AISTATS), 2007

  10. [10]

    Gunst and O

    M. Gunst and O. Shcherbakova. Asymptotic behavior of Bayes estimators for hidden Markov models with application to ion channels. Mathematical Methods of Statistics, 17, 2008

  11. [11]

    Hastings

    Keith W. Hastings. M onte C arlo sampling methods using M arkov chains and their applications. Biometrika, 57 0 (1): 0 97--109, 1970

  12. [12]

    Long short-term memory

    Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

  13. [13]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR), 2020

  14. [14]

    Surface form competition: Why the highest probability answer isn't always right, 2021

    Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn't always right, 2021

  15. [15]

    How can we know what language models know? In Association for Computational Linguistics (ACL), 2020

    Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? In Association for Computational Linguistics (ACL), 2020

  16. [16]

    Jordan, Zoubin Ghahramani, Tommi S

    Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37: 0 183--233, 1999

  17. [17]

    TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL), 2017

  18. [18]

    Adam: A method for stochastic optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015

  19. [19]

    Kleijn and A.W

    B.J.K. Kleijn and A.W. van der Vaart. The Bernstein -von mises theorem under misspecification. Electronic Journal of Statistics, 6, 2012

  20. [20]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

  21. [21]

    Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Association for Computational Linguistics (ACL), 2020

  22. [22]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Association for Computational Linguistics (ACL), 2021

  23. [23]

    Jurassic-1: Technical details and evaluation

    Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs, August 2021

  24. [24]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. R o BERT a: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  25. [25]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019

  26. [26]

    Rosenbluth, Marshall N

    Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21 0 (6): 0 1087--1092, 1953

  27. [27]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, German Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernandez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Association for Computational Linguistics (ACL), 2016

  28. [28]

    Grokking: Generalization beyond overfitting on small algorithmic datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. In ICLR MATH AI Workshop, 2021

  29. [29]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1 0 (8), 2019

  30. [30]

    Optimization as a model for few-shot learning

    Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations (ICLR), 2017

  31. [31]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...

  32. [32]

    Exploiting cloze questions for few shot text classification and natural language inference

    Timo Schick and Hinrich Schütze. Exploiting cloze questions for few shot text classification and natural language inference. In European Association for Computational Linguistics (EACL), 2021

  33. [33]

    Eliciting knowledge from language models using automatically generated prompts

    Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Eliciting knowledge from language models using automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP), 2020

  34. [34]

    How to compare different loss functions and their risks

    Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26, 2007

  35. [35]

    A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, 1998

  36. [36]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017

  37. [37]

    GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model

    Ben Wang and Aran Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

  38. [38]

    Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning

    Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. arXiv, 2021 a

  39. [39]

    Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. arXiv, 2021 b

  40. [40]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and Jamie Brew. HuggingFace 's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019

  41. [41]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017

  42. [42]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning (ICML), 2021

  43. [43]

    Multiclass classification calibration functions

    Bernardo Ávila Pires and Csaba Szepesvári. Multiclass classification calibration functions. arXiv, 2016

  44. [44]

    Measuring neural net robustness with constraints , year =

    Osbert Bastani and Yani Ioannou and Leonidas Lampropoulos and Dimitrios Vytiniotis and Aditya Nori and Antonio Criminisi , booktitle =. Measuring neural net robustness with constraints , year =

  45. [45]

    Zico Kolter , booktitle =

    Eric Wong and J. Zico Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =

  46. [46]

    A Dual Approach to Scalable Verification of Deep Networks , year =

    Krishnamurthy Dvijotham and Robert Stanforth and Sven Gowal and Timothy Mann and Pushmeet Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =

  47. [47]

    Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

    Matthias Hein and Maksym Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

  48. [48]

    Amir Ali Ahmadi and Anirudha Majumdar , journal =

  49. [49]

    Training verified learners with learned verifiers , year =

    Krishnamurthy Dvijotham and Sven Gowal and Robert Stanforth and Relja Arandjelovic and Brendan O'Donoghue and Jonathan Uesato and Pushmeet Kohli , journal =. Training verified learners with learned verifiers , year =

  50. [50]

    Scaling provable adversarial defenses , year =

    Eric Wong and Frank Schmidt and Jan Hendrik Metzen and J Zico Kolter , booktitle =. Scaling provable adversarial defenses , year =

  51. [51]

    On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , year =

    Sven Gowal and Krishnamurthy Dvijotham and Robert Stanforth and Rudy Bunel and Chongli Qin and Jonathan Uesato and Timothy Mann and Pushmeet Kohli , journal =. On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models , year =

  52. [52]

    Synthetic and natural noise both break neural machine translation , year =

    Yonatan Belinkov and Yonatan Bisk , booktitle =. Synthetic and natural noise both break neural machine translation , year =

  53. [53]

    Hotflip: White-box adversarial examples for text classification , year =

    Javid Ebrahimi and Anyi Rao and Daniel Lowd and Dejing Dou , booktitle =. Hotflip: White-box adversarial examples for text classification , year =

  54. [54]

    There is no free lunch in adversarial robustness (but there are unexpected benefits) , year =

    Dimitris Tsipras and Shibani Santurkar and Logan Engstrom and Alexander Turner and Aleksander Madry , journal =. There is no free lunch in adversarial robustness (but there are unexpected benefits) , year =

  55. [55]

    Adversarially robust generalization requires more data , year =

    Ludwig Schmidt and Shibani Santurkar and Dimitris Tsipras and Kunal Talwar and Aleksander Madry , booktitle =. Adversarially robust generalization requires more data , year =

  56. [56]

    Theoretically principled trade-off between robustness and accuracy , year =

    Hongyang Zhang and Yaodong Yu and Jiantao Jiao and Eric P Xing and Laurent El Ghaoui and Michael I Jordan , booktitle =. Theoretically principled trade-off between robustness and accuracy , year =

  57. [57]

    Improving the robustness of deep neural networks via stability training , year =

    Stephan Zheng and Yang Song and Thomas Leung and Ian Goodfellow , booktitle =. Improving the robustness of deep neural networks via stability training , year =

  58. [58]

    Certified adversarial robustness via randomized smoothing , year =

    Jeremy M Cohen and Elan Rosenfeld and J Zico Kolter , booktitle =. Certified adversarial robustness via randomized smoothing , year =

  59. [59]

    Semi-supervised self-training of object detection models , year =

    Chuck Rosenberg and Martial Hebert and Henry Schneiderman , booktitle =. Semi-supervised self-training of object detection models , year =

  60. [60]

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning , year =

    Takeru Miyato and Shin-ichi Maeda and Shin Ishii and Masanori Koyama , journal =. Virtual adversarial training: a regularization method for supervised and semi-supervised learning , year =

  61. [61]

    Autoaugment: Learning augmentation policies from data , year =

    Ekin D Cubuk and Barret Zoph and Dandelion Mane and Vijay Vasudevan and Quoc V Le , booktitle =. Autoaugment: Learning augmentation policies from data , year =

  62. [62]

    Improved regularization of convolutional neural networks with cutout , year =

    Terrance DeVries and Graham W Taylor , journal =. Improved regularization of convolutional neural networks with cutout , year =

  63. [63]

    80 million tiny images: A large data set for nonparametric object and scene recognition , volume =

    Antonio Torralba and Rob Fergus and William T Freeman , journal =. 80 million tiny images: A large data set for nonparametric object and scene recognition , volume =

  64. [64]

    Certified robustness to adversarial examples with differential privacy , year =

    Mathias Lecuyer and Vaggelis Atlidakis and Roxana Geambasu and Daniel Hsu and Suman Jana , booktitle =. Certified robustness to adversarial examples with differential privacy , year =

  65. [65]

    Second-Order Adversarial Attack and Certifiable Robustness , year =

    Bai Li and Changyou Chen and Wenlin Wang and Lawrence Carin , journal =. Second-Order Adversarial Attack and Certifiable Robustness , year =

  66. [66]

    Wide residual networks , year =

    Sergey Zagoruyko and Nikos Komodakis , booktitle =. Wide residual networks , year =

  67. [67]

    Realistic evaluation of deep semi-supervised learning algorithms , year =

    Avital Oliver and Augustus Odena and Colin A Raffel and Ekin Dogus Cubuk and Ian Goodfellow , booktitle =. Realistic evaluation of deep semi-supervised learning algorithms , year =

  68. [68]

    Unsupervised data augmentation , year =

    Qizhe Xie and Zihang Dai and Eduard Hovy and Minh-Thang Luong and Quoc V Le , journal =. Unsupervised data augmentation , year =

  69. [69]

    Temporal ensembling for semi-supervised learning , year =

    Samuli Laine and Timo Aila , booktitle =. Temporal ensembling for semi-supervised learning , year =

  70. [70]

    Regularization with stochastic transformations and perturbations for deep semi-supervised learning , year =

    Mehdi Sajjadi and Mehran Javanmardi and Tolga Tasdizen , booktitle =. Regularization with stochastic transformations and perturbations for deep semi-supervised learning , year =

  71. [71]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , year =

    Antti Tarvainen and Harri Valpola , booktitle =. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results , year =

  72. [72]

    Semi-supervised learning using gaussian fields and harmonic functions , year =

    Xiaojin Zhu and Zoubin Ghahramani and John D Lafferty , booktitle =. Semi-supervised learning using gaussian fields and harmonic functions , year =

  73. [73]

    Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , year =

    Dong-Hyun Lee , booktitle =. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks , year =

  74. [74]

    Omar Montasser and Steve Hanneke and Nathan Srebro , journal =

  75. [75]

    Adversarial examples from computational constraints , year =

    Sebastien Bubeck and Eric Price and Ilya Razenshteyn , booktitle =. Adversarial examples from computational constraints , year =

  76. [76]

    Adversarial spheres , year =

    Justin Gilmer and Luke Metz and Fartash Faghri and Samuel S Schoenholz and Maithra Raghu and Martin Wattenberg and Ian Goodfellow , journal =. Adversarial spheres , year =

  77. [77]

    Robustness may be at odds with accuracy , year =

    Dimitris Tsipras and Shibani Santurkar and Logan Engstrom and Alexander Turner and Aleksander Madry , booktitle =. Robustness may be at odds with accuracy , year =

  78. [78]

    Analysis of classifiers' robustness to adversarial perturbations , volume =

    Alhussein Fawzi and Omar Fawzi and Pascal Frossard , journal =. Analysis of classifiers' robustness to adversarial perturbations , volume =

  79. [79]

    Adversarial logit pairing , year =

    Harini Kannan and Alexey Kurakin and Ian Goodfellow , journal =. Adversarial logit pairing , year =

  80. [80]

    Evaluating and understanding the robustness of adversarial logit pairing , year =

    Logan Engstrom and Andrew Ilyas and Anish Athalye , journal =. Evaluating and understanding the robustness of adversarial logit pairing , year =

Showing first 80 references.