pith. machine review for the scientific record. sign in

arxiv: 2212.03827 · v2 · submitted 2022-12-07 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

Discovering Latent Knowledge in Language Models Without Supervision

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords latent knowledgeunsupervised learninglanguage modelsactivation spacelogical consistencytruth discoveryprompt sensitivityyes-no questions
0
0 comments X

The pith

A linear direction in language model activations encodes latent truth and can be found without any supervision or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an unsupervised method that locates a direction inside a language model's internal activations satisfying logical consistency rules, such as opposite values for a statement and its negation. This direction is identified solely from unlabeled activations and used to answer yes-no questions by checking the sign of the projection. Across six models and ten datasets the resulting answers exceed zero-shot accuracy by 4 percent on average. The method also halves sensitivity to prompt wording and keeps accuracy high even when the model is explicitly prompted to generate wrong answers.

Core claim

The central claim is that a direction exists in activation space such that the sign of the projection of a statement's activation vector onto this direction correctly indicates the truth value of the statement, because the direction has been chosen to enforce logical consistency between statements and their negations. This direction is recovered by searching for the vector that best satisfies the consistency constraints across many unlabeled statements. The resulting classifier recovers diverse knowledge represented inside the model and outperforms zero-shot baselines while remaining robust to prompt variations and to instructions that ask the model to lie.

What carries the argument

The central object is a single linear direction in activation space found by optimizing for logical consistency: the projection of any statement and its negation must have opposite signs, and the sign of the projection then serves as the yes-no answer.

If this is right

  • What the model knows internally can be read out separately from the text it generates under a given prompt.
  • Prompt engineering becomes less necessary for eliciting truthful answers.
  • The technique works even on models trained by imitation learning that may reproduce human errors in their outputs.
  • Accuracy holds when models are explicitly instructed to produce incorrect answers, showing the direction tracks internal knowledge rather than surface generation.
  • The same consistency-based search can be repeated on new models without task-specific labels or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar directions might be recoverable for other abstract properties such as uncertainty or logical consistency itself.
  • Model alignment procedures could directly optimize or verify against these latent directions instead of generated text.
  • The approach suggests that internal truth representations in language models are often approximately linear and therefore relatively easy to isolate.

Load-bearing premise

The assumption that a single linear direction exists in activation space whose projections are logically consistent and specifically track the model's knowledge of truth rather than some other consistent property.

What would settle it

If the sign of projections onto the recovered direction fails to predict ground-truth answers on a held-out set of yes-no questions at a rate higher than the zero-shot baseline, or if no direction satisfies the consistency constraints across a diverse collection of statements.

read the original abstract

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes an unsupervised method to extract latent knowledge from language models by searching for a linear direction in activation space that satisfies logical consistency (a statement and its negation have opposite projections). This direction is then used to answer yes-no questions on unlabeled activations. Across 6 models and 10 QA datasets, the method reports a 4% average improvement over zero-shot baselines, halves prompt sensitivity, and maintains accuracy under misleading prompts.

Significance. If the central claim holds, the result is significant: it offers a purely unsupervised route to internal model knowledge that is distinct from generated outputs and less sensitive to prompting. The empirical scope (multiple models and datasets) and the robustness experiments provide concrete evidence that consistency-based directions can recover factual information without ground-truth labels or model outputs.

major comments (3)
  1. [§3.2] §3.2 (Consistency objective): The method selects the direction v that maximizes logical consistency (proj(a(s)) ≈ −proj(a(¬s))). This property is satisfied by any binary feature that flips under negation, not necessarily truth. The manuscript does not include an ablation that compares the consistency-selected direction against other high-consistency directions (e.g., random vectors or directions optimized for different objectives) to show that accuracy collapses when consistency holds but truth correlation is removed.
  2. [§4.3] §4.3 (Robustness to lying prompts): The reported robustness is measured by prompting the model to generate incorrect answers while still using the fixed direction v. It is unclear whether v is recomputed on the new activations or held fixed from the original unlabeled set; if recomputed, the experiment does not isolate latent knowledge from prompt-induced changes in the activation distribution.
  3. [Table 2, §4.1] Table 2 and §4.1: The 4% average gain is reported across 10 datasets, but per-dataset variance is large (some datasets show <1% gain). The manuscript should report whether the gain is statistically significant after multiple-comparison correction and whether it remains when the direction is selected on a held-out subset of statements rather than the full unlabeled pool.
minor comments (2)
  1. [Figure 2] Figure 2: The legend does not distinguish the zero-shot baseline from the consistency direction; add explicit labels or a separate panel.
  2. [§3.1] §3.1: Notation for the projection operator is introduced without an explicit equation; add Eq. (X) defining proj_v(a) = v·a / ||v||.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate clarifications and additional experiments where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Consistency objective): The method selects the direction v that maximizes logical consistency (proj(a(s)) ≈ −proj(a(¬s))). This property is satisfied by any binary feature that flips under negation, not necessarily truth. The manuscript does not include an ablation that compares the consistency-selected direction against other high-consistency directions (e.g., random vectors or directions optimized for different objectives) to show that accuracy collapses when consistency holds but truth correlation is removed.

    Authors: We agree that logical consistency is a necessary but not sufficient condition for recovering factual truth, as other negating features could in principle satisfy the objective. Our empirical results across multiple models and datasets show that the consistency-optimized direction reliably correlates with ground-truth labels on held-out factual questions, outperforming zero-shot baselines. To directly address the concern, we will add an ablation study in the revised manuscript that compares the consistency-selected direction against random vectors and directions optimized for alternative objectives, confirming that high consistency without factual correlation does not yield comparable accuracy. revision: yes

  2. Referee: [§4.3] §4.3 (Robustness to lying prompts): The reported robustness is measured by prompting the model to generate incorrect answers while still using the fixed direction v. It is unclear whether v is recomputed on the new activations or held fixed from the original unlabeled set; if recomputed, the experiment does not isolate latent knowledge from prompt-induced changes in the activation distribution.

    Authors: The direction v is computed once on the original unlabeled set of activations and held fixed for all robustness experiments, including those with lying prompts. This isolates the latent knowledge encoded in the fixed direction from any prompt-induced shifts in the activation distribution. We will revise §4.3 to explicitly state this procedure and include a brief diagram clarifying the experimental flow. revision: yes

  3. Referee: [Table 2, §4.1] Table 2 and §4.1: The 4% average gain is reported across 10 datasets, but per-dataset variance is large (some datasets show <1% gain). The manuscript should report whether the gain is statistically significant after multiple-comparison correction and whether it remains when the direction is selected on a held-out subset of statements rather than the full unlabeled pool.

    Authors: We acknowledge the per-dataset variance in Table 2. In the revision we will add statistical significance tests for the average improvement (with Bonferroni correction for multiple comparisons) and report per-dataset p-values. We will also include new results in which the direction is selected using only a held-out subset of statements, confirming that the reported gains persist and are not due to using the full unlabeled pool for direction selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unsupervised consistency objective evaluated on external labels

full rationale

The derivation finds a direction v in activation space by maximizing logical consistency (proj(a(s)) ≈ -proj(a(¬s))) over unlabeled statements. This objective uses only model activations and the negation operator; no ground-truth labels enter the optimization. Reported accuracy is measured against 10 external QA datasets whose labels are never used to select or fit v. No equation reduces the final accuracy to a fitted parameter, and no load-bearing step relies on self-citation of an unverified uniqueness result. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a linear direction in activation space that encodes truth via logical consistency. No free parameters are explicitly fitted in the abstract description. The main axiom is that such a direction, if found, corresponds to latent knowledge rather than an artifact of the consistency objective.

axioms (1)
  • domain assumption A single linear direction in activation space exists such that projections of a statement and its negation are approximately opposite.
    This is the core search objective stated in the abstract; it is treated as given rather than derived.

pith-pipeline@v0.9.0 · 5503 in / 1487 out tokens · 25204 ms · 2026-05-15T20:30:47.043212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values.

  • InevitabilityStructure inevitability echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the method recovers diverse knowledge represented in large language models and outperforms zero-shot accuracy by 4% on average

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLM Agents Already Know When to Call Tools -- Even Without Reasoning

    cs.CL 2026-05 conditional novelty 7.0

    LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.

  2. Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.

  3. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  4. How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

    cs.CL 2026-04 unverdicted novelty 7.0

    Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.

  5. The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

    cs.LG 2026-03 unverdicted novelty 7.0

    The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...

  6. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  7. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  8. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  9. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 6.0

    Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

  10. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

  11. Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability

    cs.CL 2026-05 unverdicted novelty 6.0

    Geometric deviation of LLM hidden states from an answerable reference centroid provides a pre-generation signal for answerability that works reliably for mathematical prompts (ROC-AUC 0.78-0.84) but not factual ones.

  12. Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...

  13. How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

  14. Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.

  15. To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

    cs.CV 2026-03 unverdicted novelty 6.0

    69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.

  16. Emergent Manifold Separability during Reasoning in Large Language Models

    cs.LG 2026-02 unverdicted novelty 6.0

    Reasoning in LLMs produces a transient geometric pulse in which concept manifolds untangle into linearly separable subspaces immediately before computation and compress afterward.

  17. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  18. Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.

  19. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    cs.CL 2023-11 unverdicted novelty 5.0

    The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.

  20. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 4.0

    Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 19 Pith papers · 22 internal anchors

  1. [1]

    Amanda Askell, Yushi Bai, Anna Chen, Dawn Drain, Deep Ganguli, T. J. Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, and Jared Kaplan. A general language assistant...

  2. [2]

    Yushi Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, T. J. Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Da...

  3. [3]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,

  4. [4]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

  5. [5]

    10 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  6. [6]

    Paul Francis Christiano, Jan Leike, Tom B

    URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit. Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741,

  7. [7]

    Supervising strong learners by amplifying weak experts

    Paul Francis Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplify- ing weak experts. ArXiv, abs/1810.08575,

  8. [8]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

  9. [9]

    Truthful ai: Developing and governing ai that does not lie

    Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. Truthful ai: Developing and governing ai that does not lie. ArXiv, abs/2110.06674,

  10. [10]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654,

  11. [11]

    Unsolved problems in ml safety

    Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety. ArXiv, abs/2109.13916,

  12. [12]

    AI safety via debate

    Geoffrey Irving, Paul Francis Christiano, and Dario Amodei. Ai safety via debate. ArXiv, abs/1805.00899,

  13. [13]

    Maieutic prompting: Logically consistent reasoning with recursive explanations

    Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. ArXiv, abs/2205.11822,

  14. [14]

    Alignment of language agents

    Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. ArXiv, abs/2103.14659,

  15. [15]

    Ground-truth labels matter: A deeper look into input-label demonstrations

    Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang goo Lee, Kang Min Yoo, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations. ArXiv, abs/2205.12685,

  16. [16]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  17. [17]

    Scalable agent alignment via reward modeling: a research direction

    Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. ArXiv, abs/1811.07871,

  18. [18]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,

  19. [19]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.ArXiv, abs/1711.05101,

  20. [20]

    McDonald

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan T. McDonald. On faithfulness and factuality in abstractive summarization. ArXiv, abs/2005.00661,

  21. [21]

    Teaching language models to support answers with verified quotes

    Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nathan McAleese. Teaching language models to support answers with verified quotes. ArXiv, abs/2203.11147,

  22. [22]

    Metaicl: Learning to learn in context

    Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. ArXiv, abs/2110.15943, 2022a. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv, abs/2202.12837, 2022b. Nasrin Mos...

  23. [23]

    Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332,

  24. [24]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...

  25. [25]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nathan McAleese, and Geoffrey Irving. Red teaming language models with language models. ArXiv, abs/2202.03286,

  26. [26]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    12 Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683,

  27. [27]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,

  28. [28]

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,

  29. [29]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...

  30. [30]

    Ziegler, Ryan J

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.ArXiv, abs/2009.01325,

  31. [31]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. ArXiv, abs/1803.05355,

  32. [32]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

  33. [33]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language model...

  34. [34]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,

  35. [35]

    Calibrate before use: Improving few-shot performance of language models

    Tony Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. ArXiv, abs/2102.09690,

  36. [36]

    Prompt consistency for zero-shot task generalization

    Chunting Zhou, Junxian He, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Prompt consistency for zero-shot task generalization. ArXiv, abs/2205.00049,

  37. [37]

    Is 2+2=4? Yes

    without NLI finetuning (unlike in Section 3, where we finetuned DeBERTa on an NLI task to use it in the zero-shot setting). The outputs of DeBERTa are unlikely to be meaningful if we provide a raw input without using any [MASK] tokens3. If CCS only works when the model outputs are useful, then it should perform poorly in this setting. We also consider one...

  38. [38]

    hidden states

    For encoder-only and decoder-only models, we provide the full input, x+ or x−, which includes both the question and the proposed answer, to the model; see Appendix I.1 for more formatting and tokenization details. For encoder-decoder models, our input format depends on whether we are taking the encoder hidden states or the decoder hidden states. When we t...

  39. [39]

    [text] = I loved this movie

    with a learning rate of 0.1. H S TATISTICAL SIGNIFICANCE Our main accuracy results for CCS and other methods (e.g. in Table 1 and elsewhere) are computed by evaluating the method on 40% of the 1000 (or 500 in the case of COPA) examples sampled for each dataset, then averaging the resulting accuracy across 9 prompts per dataset (on average), 10 different d...

  40. [40]

    ‘ [content]

    (Page 164, 9 prompts in total), and the two of our own as follows: 1 [prefix]Consider the following example: “‘ [content] ”’ Between [label0] and [label1], the sentiment of this example is [label] 2 [prefix]Consider the following example: “‘ [content] ”’ Between [label0] and [label1], which is the sentiment of this example? [label] Here “[label]” is “nega...

  41. [41]

    Here the label is a short sentence

    I.2.4 COPA COPA is a causal reasoning task to determine either the cause or the effect of a given premise (Roemmele et al., 2011). Here the label is a short sentence. We use 10 prompts, where 9 are from (Sanh et al.,

  42. [42]

    ‘ [premise]

    (Page 177), and we add one more prompt: 1 [prefix]Consider the following premise: “‘ [premise] ”’ Choice 1: [choice1] Choice 2: [choice2] Q: Which one is more likely to be the [question], choice 1 or choice 2? [label] I.2.5 DB PEDIA 14 DBpedia 14 is a topic classification dataset constructed by picking 14 non-overlapping classes from DBpedia 2014 (Lehmann...

  43. [43]

    [label]” is “negative

    (Page 168), and the rest two are as follows: 1 [prefix]Consider the following example: ”’ [text] ”’ Between [label0] and [label1], the sentiment of this example is [label] 2 [prefix]Consider the following example: ”’ [text] ”’ Between [label0] and [label1], which is the sentiment of this example? [label] Here “[label]” is “negative” forx+ and “positive” f...

  44. [44]

    yes” or “no

    is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We use 5 prompts from (Sanh et al., 2021). The label is either “yes” or “no” depending on whether the information in the paragraph is enough ...