arxiv: 2212.03827 · v2 · submitted 2022-12-07 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 3 theorem links

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns , Haotian Ye , Dan Klein , Jacob Steinhardt

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords latent knowledgeunsupervised learninglanguage modelsactivation spacelogical consistencytruth discoveryprompt sensitivityyes-no questions

0 comments

The pith

A linear direction in language model activations encodes latent truth and can be found without any supervision or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an unsupervised method that locates a direction inside a language model's internal activations satisfying logical consistency rules, such as opposite values for a statement and its negation. This direction is identified solely from unlabeled activations and used to answer yes-no questions by checking the sign of the projection. Across six models and ten datasets the resulting answers exceed zero-shot accuracy by 4 percent on average. The method also halves sensitivity to prompt wording and keeps accuracy high even when the model is explicitly prompted to generate wrong answers.

Core claim

The central claim is that a direction exists in activation space such that the sign of the projection of a statement's activation vector onto this direction correctly indicates the truth value of the statement, because the direction has been chosen to enforce logical consistency between statements and their negations. This direction is recovered by searching for the vector that best satisfies the consistency constraints across many unlabeled statements. The resulting classifier recovers diverse knowledge represented inside the model and outperforms zero-shot baselines while remaining robust to prompt variations and to instructions that ask the model to lie.

What carries the argument

The central object is a single linear direction in activation space found by optimizing for logical consistency: the projection of any statement and its negation must have opposite signs, and the sign of the projection then serves as the yes-no answer.

If this is right

What the model knows internally can be read out separately from the text it generates under a given prompt.
Prompt engineering becomes less necessary for eliciting truthful answers.
The technique works even on models trained by imitation learning that may reproduce human errors in their outputs.
Accuracy holds when models are explicitly instructed to produce incorrect answers, showing the direction tracks internal knowledge rather than surface generation.
The same consistency-based search can be repeated on new models without task-specific labels or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar directions might be recoverable for other abstract properties such as uncertainty or logical consistency itself.
Model alignment procedures could directly optimize or verify against these latent directions instead of generated text.
The approach suggests that internal truth representations in language models are often approximately linear and therefore relatively easy to isolate.

Load-bearing premise

The assumption that a single linear direction exists in activation space whose projections are logically consistent and specifically track the model's knowledge of truth rather than some other consistent property.

What would settle it

If the sign of projections onto the recovered direction fails to predict ground-truth answers on a held-out set of yes-no questions at a rate higher than the zero-shot baseline, or if no direction satisfies the consistency constraints across a diverse collection of statements.

read the original abstract

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean unsupervised trick for pulling a consistent direction out of activations that beats zero-shot on QA tasks, but the link from consistency to actual latent truth still needs tighter evidence.

read the letter

The core contribution is a method that searches activation space for a single direction where a statement and its negation get opposite projections, using only unlabeled data and no model outputs. Across six models and ten QA datasets this direction recovers answers about 4% better than zero-shot prompting on average, halves prompt sensitivity, and keeps working even when the model is explicitly told to lie. That combination of results is the part worth paying attention to; it shows you can extract something useful from the internals without relying on human ratings or imitation data. The approach is straightforward and the evaluation is held-out, so there is no obvious circularity in the numbers they report. Credit to the authors for running the checks on misleading prompts and for testing multiple models rather than one cherry-picked case. The soft spot is the interpretation step. Logical consistency under negation is satisfied by any binary feature that flips sign, not just truth. The paper does not appear to include ablations that compare the selected direction against other high-consistency directions or that break the correlation with the QA labels while keeping consistency intact. Without those, it remains possible that the direction is capturing a correlated proxy rather than the model's actual knowledge of truth. The robustness to lying prompts helps rule out some surface artifacts, but it does not close the gap. This is the kind of work that belongs in a reading group focused on interpretability or alignment internals. People already thinking about activation editing or linear probes will get immediate value from the method and the empirical numbers. It is not yet a finished story on what models know versus what they say, but the technique is concrete enough to build on. I would send it to peer review; the idea is novel relative to the cited baselines, the experiments are broad, and the main limitation is interpretive rather than a load-bearing flaw in the results themselves.

Referee Report

3 major / 2 minor

Summary. The paper proposes an unsupervised method to extract latent knowledge from language models by searching for a linear direction in activation space that satisfies logical consistency (a statement and its negation have opposite projections). This direction is then used to answer yes-no questions on unlabeled activations. Across 6 models and 10 QA datasets, the method reports a 4% average improvement over zero-shot baselines, halves prompt sensitivity, and maintains accuracy under misleading prompts.

Significance. If the central claim holds, the result is significant: it offers a purely unsupervised route to internal model knowledge that is distinct from generated outputs and less sensitive to prompting. The empirical scope (multiple models and datasets) and the robustness experiments provide concrete evidence that consistency-based directions can recover factual information without ground-truth labels or model outputs.

major comments (3)

[§3.2] §3.2 (Consistency objective): The method selects the direction v that maximizes logical consistency (proj(a(s)) ≈ −proj(a(¬s))). This property is satisfied by any binary feature that flips under negation, not necessarily truth. The manuscript does not include an ablation that compares the consistency-selected direction against other high-consistency directions (e.g., random vectors or directions optimized for different objectives) to show that accuracy collapses when consistency holds but truth correlation is removed.
[§4.3] §4.3 (Robustness to lying prompts): The reported robustness is measured by prompting the model to generate incorrect answers while still using the fixed direction v. It is unclear whether v is recomputed on the new activations or held fixed from the original unlabeled set; if recomputed, the experiment does not isolate latent knowledge from prompt-induced changes in the activation distribution.
[Table 2, §4.1] Table 2 and §4.1: The 4% average gain is reported across 10 datasets, but per-dataset variance is large (some datasets show <1% gain). The manuscript should report whether the gain is statistically significant after multiple-comparison correction and whether it remains when the direction is selected on a held-out subset of statements rather than the full unlabeled pool.

minor comments (2)

[Figure 2] Figure 2: The legend does not distinguish the zero-shot baseline from the consistency direction; add explicit labels or a separate panel.
[§3.1] §3.1: Notation for the projection operator is introduced without an explicit equation; add Eq. (X) defining proj_v(a) = v·a / ||v||.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate clarifications and additional experiments where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (Consistency objective): The method selects the direction v that maximizes logical consistency (proj(a(s)) ≈ −proj(a(¬s))). This property is satisfied by any binary feature that flips under negation, not necessarily truth. The manuscript does not include an ablation that compares the consistency-selected direction against other high-consistency directions (e.g., random vectors or directions optimized for different objectives) to show that accuracy collapses when consistency holds but truth correlation is removed.

Authors: We agree that logical consistency is a necessary but not sufficient condition for recovering factual truth, as other negating features could in principle satisfy the objective. Our empirical results across multiple models and datasets show that the consistency-optimized direction reliably correlates with ground-truth labels on held-out factual questions, outperforming zero-shot baselines. To directly address the concern, we will add an ablation study in the revised manuscript that compares the consistency-selected direction against random vectors and directions optimized for alternative objectives, confirming that high consistency without factual correlation does not yield comparable accuracy. revision: yes
Referee: [§4.3] §4.3 (Robustness to lying prompts): The reported robustness is measured by prompting the model to generate incorrect answers while still using the fixed direction v. It is unclear whether v is recomputed on the new activations or held fixed from the original unlabeled set; if recomputed, the experiment does not isolate latent knowledge from prompt-induced changes in the activation distribution.

Authors: The direction v is computed once on the original unlabeled set of activations and held fixed for all robustness experiments, including those with lying prompts. This isolates the latent knowledge encoded in the fixed direction from any prompt-induced shifts in the activation distribution. We will revise §4.3 to explicitly state this procedure and include a brief diagram clarifying the experimental flow. revision: yes
Referee: [Table 2, §4.1] Table 2 and §4.1: The 4% average gain is reported across 10 datasets, but per-dataset variance is large (some datasets show <1% gain). The manuscript should report whether the gain is statistically significant after multiple-comparison correction and whether it remains when the direction is selected on a held-out subset of statements rather than the full unlabeled pool.

Authors: We acknowledge the per-dataset variance in Table 2. In the revision we will add statistical significance tests for the average improvement (with Bonferroni correction for multiple comparisons) and report per-dataset p-values. We will also include new results in which the direction is selected using only a held-out subset of statements, confirming that the reported gains persist and are not due to using the full unlabeled pool for direction selection. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unsupervised consistency objective evaluated on external labels

full rationale

The derivation finds a direction v in activation space by maximizing logical consistency (proj(a(s)) ≈ -proj(a(¬s))) over unlabeled statements. This objective uses only model activations and the negation operator; no ground-truth labels enter the optimization. Reported accuracy is measured against 10 external QA datasets whose labels are never used to select or fit v. No equation reduces the final accuracy to a fitted parameter, and no load-bearing step relies on self-citation of an unverified uniqueness result. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a linear direction in activation space that encodes truth via logical consistency. No free parameters are explicitly fitted in the abstract description. The main axiom is that such a direction, if found, corresponds to latent knowledge rather than an artifact of the consistency objective.

axioms (1)

domain assumption A single linear direction in activation space exists such that projections of a statement and its negation are approximately opposite.
This is the core search objective stated in the abstract; it is treated as given rather than derived.

pith-pipeline@v0.9.0 · 5503 in / 1487 out tokens · 25204 ms · 2026-05-15T20:30:47.043212+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values.
InevitabilityStructure inevitability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the method recovers diverse knowledge represented in large language models and outperforms zero-shot accuracy by 4% on average

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
cs.CL 2026-05 unverdicted novelty 7.0

LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 7.0

Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
cs.CL 2026-04 unverdicted novelty 7.0

Subword tokenization impairs phonological knowledge encoding in LMs, but an IPA-based fine-tuning method restores it with minimal impact on other capabilities.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
cs.LG 2026-03 unverdicted novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability
cs.CL 2026-05 unverdicted novelty 6.0

Geometric deviation of LLM hidden states from an answerable reference centroid provides a pre-generation signal for answerability that works reliably for mathematical prompts (ROC-AUC 0.78-0.84) but not factual ones.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
cs.LG 2026-04 unverdicted novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
cs.AI 2026-04 unverdicted novelty 6.0

Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
cs.CV 2026-03 unverdicted novelty 6.0

69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
Emergent Manifold Separability during Reasoning in Large Language Models
cs.LG 2026-02 unverdicted novelty 6.0

Reasoning in LLMs produces a transient geometric pulse in which concept manifolds untangle into linearly separable subspaces immediately before computation and compress afterward.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 19 Pith papers · 22 internal anchors

[1]

Amanda Askell, Yushi Bai, Anna Chen, Dawn Drain, Deep Ganguli, T. J. Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, John Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, and Jared Kaplan. A general language assistant...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Yushi Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, T. J. Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Da...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,

work page 2021
[4]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

10 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[6]

Paul Francis Christiano, Jan Leike, Tom B

URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit. Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741,

work page arXiv
[7]

Supervising strong learners by amplifying weak experts

Paul Francis Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplify- ing weak experts. ArXiv, abs/1810.08575,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[9]

Truthful ai: Developing and governing ai that does not lie

Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. Truthful ai: Developing and governing ai that does not lie. ArXiv, abs/2110.06674,

work page arXiv
[10]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[11]

Unsolved problems in ml safety

Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. Unsolved problems in ml safety. ArXiv, abs/2109.13916,

work page arXiv
[12]

AI safety via debate

Geoffrey Irving, Paul Francis Christiano, and Dario Amodei. Ai safety via debate. ArXiv, abs/1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Maieutic prompting: Logically consistent reasoning with recursive explanations

Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. Maieutic prompting: Logically consistent reasoning with recursive explanations. ArXiv, abs/2205.11822,

work page arXiv
[14]

Alignment of language agents

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. ArXiv, abs/2103.14659,

work page arXiv
[15]

Ground-truth labels matter: A deeper look into input-label demonstrations

Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang goo Lee, Kang Min Yoo, and Taeuk Kim. Ground-truth labels matter: A deeper look into input-label demonstrations. ArXiv, abs/2205.12685,

work page arXiv
[16]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. ArXiv, abs/1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[19]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.ArXiv, abs/1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

McDonald

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan T. McDonald. On faithfulness and factuality in abstractive summarization. ArXiv, abs/2005.00661,

work page arXiv 2005
[21]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nathan McAleese. Teaching language models to support answers with verified quotes. ArXiv, abs/2203.11147,

work page arXiv
[22]

Metaicl: Learning to learn in context

Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. Metaicl: Learning to learn in context. ArXiv, abs/2110.15943, 2022a. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? ArXiv, abs/2202.12837, 2022b. Nasrin Mos...

work page arXiv 2017
[23]

Reiichiro Nakano, Jacob Hilton, S. Arun Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions with h...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nathan McAleese, and Geoffrey Irving. Red teaming language models with language models. ArXiv, abs/2202.03286,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

12 Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv, abs/1910.10683,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[27]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,

work page 2011
[29]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang A. Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M SAIFUL BARI, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V . Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matt...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Ziegler, Ryan J

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback.ArXiv, abs/2009.01325,

work page arXiv 2009
[31]

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. ArXiv, abs/1803.05355,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language model...

work page internal anchor Pith review Pith/arXiv arXiv
[34]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[35]

Calibrate before use: Improving few-shot performance of language models

Tony Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. ArXiv, abs/2102.09690,

work page arXiv
[36]

Prompt consistency for zero-shot task generalization

Chunting Zhou, Junxian He, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Prompt consistency for zero-shot task generalization. ArXiv, abs/2205.00049,

work page arXiv
[37]

Is 2+2=4? Yes

without NLI finetuning (unlike in Section 3, where we finetuned DeBERTa on an NLI task to use it in the zero-shot setting). The outputs of DeBERTa are unlikely to be meaningful if we provide a raw input without using any [MASK] tokens3. If CCS only works when the model outputs are useful, then it should perform poorly in this setting. We also consider one...

work page 2021
[38]

hidden states

For encoder-only and decoder-only models, we provide the full input, x+ or x−, which includes both the question and the proposed answer, to the model; see Appendix I.1 for more formatting and tokenization details. For encoder-decoder models, our input format depends on whether we are taking the encoder hidden states or the decoder hidden states. When we t...

work page 2016
[39]

[text] = I loved this movie

with a learning rate of 0.1. H S TATISTICAL SIGNIFICANCE Our main accuracy results for CCS and other methods (e.g. in Table 1 and elsewhere) are computed by evaluating the method on 40% of the 1000 (or 500 in the case of COPA) examples sampled for each dataset, then averaging the resulting accuracy across 9 prompts per dataset (on average), 10 different d...

work page 2021
[40]

‘ [content]

(Page 164, 9 prompts in total), and the two of our own as follows: 1 [prefix]Consider the following example: “‘ [content] ”’ Between [label0] and [label1], the sentiment of this example is [label] 2 [prefix]Consider the following example: “‘ [content] ”’ Between [label0] and [label1], which is the sentiment of this example? [label] Here “[label]” is “nega...

work page 2019
[41]

Here the label is a short sentence

I.2.4 COPA COPA is a causal reasoning task to determine either the cause or the effect of a given premise (Roemmele et al., 2011). Here the label is a short sentence. We use 10 prompts, where 9 are from (Sanh et al.,

work page 2011
[42]

‘ [premise]

(Page 177), and we add one more prompt: 1 [prefix]Consider the following premise: “‘ [premise] ”’ Choice 1: [choice1] Choice 2: [choice2] Q: Which one is more likely to be the [question], choice 1 or choice 2? [label] I.2.5 DB PEDIA 14 DBpedia 14 is a topic classification dataset constructed by picking 14 non-overlapping classes from DBpedia 2014 (Lehmann...

work page 2014
[43]

[label]” is “negative

(Page 168), and the rest two are as follows: 1 [prefix]Consider the following example: ”’ [text] ”’ Between [label0] and [label1], the sentiment of this example is [label] 2 [prefix]Consider the following example: ”’ [text] ”’ Between [label0] and [label1], which is the sentiment of this example? [label] Here “[label]” is “negative” forx+ and “positive” f...

work page 2021
[44]

yes” or “no

is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). We use 5 prompts from (Sanh et al., 2021). The label is either “yes” or “no” depending on whether the information in the paragraph is enough ...

work page 2021