arXiv preprint arXiv:2106.00737 , year=

Li, B · 2021 · arXiv 2106.00737

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0 · 2 refs

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

cs.CL · 2024-10-10 · unverdicted · novelty 4.0

RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.

citing papers explorer

Showing 4 of 4 citing papers.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small cs.LG · 2022-11-01 · conditional · none · ref 28 · 2 links
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Show Your Work: Scratchpads for Intermediate Computation with Language Models cs.LG · 2021-11-30 · unverdicted · none · ref 12
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 147
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning cs.CL · 2024-10-10 · unverdicted · none · ref 9
RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.

arXiv preprint arXiv:2106.00737 , year=

fields

years

verdicts

representative citing papers

citing papers explorer