arXiv preprint arXiv:2106.00737 (2021)

Belinda Z Li, Maxwell Nye, Jacob Andreas · 2021 · arXiv 2106.00737

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0 · 2 refs

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.

Consistency Training while Mitigating Obfuscation via Rate Matching

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

cs.HC · 2026-05-26 · unverdicted · novelty 6.0

Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.

Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning

cs.CL · 2024-10-10 · unverdicted · novelty 4.0

RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Show Your Work: Scratchpads for Intermediate Computation with Language Models cs.LG · 2021-11-30 · unverdicted · none · ref 12
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners cs.LG · 2026-06-02 · unverdicted · none · ref 12
Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 42
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation cs.HC · 2026-05-26 · unverdicted · none · ref 84
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
Exploring Natural Language-Based Strategies for Efficient Number Learning in Children through Reinforcement Learning cs.CL · 2024-10-10 · unverdicted · none · ref 9
RL simulations find explicit action guidance in instructions and a specific ordering curriculum improve number composition learning and generalization.

arXiv preprint arXiv:2106.00737 (2021)

fields

years

verdicts

representative citing papers

citing papers explorer