Title resolution pending

A General Language Assistant as a Laboratory for Alignment , author= · 2021

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Steering Language Models With Activation Engineering

cs.CL · 2023-08-20 · unverdicted · novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

LIMA: Less Is More for Alignment

cs.CL · 2023-05-18 · conditional · novelty 7.0

Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.

Lessons from the Trenches on Reproducible Evaluation of Language Models

cs.CL · 2024-05-23 · accept · novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A Survey on Knowledge Distillation of Large Language Models

cs.CL · 2024-02-20 · accept · novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

citing papers explorer

Showing 8 of 8 citing papers.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 91
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Steering Language Models With Activation Engineering cs.CL · 2023-08-20 · unverdicted · none · ref 138
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
LIMA: Less Is More for Alignment cs.CL · 2023-05-18 · conditional · none · ref 25
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection cs.CL · 2026-05-16 · unverdicted · none · ref 13 · 2 links
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · accept · none · ref 101
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 63
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 26
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 114
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer