CoRR, abs/1909.06044

Haochen Liu, Tyler Derr, Zitao Liu, Jiliang Tang , title = · 2019 · arXiv 1909.06044

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Red Teaming Language Models with Language Models

cs.CL · 2022-02-07 · conditional · novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.

citing papers explorer

Showing 2 of 2 citing papers.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 226
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Red Teaming Language Models with Language Models cs.CL · 2022-02-07 · conditional · none · ref 6
One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.

CoRR, abs/1909.06044

fields

years

verdicts

representative citing papers

citing papers explorer