Obj-Disco decomposes LLM alignment reward signals into sparse weighted combinations of interpretable natural language objectives via iterative analysis of behavioral changes across checkpoints, capturing over 90% of observed reward behavior.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Bee hive mind from weighted voter imitation equals a single RL agent using a new multi-armed bandit rule called Maynard-Cross Learning.
citing papers explorer
-
Discovering Implicit Large Language Model Alignment Objectives
Obj-Disco decomposes LLM alignment reward signals into sparse weighted combinations of interpretable natural language objectives via iterative analysis of behavioral changes across checkpoints, capturing over 90% of observed reward behavior.
-
The Hive Mind is a Single Reinforcement Learning Agent
Bee hive mind from weighted voter imitation equals a single RL agent using a new multi-armed bandit rule called Maynard-Cross Learning.