Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
GRAIN is a gradient aggregation method using min-norm objectives to ensure non-negative inner products with group gradients, yielding tighter uniform stability bounds than SGD under smoothness assumptions.
citing papers explorer
-
The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology
Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.
-
GRAIN: Group Aggregation via Min-Norm Objective
GRAIN is a gradient aggregation method using min-norm objectives to ensure non-negative inner products with group gradients, yielding tighter uniform stability bounds than SGD under smoothness assumptions.