Towards Deconfusing Gradient Hacking

Alignment Forum

24 Oct

When we think about gradient hacking, the most intuitive framing is to consider some kind of agent embedded inside a larger network (like a GPT) that somehow intentionally modifies the loss landscape of the larger network with respect to the base loss, and that this modification makes it so that in optimizing for the base objective, the base optimizer also happens to optimize the mesaobjective. Here I consider the base objective to be a function Θ→R from the params of the network to the reals, that has all the training data baked in for simplicity, and the mesaobjective another function Θ→R, possibly with some constraint that both objectives have to be indifferent between models which behave the same on all inputs. The "somehow" is often considered to be some kind of perturbing or otherwise making the output of the larger network worse whenever the mesaobjective isn't met, therefore creating an incentive for gradient descent to improve the mesaobjective. One example of this line of thinking can be found in my last post about gradient hacking. Unfortunately, I think there are some confusions with this framing.

AlignmentMesaoptimization

Stella Biderman

Towards Deconfusing Gradient Hacking

What Language Model to Train if You Have One Million GPU Hours?

Cut the CARP: Fishing for zero-shot story evaluation