Obstacles to Gradient Hacking

Alignment Forum

5 Sept

This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example. The discussion was sparked by this post. We didn't end up coming up with any good examples (or proofs of nonexistence) but hopefully this post is helpful for anyone else trying to construct gradient hacking examples.

Note that because our goal is to construct a concrete example of gradient hacking, when I write about "what we want'' and "unfortunate" roadblocks, those are from the perspective of a mesaoptimizer (or a researcher trying to construct an example of a mesaoptimizer to study), not from the perspective of a researcher attempting to build aligned AI.

AlignmentMesaoptimization

Stella Biderman

Obstacles to Gradient Hacking

Cut the CARP: Fishing for zero-shot story evaluation

An Empirical Exploration in Quality Filtering of Text Data