Anomalous tokens reveal the original identities of Instruct models
I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.
I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.
I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.
I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.
Towards Deconfusing Gradient Hacking
When we think about gradient hacking, the most intuitive framing is to consider some kind of agent embedded inside a larger network (like a GPT) that somehow intentionally modifies the loss landscape of the larger network with respect to the base loss, and that this modification makes it so that in optimizing for the base objective, the base optimizer also happens to optimize the mesaobjective. Here I consider the base objective to be a function Θ→R from the params of the network to the reals, that has all the training data baked in for simplicity, and the mesaobjective another function Θ→R, possibly with some constraint that both objectives have to be indifferent between models which behave the same on all inputs. The "somehow" is often considered to be some kind of perturbing or otherwise making the output of the larger network worse whenever the mesaobjective isn't met, therefore creating an incentive for gradient descent to improve the mesaobjective. One example of this line of thinking can be found in my last post about gradient hacking. Unfortunately, I think there are some confusions with this framing.
Obstacles to Gradient Hacking
This post is essentially the summary of a long discussion on the EleutherAI discord about trying to exhibit gradient hacking in real models by hand crafting an example. The discussion was sparked by this post. We didn't end up coming up with any good examples (or proofs of nonexistence) but hopefully this post is helpful for anyone else trying to construct gradient hacking examples.
Note that because our goal is to construct a concrete example of gradient hacking, when I write about "what we want'' and "unfortunate" roadblocks, those are from the perspective of a mesaoptimizer (or a researcher trying to construct an example of a mesaoptimizer to study), not from the perspective of a researcher attempting to build aligned AI.