Anomalous tokens reveal the original identities of Instruct models

Alignment Forum

9 Feb

I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.

I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.

AlignmentSecurityMesaoptimization

Stella Biderman

Anomalous tokens reveal the original identities of Instruct models

ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics

SantaCoder: don't reach for the stars!