Mesaoptimization

13 Feb

Mesaoptimization is a situation where an optimization algorithm develops an output that is itself an optimization algorithm. This can be explicit, such as when doing reinforcement learning, or implicit, such as when the training-deployment pipeline is structurally biased towards meeting particular desiderata. But the end result is the same. An output, often times a neural network, itself contains the machinery to carry out optimization or to exhibit agency. Mesaoptimization is a problematic phenomenon because it can be difficult to infer properties of the mesaoptimizer from studying the optimizer. Notably, the objective of the mesaoptimzier can be different from the objective of the original optimizer. In the worst case, it can result in models changing their behavior substantially after they’ve been deployed, and potentially carrying out a so-called “treacherous turn.”

Despite the threat it poses, mesaoptimization is a minimally studied and poorly understood phenomenon. Our work is primarily directed to understanding mesaoptimization, where and how it shows up, and the situations in which it can cause deceptive behavior better. The long-term goal is being able to detect networks that might carry out treacherous turns or other unacceptable behavior before they’re deployed.

Posts

Featured

9 Feb 2023

Alignment Forum

Anomalous tokens reveal the original identities of Instruct models

9 Feb 2023

Alignment Forum

I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.

I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.

9 Feb 2023

Alignment Forum

24 Oct 2021

Alignment Forum

Towards Deconfusing Gradient Hacking

24 Oct 2021

Alignment Forum

24 Oct 2021

Alignment Forum

5 Sep 2021

Alignment Forum

Obstacles to Gradient Hacking

5 Sep 2021

Alignment Forum

5 Sep 2021

Alignment Forum

Alignment

Stella Biderman

Mesaoptimization

Posts

Alignment MineTest

Polyglot