Mesaoptimization

Mesaoptimization is a situation where an optimization algorithm develops an output that is itself an optimization algorithm. This can be explicit, such as when doing reinforcement learning, or implicit, such as when the training-deployment pipeline is structurally biased towards meeting particular desiderata. But the end result is the same. An output, often times a neural network, itself contains the machinery to carry out optimization or to exhibit agency. Mesaoptimization is a problematic phenomenon because it can be difficult to infer properties of the mesaoptimizer from studying the optimizer. Notably, the objective of the mesaoptimzier can be different from the objective of the original optimizer. In the worst case, it can result in models changing their behavior substantially after they’ve been deployed, and potentially carrying out a so-called “treacherous turn.”

Despite the threat it poses, mesaoptimization is a minimally studied and poorly understood phenomenon. Our work is primarily directed to understanding mesaoptimization, where and how it shows up, and the situations in which it can cause deceptive behavior better. The long-term goal is being able to detect networks that might carry out treacherous turns or other unacceptable behavior before they’re deployed.

Posts

Previous
Previous

Alignment MineTest

Next
Next

Polyglot