Interpreting Across Time

Most research on interpreting ML models focuses on trained models as static objects and seeks to understand the functions that they implement when applied for inference. However, ML models have another very important view in which they are time-dependent objects that evolve over the course of training. The primary goal of the Interpreting Across Time project is to understand how model behavior evolves over the course of training and what actions people training models can take to deliberately induce or suppress undesirable behaviors.

Releases

Papers

Next
Next

Eliciting Latent Knowledge