Interpreting Across Time
Most research on interpreting ML models focuses on trained models as static objects and seeks to understand the functions that they implement when applied for inference. However, ML models have another very important view in which they are time-dependent objects that evolve over the course of training. The primary goal of the Interpreting Across Time project is to understand how model behavior evolves over the course of training and what actions people training models can take to deliberately induce or suppress undesirable behaviors.