Interpreting Across Time

16 Mar

Most research on interpreting ML models focuses on trained models as static objects and seeks to understand the functions that they implement when applied for inference. However, ML models have another very important view in which they are time-dependent objects that evolve over the course of training. The primary goal of the Interpreting Across Time project is to understand how model behavior evolves over the course of training and what actions people training models can take to deliberately induce or suppress undesirable behaviors.

Releases

Featured

Feb 13, 2023

Pythia

Feb 13, 2023

A suite of models designed to enable controlled scientific research on transparently trained LLMs

Feb 13, 2023

Papers

Featured

Feb 6, 2024

arXiv

Neural networks learn moments of increasing order

Feb 6, 2024

arXiv

Feb 6, 2024

arXiv

Dec 15, 2023

NeurIPS

Emergent and Predictable Memorization in Large Language Models

Dec 15, 2023

NeurIPS

Dec 15, 2023

NeurIPS

Apr 5, 2023

ICML

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Apr 5, 2023

ICML

Apr 5, 2023

ICML

External links

The Pythia project page on GitHub

Interpretability

Stella Biderman

Interpreting Across Time

Releases

Papers

External links

Eliciting Latent Knowledge