Stella Biderman 16/03/2023 Stella Biderman 16/03/2023

Interpreting Across Time

How do properties of models emerge and evolve over the course of training?

Stella Biderman 15/03/2023 Stella Biderman 15/03/2023

Eliciting Latent Knowledge

As models get smarter, humans won't always be able to independently check if a model's claims are true or false. We aim to circumvent this issue by directly eliciting latent knowledge (ELK) inside the model’s activations.