Interpreting Across Time
How do properties of models emerge and evolve over the course of training?
Eliciting Latent Knowledge
As models get smarter, humans won't always be able to independently check if a model's claims are true or false. We aim to circumvent this issue by directly eliciting latent knowledge (ELK) inside the model’s activations.