Interpreting Across Time
How do properties of models emerge and evolve over the course of training?
Eliciting Latent Knowledge
As models get smarter, humans won't always be able to independently check if a model's claims are true or false. We aim to circumvent this issue by directly eliciting latent knowledge (ELK) inside the model’s activations.
Alignment MineTest
Alignment-MineTest is a research project that uses the open source Minetest voxel engine as a platform for studying AI alignment.