Interpretability

Peeking inside the black box of machine learning algorithms to build robust understandings of what they do and why.

Current Projects

Featured

Interpreting Across Time

Eliciting Latent Knowledge

Releases

Featured

Model

Pythia

Model

A suite of models designed to enable controlled scientific research on transparently trained LLMs

Model

Library

tuned-lens

Library

A library implementing the Tuned Lens, along with other tools for extracting, manipulating, and studying the learned representations of transformers across layers.

Library

Publications

Featured

Feb 6, 2024

arXiv

Neural networks learn moments of increasing order

Feb 6, 2024

arXiv

Feb 6, 2024

arXiv

Dec 17, 2023

NeurIPS Workshop (Attributing Model Behavior at Scale)

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Dec 17, 2023

NeurIPS Workshop (Attributing Model Behavior at Scale)

Dec 17, 2023

NeurIPS Workshop (Attributing Model Behavior at Scale)

Dec 16, 2023

NeurIPS Workshop on Socially Responsible Language Modelling Research (SoLaR)

Eliciting Language Model Behaviors using Reverse Language Models

Dec 16, 2023

NeurIPS Workshop on Socially Responsible Language Modelling Research (SoLaR)

Dec 16, 2023

NeurIPS Workshop on Socially Responsible Language Modelling Research (SoLaR)