Eliciting Latent Knowledge

A few years ago, Paul Christiano and his research organization ARC identified the ELK problem in a report that comprehensively explains the problem. ELK stands for Eliciting Latent Knowledge. ELK seems to capture a core difficulty in alignment. The short description of the issue captured by the problem is that we don’t have surefire ways to understand the beliefs of models and systems that we train, and so if we’re ever in a situation where our systems know things that we don’t, we can’t be sure that we can recover that information.

Practically, when investigating the information encoded in deep learning models, the most common approach is some variation on “ask the model what it thinks.” While this can be a reasonable approach in some contexts, it is potentially dangerous when applied to models that have been exposed to or explicitly trained on deceptive behavior, or that have learnt to exhibit that behavior spontaneously. For example, the pre-training task for large language models is misaligned with straightforward truth-telling: humans often say false things, so the most likely continuation of a text isn't always the most truthful one, let alone the actual continuation that gets sampled. Similarly, recent research on language models finetuned with RLHF has shown a decline in subjective “honesty” and “truthfulness” unless appropriate mitigation steps are taken. When models are deployed in critical situations, this can lead to difficult-to-detect failures where human users are either mislead or unaware of important information that would allow them to avert disaster.

Ideally we'd be able to directly elicit these models' knowledge about the world in a robust way. Currently, we’re building on top of the work from Collin Burns et al. and their method Contrastive Consistent Search (CCS).

Releases

Featured

5 Feb 2023

tuned-lens

5 Feb 2023

A library implementing the Tuned Lens, along with other tools for extracting, manipulating, and studying the learned representations of transformers across layers.

5 Feb 2023

Papers

Featured

16 Dec 2023

NeurIPS Workshop (SoLaR)

Eliciting Language Model Behaviors using Reverse Language Models

16 Dec 2023

NeurIPS Workshop (SoLaR)

16 Dec 2023

NeurIPS Workshop (SoLaR)

14 Dec 2023

NeurIPS

LEACE: Perfect linear concept erasure in closed form

14 Dec 2023

NeurIPS

14 Dec 2023

NeurIPS

2 Mar 2023

arXiv

Eliciting Latent Predictions from Transformers with the Tuned Lens

2 Mar 2023

arXiv

2 Mar 2023

arXiv

Eliciting Latent Knowledge

Releases

Papers

Interpreting Across Time

Training LLMs