Eliciting Latent Knowledge

A few years ago, Paul Christiano and his research organization ARC identified the ELK problem in a report that comprehensively explains the problem. ELK stands for Eliciting Latent Knowledge. ELK seems to capture a core difficulty in alignment. The short description of the issue captured by the problem is that we don’t have surefire ways to understand the beliefs of models and systems that we train, and so if we’re ever in a situation where our systems know things that we don’t, we can’t be sure that we can recover that information.

Practically, when investigating the information encoded in deep learning models, the most common approach is some variation on “ask the model what it thinks.” While this can be a reasonable approach in some contexts, it is potentially dangerous when applied to models that have been exposed to or explicitly trained on deceptive behavior, or that have learnt to exhibit that behavior spontaneously. For example, the pre-training task for large language models is misaligned with straightforward truth-telling: humans often say false things, so the most likely continuation of a text isn't always the most truthful one, let alone the actual continuation that gets sampled. Similarly, recent research on language models finetuned with RLHF has shown a decline in subjective “honesty” and “truthfulness” unless appropriate mitigation steps are taken. When models are deployed in critical situations, this can lead to difficult-to-detect failures where human users are either mislead or unaware of important information that would allow them to avert disaster.

Ideally we'd be able to directly elicit these models' knowledge about the world in a robust way. Currently, we’re building on top of the work from Collin Burns et al. and their method Contrastive Consistent Search (CCS).

Releases

Papers

Previous
Previous

Interpreting Across Time

Next
Next

Training LLMs