Detecting Backdoors with Meta-Models

It is widely known that it is possible to implant backdoors into neural networks, by which an attacker can choose an input to produce a particular undesirable output (e.g. misclassify an image). We propose to use meta-models, neural networks that take another network's parameters as input, to detect backdoors directly from model weights. To this end we present a meta-model architecture and train it on a dataset of ~4000 clean and backdoored CNNs trained on CIFAR-10. Our approach is simple and scalable, and is able to detect the presence of a backdoor with accuracy when the test trigger pattern is i.i.d., with some success even on out-of-distribution backdoors.

Previous
Previous

ReLoRA: High-Rank Training Through Low-Rank Updates

Next
Next

Eliciting Language Model Behaviors using Reverse Language Models