Genearlisation Hacking

A first look at adversarial generalisation failures in deliberative alignment.