Genearlisation Hacking

A first look at adversarial generalisation failures in deliberative alignment.

Flowchart diagram illustrating a decision-making process with a red flag at the top, a robot icon, a thought bubble with a red flag, and a green box with code tags, leading to a database icon and a checkmark with red and green indicators.