AGISF - Reward Misspecification and Foundation Models

2023-02-07

1 minute read

Notes , Alignment , AGI , AGISF

Week 2 of the AI alignment curriculum. Reward misspecification occurs when RL agents are rewarded for misbehaving.

Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020)

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome.
Evil genies
Amounts to both the old and new understanding of hacking
RLHF can help, but only if the correct reward function is learned
The map is not the territory; agents learn from the map

Examples

Lego stacking - red block gets flipped, so its bottom is at the same height as the blue block’s top
Coast runners - the boat continuously goes for the bonuses, rather than winning the race
Grasping ball with robotic hand - the agent hovers the hand in front of the camera
Simulated robot - slides along the ground because of physics bug

Deep RL from human preferences: blog post (Christiano et al., 2017)

Might be surprising with out of distribution examples
Hidden assumptions will come out to bite you

On the opportunities and risks of foundation models (Bommasani et al., 2022)

Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences
Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure

Learning to summarize with human feedback: blog post (Stiennon et al., 2020)

The alignment problem from a deep learning perspective (Ngo, Chan and Mindermann, 2022)