AGISF - Reward Misspecification and Foundation Models
Week 2 of the AI alignment curriculum. Reward misspecification occurs when RL agents are rewarded for misbehaving.
Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020)
- Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome.
- Evil genies
- Amounts to both the old and new understanding of hacking
- RLHF can help, but only if the correct reward function is learned
- The map is not the territory; agents learn from the map
Examples
- Lego stacking - red block gets flipped, so its bottom is at the same height as the blue block’s top
- Coast runners - the boat continuously goes for the bonuses, rather than winning the race
- Grasping ball with robotic hand - the agent hovers the hand in front of the camera
- Simulated robot - slides along the ground because of physics bug
Deep RL from human preferences: blog post (Christiano et al., 2017)
- Might be surprising with out of distribution examples
- Hidden assumptions will come out to bite you
On the opportunities and risks of foundation models (Bommasani et al., 2022)
- Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences
- Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure