AGISF - Reward Misspecification and Foundation Models

1 minute read

Week 2 of the AI alignment curriculum. Reward misspecification occurs when RL agents are rewarded for misbehaving.

Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020)

  • Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome.
  • Evil genies
  • Amounts to both the old and new understanding of hacking
  • RLHF can help, but only if the correct reward function is learned
  • The map is not the territory; agents learn from the map

Examples

  • Lego stacking - red block gets flipped, so its bottom is at the same height as the blue block’s top
  • Coast runners - the boat continuously goes for the bonuses, rather than winning the race
  • Grasping ball with robotic hand - the agent hovers the hand in front of the camera
  • Simulated robot - slides along the ground because of physics bug

Deep RL from human preferences: blog post (Christiano et al., 2017)

  • Might be surprising with out of distribution examples
  • Hidden assumptions will come out to bite you

On the opportunities and risks of foundation models (Bommasani et al., 2022)

  • Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences
  • Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure

Learning to summarize with human feedback: blog post (Stiennon et al., 2020)

The alignment problem from a deep learning perspective (Ngo, Chan and Mindermann, 2022)