AGISF - Goal misgeneralization

3 minute read

Week 3 of the AI alignment curriculum. Goal misgeneralization is scenarios in which agents in new situations generalize to behaving in competent yet undesirable ways because of learning the wrong goals from previous training.

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals (Shah, 2022)

Blog post

  • A correct specification is needed for the learner to have the right context (so it doesn’t exploit bugs), but doesn’t automatically result in correct goals
  • If there are more than 1 way to interpret the set of learning examples (and there probably is), assume the worst
  • Good and bad behaviour can result in the same rewards - the reward function acts as a lossy compression of the value of a given set of actions
  • An AGI doesn’t have to be actively deceptive to have bad outcomes

Examples

Ball following

  • The goal is to hit the blobs in a specific sequence, the agent learns to follow the other agent, as that originally gave good results
  • Seems like a good heuristic? I’d probably do the same?

Tree chopping

  • Chopping trees as fast as possible gave the most reward when the agent wasn’t very competent, so it just chops down everything, rather than do so in a sustainable way

Expression evaluator

  • Always asks clarifying questions
  • Might not be that stupid, depends how you look at it - could be clarifying that it’s using numbers. Also that it’s showing engagement? Sounds like something that would be done in school

Paper

  • Capability is understood to mean that an agent is good at a task even in novel situations
  • Pretty much the same as the blog post, but with more details and more sciency

Why alignment could be hard with modern deep learning (Cotra, 2021)

  • Introduces:
    • Saints - actively want to do what you want (i.e. aligned)
    • Sycophants - do whatever they think you want them to do (i.e. whatever gets them praised)
    • Schemers - do whatever they think you will want that will advance their hidden goals (i.e. whatever gets them closer to what they really want)
  • This is sort of like simulacrum levels?

Thought experiments provide a third anchor (Steinhardt, 2022)

  • Introduces anchors, as sort of idea generators
    • Current ML systems - extrapolating from e.g. LLM can give valuable insights, and also allow testing of hypothesizes
    • Humans - are an intuitive source of understanding of various concepts, seeing as they are quite intelligent and can learn
    • Though experiments with ideal optimizers - useful to work out generalizations and overarching constraints
  • ML tends to change a lot, fast. Yesterdays models are a lot less interesting and insightful
  • Humans tend to anthropomorphize a lot too much, e.g. assigning emotions where there are none. Humans are useful, but very limited (like neurosurgeons vs vets)
  • Thought experiments are nice, but need to be checked empirically whether they make sense. They also tend to assume spherical cows
    • Correctly predicted deception and power seeking
  • Other anchors are evolution, ethology or economy

ML systems will have weird failure modes (Steinhardt, 2022)

  • Discusses deceptive alignment, where the model pretends to be aligned during training, then does a treacherous turn
  • Requires the model to:
    • have a coherent policy/plan/reward function that it wants to achieve/protect
    • be able to make long term plans
    • know what the desired outcomes are (so it can pretend to generate them) These are an unlikely set of assumptions (which is duely noted in the article), but show that it’s possible in theory

The alignment problem from a deep learning perspective (Ngo, Chan and Mindermann, 2022)