Temporal-Difference Learning

4 minute read

Temporal-Difference Learning

I’ve been watching David Silver’s Reinforcement Learning lectures. They’re really well done, and I personally found them at the right ration of new/known info. I.e. they’re challenging, but not incomprehensible. I’m not sure if I should take that as a complement or an insult… Anyway, I was struck by how well humans can be modeled by Temporal-Difference learning. Although I see from Wikipedia that I’m not very original, seeing as there is a whole section there on how it’s relevant for neuroscience… Also Scott Alexander wrote about it recently.

I’d previously read that human decision making tends to be similar to RL, but apart from a general feeling of “yes, humans tend to learn from their experiences, updating their priors in order to maximise reward (or rather minimise punishment)” I hadn’t gone into any specifics. Hence my watching the afore mentioned lectures. What struck me the most, though, was how TD tends to be biased. This shouldn’t really be a shock to me, as of course algorithms can be biased - why on earth not? - but I hadn’t groked the (or a) mechanism of how this could happen.

TD(0) was presented as an alternative to Monte-Carlo evaluation, which basically means that the algorithm runs a whole execution (e.g. plays a whole game of Mario), after each of which it updates its priors on the basis of the results of the game. After doing this a lot of times, from the law of large numbers, the algorithm will pretty much be able to predict what its final score will be from any specific game state. e.g. if it’s just at the end of the level it can predict that it’ll most likely finish the level, while if it’s falling down a hole it can predict that it’s pretty much game over. The main thing here is that it waits for each game to finish before it updates it’s beliefs.

TD, on the other hand, keeps on updating while it’s running. So the computer playing Mario will notice that it gets points for collecting coins, while the MC version would only notice it after it finishes a run through. The following slide from the lecture shows some differences between the two:

MC vs TD

Both approaches will converge with the true predictions (i.e. what will actually happen), though not necessarily at the same speed. TD tends to be more efficient in converging - my understanding is that this is because it’s learning while running, rather than waiting till the end to update its views. This is at the cost of some bias, which MC doesn’t have. These two approaches are the edge cases of a whole spectrum of strategies, i.e. TD(n), where n is the number of steps to look ahead (or back - it’s all a matter of perspective). When n=0, it’s the same as TD(0) which simply looks at the most recent event. When n=∞, this is the same as MC, as the algorithm will look at all available actions - i.e. the whole run.

Wild speculations

Now for the speculations. Both MC and TD seem to explain how humans learn stuff, as well as differences in how individuals learn. On a lifetime scale it’s obvious (at least it seems so…) that a given human cannot learn via MC, as they only seem to have one shot at life. Which sort of disqualifies the policy of “run a single trial of life, then update your knowledge”. There are of course tales of people doing just that (I highly recommend all of those), but I think it’s safe to say that isn’t the general experience… So assuming an average life, people have to learn in realtime, on the basis of their actions. Just like in TD. After experiencing something people (hopefully) extract lessons from it and accordingly update their priors. The problem with that is that they update their priors in the direction of whatever will maximise their value function, but only on the basis of what their look-ahead generates. So if someone isn’t all that good at evaluating all the possible ways in which the action could have influenced them, or aren’t very good at calculating the expected value, they won’t update all that well. This also applies to how far they look ahead. Some people might be able to see 100 steps ahead and accordingly update, others only 1, and be short sighted. Isn’t this simply the problem of exploration and exploitation? Although it feels different. Or maybe they simply have a value function that weighs things in non intuitive ways…

Another difference between MC and TD are biases. MC doesn’t have them. TD does. Which explains a lot if this is a good model of human learning. Humans are overflowing with biases. That always seemed strange to me, as why on earth would they? What good do they do? Of course they are often useful heuristics, but why should that be the whole story? But if the brain acts sort of like TD, then biases are to be expected.

I also wonder at the differences in learning styles between people. I tend to continuously update my beliefs on all kinds of topics, which has the unfortunate side effect of me believing things too soon, too often. While my friend has the opposite problem, updating only after a preponderance of evidence totally crushes any doubts. This could also sort of be explained by the difference between MC and TD, though on a much smaller scale - I tend to set my policy/value function for a given topic in a very TD way, continuously updating, while my friend is doing it more MC like. Though on second thought that seems like it’s stretching the idea a lot too far…