Motivation (successes, AGI, human feedback, history)
Multi-armed bandit problems (stochastic, contextual)
Solving MDPs 1: (Bellman equations, Value iteration)
Solving MDPs 2: (Contraction, Policy iteration)
Temporal difference learning 1: (TD(0), Sarsa, Q-learning)
Temporal difference learning 2: (n-step,Double-Q, DQN)
Policy gradient methods 1: (Tabular)
Policy gradient methods 2: (Variance reduction, Neural)
Combining learning and planning (AlphaZero, muZero)
Exploration in RL
Multi-agent RL (cooperative vs. adversarial)
Applications: Advertising, RLHF, Robotics, …
Neuro-science and RL