1. 9.4 From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function
  2. 9.2 Actor-Critic Reinforcement Learning with Phased Actor