[Study Note] V and Q Function in RF
UC Berkeley's CS285 course: lecture 4.
The goal of reinforcement learning is to find an optimal policy \(\pi^*\) that maximises the cumulative rewards over time.
Define the trajectory \(\tau = (s_0,a_0,s_1,a_1,...) \), which is a sequence of states and actions. The probability under