This lesson is for subscribers
You've completed the free preview. Subscribe to unlock every lesson in every course.
Combining reward maximization with KL divergence penalty to prevent the policy from drifting too far from reference.
You've completed the free preview. Subscribe to unlock every lesson in every course.