Course contentsShow
Machine Learning and Deep Learning
Lesson 1771 of 3,53838. Instruction Tuning and AlignmentPro lesson

The RLHF Objective Function

Combining reward maximization with KL divergence penalty to prevent the policy from drifting too far from reference.

This lesson is for subscribers

You've completed the free preview. Subscribe to unlock every lesson in every course.