This lesson is for subscribers
You've completed the free preview. Subscribe to unlock every lesson in every course.
The tendency of RLHF models to agree with users even when incorrect, prioritizing approval over truthfulness.
You've completed the free preview. Subscribe to unlock every lesson in every course.