Course contentsShow
Machine Learning and Deep Learning
Lesson 1383 of 3,53831. Multimodal ModelsPro lesson

UNITER: Unified Vision-Language Pretraining

Single-stream architecture that jointly encodes image regions and text tokens with transformer layers.

This lesson is for subscribers

You've completed the free preview. Subscribe to unlock every lesson in every course.