BiteSizedChunks.comLearn one small thing at a time.

Course contentsShow

Machine Learning and Deep Learning

683From Batch GD to Stochastic GD
684Mini-Batch Gradient Descent
685Batch Size Effects on Training
686The Learning Rate: Core Hyperparameter
687Learning Rate Too High or Too Low
688SGD with Momentum: Concept
689SGD with Momentum: Mathematics
690Nesterov Accelerated Gradient
691The Problem SGD Variants Solve
692Adagrad: Adaptive Learning Rates
693Adagrad's Limitations
694RMSprop: Exponential Averaging of Gradients
695Adam: Combining Momentum and Adaptation
696Adam Hyperparameters and Defaults
697AdamW: Decoupled Weight Decay
698Choosing an Optimizer in Practice
699Why Fixed Learning Rates Fail
700Momentum-Based Optimization
701Nesterov Accelerated Gradient
702AdaGrad: Per-Parameter Learning Rates
703AdaGrad's Diminishing Learning Rate Problem
704RMSprop: Exponential Moving Average of Gradients
705Adam: Combining Momentum and Adaptive Rates
706Adam's Bias Correction Mechanism
707AdamW: Decoupled Weight Decay
708NAdam: Nesterov-Accelerated Adam
709AdaMax and AdaBound Variants
710Choosing Hyperparameters for Adaptive Optimizers
711When to Use SGD vs Adam
712Implementing Adaptive Optimizers in PyTorch
713Why Learning Rate Scheduling Matters
714Step Decay Schedules
715Exponential Decay
716Polynomial Decay
717Cosine Annealing
718Cosine Annealing with Warm Restarts
719Linear Warmup
720ReduceLROnPlateau: Adaptive Scheduling
721One Cycle Learning Rate Policy
722Cyclical Learning Rates
723Implementing Schedules in PyTorch
724Choosing and Tuning LR Schedules
725The Exploding Gradient Problem
726Gradient Norm and When to Clip
727Gradient Clipping by Value
728Gradient Clipping by Norm
729Choosing Clipping Thresholds
730Gradient Clipping in PyTorch
731Gradient Accumulation for Stability
732Mixed Precision and Gradient Scaling

Machine Learning and Deep Learning

683From Batch GD to Stochastic GD
684Mini-Batch Gradient Descent
685Batch Size Effects on Training
686The Learning Rate: Core Hyperparameter
687Learning Rate Too High or Too Low
688SGD with Momentum: Concept
689SGD with Momentum: Mathematics
690Nesterov Accelerated Gradient
691The Problem SGD Variants Solve
692Adagrad: Adaptive Learning Rates
693Adagrad's Limitations
694RMSprop: Exponential Averaging of Gradients
695Adam: Combining Momentum and Adaptation
696Adam Hyperparameters and Defaults
697AdamW: Decoupled Weight Decay
698Choosing an Optimizer in Practice
699Why Fixed Learning Rates Fail
700Momentum-Based Optimization
701Nesterov Accelerated Gradient
702AdaGrad: Per-Parameter Learning Rates
703AdaGrad's Diminishing Learning Rate Problem
704RMSprop: Exponential Moving Average of Gradients
705Adam: Combining Momentum and Adaptive Rates
706Adam's Bias Correction Mechanism
707AdamW: Decoupled Weight Decay
708NAdam: Nesterov-Accelerated Adam
709AdaMax and AdaBound Variants
710Choosing Hyperparameters for Adaptive Optimizers
711When to Use SGD vs Adam
712Implementing Adaptive Optimizers in PyTorch
713Why Learning Rate Scheduling Matters
714Step Decay Schedules
715Exponential Decay
716Polynomial Decay
717Cosine Annealing
718Cosine Annealing with Warm Restarts
719Linear Warmup
720ReduceLROnPlateau: Adaptive Scheduling
721One Cycle Learning Rate Policy
722Cyclical Learning Rates
723Implementing Schedules in PyTorch
724Choosing and Tuning LR Schedules
725The Exploding Gradient Problem
726Gradient Norm and When to Clip
727Gradient Clipping by Value
728Gradient Clipping by Norm
729Choosing Clipping Thresholds
730Gradient Clipping in PyTorch
731Gradient Accumulation for Stability
732Mixed Precision and Gradient Scaling

← Machine Learning and Deep Learning

Lesson 692 of 3,538·17. Optimization for Deep LearningPro lesson

Adagrad: Adaptive Learning Rates

Accumulating squared gradients to scale learning rates inversely with past gradient magnitudes per parameter.

This lesson is for subscribers

You've completed the free preview. Subscribe to unlock every lesson in every course.

See pricing Back to course