BiteSizedChunks.comLearn one small thing at a time.

Course contentsShow

Machine Learning and Deep Learning

1Scalars, Vectors, and Matrices: Definitions
2Vector Operations: Addition and Scalar Multiplication
3Dot Product and Vector Similarity
4Vector Norms and Distance Metrics
5Matrix-Vector Multiplication
6Matrix-Matrix Multiplication
7Matrix Transpose and Symmetry
8Identity Matrix and Matrix Inverse
9Systems of Linear Equations
10Linear Independence and Span
11Basis and Dimension
12Column Space and Null Space
13Rank of a Matrix
14Determinants and Their Properties
15Trace of a Matrix
16Eigenvalues and Eigenvectors: Definitions
17Computing Eigenvalues and Eigenvectors
18Eigendecomposition of Matrices
19Diagonalization and Its Applications
20Orthogonality and Orthonormal Vectors
21Orthogonal Matrices and Their Properties
22Singular Value Decomposition (SVD): Concept
23Computing and Interpreting SVD
24Matrix Approximation with SVD
25Positive Definite and Semidefinite Matrices
26Quadratic Forms
27Matrix Calculus: Gradients of Matrix Expressions
28Numerical Stability in Linear Algebra
29Functions and Continuity
30Limits: The Foundation of Derivatives
31The Derivative Definition
32Geometric Interpretation of Derivatives
33Basic Differentiation Rules
34Product and Quotient Rules
35The Chain Rule
36Derivatives of Exponential Functions
37Derivatives of Logarithmic Functions
38Derivatives of Trigonometric Functions
39Higher-Order Derivatives
40Implicit Differentiation
41Partial Derivatives: Introduction
42The Gradient Vector
43Directional Derivatives
44The Multivariable Chain Rule
45Critical Points and Extrema
46The Hessian Matrix
47Second Derivative Test in Multiple Dimensions
48Taylor Series and Approximations
49L'Hôpital's Rule
50The Jacobian Matrix
51Integration Fundamentals
52Numerical Differentiation
53Sample Spaces and Events
54Probability Axioms and Basic Rules
55Conditional Probability
56Independence of Events
57Bayes' Theorem
58Random Variables: Discrete and Continuous
59Probability Mass Functions
60Probability Density Functions
61Cumulative Distribution Functions
62Expectation and Mean
63Variance and Standard Deviation
64Common Discrete Distributions: Bernoulli and Binomial
65Poisson Distribution
66Uniform Distribution
67Normal (Gaussian) Distribution
68Exponential and Gamma Distributions
69Joint Probability Distributions
70Marginal and Conditional Distributions
71Covariance and Correlation
72Independence of Random Variables
73Law of Large Numbers
74Central Limit Theorem
75Population vs Sample
76Descriptive Statistics: Central Tendency
77Descriptive Statistics: Spread and Variability
78Percentiles and Quantiles
79Covariance and Correlation
80The Law of Large Numbers
81Central Limit Theorem
82Sampling Distributions
83Point Estimation Fundamentals
84Bias and Variance of Estimators
85Maximum Likelihood Estimation
86Method of Moments
87Confidence Intervals
88Bootstrap Resampling
89Hypothesis Testing Framework
90Type I and Type II Errors
91Common Statistical Tests
92Multiple Testing Correction
93What is Mathematical Optimization?
94Unconstrained vs Constrained Optimization
95Local vs Global Optima
96Convex Sets
97Convex Functions
98First-Order Optimality Conditions
99Second-Order Optimality Conditions
100The Gradient Descent Algorithm
101Learning Rate and Step Size
102Convergence Guarantees for Gradient Descent
103Lipschitz Continuity and Smoothness
104Strong Convexity
105Stochastic Gradient Descent Basics
106Momentum Methods
107Newton's Method
108Quasi-Newton Methods
109Coordinate Descent
110Constrained Optimization and Lagrange Multipliers
111KKT Conditions
112Subgradients and Non-Smooth Optimization

Machine Learning and Deep Learning

1Scalars, Vectors, and Matrices: Definitions
2Vector Operations: Addition and Scalar Multiplication
3Dot Product and Vector Similarity
4Vector Norms and Distance Metrics
5Matrix-Vector Multiplication
6Matrix-Matrix Multiplication
7Matrix Transpose and Symmetry
8Identity Matrix and Matrix Inverse
9Systems of Linear Equations
10Linear Independence and Span
11Basis and Dimension
12Column Space and Null Space
13Rank of a Matrix
14Determinants and Their Properties
15Trace of a Matrix
16Eigenvalues and Eigenvectors: Definitions
17Computing Eigenvalues and Eigenvectors
18Eigendecomposition of Matrices
19Diagonalization and Its Applications
20Orthogonality and Orthonormal Vectors
21Orthogonal Matrices and Their Properties
22Singular Value Decomposition (SVD): Concept
23Computing and Interpreting SVD
24Matrix Approximation with SVD
25Positive Definite and Semidefinite Matrices
26Quadratic Forms
27Matrix Calculus: Gradients of Matrix Expressions
28Numerical Stability in Linear Algebra
29Functions and Continuity
30Limits: The Foundation of Derivatives
31The Derivative Definition
32Geometric Interpretation of Derivatives
33Basic Differentiation Rules
34Product and Quotient Rules
35The Chain Rule
36Derivatives of Exponential Functions
37Derivatives of Logarithmic Functions
38Derivatives of Trigonometric Functions
39Higher-Order Derivatives
40Implicit Differentiation
41Partial Derivatives: Introduction
42The Gradient Vector
43Directional Derivatives
44The Multivariable Chain Rule
45Critical Points and Extrema
46The Hessian Matrix
47Second Derivative Test in Multiple Dimensions
48Taylor Series and Approximations
49L'Hôpital's Rule
50The Jacobian Matrix
51Integration Fundamentals
52Numerical Differentiation
53Sample Spaces and Events
54Probability Axioms and Basic Rules
55Conditional Probability
56Independence of Events
57Bayes' Theorem
58Random Variables: Discrete and Continuous
59Probability Mass Functions
60Probability Density Functions
61Cumulative Distribution Functions
62Expectation and Mean
63Variance and Standard Deviation
64Common Discrete Distributions: Bernoulli and Binomial
65Poisson Distribution
66Uniform Distribution
67Normal (Gaussian) Distribution
68Exponential and Gamma Distributions
69Joint Probability Distributions
70Marginal and Conditional Distributions
71Covariance and Correlation
72Independence of Random Variables
73Law of Large Numbers
74Central Limit Theorem
75Population vs Sample
76Descriptive Statistics: Central Tendency
77Descriptive Statistics: Spread and Variability
78Percentiles and Quantiles
79Covariance and Correlation
80The Law of Large Numbers
81Central Limit Theorem
82Sampling Distributions
83Point Estimation Fundamentals
84Bias and Variance of Estimators
85Maximum Likelihood Estimation
86Method of Moments
87Confidence Intervals
88Bootstrap Resampling
89Hypothesis Testing Framework
90Type I and Type II Errors
91Common Statistical Tests
92Multiple Testing Correction
93What is Mathematical Optimization?
94Unconstrained vs Constrained Optimization
95Local vs Global Optima
96Convex Sets
97Convex Functions
98First-Order Optimality Conditions
99Second-Order Optimality Conditions
100The Gradient Descent Algorithm
101Learning Rate and Step Size
102Convergence Guarantees for Gradient Descent
103Lipschitz Continuity and Smoothness
104Strong Convexity
105Stochastic Gradient Descent Basics
106Momentum Methods
107Newton's Method
108Quasi-Newton Methods
109Coordinate Descent
110Constrained Optimization and Lagrange Multipliers
111KKT Conditions
112Subgradients and Non-Smooth Optimization

← Machine Learning and Deep Learning

Lesson 1606 of 3,538·35. Modern Large Language Models: ArchitecturePro lesson

Causal Self-Attention Masking

Learn how causal masking prevents tokens from attending to future positions, enabling autoregressive generation.

This lesson is for subscribers

You've completed the free preview. Subscribe to unlock every lesson in every course.

See pricing Back to course