BiteSizedChunks.comLearn one small thing at a time.

Course contentsShow

39What is the Hugging Face Hub
40Navigating the Model Hub Interface
41Understanding Model Cards
42Model Licensing and Usage Rights
43Model Size and Performance Trade-offs
44Task-Specific Model Selection
45Model Variants and Checkpoints
46Community Metrics and Trust Signals
47Hugging Face CLI and Programmatic Access
48Private Models and Organization Repos
49Installing and Importing Transformers
50Pipeline API for Quick Inference
51Understanding AutoClasses
52Tokenizers: Encoding and Decoding
53Model Inputs and Attention Masks
54Loading Pre-trained Model Weights
55Model Outputs and Hidden States
56Generation Methods: generate()
57Padding and Truncation Strategies
58Working with Different Model Types
59Batch Processing and DataLoaders
60Saving and Loading Custom Models
61What is Inference Optimization
62Measuring Inference Performance
63CPU vs GPU Inference Trade-offs
64Batch Size and Throughput
65Memory Management During Inference
66Torch Compile and JIT
67ONNX Runtime Basics
68Attention Mechanism Optimization
69Model Pruning Fundamentals
70Mixed Precision Inference
71Dynamic vs Static Shape Optimization
72Profiling Inference Bottlenecks
73Warm-up and Model Loading
74Inference Optimization Decision Framework
75Understanding Device Placement in PyTorch
76Checking Available Hardware and CUDA Setup
77Memory Management and GPU Allocation
78What is Model Quantization
79Post-Training Quantization with Transformers
808-bit and 4-bit Quantization with bitsandbytes
81Quantization Trade-offs: Speed vs Quality
82Mixed Precision and Automatic Device Mapping
83CPU Inference Optimization Techniques
84Benchmarking Device and Quantization Configurations

39What is the Hugging Face Hub
40Navigating the Model Hub Interface
41Understanding Model Cards
42Model Licensing and Usage Rights
43Model Size and Performance Trade-offs
44Task-Specific Model Selection
45Model Variants and Checkpoints
46Community Metrics and Trust Signals
47Hugging Face CLI and Programmatic Access
48Private Models and Organization Repos
49Installing and Importing Transformers
50Pipeline API for Quick Inference
51Understanding AutoClasses
52Tokenizers: Encoding and Decoding
53Model Inputs and Attention Masks
54Loading Pre-trained Model Weights
55Model Outputs and Hidden States
56Generation Methods: generate()
57Padding and Truncation Strategies
58Working with Different Model Types
59Batch Processing and DataLoaders
60Saving and Loading Custom Models
61What is Inference Optimization
62Measuring Inference Performance
63CPU vs GPU Inference Trade-offs
64Batch Size and Throughput
65Memory Management During Inference
66Torch Compile and JIT
67ONNX Runtime Basics
68Attention Mechanism Optimization
69Model Pruning Fundamentals
70Mixed Precision Inference
71Dynamic vs Static Shape Optimization
72Profiling Inference Bottlenecks
73Warm-up and Model Loading
74Inference Optimization Decision Framework
75Understanding Device Placement in PyTorch
76Checking Available Hardware and CUDA Setup
77Memory Management and GPU Allocation
78What is Model Quantization
79Post-Training Quantization with Transformers
808-bit and 4-bit Quantization with bitsandbytes
81Quantization Trade-offs: Speed vs Quality
82Mixed Precision and Automatic Device Mapping
83CPU Inference Optimization Techniques
84Benchmarking Device and Quantization Configurations

← AI Engineering

Lesson 62 of 1,886·2. Working with Pre-trained ModelsPro lesson

Measuring Inference Performance

Key metrics for inference: latency, throughput, tokens per second, and time to first token.

This lesson is for subscribers

You've completed the free preview. Subscribe to unlock every lesson in every course.

See pricing Back to course