← Back to Data Science

Data Science Glossary

Key terms from the Data Science course, linked to the lesson that introduces each one.

5,145 terms.

#

`element_line()`
Controls line elements (grid lines, axis lines)
Lesson 1365Customizing Non-Text ElementsLesson 1366Theme() Function Deep Dive
`element_rect()`
Controls rectangular elements (backgrounds, borders)
Lesson 1365Customizing Non-Text ElementsLesson 1366Theme() Function Deep Dive
1:1 to 3:1
Breaking even or marginally profitable, but likely not covering operational overhead, support costs, or providing adequate return on capital.
Lesson 1667LTV:CAC Ratio and ProfitabilityLesson 1756LTV:CAC Ratio as a Health Metric
80% power
, meaning an 80% chance of detecting a true effect if it exists.
Lesson 446Power and Sample Size for ANOVALesson 1495Power Analysis Fundamentals
95% confidence
The procedure captures the true parameter 95% of the time
Lesson 267Interpreting Confidence LevelsLesson 278Confidence Interval Formula for One Proportion

A

Above 5:1
Excellent margins, but potentially signals underinvestment in growth.
Lesson 1667LTV:CAC Ratio and Profitability
Above average
`WHERE value > (SELECT AVG(value) FROM table)`
Lesson 964Subqueries with Aggregate Functions
Absolute counts
500 users from Jan cohort were active in Week 3
Lesson 1647Building a Cohort Table
Academic scores
Assignments with varying point values
Lesson 43Weighted Mean and Its Applications
Acceleration
If the variance of your statistic changes across different data values (making the distribution asymmetric), BCa accounts for this skewness
Lesson 304BCa Bootstrap Intervals: Bias Correction
Accept limitations
Acknowledge when clean measurement isn't feasible
Lesson 1527Ignoring Network Effects
Accept or reject
if balance is good, proceed; if not, generate a new randomization
Lesson 1492Rerandomization and Practical Implementation
Accept trade-offs
consciously
Lesson 2121Timeboxing and Deadlines
Acceptable error types
Is a false positive worse than a false negative?
Lesson 2117Defining 'Good Enough' with Stakeholders
Access controls
Limit who can use sensitive features (building on GDPR principles you learned)
Lesson 1925Mitigation Strategies and Responsible Disclosure
Access Date
When you retrieved the data.
Lesson 2063Essential Metadata to Capture
Access method
How you retrieved it (SQL query, API call, manual download, automated script)
Lesson 1161Documenting Data Sources
Accessibility
Design for colorblind viewers, provide alt text, and use clear labels.
Lesson 1247The Ethics of Visualization DesignLesson 2086Stage 2: Data Acquisition and Assessment
Accounting for censored observations
by including them in the "at-risk" count up until they're censored, then removing them
Lesson 809Introduction to the Kaplan-Meier Estimator
Accuracy SLO
"Data quality checks will pass with <0.
Lesson 1860SLA and SLO Definitions
Accuracy varies
Ambiguous addresses ("Main Street") return less precise results
Lesson 1315Geocoding and Reverse Geocoding
ACF
(Autocorrelation Function) at lag k, you measure the total correlation between a time series and its k-period-ago self.
Lesson 728PACF vs ACF: Key DifferencesLesson 733Using ACF and PACF TogetherLesson 798SARIMA Model Selection
ACF of residuals
, you want to see:
Lesson 786ACF and PACF of Residuals
ACF plots
decay very slowly (indicating non-stationarity)
Lesson 734Why Differencing and Detrending Matter
ACF/PACF clues
When your first-differenced ACF still shows slow decay
Lesson 736Higher-Order Differencing
ACF/PACF of residuals
Should show no significant spikes (all patterns captured)
Lesson 799Fitting and Diagnosing SARIMA Models
Acknowledge the constraint
with stakeholders—don't promise ML magic without the ingredients.
Lesson 2124Insufficient or Low-Quality Data
Acknowledge the limitation
State that the intercept has no practical interpretation in your context
Lesson 526When the Intercept Has No Meaning
Acknowledge uncertainty
Use phrases like "on average" or "typically" rather than absolute claims
Lesson 530Communicating Results to Non-Technical Audiences
Acknowledging limitations
Honest interpretation includes caveats—data gaps, model assumptions, confidence intervals.
Lesson 2090Stage 6: Interpretation and Insight Generation
Acquisition channels
are the various pathways through which potential users discover and arrive at your product, website, or service.
Lesson 1711What Are Acquisition Channels?
Actionable across teams
– Different departments can influence it through their work
Lesson 1604What is a North Star Metric?
Active Customers
Engaged users who regularly use your product or make repeat purchases.
Lesson 1704Customer Lifecycle Stages
Actual value (Y ᵢ)
= what you measured
Lesson 538What Are Fitted Values?
Actual(t)
is the most recent observed value
Lesson 758Simple Exponential Smoothing (SES)
Adapt immediately
Change a threshold or add a condition in minutes
Lesson 2128Data Distribution Shifts Frequently
Add a legend
Always include a size legend so viewers can interpret bubble magnitudes
Lesson 1229Bubble Charts for Three Variables
Add candidate predictors
one at a time or in meaningful blocks (e.
Lesson 703Sequential Model Building Strategy
Add context
Compare the effect size to something meaningful
Lesson 530Communicating Results to Non-Technical Audiences
Add redundant encoding
Don't rely on color alone.
Lesson 1248Color Blindness and Color Palette Design
Adding a constant
If you add a constant *c* to a random variable *X*, the expected value increases by exactly *c*:
Lesson 149Properties of Expectation and Variance
Additional detail
on methodology (statistical tests used, data cleaning steps)
Lesson 1949Anticipating Questions: Building in Appendices
Additive changes
are safest: adding new columns doesn't break existing queries that don't reference them.
Lesson 1876Schema Evolution and Backwards Compatibility
Additive models
assume components are added together:
Lesson 743Additive vs Multiplicative Models
Additive seasonality
means seasonal fluctuations stay roughly constant in size regardless of the data's level.
Lesson 766Additive vs Multiplicative Seasonality
Adds a penalty
Includes a penalty term for each additional change-point to avoid over-segmenting (similar to regularization in regression)
Lesson 1416PELT Algorithm: Pruned Exact Linear Time
Adequate range
Data should span a reasonable range of values
Lesson 480Scatterplots and Visual Assessment
Adequate sample size
Expected frequency ≥ 5 in every category
Lesson 419Assumptions and Minimum Expected Frequencies
Adjust next sprint
Based on feedback, decide whether to refine features, try new models, or pivot
Lesson 2113Timeboxing and Sprint Planning for Data Projects
Adjust the scale
to emphasize meaningful differences—sometimes a gradient from 0-100% works, other times 20- 80% highlights actionable variations
Lesson 1649Visualizing Cohort Data with Heatmaps
Adjusted fences
use a skewness coefficient to shift boundaries asymmetrically
Lesson 1388Limitations and Alternatives to IQR Detection
Adjusted p-values
corrected for multiple testing (Tukey, Bonferroni, etc.
Lesson 462Interpreting and Reporting Post-Hoc Results
Adjusted R-squared
Quick model comparisons, reporting to non-technical audiences, when interpretability matters most.
Lesson 616Adjusted R-Squared vs Other CriteriaLesson 626Nested vs Non-Nested ModelsLesson 632Parsimony and Occam's Razor
Adjustment
means including confounders directly in a regression model as additional predictors.
Lesson 1431Controlling for Confounders: Adjustment
Administrative records
Lists that don't capture informal workers or undocumented individuals
Lesson 249Coverage Error and Undercoverage
Administrative selection
gatekeepers assign treatment based on need or eligibility
Lesson 1444Selection Bias and Treatment Assignment
Adoption rate
= (Users who've used feature at least once) / (Total active users)
Lesson 1696Feature Adoption and Usage Frequency
Adstock
(also called "advertising stock") is a transformation that captures two key phenomena:
Lesson 1739Adstock and Carryover Effects
Advanced composition
Mathematical techniques can provide tighter bounds, so the total might be less than simple addition
Lesson 1900Privacy Budget and Composition
Advantage
Easiest to implement, unbiased in expectation.
Lesson 1437Randomization Mechanisms
Adverse Impact Ratio
extends the 80% rule with confidence intervals
Lesson 1890Measuring Disparate Impact
Advocacy in analyst's clothing
Using your technical authority to push personal or organizational agendas
Lesson 1926The Honest Broker Role
Aesthetics (aes)
How variables map to visual properties like x-position, y-position, color, size, or shape
Lesson 1339What is the Grammar of Graphics?Lesson 1340The Seven Layers of Grammar
Affected by extremes
One very high or very low value (an outlier) can pull the mean in that direction
Lesson 39The Mean (Arithmetic Average)
After testing
If your first-differenced series still fails stationarity tests (ADF, KPSS)
Lesson 736Higher-Order Differencing
Age groups
A person in the "18-25" bracket cannot also be in the "26-35" bracket
Lesson 81Mutually Exclusive Events
Age in customer data
Even if someone's age of 150 falls within 3 standard deviations (passing the Z-score test), you know it's invalid—humans don't live that long.
Lesson 75Domain-Specific Outlier Rules
Aggregate Functions
– Count or measure each bin
Lesson 912Fundamental Difference: Filter Timing
Aggregate functions calculate
Totals, averages, counts are computed for each group
Lesson 915Combining WHERE and HAVING
Aggregated summaries
Storing `total_order_value` on a customer record instead of calculating it from order lines each time.
Lesson 1074Duplicating Data Across Tables
Aggregation
Switch to hexbin maps or heatmaps for dense datasets
Lesson 1310Point Maps and Scatter Plots on Maps
Aggregations
Sum, count, mean calculations that don't need all data simultaneously
Lesson 1800Chunked Reading with read_csv
Agreed upon
Stakeholders buy in *before* analysis begins
Lesson 2094Defining Success Metrics Upfront
Agreement is confidence
When both tests agree (ADF rejects + KPSS doesn't reject), you can confidently call the series stationary.
Lesson 718Interpreting Stationarity Test Results
Agricultural data
might follow growing seasons that vary by region
Lesson 746Choosing Seasonal Period
AIC (Akaike Information Criterion)
and **BIC (Bayesian Information Criterion)** are scores that penalize models for using too many parameters while rewarding good fit to the data.
Lesson 781Information Criteria: AIC and BIC
AIC/BIC
Formal model selection procedures, comparing non-nested models, automated selection algorithms.
Lesson 616Adjusted R-Squared vs Other Criteria
Airflow
offers multiple ways to declare dependencies:
Lesson 1843Declaring Dependencies in Orchestration Tools
Alation
, and **Apache Atlas** maintain centralized inventories of your data assets.
Lesson 1164Tools for Lineage Tracking
Alertness
(coffee directly increases alertness)
Lesson 1469Building a Simple Causal DAG
Algorithm initialization
Neural networks, k-means clustering, random forests all start with random states
Lesson 2055Why Randomness Matters in Data Science
Algorithmic amplification of harm
occurs when automated systems take existing problems—bias, misinformation, manipulation, or discrimination—and multiply their impact exponentially.
Lesson 1923Algorithmic Amplification of Harm
Align with business reality
Reflect how your sales team actually closes deals
Lesson 1731Custom Rule-Based Attribution
All assumptions met
→ Proceed with standard parametric t-test
Lesson 383Diagnostic Workflow: When to Proceed or Switch Tests
Allocate budget wisely
Identify which touchpoints assist vs.
Lesson 1719The Customer Journey and Touchpoints
Allow multiple users
to access data simultaneously
Lesson 842What is a Database?
Allowed Values
Valid ranges for numeric data or enumerated categories
Lesson 2064Creating Data Dictionaries
Alpha
controls how much weight recent observations get when updating the **baseline level** of your series.
Lesson 769Smoothing Parameters: Alpha, Beta, Gamma
Alpha (α)
– your significance level (usually 0.
Lesson 344Power Analysis in Study Design
Alt text
(alternative text) is a brief written description of a visualization that screen readers can announce.
Lesson 1250Text Alternatives and Screen Reader Compatibility
Alternative
The interaction coefficient differs from zero (it matters)
Lesson 654Testing Interaction Significance
Alternative (H ₐ)
The variables are associated
Lesson 433Conducting Fisher's Exact Test
Alternative analyses
you considered but didn't choose (and why)
Lesson 1949Anticipating Questions: Building in Appendices
Always include confidence intervals
, not just point estimates.
Lesson 1928Communicating Uncertainty Honestly
Always positive
Log-normal variables are strictly greater than zero
Lesson 178Log-Normal Distribution: Definition and Properties
Always qualify columns
in multi-table queries, even when names don't conflict—it makes your intent crystal clear
Lesson 922Selecting Columns from Joined Tables
Always specify join conditions
that relate the tables using foreign key relationships
Lesson 955Avoiding Cartesian Products
Always state units
when reporting slopes ("$150 per square foot," not just "150")
Lesson 525Units and Scale in Interpretation
Always try this first
Use pandas' built-in operations that work on entire columns at once.
Lesson 1806Parallel Processing with apply() Alternatives
Always unique
Unlike `RANK()` or `DENSE_RANK()`, ties receive different numbers based on arbitrary order
Lesson 1007ROW_NUMBER(): Assigning Unique Row Numbers
Always use parentheses
when mixing `AND` and `OR`, even if precedence would give the correct result.
Lesson 870Operator Precedence and Parentheses
Always-valid inference
provides p-values and confidence intervals that remain statistically valid *no matter when you stop* — whether you check once, continuously, or at random times you didn't plan ahead.
Lesson 1513Always-Valid Inference and Confidence Sequences
Amazon
Number of purchases per month — reflects both customer satisfaction and business sustainability.
Lesson 1606Examples of North Star Metrics by Industry
Ambiguity kills analysis
If you're studying "time to employee turnover," does the clock start at date of hire, end of training, or first promotion?
Lesson 803Defining the Event and Time Origin
Amplify historical inequities
baked into training data
Lesson 1888Protected Classes and Sensitive Attributes
Analysis becomes consistent
You always know where to find variables
Lesson 1142What is Tidy Data?
Analysis cells
Alternate between explaining your approach (markdown) and executing it (code)
Lesson 1982Literate Programming with Notebooks
Analysis plan
Statistical test you'll use, significance level (usually α = 0.
Lesson 1485Documentation and Pre-Registration
Analytical
"Which customer segments have the highest lifetime value, and what acquisition channels bring us those segments?
Lesson 2093Translating Business Questions into Analytical Questions
Analytical goal
Are you comparing values, showing distribution, revealing relationships, tracking change over time, or displaying composition?
Lesson 1230Choosing the Right Chart Type
Analytics
You need to understand trends and make informed decisions
Lesson 4Data Science vs Data Analytics vs Business Intelligence
Anchor Member
The starting point—your initial row(s) with no dependencies.
Lesson 996Recursive CTEs: Introduction
Anderson-Darling test
is another statistical test that checks whether your data follows a normal distribution, but with a special feature: it gives **more weight to the tails** (the extreme values at both ends) than the K- S test does.
Lesson 207Anderson-Darling TestLesson 449Normality of Residuals
Animation
Show changes over time or across a third variable sequentially
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Anonymize rather than delete
where possible for retained data
Lesson 1909Right to Erasure and Data Retention Policies
Anonymous participation options
when power dynamics exist
Lesson 1918Special Populations and Vulnerable Groups
ANOVA framework
(Analysis of Variance), which decomposes total variation into parts explained by the model versus leftover residuals.
Lesson 618Global F-Test for Overall Model Significance
Anscombe's quartet
the famous cautionary tale where four datasets have identical summary statistics but wildly different relationships that only visualization reveals.
Lesson 1222Scatter Plots for Relationships
Answers to likely questions
based on past presentations or stakeholder concerns
Lesson 1949Anticipating Questions: Building in Appendices
Anticipation
occurs when units change behavior *before* treatment actually occurs.
Lesson 1458Common DiD Pitfalls
ANY
Returns `TRUE` if the comparison is true for *at least one* value returned by the subquery
Lesson 963ANY and ALL OperatorsLesson 1506Benjamini-Hochberg Procedure
Any difference
(two-sided/non-directional)
Lesson 345Directionality in Hypothesis Testing
Any matrix data
Where row/column relationships matter
Lesson 1224Heatmaps and Correlation Matrices
Any shape
The original population can be uniform, exponential, Poisson, or anything else.
Lesson 218What the Central Limit Theorem States
Apache Airflow
, **Prefect**, and **Dagster** log every execution step.
Lesson 1164Tools for Lineage Tracking
Apache Atlas
maintain centralized inventories of your data assets.
Lesson 1164Tools for Lineage Tracking
Apache Spark
emerged as a faster alternative, keeping data in memory when possible and supporting iterative algorithms (essential for machine learning).
Lesson 1764The Big Data Technology Landscape
Aperiodicity
The chain doesn't get stuck in cycles
Lesson 1589Markov Chains: The Foundation of MCMC
API (Application Programming Interface)
is like a restaurant menu for data.
Lesson 21APIs and Web Scraping
API limits
Most services cap free requests per day
Lesson 1315Geocoding and Reverse Geocoding
APIs
Requesting data through structured interfaces
Lesson 11Data Collection and Acquisition
Appendices
Technical details, additional charts, validation metrics
Lesson 1966Report Structure and Executive Summary
Application Logic Burden
Unlike foreign key constraints that enforce referential integrity automatically, you must manually keep denormalized data consistent through careful application code or database triggers.
Lesson 1075Handling Data Consistency in Denormalized Schemas
Apply a color scale
where higher retention rates get warmer colors (red, orange) and lower rates get cooler colors (blue, green)
Lesson 1649Visualizing Cohort Data with Heatmaps
Apply conditional logic
"If the first touch was organic search AND a demo was booked, give search 40%"
Lesson 1731Custom Rule-Based Attribution
Apply domain knowledge
could this happen in reality?
Lesson 1209Outlier Detection and Investigation
Apply information criteria
Calculate AIC and BIC to balance fit and complexity
Lesson 633Practical Model Selection Strategy
Apply insights
Set warranty periods just beyond the steep part of the failure curve; flag high-risk product lines
Lesson 837Product Warranty and Failure Analysis
Apply intervention
Only the treatment group sees the new feature, pricing, or campaign
Lesson 1641Isolating Effects with Control Groups
Apply removal effect
Remove one channel completely, recalculate conversion probability
Lesson 1733Markov Chain Attribution Models
Apply the correction factor
The `n/((n-1)(n-2))` part adjusts for sample size, making the estimate more accurate for smaller datasets.
Lesson 65Calculating Skewness
Appropriate dimensions
for target medium
Lesson 1369Publication-Ready Plot Styling
AR (autoregressive) processes
and determining their order.
Lesson 731PACF for AR Process Identification
AR process
PACF cuts off sharply; ACF decays gradually
Lesson 731PACF for AR Process Identification
AR(1)
Only the first lag is significant; all others fall within the confidence bounds
Lesson 731PACF for AR Process IdentificationLesson 774Autoregressive (AR) Models
Architectural discussion
Sharing skeleton code to validate design decisions
Lesson 2029Draft Pull Requests and WIP Workflows
Area or volume
(acceptable since ratios are meaningful: "twice as much")
Lesson 1238Matching Encoding to Data TypeLesson 1240Area and Volume Distortions
ARMA
models combine both components, so their PACF shows **gradual decay** (influenced by the MA part) rather than a clean cutoff.
Lesson 732PACF Patterns for Common Models
ARPU
(Average Revenue Per User) = Monthly Recurring Revenue / Number of Customers
Lesson 1666LTV for Subscription Businesses
ARR
is MRR × 12, representing the annualized value of subscriptions.
Lesson 1628SaaS Metrics: MRR, ARR, and Logo Churn
Artists
Everything visible on the plot—lines, text, patches, images—are "Artist" objects.
Lesson 1255The Anatomy of a Matplotlib Figure
Ask
What happens if I reject H₀ when it's actually true?
Lesson 334Setting Alpha: Choosing Your Significance Level
Ask "Why" repeatedly
Use the "Five Whys" technique.
Lesson 2102Understanding Stakeholder Goals and Constraints
Ask clarifying questions
When told to "make it more accurate," probe what accuracy means in their context—speed?
Lesson 2105Translating Between Technical and Business Language
Ask domain experts
what size effect would matter
Lesson 609Practical vs Statistical Significance
Ask questions, don't demand
"Have you considered handling NaN values here?
Lesson 2024Code Review Best Practices
Assess completeness
Are there known gaps, missing periods, or quality issues?
Lesson 2098Identifying Data Availability Gaps Early
Assess variance equality
compare standard deviations or use Levene's test (not yet covered formally, but intuitive: do the spreads look similar?
Lesson 290Assumptions and Diagnostics for Difference Intervals
Assigns new customers
to the right segment as soon as they arrive
Lesson 1710Operationalizing Segments: Scoring and Deployment
Assumes Normal Distribution
Z-scores interpret best when data follows a normal distribution.
Lesson 201Z-Score Applications and Limitations
Assuming NULLs = zeros
They don't!
Lesson 884AVG: Computing Averages
Assumption testing
Early scoping involves assumptions about what matters.
Lesson 2109Why Data Science is Inherently Iterative
Assumption Validation
means checking whether your model's prerequisites are met.
Lesson 2089Stage 5: Model Development and Validation
Assumptions
"Assumed all temperature readings are in Fahrenheit based on metadata; values outside -50°F to 150°F flagged as suspicious"
Lesson 1162Documenting TransformationsLesson 2100Documenting Assumptions and Open Questions
Assumptions are severely violated
extreme outliers dominate, variance explodes as X increases, or observations aren't independent
Lesson 555When Regression Is and Isn't Appropriate
Assumptions made
Did you assume missing data was random?
Lesson 1917Transparency in Analysis and Models
Assumptions matter more
Violations of homogeneity of variance become more problematic
Lesson 468Balanced vs Unbalanced Designs
Asymmetric
Unlike the normal distribution, it's not symmetric around its mean
Lesson 178Log-Normal Distribution: Definition and Properties
Asymptotic
(the tails approach but never touch zero—technically possible values extend infinitely in both directions)
Lesson 169The Normal Distribution: Definition and Properties
Asymptotic p-values
rely on large-sample approximations (like the Central Limit Theorem).
Lesson 322Exact vs Asymptotic P-Values
Async
Anticipate questions proactively.
Lesson 1957Adapting Delivery Format: Live vs Async
Async formats
Must be self-explanatory.
Lesson 1957Adapting Delivery Format: Live vs Async
At most 2 accept
Sum P(X=0) + P(X=1) + P(X=2)
Lesson 130Calculating Binomial Probabilities
At-Risk Customers
Previously active users showing warning signs—declining usage, skipped payments, reduced session frequency, or negative support interactions.
Lesson 1704Customer Lifecycle Stages
Atomicity
All operations in a transaction succeed or all fail—no partial completion
Lesson 1110What Are Database Transactions?
ATT
the average effect of treatment *for those who actually received treatment*.
Lesson 1451Estimating Treatment Effects from Matched Samples
Attribute credit
The difference in conversion probability represents that channel's contribution
Lesson 1733Markov Chain Attribution Models
Attribution
You connect marketing spend to actual outcomes—which campaign drove that cohort with 60% Day-30 retention?
Lesson 1711What Are Acquisition Channels?Lesson 1736MMM vs Attribution: Key DifferencesLesson 1744Incrementality vs Attribution
Attribution decay
models how influence weakens over time.
Lesson 1639Time Windows and Attribution Decay
Audience-specific reports
Executive summary vs technical deep-dive
Lesson 1984Parameterized Reports
Audit backups
erasure applies there too (eventually)
Lesson 1909Right to Erasure and Data Retention Policies
Audit regularly
to catch unauthorized secondary uses
Lesson 1915Secondary Use and Scope Creep
Audit trail
See who changed what, when, and why through commit messages
Lesson 1990What is Version Control and Why Git?
Auditing
When stakeholders question your findings, you need to demonstrate data provenance.
Lesson 2062Why Data Source Documentation Matters
Augmented Dickey-Fuller (ADF) test
on your transformed series.
Lesson 741Testing Stationarity After Transformation
Augmented Dickey-Fuller test
gives you a rigorous, statistical answer.
Lesson 716Augmented Dickey-Fuller Test
Authentication Failures
occur when your credentials are wrong or insufficient.
Lesson 1093Troubleshooting Connection Issues
Author
Name and email of who made the commit
Lesson 1999Viewing Commit History
Auto-correct
known issues with logging (caution required)
Lesson 1826Data Validation and Schema Enforcement
Autocommit mode
Each SQL statement is automatically committed (saved) immediately after it runs.
Lesson 1111Autocommit Mode vs Explicit Transactions
Autocorrelation Function (ACF)
takes this idea further by systematically calculating these relationships at multiple different lags.
Lesson 720The Autocorrelation Function (ACF)
Automate the process
write scripts that loop through randomizations and check balance
Lesson 1492Rerandomization and Practical Implementation
Automated collection
Setting up systems to continuously gather data
Lesson 11Data Collection and Acquisition
Automated pipelines
from raw data to final output
Lesson 1981What Makes a Report Reproducible?
Automated validation frameworks
solve this by letting you define expectations once and apply them consistently across datasets, pipelines, and time.
Lesson 1158Automated Validation Frameworks
Automatic Deduplication
Duplicate rows are removed automatically
Lesson 999UNION: Combining Distinct Results
Automatic derivatives
Calculating gradients for optimization becomes straightforward
Lesson 670Why Exponential Family Matters for GLMs
Automating documentation
means writing scripts that inspect your data and generate complete documentation automatically.
Lesson 2067Automating Documentation with Code
AutoRegressive Integrated Moving Average
.
Lesson 773Introduction to ARIMA: Components and Notation
Availability
Actual operating time ÷ planned production time (accounting for breakdowns, changeovers)
Lesson 1636Manufacturing Metrics: OEE, Yield, and Cycle Time
Availability SLO
"The pipeline will successfully complete 99.
Lesson 1860SLA and SLO Definitions
Average balance method
Use `(Start + End) / 2` to account for growth
Lesson 1671Churn Rate Calculation Methods
Average Purchase Value
is the mean revenue per transaction.
Lesson 1663Simple LTV: Average Revenue Per Customer
Average those cubes
Sum them all up—this is your "third moment.
Lesson 65Calculating Skewness
Average Treatment Effect (ATE)
, which answers: "On average, how much did the treatment change the outcome compared to no treatment?
Lesson 1440Treatment Effect Estimation
AVG
, **MIN**, and **MAX**—together with **GROUP BY** to create rich summaries of grouped data.
Lesson 892GROUP BY with Different Aggregate FunctionsLesson 894NULL Values in GROUP BY
AVG(salary)
calculates average for each department
Lesson 903Combining WHERE and HAVING
Avoid "security through obscurity"
Don't assume hiding risks makes them disappear
Lesson 1925Mitigation Strategies and Responsible Disclosure
Avoid conditioning on colliders
which would create spurious associations
Lesson 1475Using DAGs to Guide Analysis
Avoid conditioning on mediators
on the causal path — which would block part of the effect you want to measure
Lesson 1475Using DAGs to Guide Analysis
Avoid extrapolation
Don't use your model to predict Y values for X values far from your observed range
Lesson 526When the Intercept Has No Meaning
Avoid manipulation
You've learned about truncated axes, area distortions, and cherry-picked ranges—these aren't just technical errors, they're ethical violations when done knowingly.
Lesson 1247The Ethics of Visualization Design
Avoid problematic pairs
Red-green, blue-purple, and light green-yellow combinations are particularly troublesome.
Lesson 1248Color Blindness and Color Palette Design
Avoid redundant evaluations
Don't call the same function multiple times within different WHEN clauses.
Lesson 1037CASE Best Practices and Performance
Avoid unnecessary CTEs
If a simple subquery suffices and is clearer, use it
Lesson 997CTE Best Practices and Performance
Avoiding double-counting
When your data has intentional duplicates but you need unique-value statistics
Lesson 887Aggregates with DISTINCT
Axis
objects—the x-axis and y-axis with their tick marks, labels, and scales.
Lesson 1255The Anatomy of a Matplotlib Figure
Axis Limits
Control what range of data appears using `set_xlim()` and `set_ylim()`.
Lesson 1270Customizing Axes: Labels, Limits, and Scales
Azimuth
The horizontal rotation angle around your plot.
Lesson 1326Viewing Angles and Projection Types

B

Backfilling corrupts data
Re-processing historical data could add duplicate aggregations
Lesson 1847What is Idempotency?
Background geoms
Large shapes, reference regions, or filled areas
Lesson 1355Layer Order and Plot Composition
Bad
`"Pipeline failed processing file"`
Lesson 1857Logging Best Practices
Bad (chronological)
"We collected transaction data from 2020-2024, cleaned 847 outliers, ran correlation analysis, built three models, and found churn is predicted by login frequency.
Lesson 1942The Pyramid Principle: Starting with the Conclusion
Bad (curved pattern)
Suggests non-linear relationship; linear regression isn't appropriate
Lesson 557The Residuals vs Fitted Values Plot
Bad (funnel shape)
Indicates heteroscedasticity; variance increases or decreases with fitted values
Lesson 557The Residuals vs Fitted Values Plot
Bad (outliers)
Points far from the rest may be influential observations
Lesson 557The Residuals vs Fitted Values Plot
Bad hypothesis
"The new button will improve engagement.
Lesson 1479Formulating Hypotheses
Balance
means mixing high-volume/low-margin channels with low-volume/high-ROI ones
Lesson 1716Channel Mix and Portfolio Thinking
Balance Index Overhead
Every index speeds reads but slows writes.
Lesson 1086Index Maintenance and Monitoring
Balance inference
remember that rerandomization changes your p-values slightly (though often negligibly in practice)
Lesson 1492Rerandomization and Practical Implementation
Balance point
The mean is the value where positive and negative distances from it cancel out perfectly
Lesson 39The Mean (Arithmetic Average)
Balance Tables
Create side-by-side summaries showing mean (or proportion) of each covariate in treatment vs.
Lesson 1491Covariate Balance and Diagnostics
Balancing groups
Good matches ensure treatment and control groups look similar *before* treatment
Lesson 1445The Matching Framework
Bars
(`geom_bar` or `geom_col`) showing magnitudes as vertical rectangles
Lesson 1342Geometric Objects (geoms)
Bartlett's Test
is more **powerful** when your data is truly normal, but it's very sensitive to non-normality—it might reject equal variances simply because your data isn't perfectly bell-shaped, not because variances actually differ.
Lesson 380Testing Equal Variances: Levene's and Bartlett's Tests
Baseball batting averages
(hits per at-bat)
Lesson 184Beta Distribution: Bounded Between 0 and 1
Baseline variance
Higher variability requires more data
Lesson 1692Statistical Significance and Iteration
Basemaps
solve this by providing pre-rendered background images that give your audience familiar reference points—like roads, rivers, city names, and borders.
Lesson 1314Basemaps and Map Tiles
Batch
(hours-to-days) permits scheduled ETL/ELT runs during off-peak hours.
Lesson 1825Designing Pipeline Architecture
Batch pipelines
work like a postal service—collect mail throughout the day, then deliver it all at scheduled times (hourly, daily, nightly).
Lesson 1824Batch vs Streaming Pipelines
Bayesian
"There's a 95% probability the true conversion rate is between 12% and 18%.
Lesson 1564Comparing Bayesian and Frequentist Proportion Inference
Bayesian A/B testing
treats the conversion rate as a random variable with a probability distribution.
Lesson 1580Bayesian vs Frequentist A/B Testing
Bayesian inference
is the extension of this idea into a full statistical methodology.
Lesson 116From Bayes' Theorem to Bayesian Inference
Bayesian Information Criterion (BIC)
is a model selection tool that helps you choose between competing regression models.
Lesson 630Bayesian Information Criterion (BIC)
Bayesian interpretation
treats probability as a **degree of belief** or **quantification of uncertainty**.
Lesson 1540Comparing Bayesian and Frequentist Interpretations
Be honest about uncertainty
"High confidence" or "preliminary estimate" builds trust without undermining your conclusion.
Lesson 1944Executive Summary Best Practices
Be selective
Test only coefficients you care about based on theory, not all of them exploratorily.
Lesson 624Multiple Testing Considerations
Be specific and actionable
Instead of "this is confusing," try "Consider renaming `df2` to `customer_features` to clarify what this dataframe contains.
Lesson 2024Code Review Best Practices
Bed utilization rate
= (occupied bed-days / available bed-days) measures capacity efficiency.
Lesson 1633Healthcare Metrics: Patient Outcomes and Operational Efficiency
Behavior
feature usage, purchase frequency, engagement level
Lesson 1701What is Customer Segmentation?
Below maximum
`WHERE value < (SELECT MAX(value) FROM table)`
Lesson 964Subqueries with Aggregate Functions
Benchmarking salaries
across companies while maintaining confidentiality
Lesson 1903Secure Multi-Party Computation
Benchmarks
Compare against industry standards, baseline models, or competitors.
Lesson 1939Context and Comparison: Making Numbers MeaningfulLesson 1962Contextualizing Numbers
Benjamini-Hochberg
(exploratory, control FDR)
Lesson 1508Pre-Registration and Correction Strategy
Benjamini-Hochberg (BH) procedure
takes a different approach.
Lesson 1506Benjamini-Hochberg Procedure
Benjamini-Hochberg (FDR)
When you're exploring metrics and can tolerate some false positives
Lesson 1507Multiple Testing in A/B Test Variations
Bernoulli trial
is a single experiment or observation that can result in exactly two outcomes: we call one outcome "success" and the other "failure.
Lesson 123Bernoulli Trial Definition and PropertiesLesson 126From Bernoulli to Binomial: Multiple Trials
Bernoulli/Binomial
→ Logit link (log(p/(1-p)) = Xβ)
Lesson 676Canonical vs Non-Canonical Links
Best practice
Sort only when necessary for your analysis or presentation.
Lesson 880Performance Considerations and Best Practices
Beta
controls how quickly the **trend component** (upward or downward direction) updates.
Lesson 769Smoothing Parameters: Alpha, Beta, Gamma
Beta distribution
is the natural choice because it:
Lesson 1581Setting Priors for A/B Tests
Beta-Binomial conjugate pair
, your posterior is a Beta distribution: `Beta(α + successes, β + failures)`.
Lesson 1562Credible Intervals for Proportions
Beta-Binomial model
(proportion problems), if your posterior is `Beta(α, β)`:
Lesson 1561Posterior Mean and Mode
Beta(2, 8) prior
(you think a conversion rate is probably low).
Lesson 1560Computing the Posterior Distribution
Beta(α, β) prior
representing your initial belief
Lesson 1560Computing the Posterior Distribution
Better analysis
When controlling for smoking status, the correlation disappeared or even reversed, showing coffee might be protective.
Lesson 1426Real-World Examples: Correlation vs Causation
Better objective
"Deliver a seamless first-time user experience by Q2"
Lesson 1609Setting Effective Objectives
Between Groups
(or "Treatment"): Variation explained by differences among your group means
Lesson 444The ANOVA Table
Between-group variance (numerator)
Measures how spread out the group means are from the overall mean
Lesson 440The F-Statistic and Its Distribution
Betweenness centrality
How often a node lies on shortest paths between others (the "bridges")
Lesson 1320Network Metrics and Visual Analysis
BI
You need regular reports on key business metrics
Lesson 4Data Science vs Data Analytics vs Business Intelligence
Bias
It ignores everything that happened after the first click—potentially undervaluing nurture campaigns and retargeting that actually closed the deal.
Lesson 1723Comparing Single-Touch Models
Bias and noise
Sensor errors, bot traffic, or sampling issues
Lesson 1762Extended Dimensions: Veracity and Value
Bias Correction
If your bootstrap distribution is systematically shifted from the sample statistic, BCa corrects for this
Lesson 304BCa Bootstrap Intervals: Bias Correction
Biased assignment
Certain user types might be systematically excluded or included
Lesson 1524Sample Ratio Mismatch (SRM)
BIC (Bayesian Information Criterion)
are scores that penalize models for using too many parameters while rewarding good fit to the data.
Lesson 781Information Criteria: AIC and BIC
Big Compute
problems occur when the calculations themselves are expensive, even with modest data sizes.
Lesson 1765Big Data vs Big Compute
Big Data
problems arise when you have so much data that it won't fit in memory or takes too long to read/write.
Lesson 1765Big Data vs Big Compute
Biggest impact
Which step affects the most users?
Lesson 1685Actionable Insights from Funnel Analysis
BigQuery
Serverless model; Google manages all infrastructure automatically
Lesson 1813Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
Bimodal
Two distinct peaks, suggesting two subgroups (e.
Lesson 1175Histograms for Distribution Shape
Bimodal or multimodal patterns
(multiple peaks)
Lesson 1286Violin Plots and Distribution Shape
Bin data
`stat_bin()` aggregates continuous data into intervals
Lesson 1352Statistical Transformations with stat_* Layers
Binary assets
that must be versioned with code
Lesson 2033Git Large File Storage (LFS) for Data Assets
Binary or semi-structured
Git can't show meaningful diffs, so every change duplicates the entire file
Lesson 2070Separating Data from Code
Binning matters
Too few bins and you miss important details; too many bins and you see noise instead of pattern.
Lesson 1220Histograms for Continuous Distributions
Binomial data
k successes in n trials
Lesson 1560Computing the Posterior Distribution
Biological Gradient
Is there a dose-response relationship?
Lesson 498Bradford Hill Criteria for Causation
Block randomization
divides the assignment process into small "blocks" of fixed size, ensuring balance within each block.
Lesson 1488Block Randomization
Blood pressure
readings across populations
Lesson 179When Variables Are Log-Normally Distributed
BLUE
the Best Linear Unbiased Estimator.
Lesson 521Properties of Least Squares Estimators
Bonferroni
Divide your α by the number of tests (conservative, appropriate for critical decisions)
Lesson 1507Multiple Testing in A/B Test VariationsLesson 1508Pre-Registration and Correction Strategy
Books
table has a primary key `book_id`
Lesson 1051Introduction to Foreign Keys
Boolean logic
every condition produces either TRUE or FALSE.
Lesson 865Introduction to Logical Operators in SQL
bootstrap distribution
shows you the range and variability of what your estimate might be, forming the foundation for confidence intervals.
Lesson 299How Bootstrap Resampling WorksLesson 300Bootstrap Distribution of a Statistic
Bootstrap methods
work by resampling *with replacement* from your observed data thousands of times.
Lesson 291Non-Parametric Alternatives for Difference Intervals
Bootstrapping
Resampling methods generate different samples
Lesson 2055Why Randomness Matters in Data Science
Boston's coefficient
doesn't appear (it's built into the intercept)
Lesson 643Interpreting Coefficients Relative to Reference
Both ACF and PACF
Gradual decay (no clear cutoff)
Lesson 733Using ACF and PACF Together
Both are Aces
P(A ∩B) = (4/52) × (3/51) ≈ 0.
Lesson 104Dependent Events and Joint Probability
Both say non-stationary
Apply differencing or detrending.
Lesson 718Interpreting Stationarity Test Results
Both say stationary
Proceed with modeling—no transformation needed.
Lesson 718Interpreting Stationarity Test Results
Both transformed
With `log(Y) = β₀ + β₁log(X)`, β₁ becomes an *elasticity*—the percent change in Y per 1% change in X.
Lesson 594Interpreting Models After Transformation
Bottom layers
Technical details, methodology, data sources—available as appendices if questioned.
Lesson 1952The Pyramid Principle: Leading with Conclusions
Bottom/reference
Common ancestor (the base)
Lesson 2019Using Diff Tools for Conflict Resolution
Box plots
show the distribution through **five key numbers**: the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Lesson 1223Box Plots and Violin PlotsLesson 1268Box Plots and Violin PlotsLesson 1343Statistical Transformations
Box-Cox transformation
solves this by testing a *family* of power transformations, controlled by a single parameter called **lambda (λ)**.
Lesson 214Box-Cox TransformationLesson 593Box-Cox Transformation
boxplot
draws a box from the first quartile (Q1) to the third quartile (Q3), with a line at the median.
Lesson 55Visualizing SpreadLesson 1285Categorical Plots: stripplot, swarmplot, boxplot
Boy Scout Rule
Leave code slightly cleaner than you found it whenever you touch it
Lesson 2137Refactoring Strategies and Debt Paydown
Branch A
changes line 42 of `analysis.
Lesson 2010Merge Conflicts: What They Are
Branch B
changes that *same* line 42 a different way
Lesson 2010Merge Conflicts: What They Are
Branches solve this problem
You can:
Lesson 2005What are Branches and Why Use Them?
Branching logic
After analyzing a dataset, you might trigger different validation pipelines depending on data quality scores or record counts.
Lesson 1844Dynamic Dependencies
Brand awareness campaigns
where every impression matters similarly
Lesson 1727Linear Attribution Model
Brand awareness efforts
Which channels are best at introducing new prospects?
Lesson 1720First-Touch Attribution Model
Breadth
means splitting into many parallel branches at fewer levels.
Lesson 1623Depth vs Breadth in Metric Trees
Breaking changes
(renaming, deleting columns, changing data types) require careful handling:
Lesson 1876Schema Evolution and Backwards Compatibility
Breaking point
Above 10-20 GB (or ~50% of available RAM), Pandas becomes unreliable or crashes
Lesson 1783Data Size Thresholds: When Pandas Isn't Enough
Bright spots
Anomalously high retention cohorts teach you what worked
Lesson 1649Visualizing Cohort Data with Heatmaps
Broken LTV:CAC ratio
If churn is too high, you may never recover acquisition costs
Lesson 1670What is Churn and Why It Matters
Bubble charts
extend this by encoding a third numeric variable through the **size of each point (bubble)**.
Lesson 1229Bubble Charts for Three Variables
Bug fixes
Create a `hotfix` branch to quickly patch issues
Lesson 2005What are Branches and Why Use Them?
Build a bootstrap distribution
of the test statistic under H₀
Lesson 396Bootstrap Hypothesis Testing
Build backup slides
with technical details for deep dives
Lesson 1956Anticipating and Addressing Audience Questions
Build comprehensive models
Capture the full story your data tells
Lesson 1190Introduction to Multivariate Analysis
Build intuition
before diving into formulas
Lesson 259Simulating Sampling Distributions
Build the transition graph
Map all observed customer journeys as state transitions (e.
Lesson 1733Markov Chain Attribution Models
Build trust incrementally
Regular check-ins demonstrate progress and keep stakeholders engaged.
Lesson 2111Fast Feedback Loops with Stakeholders
Built-in transformations
(ggplot2, some Seaborn):
Lesson 1373Statistical Transformations: Built-in vs Manual
Burden of proof
The prosecution must prove guilt beyond reasonable doubt
Lesson 312Hypothesis Testing as a Legal Analogy
Burn-in
refers to discarding the first portion of your MCMC samples—typically the first 10-50% of iterations.
Lesson 1592Burn-in, Thinning, and Convergence Diagnostics
Business → Technical
When a stakeholder says "We need to reduce customer churn," you translate this into: "Build a classification model predicting 30-day cancellation probability, optimized for recall since false negatives cost more than false positives, using histori...
Lesson 2105Translating Between Technical and Business Language
Business decisions can't wait
Supply chain adjustments based on current demand patterns
Lesson 1788Streaming Data and Real-Time Requirements
Business documentation
Process flows, compliance rules, product specs
Lesson 1201Domain Knowledge as a Hypothesis Source
Business implication
If you're a pool safety company, don't target ice cream shops for partnerships based on this correlation—focus on seasonal weather patterns instead.
Lesson 1426Real-World Examples: Correlation vs Causation
Business intelligence
"What's our cheapest product?
Lesson 885MIN and MAX: Finding Extremes
Business Intelligence (BI) professional
creates a dashboard showing last quarter's sales by region
Lesson 4Data Science vs Data Analytics vs Business Intelligence
Business KPIs
Sales, transactions, or user activity with weekly or monthly seasonality
Lesson 1411Applications and Limitations
Business logic violations
Withdrawing more money than available
Lesson 1109Input Validation and Defense in Depth
Business needs evolve
What mattered last quarter might not matter now
Lesson 15Deployment, Monitoring, and Iteration
Business processes
How does data flow through the organization?
Lesson 1168Understanding Domain Context
Business question
"Why are customers leaving?
Lesson 2085Stage 1: Problem Definition and Scoping
Business relevance
It forces you to ask: "What size of change would actually move the needle for our business?
Lesson 1494Effect Size: The Minimum Detectable Effect
Business requirements
Must results be explainable to non-technical stakeholders?
Lesson 1169Clarifying Assumptions and Constraints
Business rules
(minimum/maximum prices, valid categories)
Lesson 75Domain-Specific Outlier Rules
Business strategy shifts
If your company pivots from growth-at-all-costs to sustainable profitability, your North Star metric and its supporting branches must change.
Lesson 1626Maintaining and Evolving Metric Trees
Business Understanding
Knowing how organizations actually work helps you focus on problems that matter, not just technically interesting puzzles.
Lesson 7The Data Science Skill Stack
Business-Friendly Labels
Instead of "Cluster 3," assign meaningful names like "High-Value Loyalists," "At-Risk Champions," or "New Bargain Hunters.
Lesson 1709Segment Profiling and Interpretation
Buttons
trigger specific actions when clicked:
Lesson 1332Streamlit Widgets: Inputs and Controls
By cohort/channel
Some channels may have better LTV:CAC but slower payback, affecting budget allocation
Lesson 1757Payback Period: Definition and Importance
by how much
A confidence interval for the difference gives you a range of plausible values for the true difference between two population proportions.
Lesson 412Confidence Interval for DifferenceLesson 1955Framing Insights in Business Language

C

C(n-1, r-1)
counts arrangements of those *r-1* successes
Lesson 135The Negative Binomial Distribution: Waiting for r Successes
C(n, k)
The number of ways to choose k successes from n trials (called "n choose k" or the binomial coefficient)
Lesson 127Binomial Distribution PMF
Caching
means storing the results of expensive computations so you can reuse them instead of recalculating.
Lesson 1337Dashboard Performance and CachingLesson 1782Spark Performance Basics: Partitions and Caching
Calculate
standardized residuals for each cell after your significant Chi-Squared test
Lesson 428Post-Hoc Analysis and ResidualsLesson 994CTEs for Simplifying Complex Joins
Calculate a p-value
If you observe an extreme imbalance (e.
Lesson 391The Sign Test for Medians
Calculate cumulative contribution
to the total metric
Lesson 1698Power User Curves and Engagement Distribution
Calculate difference
Treatment effect = (Treatment metric) - (Control metric)
Lesson 1641Isolating Effects with Control Groups
Calculate differences
Subtract the hypothesized median from each observation
Lesson 391The Sign Test for Medians
Calculate error metrics
Compare predictions to actual values
Lesson 790Out-of-Sample Forecast Evaluation
Calculate forecast errors
on your data
Lesson 772Holt-Winters Parameter Optimization
Calculate incrementality
Lift = (Test performance - Expected baseline) ÷ Expected baseline
Lesson 1746Geo-Lift ExperimentsLesson 1747Ghost Ads and PSA Tests
Calculate intercept β₀
Use β₀ = ȳ - β₁x̄
Lesson 522Implementing Least Squares from Scratch
Calculate LTV per cohort
by summing or projecting total revenue per customer in that group
Lesson 1664Cohort-Based LTV Calculation
Calculate P(A and B)
the probability both events occur together
Lesson 102Testing for Independence
Calculate P(A)
the probability of event A occurring
Lesson 102Testing for Independence
Calculate P(B)
the probability of event B occurring
Lesson 102Testing for Independence
Calculate paired differences
(just like the paired t-test or Sign Test)
Lesson 392Wilcoxon Signed-Rank Test
Calculate probabilities
Convert to the standard normal distribution (from your previous lesson) to find exact probabilities
Lesson 195Z-Score Definition and Interpretation
Calculate slope β₁
Use the formula involving sums of products and squared deviations from the mean
Lesson 522Implementing Least Squares from Scratch
Calculate statistics
like mean, median, min, max, std, or count for each group
Lesson 1185Grouped Summary Statistics
Calculate the business impact
in dollars, time, or customers
Lesson 1956Anticipating and Addressing Audience Questions
Calculate the expected value
E(X) = Σ [outcome × probability]
Lesson 152Decision Making Under Uncertainty
Calculate the p-value
as the proportion of permuted statistics as extreme or more extreme than your observed value
Lesson 395Permutation Tests for Means and Beyond
Calculate the rolling average
at each point using that window
Lesson 739Moving Average Detrending
Calculate the tail probabilities
For 95%, that's (1 - 0.
Lesson 1575Computing Equal-Tailed Credible Intervals
Calculate the test statistic
using only b and c
Lesson 436Conducting McNemar's Test
Calculate the treatment effect
within each stratum
Lesson 1430Controlling for Confounders: Stratification
Calculate the U statistic
based on these rank sums
Lesson 393Mann-Whitney U Test (Wilcoxon Rank-Sum)
Calculate transition probabilities
For each state, determine the likelihood of moving to the next state
Lesson 1733Markov Chain Attribution Models
Calculate your statistic
(median, correlation, ratio, etc.
Lesson 306Bootstrap for Non-Standard Problems
Calculate your test statistic
(you learned this in lesson 316)
Lesson 319Calculating P-Values from Test Statistics
Calculate your Z-score
from raw data (you've already learned this!
Lesson 198Using Z-Tables for Probability
Calculated fields
Storing computed values (like `order_total`) instead of recalculating from line items every time.
Lesson 1071When to Denormalize: Performance Trade-offs
Calculating differences
Compare current vs previous values (sales growth, price changes)
Lesson 1023Introduction to Window Functions: LAG and LEAD
Calculating the posterior distribution
means applying Bayes' theorem to compute exactly how probable each parameter value is, given both your starting assumptions and the observed data.
Lesson 1545Calculating the Posterior Distribution
Calculations
Computing percentages or ratios using aggregates
Lesson 967Subqueries in the SELECT Clause
Caliper Matching
adds a safety rule: only match if the propensity scores are within a maximum distance (the "caliper").
Lesson 1448Propensity Score Matching Methods
Call Centers
A help desk receives 30 calls per day on average.
Lesson 144Poisson Applications: Arrivals and Events
Call-in polls
Only passionate viewers with free time participate
Lesson 246Volunteer and Self-Selection Bias
Campaign A
70% chance of $50,000 profit, 30% chance of $0
Lesson 152Decision Making Under Uncertainty
Campaign B
40% chance of $100,000 profit, 60% chance of -$10,000 loss
Lesson 152Decision Making Under Uncertainty
canonical link
for binomial (binary) outcomes, meaning it naturally pairs with the exponential family representation of the binomial distribution.
Lesson 673The Logit LinkLesson 676Canonical vs Non-Canonical LinksLesson 678Choosing the Right Link FunctionLesson 690The Poisson Distribution as a GLM
Capture non-linear monotonic patterns
(a curved upward trend still gets positive correlation)
Lesson 486Spearman's Rank Correlation Coefficient
Career advancement
Publishing sensational findings (even if overstated) could boost your reputation.
Lesson 1930Managing Conflicts of Interest
Carryover effect
Advertising impact persists and decays over time, like a drug slowly leaving your bloodstream
Lesson 1739Adstock and Carryover Effects
CartoDB
Clean, minimal styles for data-first presentations
Lesson 1314Basemaps and Map Tiles
Case Studies
simulate real problems: "How would you measure the success of a new feature?
Lesson 2142Interviewing: Technical and Behavioral Prep
Cash-constrained companies
prioritize rapid payback above all, even if it means higher CAC or lower ROAS, because they literally can't afford to wait.
Lesson 1759Optimizing ROAS, CAC, and Payback Together
categorical
and **numerical variables**, making your **data cleaning and preparation** work much easier than dealing with messy external sources.
Lesson 20Primary Data Sources: Databases and Data WarehousesLesson 634Categorical Variables in Regression
categorical × categorical
interaction captures whether the combined effect of two categories differs from their individual additive effects—like whether a specific treatment works differently depending on disease severity level.
Lesson 687Categorical Predictors and Interactions in Logistic ModelsLesson 1182Choosing Analysis Methods by Variable Types
Categorical comparisons
Use `geom_bar` or `geom_col`
Lesson 1342Geometric Objects (geoms)
Categorical plots
Compare groups (box plots, violin plots, bar plots)
Lesson 1281Introduction to Seaborn's Statistical Plots
Categorical-to-Categorical
Build contingency tables and apply association measures like Cramér's V or chi-square tests.
Lesson 1210Relationship Exploration: Correlation and Association
Category or product line
if your queries consistently filter these
Lesson 1812Partitioning and Clustering Strategies
Causal clarity
Can you tie drop-off to specific friction (forms too long, unclear CTAs)?
Lesson 1685Actionable Insights from Funnel Analysis
Causal question
Does traffic *cause* revenue, or do successful companies simply attract both?
Lesson 1426Real-World Examples: Correlation vs Causation
Causal reasoning
Ask "what causes this feature to have predictive power?
Lesson 1883Protected Classes and Proxy Variables
Causation
means one variable *directly causes* changes in another.
Lesson 1420Defining Correlation and Causation
Cause must precede effect
If A causes B, then A must happen before B.
Lesson 1425Identifying Potential Causal Relationships
Caused by
both your treatment and outcome
Lesson 1432Colliders and Bad Controls
CC-BY
Requires attribution when data is used
Lesson 2082Choosing a License for Data Science Projects
CC-BY-SA
Requires derivatives to use the same license (like GPL for data)
Lesson 2082Choosing a License for Data Science Projects
CC0 (Public Domain)
Maximum openness, no restrictions
Lesson 2082Choosing a License for Data Science Projects
CDF
For x in [a, b], F(x) = (x - a)/(b - a) — a straight line from 0 to 1
Lesson 161The Continuous Uniform Distribution
Cell proportions
divide each cell by the grand total, giving you joint probabilities like P(A and B).
Lesson 98Conditional Probability with Tables
Cells
Metrics like user count, retention rate, cumulative revenue, or conversion rate
Lesson 1647Building a Cohort Table
Censored observations
subjects still "at risk" but whose outcome is unknown (they left the study, were lost to follow-up, or the study ended)
Lesson 812Handling Event Times and CensoringLesson 839Time-to-Conversion in Marketing Funnels
Censored observations contribute
to the "at risk" count up until their censoring time, then they're removed from the calculation.
Lesson 812Handling Event Times and Censoring
Census data
Does your sample reflect regional population proportions?
Lesson 421Applications: Uniform, Genetic Ratios, and Distributions
Centered
at the true population mean (μ)
Lesson 252Sampling Distribution of the Sample Mean
Centered around zero
positive and negative deviations should balance out
Lesson 709Irregular Component: Random Noise
Centered moving averages
use data points from *both* before and after the target time.
Lesson 753Centered vs Trailing Moving Averages
Centering
solves this by transforming each predictor to have a mean of zero.
Lesson 656Centering Variables in InteractionsLesson 661Centering Predictors for Polynomials
Centers
Which group has the highest median?
Lesson 1186Box Plots and Violin Plots by Group
Central Limit Theorem
(which you'll learn later) shows that averages tend to be normally distributed
Lesson 169The Normal Distribution: Definition and PropertiesLesson 223Standard Error and the CLT
Central Limit Theorem (CLT)
is one of the most important results in statistics.
Lesson 218What the Central Limit Theorem States
Central tendency
is the statistical concept of finding a single representative value that describes the "center" or "typical" value of a dataset.
Lesson 38What is Central Tendency?Lesson 1172What is Univariate Analysis?Lesson 1220Histograms for Continuous Distributions
Centrality measures
identify important nodes:
Lesson 1320Network Metrics and Visual Analysis
Centralized storage with structure
ensures documentation lives where everyone can find it.
Lesson 2068Data Provenance Best Practices
Change-point detection
identifies moments in time when the statistical properties of your data fundamentally shift.
Lesson 1412What is Change-Point Detection?
Changes to be committed
(staged): Files you've added to the staging area with `git add` but haven't committed yet.
Lesson 1997Viewing Repository State with git status
Changing spread
The fluctuations get wider or narrower over time (violates constant variance)
Lesson 715Visual Tests for Stationarity
Channel concentration
percentage of volume from top channel (lower is safer)
Lesson 1716Channel Mix and Portfolio Thinking
Chartjunk
refers to anything in a visualization that doesn't represent data or support comprehension:
Lesson 1246Visual Clutter and Chartjunk
Check access permissions
Can you actually query these databases or files?
Lesson 2098Identifying Data Availability Gaps Early
Check associations with outcome
Does it also correlate with your dependent variable?
Lesson 1429Identifying Confounders in Practice
Check associations with treatment
Does the potential confounder correlate with your independent variable?
Lesson 1429Identifying Confounders in Practice
Check connection parameters
Validate host, port, database name, and connection string format
Lesson 1093Troubleshooting Connection Issues
Check context
does it appear in a cluster of suspicious records?
Lesson 1209Outlier Detection and Investigation
Check contrast ratios
between text/elements and backgrounds
Lesson 1254Testing Visualizations for Accessibility
Check covariate balance
against your threshold
Lesson 1492Rerandomization and Practical Implementation
Check for cycles
(remove them—DAGs are acyclic!
Lesson 1469Building a Simple Causal DAG
Check it
Plot log-odds against each continuous predictor; look for straight-line patterns, not curves.
Lesson 686Assumptions and Diagnostics in Logistic Regression
Check normality
Q-Q plots or Shapiro-Wilk test per group
Lesson 290Assumptions and Diagnostics for Difference Intervals
Check response patterns
Low response rates (under 50%) often signal nonresponse bias.
Lesson 250Strategies for Bias Detection and Mitigation
Check result counts
if you expect hundreds of rows but get millions, investigate immediately
Lesson 955Avoiding Cartesian Products
Check retention schedules
what *must* you keep by law?
Lesson 1909Right to Erasure and Data Retention Policies
Check source documentation
or file metadata when available
Lesson 1135Detecting and Fixing Encoding Issues
Check statistical significance
Use t-tests and F-tests to identify meaningful predictors
Lesson 633Practical Model Selection Strategy
Check the lineage
Use your pipeline's metadata to identify which upstream tables, files, or APIs fed into the problematic dataset
Lesson 1870Root Cause Analysis for Quality Issues
Check the p-value
(statistical significance)
Lesson 609Practical vs Statistical Significance
Checks each row
in the main query to see if its column value matches *any* value from the subquery results
Lesson 961IN Operator with Subqueries
Cherry-picking time ranges
means deliberately selecting start and end dates that support a preferred narrative while hiding inconvenient context.
Lesson 1241Cherry-Picking Time Ranges
Chi-square tests
compare outcome distributions across groups
Lesson 1890Measuring Disparate Impact
Chi-Squared test
uses an approximation based on a mathematical distribution, while **Fisher's Exact Test** calculates the exact probability by considering all possible table arrangements.
Lesson 434Fisher's Exact vs Chi-Squared: When to Use Each
Chi-Squared Test of Independence
helps you answer questions like: "Is there a relationship between gender and product preference?
Lesson 422Introduction to Chi-Squared Test of Independence
Children and minors
They lack legal capacity and cognitive maturity to understand data implications
Lesson 1918Special Populations and Vulnerable Groups
Choose Dagster when
You're managing complex data transformations, need strong guarantees about data quality, or want asset-centric workflows.
Lesson 1839Alternative Orchestration Tools
Choose intensity metrics
Frequency (daily visits), depth (features used), or duration (session length)
Lesson 1693Defining User Engagement
Choose Luigi when
You have simpler pipelines, want minimal infrastructure, or need quick prototyping without heavy tooling.
Lesson 1839Alternative Orchestration Tools
Choose Prefect when
You want rapid development, need dynamic pipelines, or prefer writing pure Python without Airflow's constraints.
Lesson 1839Alternative Orchestration Tools
Choosing measures
Remember comparing mean, median, and mode?
Lesson 63Understanding Distribution Shape
Choosing references wisely
pick a meaningful baseline for comparison
Lesson 643Interpreting Coefficients Relative to Reference
Choosing α before analysis
means deciding your threshold for rejecting the null hypothesis—typically 0.
Lesson 329Choosing α Before Analysis
Churn
is when customers stop doing business with you.
Lesson 1670What is Churn and Why It Matters
Churn analysis
measures the percentage who *stop using* your product in a given period (Week 1: 10% churned, Week 2: 8% churned).
Lesson 1660Retention Curves vs Churn AnalysisLesson 1678What is Funnel Analysis?
Churn prediction
becomes more accurate when built separately for high-value versus low-value segments
Lesson 1701What is Customer Segmentation?
Churn reason
(from attribution analysis): If they left due to missing features, notify them when those ship
Lesson 1676Win-Back and Retention Strategies
Churned Customers
Those who've stopped paying, canceled subscriptions, or haven't engaged in your defined inactivity window.
Lesson 1704Customer Lifecycle Stages
Circular
Emphasizing connections over clustering
Lesson 1318Network Layout Algorithms
City populations
A few megacities dwarf most towns
Lesson 190The Pareto Distribution: Heavy Tails and Power Laws
City sizes
A few megacities contain most urban population
Lesson 191Pareto Principle and the 80/20 Rule
Claim
"Mobile users convert at lower rates than desktop users"
Lesson 1946Supporting Your Claims with Evidence
Class attributes
represent **columns**
Lesson 1117What is an ORM and Why Use It?
Clean experimentation
Test new packages without risking your system-wide installation
Lesson 2039Virtual Environments: Concept and Benefits
Clean working directory
When nothing needs attention
Lesson 1998Checking Repository Status
Clean, documented code
with clear README files
Lesson 2091Stage 7: Communication and Handoff
Cleaner pipelines
Data arrives pre-formatted
Lesson 1802Filtering During Read with dtype and Converters
Clear
"Achieve 95% recall on fraud cases while maintaining false positive rate below 2%"
Lesson 2094Defining Success Metrics Upfront
Clear dependencies
on packages, data sources, and environments
Lesson 1981What Makes a Report Reproducible?
Clear factorial design
(every combination of variables)
Lesson 1482Control and Treatment Design
Clear metrics
(not "satisfaction," but "NPS score")
Lesson 2093Translating Business Questions into Analytical Questions
Clear outputs before committing
Use "Restart & Clear Output" before staging your notebook.
Lesson 2030Version Control for Notebooks: Challenges and Solutions
Clear problem statement
What question did you answer?
Lesson 2141Building a Portfolio and Personal Brand
Click-through rates
in digital marketing (proportion of clicks)
Lesson 184Beta Distribution: Bounded Between 0 and 1
Climate data
Policy impacts or environmental shifts
Lesson 1412What is Change-Point Detection?
Closeness centrality
How quickly a node can reach all others (the "efficient communicators")
Lesson 1320Network Metrics and Visual Analysis
Cloud data warehouses
(Snowflake, BigQuery, Redshift) providing scalable compute
Lesson 1821Hybrid Approaches and Modern Data Stacks
Cluster
or **multistage sampling** concentrates your effort geographically.
Lesson 243Choosing the Right Sampling MethodLesson 1481Unit of Randomization
Cluster randomization
Randomize by groups (e.
Lesson 1527Ignoring Network Effects
Cluster sampling
is a technique where you divide your population into groups (called **clusters**), randomly select some of those clusters, and then survey all or some members within the chosen clusters.
Lesson 237Cluster SamplingLesson 243Choosing the Right Sampling Method
Clustered Data
Students within the same classroom, patients from the same hospital, or measurements from the same family are often more similar to each other than to observations from different clusters.
Lesson 381Independence Assumption and Its ViolationsLesson 548Independence of Observations
Clustering coefficient
measures how tightly a node's neighbors are connected to each other—like whether your friends also know each other.
Lesson 1320Network Metrics and Visual Analysis
Clusters of high correlations
reveal groups of variables that measure similar underlying concepts.
Lesson 511Reading and Interpreting Correlation Matrices
Clusters or trends
Independence assumption might be violated
Lesson 556What Are Residuals and Why Plot Them?
Coarsen
Temporarily bin continuous variables into meaningful categories (e.
Lesson 1449Coarsened Exact Matching (CEM)
Coarsened Exact Matching
solves this through a clever three-step process:
Lesson 1449Coarsened Exact Matching (CEM)
Code chunks
Sections of R code enclosed in special delimiters that execute and display results
Lesson 1983R Markdown for Dynamic Reports
Code clarity
`src/` contains reusable functions and scripts.
Lesson 2032Organizing Repository Structure for Data Science
Code contribution process
Should contributors fork your repo?
Lesson 2083Contributing Guidelines and Contact Information
Code debt
Copy-pasting notebook cells instead of writing reusable functions
Lesson 2131What is Technical Debt in Data Science?
Code Licenses
(your scripts and algorithms):
Lesson 2082Choosing a License for Data Science Projects
Code management
means tracking changes to your scripts and notebooks, usually with tools like version control systems.
Lesson 29Code and Environment Management
Code references
Scripts or notebook cells that performed each transformation
Lesson 2065Tracking Data Lineage
Code review
Share branches for review before merging into `main`
Lesson 2005What are Branches and Why Use Them?
Code review happens
Team members examine your changes, spot bugs, suggest improvements, and ensure standards are met
Lesson 2022Understanding Pull Requests
Code standards
What style guide do you follow (PEP 8)?
Lesson 2083Contributing Guidelines and Contact Information
Code versions
Git commit hashes, script versions, or package versions
Lesson 1988Embedding Data Lineage and Metadata
Coefficient of Variation
is your tool when comparing datasets with different units or scales (e.
Lesson 54When to Use Each Measure
Coefficient of Variation (CV)
solves this by expressing variability as a *percentage of the mean*.
Lesson 53Coefficient of Variation
Coefficient p-values
Statistical significance of specific dummies shifts because you're testing different comparisons
Lesson 647Impact on Model Results and Reporting
Coefficient values
Each dummy variable coefficient represents the difference from the reference, so new reference = new differences
Lesson 647Impact on Model Results and Reporting
Coffee
→ **Alertness** (coffee directly increases alertness)
Lesson 1469Building a Simple Causal DAG
Cohen's d
for t-tests (difference between means in standard deviation units)
Lesson 384What is Effect Size?
Coherence
Does the causal interpretation align with existing theory and evidence?
Lesson 498Bradford Hill Criteria for CausationLesson 1563Sequential Updating with New Data
Cohort analysis
is a technique that divides users or customers into groups—called cohorts—based on a shared characteristic or experience within a defined time window.
Lesson 1644What is Cohort Analysis?Lesson 1661What is Customer Lifetime Value (LTV)?Lesson 1678What is Funnel Analysis?Lesson 1701What is Customer Segmentation?Lesson 1715Comparing Channel Performance
Cohort comparison
Use log-rank tests to compare retention across pricing tiers or customer segments
Lesson 838Subscription and Membership Duration Modeling
Cohort-based payback analysis
breaks down payback periods by customer segment (acquisition channel, geography, plan type, etc.
Lesson 1758Cohort-Based Payback Analysis
Coin flip
Sample space: {Heads, Tails}
Lesson 82Collectively Exhaustive Events
Collaborate
with domain experts or specialists
Lesson 34Recognizing Boundaries of Competence
Collaboration actions
(weight: 0.
Lesson 1699Engagement Scoring Systems
Collaboration-friendly
Everyone knows where to find things.
Lesson 2032Organizing Repository Structure for Data Science
Collaborative fraud detection
across banks without sharing customer data
Lesson 1903Secure Multi-Party Computation
Collect more data
to increase sample size
Lesson 426Assumptions and Sample Size Requirements
Collectively exhaustive events
are a group of events whose union contains *every possible outcome* in the sample space— nothing is left out.
Lesson 82Collectively Exhaustive Events
College attended
may proxy for race, class, and family wealth
Lesson 1883Protected Classes and Proxy Variables
Collibra
, **Alation**, and **Apache Atlas** maintain centralized inventories of your data assets.
Lesson 1164Tools for Lineage Tracking
Colliders
Where paths meet (X → C ← Y).
Lesson 1471Mediators and Colliders
Collinearity
makes models unstable and coefficients hard to interpret
Lesson 1197Identifying Variable Importance and Redundancy
Color (hue)
Different colors stand out immediately (red among blues)
Lesson 1235Pre-Attentive AttributesLesson 1310Point Maps and Scatter Plots on Maps
Color (intensity)
Sequential scales for continuous variables (temperature, risk level)
Lesson 1310Point Maps and Scatter Plots on Maps
Color blindness simulators
(like Coblis or Chrome DevTools) show your chart through the lens of deuteranopia, protanopia, or other color vision deficiencies
Lesson 1254Testing Visualizations for Accessibility
Color choices
Ensure colorblind-friendly palettes and grayscale compatibility.
Lesson 1369Publication-Ready Plot Styling
Color encoding
Use color to represent the third dimension on a 2D plot (like heatmaps)
Lesson 1329Effective Use and Pitfalls of 3D VisualizationsLesson 1362When to Use Facets vs. Other Approaches
Color mapping
adds another dimension, using different hues or intensity to show groupings or continuous scales.
Lesson 1265Scatter Plots: Relationships Between Variables
Color-coding
in scatter plots can reveal when different groups show different trends
Lesson 1195Interaction Effects Between Variables
ColorBrewer palettes
offer scientifically-designed color schemes for categorical, sequential, or diverging data:
Lesson 1368Color Scales and Palettes
Colors
can be specified multiple ways in Matplotlib:
Lesson 1272Colors, Markers, and Line Styles
Column charts
arrange categories along the horizontal axis with vertical bars extending upward.
Lesson 1219Bar Charts and Column Charts
Column Count Must Match
Both queries must return the same number of columns
Lesson 999UNION: Combining Distinct Results
Column Names
The result uses column names from the first SELECT
Lesson 999UNION: Combining Distinct ResultsLesson 1151Schema Validation
Column order
(optional, if order matters for your workflow)
Lesson 1151Schema Validation
Column presence
All required columns exist
Lesson 1151Schema Validation
Column proportions
divide each cell by its column total.
Lesson 98Conditional Probability with Tables
Column Types
map Python objects to SQL data types.
Lesson 1121Column Types, Constraints, and Relationships
Columns (Fields/Attributes)
Each column represents a specific property or feature.
Lesson 843Relational Database Concepts
Combination rule
Require both high probability (e.
Lesson 1585Early Stopping in Bayesian Tests
Combine
Aggregate results back together
Lesson 1768Data Parallelism Fundamentals
Combine adjacent categories
to increase expected counts
Lesson 419Assumptions and Minimum Expected Frequencies
Combine all observations
from both groups
Lesson 393Mann-Whitney U Test (Wilcoxon Rank-Sum)
Combine all rows
from both queries
Lesson 998Introduction to Set Operations
Combine categories
if logically defensible
Lesson 426Assumptions and Sample Size Requirements
Comfortable zone
Datasets under 1-2 GB work smoothly in Pandas on typical machines
Lesson 1783Data Size Thresholds: When Pandas Isn't Enough
Command
The script or code to run
Lesson 1874DVC Pipelines and Stages
Command-line tools
that accept parameters and integrate with schedulers
Lesson 2074Notebooks vs Scripts: When to Use Each
Commit hash
A unique 40-character identifier (like `a3f2b8c.
Lesson 1999Viewing Commit History
Commit the merge
with `git commit` (Git will provide a default merge commit message)
Lesson 2011Resolving Merge Conflicts
Commit thoughtfully
Make atomic commits after completing logical units of work, not after every cell execution.
Lesson 2030Version Control for Notebooks: Challenges and Solutions
Common data sources
and their quirks in that sector
Lesson 2145Transitioning Between Industries and Domains
Common Table Expression (CTE)
is a named temporary result set that you define at the beginning of a query using the `WITH` clause.
Lesson 989What are Common Table Expressions (CTEs)?
Communicate timeline risks early
If you discover the analysis will take longer than expected, flag it immediately.
Lesson 2099Aligning with Business Timelines and Decision Points
Communicating results
with stakeholders who benefit from narrative + code + visuals in one document
Lesson 2074Notebooks vs Scripts: When to Use Each
Communication
You must explain complex findings to people who don't speak "data.
Lesson 7The Data Science Skill Stack
Communication bridge
Owner translates technical nuances for business stakeholders
Lesson 1619What is Metric Ownership?
Community channels
Link to Slack, Discord, or discussion forums
Lesson 2083Contributing Guidelines and Contact Information
Community detection
algorithms group nodes into clusters based on connection patterns, revealing natural subdivisions in the network.
Lesson 1320Network Metrics and Visual Analysis
Company Level
Your North Star Metric becomes the top-level objective.
Lesson 1608Connecting North Star Metrics to OKRs
Compare across cohorts
to identify trends, improvements, or degradation
Lesson 1664Cohort-Based LTV Calculation
Compare across tables
Filter rows in one table based on criteria from another
Lesson 959Introduction to Subqueries in WHERE
Compare apples to apples
Compare January sales to July sales fairly
Lesson 748Seasonally Adjusted Data
Compare apples to oranges
Compare test scores from different exams with different scales
Lesson 195Z-Score Definition and Interpretation
Compare cohorts instantly
Did the January cohort retain better than February's?
Lesson 1656Visualizing Retention Curves
Compare costs
Which error would cause more harm?
Lesson 334Setting Alpha: Choosing Your Significance Level
Compare effects
across strata—if the relationship disappears or reverses, the confounder was key
Lesson 1430Controlling for Confounders: Stratification
Compare expected values
across alternatives
Lesson 152Decision Making Under Uncertainty
Compare nested models
Use partial F-tests when adding/removing specific variables
Lesson 633Practical Model Selection Strategy
Compare posteriors
to see which hypothesis is most supported by the evidence.
Lesson 113Multiple Hypotheses and Total ProbabilityLesson 1572Sensitivity Analysis and Prior Robustness
Compare stratified analyses
Calculate effects within each confounder level—are they consistent or wildly different?
Lesson 1429Identifying Confounders in Practice
Compare the smaller sum
to critical values or compute a p-value
Lesson 392Wilcoxon Signed-Rank Test
Compare visually and numerically
Do summary statistics (mean, variance, extreme values) of simulated data match your observed data?
Lesson 1596Posterior Predictive Checks and Model Comparison
Compare your observed statistic
to this distribution to get a p-value
Lesson 396Bootstrap Hypothesis Testing
Comparing datasets
Detect records in a source system missing from a target
Lesson 1002EXCEPT: Finding Differences
Comparing groups
Use SE to gauge if observed differences are substantial
Lesson 265Using Standard Error in Practice
Comparing means across categories
(like average sales by quarter)
Lesson 1288Point Plots for Trend Visualization
Comparing metrics
Find records where one value exceeds another
Lesson 947Self-Joins for Comparisons Within a Table
Comparing models
requires matching units (you can't directly compare slopes from different scales)
Lesson 525Units and Scale in Interpretation
Comparing multiple curves
(different cohorts or product versions) reveals which changes improved stickiness
Lesson 1653What are Retention Curves?
Comparisons
between a row and an aggregate (e.
Lesson 1005Introduction to Window Functions
Complement Rule
gives you a shortcut:
Lesson 89The Complement Rule
Complementary events
save work when one tail is shorter.
Lesson 130Calculating Binomial Probabilities
Complementary probabilities
Using P(A') = 1 - P(A) for efficiency
Lesson 130Calculating Binomial Probabilities
Complete rows
If entire rows are missing, perhaps certain groups weren't measured
Lesson 1179Identifying Missing Values Patterns
Completeness checks
are your detective work for finding exactly where data is missing, how much is missing, and whether the missingness follows patterns.
Lesson 1153Completeness Checks
Completion Rate
Percentage of content finished by viewers.
Lesson 1635Media and Content Metrics: Watch Time and Content Performance
Complex aggregations
Multiple groupBy operations with window functions over large groups
Lesson 1784Computation Complexity: Beyond Data Size
Complex constraints
Stan's type system handles parameter boundaries and transformations elegantly
Lesson 1595Stan: High-Performance Bayesian Inference
Complex layouts
Use `constrained_layout=True`
Lesson 1277Adjusting Subplot Spacing and Layout
Complex models
Multi-parameter models where conjugacy breaks down anyway
Lesson 1556Choosing Between Conjugate and Non-Conjugate Priors
Complex queries
When you need multiple derived tables or nested subqueries
Lesson 974When to Use FROM Subqueries vs CTEs
Complex trends
that aren't straight lines
Lesson 745STL Decomposition (Seasonal-Trend Loess)
Complexity costs
Adding that tenth feature interaction makes your model unmaintainable
Lesson 2116Diminishing Returns and the 80/20 Rule
Complexity penalty
A term that increases with the number of parameters (k)
Lesson 629Akaike Information Criterion (AIC)
Compliance
Meet regulations like GDPR while still enabling data-driven work
Lesson 1901Synthetic Data Generation
Composite keys
Multiple columns together, like `(order_id, product_id)`
Lesson 1048What Are Primary Keys?
Compositional changes
occur when the *makeup* of your treatment or control groups changes over time.
Lesson 1458Common DiD Pitfalls
Compounding growth drag
Even with strong acquisition, high churn prevents the compounding effects of a growing base
Lesson 1670What is Churn and Why It Matters
Computational complexity
the number and cost of operations you perform—can make processing even modest-sized datasets painfully slow on a single machine.
Lesson 1784Computation Complexity: Beyond Data Size
Computational complexity increases
Different methods (Type I, II, III sums of squares) can give different results
Lesson 468Balanced vs Unbalanced Designs
Computational efficiency
Process data in manageable chunks
Lesson 1538Updating Beliefs with Sequential Data
Computational resources
Can you process millions of rows or just thousands?
Lesson 1169Clarifying Assumptions and Constraints
Computational simplicity
No need for sampling algorithms or numerical integration
Lesson 1555Advantages and Limitations of Conjugate Priors
Computationally efficient
You only need the last forecast and the new observation
Lesson 757Introduction to Exponential Smoothing
Compute baseline conversion probability
The chance a random user converts given the current channel mix
Lesson 1733Markov Chain Attribution Models
Compute means
Find x̄ (mean of x values) and ȳ (mean of y values)
Lesson 522Implementing Least Squares from Scratch
Compute on encrypted values
using special arithmetic that preserves secrecy
Lesson 1903Secure Multi-Party Computation
Compute summaries
`stat_summary()` calculates means, medians, or custom functions
Lesson 1352Statistical Transformations with stat_* Layers
Computer Science
builds software systems and algorithms.
Lesson 1Defining Data Science
Computer Science & Programming
Lesson 1Defining Data Science
Concentration of values
(wider sections = more data points)
Lesson 1286Violin Plots and Distribution Shape
Conclusion
The die doesn't appear to follow a uniform distribution; it's likely biased
Lesson 420Interpreting Chi-Squared Test ResultsLesson 733Using ACF and PACF Together
Conclusion cells
Summarize findings and recommendations
Lesson 1982Literate Programming with Notebooks
Conditional dependencies
Some tools support dynamic dependency creation
Lesson 1843Declaring Dependencies in Orchestration Tools
Conditional probability
captures exactly this: the probability of event A happening when we *already know* event B has occurred.
Lesson 92Definition and Notation of Conditional ProbabilityLesson 96Conditional Probability in Tree Diagrams
Conditional values
Different logic per row based on related data
Lesson 967Subqueries in the SELECT Clause
Confidence bands
Usually shown as blue shaded regions or dashed lines (typically at ±2/√n).
Lesson 722ACF Plots and Interpretation
Confirmation bias
Analyzing data only until it supports a desired conclusion
Lesson 1926The Honest Broker Role
Conflict (insight)
What surprising or important pattern did you discover?
Lesson 1933The Power of Narrative in Data Communication
Confusing logic
Code must constantly check `type` to interpret what's valid
Lesson 1148Handling Multiple Types in One Table
Confusion
New team members (or your future self) waste time trying to understand if old experiments are still relevant
Lesson 2135Dead Experimental Code and Feature Sprawl
Confusion matrices
See model prediction patterns
Lesson 1224Heatmaps and Correlation Matrices
conjugate prior
is a prior distribution that, when combined with a specific likelihood function, produces a posterior distribution from the same probability family as the prior.
Lesson 1550What Are Conjugate Priors?Lesson 1551Beta-Binomial Conjugacy
Connecting to objectives
Every insight should tie back to the problem you scoped at the start.
Lesson 2090Stage 6: Interpretation and Insight Generation
Connection pooling
is like a parking lot for database connections.
Lesson 1092Connection Pooling Basics
Connection to normal
If *ln(X)* ~ Normal(μ, σ²), then *X* ~ Log-Normal
Lesson 178Log-Normal Distribution: Definition and Properties
Cons
Stale data between refreshes, storage overhead, refresh time on large datasets
Lesson 1076Materialized Views and Summary Tables
Consecutive rankings
for categorization (like price tiers: budget, mid-range, premium)
Lesson 1009DENSE_RANK(): Ranking Without Gaps
Consider `nbdime` or `jupytext`
Tools like `nbdime` provide notebook-aware diffs.
Lesson 2030Version Control for Notebooks: Challenges and Solutions
Consider accessibility
Approximately 8% of men and 0.
Lesson 1961Color as Communication Tool
Consider adversarial users
Who benefits from gaming your system?
Lesson 1924Red Team Thinking for Data Scientists
Consider d=2 cautiously
If d=1 didn't work, try second-order differencing (differencing the already-differenced series).
Lesson 778Determining Differencing Order (d)
Consider JOINs instead
Correlated subqueries can often be rewritten as LEFT JOINs with GROUP BY, executing more efficiently
Lesson 969Performance Considerations for SELECT Subqueries
Consider JOINs instead when
You have many conditions (10+) or conditions change frequently.
Lesson 1037CASE Best Practices and Performance
Consider ramp-up periods
Exclude the first few days from analysis
Lesson 1525Novelty and Primacy Effects
Consider robustness
With n > 30-40, t-tests handle mild violations well (Central Limit Theorem)
Lesson 398Choosing Between Parametric and Non-Parametric Tests
Consider simpler alternatives
regression with strong domain priors, expert-designed scoring systems, or rule-based logic.
Lesson 2124Insufficient or Low-Quality Data
Consider the confidence interval
width
Lesson 609Practical vs Statistical Significance
Consider UUID/GUID
for distributed systems where different databases generate records independently
Lesson 1050Choosing Effective Primary Keys
Consider WHERE filters when
You only need to include/exclude rows, not transform values.
Lesson 1037CASE Best Practices and Performance
Consistency risks
if updates fail partially
Lesson 1071When to Denormalize: Performance Trade-offs
Consistency with benchmarks
Does your entire interval fall in the "large effect" range, or does it span from "small" to "large"?
Lesson 387Confidence Intervals for Effect Sizes
Consistent analysis syntax
Functions like `groupby()`, `pivot_table()`, and aggregation operations work the same way across different datasets.
Lesson 1149Benefits of Tidy Data for Downstream Work
Consistent spread
The scatter shouldn't fan out or compress at one end
Lesson 480Scatterplots and Visual Assessment
Constant autocorrelation structure
the relationship between observations at different time lags remains stable
Lesson 712What is Stationarity?
Constant mean
the average value doesn't drift up or down
Lesson 712What is Stationarity?
Constant variance
the spread or volatility stays the same
Lesson 712What is Stationarity?
Constant variance (homoscedasticity)
Do residuals spread evenly across all predicted values, or do they fan out or compress?
Lesson 544The Role of Residuals in Diagnostics
Constant-width seasonal swings
→ additive
Lesson 710Additive vs Multiplicative Models
Consultation
involving your Data Protection Officer and potentially data subjects
Lesson 1910Data Protection Impact Assessments (DPIAs)
Consume massive memory
your database must store or stream millions of rows
Lesson 943CROSS JOIN Results: Size and Structure
Consume memory
holding all unique combinations
Lesson 911Performance Considerations with Multiple Groups
Contact/Contribution
Who maintains this and how to get involved
Lesson 2077The Purpose and Anatomy of a Good README
Container tools
that package code *and* environment together
Lesson 29Code and Environment Management
Content Acquisition Cost (CAC)
Total spend (licensing or production) divided by content hours.
Lesson 1635Media and Content Metrics: Watch Time and Content Performance
Content Library Depth
Number of titles and hours of available content.
Lesson 1635Media and Content Metrics: Watch Time and Content Performance
Content marketing
(blog posts, videos, podcasts)
Lesson 1711What Are Acquisition Channels?
Content Platform
Discover Content → Click → Watch/Read → Like/Share
Lesson 1678What is Funnel Analysis?
Content platforms
Account creation or premium upgrade
Lesson 1686Defining Conversions and Conversion Rate
Context expertise
Owner knows when the metric is actionable vs.
Lesson 1619What is Metric Ownership?
Contextual understanding
Some intersections carry unique historical disadvantages that single-attribute analysis misses entirely
Lesson 1893Intersectionality in Fairness
Continue the rebase
Run `git rebase --continue` to move to the next commit
Lesson 2018Resolving Conflicts During Rebase
Continuity
The eye naturally follows smooth, continuous paths.
Lesson 1236Gestalt Principles in Visualization
Continuity correction
For small counts (b + c < 25), use the corrected formula: χ² = (|b - c| - 1)² / (b + c)
Lesson 436Conducting McNemar's Test
Continuous (water)
You might have 250ml, or 250.
Lesson 18Numerical Variables: Discrete and Continuous
Continuous data
mapping numeric ranges to positions or gradients
Lesson 1344Scales and Coordinate Systems
Continuous monitoring required
IoT sensors tracking equipment failures need instant alerts
Lesson 1788Streaming Data and Real-Time Requirements
Continuous numerical data
represents *measurements* that can take any value within a range, including decimals.
Lesson 18Numerical Variables: Discrete and Continuous
continuous positive values
that tend to be skewed rather than symmetric.
Lesson 183Applications of the Gamma DistributionLesson 678Choosing the Right Link Function
Continuous predictors
(like age, blood pressure, or income) take numerical values along a scale, while **categorical predictors** (like treatment group, gender, or risk category) represent distinct groups.
Lesson 829Continuous and Categorical Predictors
Continuous relationships
Use `geom_point` or `geom_line`
Lesson 1342Geometric Objects (geoms)
Continuous unbounded data
The **identity link** (standard linear regression) is appropriate.
Lesson 678Choosing the Right Link Function
Contour plots
Display 3D surfaces as 2D contour lines, like topographic maps
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Contracting funnel
The opposite—wide on the left, narrow on the right.
Lesson 559Detecting Heteroscedasticity (Non-Constant Variance)
Contradicts defined success metrics
(e.
Lesson 2107Saying No and Pushing Back Constructively
Contrast checkers
verify that your text and visual elements meet minimum visibility standards
Lesson 1254Testing Visualizations for Accessibility
Control backfills
Re-running a task may require re-running its entire downstream chain
Lesson 1841Upstream and Downstream Dependencies
Control charts
for process stability without strong seasonality
Lesson 1411Applications and Limitations
Control for confounders
Isolate true relationships from spurious ones
Lesson 1190Introduction to Multivariate Analysis
Control for confounding variables
you learned about in partial correlation
Lesson 595From Simple to Multiple Linear Regression
Control group, before intervention
(baseline)
Lesson 1452The Difference-in-Differences Setup
Control Limits
are the "voice of the process.
Lesson 1400Control Limits vs Specification Limits
controlled experiment
solves this problem through **randomization**.
Lesson 499Why Controlled Experiments Are NeededLesson 1477Core Principles of A/B Testing
Controlling deletes
Cascading actions (DELETE and UPDATE) you learned about help maintain integrity when parent records change
Lesson 1055What is Referential Integrity?
Controls
Account for confounding variables (income, membership duration)
Lesson 1204From Hypothesis to Analysis Plan
Controls for attention effects
Users still see *something* in the ad slot
Lesson 1747Ghost Ads and PSA Tests
Convenience
or **quota sampling** may be pragmatic (but acknowledge the bias risk).
Lesson 243Choosing the Right Sampling Method
Convenience sampling gone wrong
Surveying only people easy to reach (like students in your class) when you want to understand all adults.
Lesson 244Selection Bias and Its Causes
Convenience wins
Metrics like clicks, page views, or session duration give fast feedback.
Lesson 1530Mismatched Metrics and Goals
Convention
Most SQL developers write keywords in UPPERCASE to distinguish them from table/column names, but lowercase works equally well.
Lesson 847Basic SQL Syntax Rules
conversion
is any desired action a user completes that moves them closer to delivering value to your business.
Lesson 1686Defining Conversions and Conversion RateLesson 1690Landing Page and CTA Optimization
Conversion probability curves
What percentage converts by day 7, 30, or 90?
Lesson 839Time-to-Conversion in Marketing Funnels
Convert between zones
Transform a timestamp from one timezone to another (e.
Lesson 1042Working with Timestamps and Time Zones
Cookie banners
are designed to extract "consent" through friction and confusion.
Lesson 1914Consent in Digital Contexts
Cookiecutter
is a command-line tool that creates projects from templates.
Lesson 2076Code Organization Templates and Cookiecutter
Cookiecutter Data Science
, which implements best practices you've already learned: separating raw from processed data, organizing notebooks, keeping configuration separate, and more.
Lesson 2076Code Organization Templates and Cookiecutter
Coordinated disclosure
If external disclosure is needed, work with security/ethics experts to time and frame it appropriately
Lesson 1925Mitigation Strategies and Responsible Disclosure
Coordinates (coord)
The space where data is plotted (Cartesian, polar, map projections)
Lesson 1340The Seven Layers of Grammar
Coordinating dependencies
some tasks must wait for others (e.
Lesson 1769Task Parallelism and Work Distribution
Copyleft licenses
(like GPL, AGPL) require that derivative works also be open source under the same license.
Lesson 2081Understanding Open Source Licenses
Core
is the foundation level, and the **ORM** (Object-Relational Mapper) is built on top of it.
Lesson 1118SQLAlchemy Core vs ORM
Core feature usage
(weight: 0.
Lesson 1699Engagement Scoring Systems
Correct
"We are 95% confident the population proportion lies between 0.
Lesson 281Interpreting Proportion Confidence Intervals
Correct interpretation
"Being sick causes people to go to the hospital.
Lesson 496Reverse Causality
Correct period
Algorithm properly separates normal seasonal peaks from true anomalies
Lesson 1409Setting Detection Parameters
Correctly explaining results
"NYC homes cost $15k more than Boston homes" (not "NYC homes cost $15k")
Lesson 643Interpreting Coefficients Relative to Reference
Correctness first
Does the logic actually work?
Lesson 2024Code Review Best Practices
Correlated SELECT subqueries
run repeatedly:
Lesson 969Performance Considerations for SELECT Subqueries
Correlated subqueries
reference columns from the outer query.
Lesson 968Correlated vs Non-Correlated Subqueries in SELECT
correlated subquery
references the outer query and must run for *every row* being evaluated.
Lesson 966Performance Considerations for WHERE SubqueriesLesson 975What is a Correlated Subquery?
Correlation
means two variables move together in a statistically observable pattern.
Lesson 1420Defining Correlation and Causation
Correlation coefficient (r)
for relationships between variables
Lesson 384What is Effect Size?
Correlation IDs
to trace a record through multiple systems
Lesson 1857Logging Best Practices
Correlation matrices and heatmaps
(Lesson 1192) reveal pairs of highly correlated variables—strong candidates for redundancy
Lesson 1197Identifying Variable Importance and Redundancy
correlation matrix
solves this by computing correlations between *every pair* of variables and organizing them into a grid.
Lesson 510Correlation Matrices: Construction and DisplayLesson 513Applications: Feature Selection and MulticollinearityLesson 1192Correlation Matrices and Heatmaps
Correlation matrix examination
Drop one variable from pairs with correlation > 0.
Lesson 585Remedies: Variable Selection
Cost efficiency
How much you spend to operate
Lesson 1516Business Metrics: Definition and Examples
Cost forecasting
Knowing the hazard function helps finance teams predict warranty claim volumes and budget accordingly.
Lesson 837Product Warranty and Failure Analysis
Cost less to maintain
No retraining pipelines, drift monitoring, or GPU compute
Lesson 2128Data Distribution Shifts Frequently
Cost Per Acquisition (CPA)
by targeting cheaper traffic sources, they might inadvertently decrease **Average Order Value (AOV)** that the sales team tracks.
Lesson 1625Cross-Functional Metric DependenciesLesson 1714Channel-Level MetricsLesson 1715Comparing Channel Performance
Cost per patient encounter
aggregates all expenses divided by patient visits or admissions—a critical profitability metric.
Lesson 1633Healthcare Metrics: Patient Outcomes and Operational Efficiency
Cost-benefit analysis
Is the effect large enough to justify intervention costs?
Lesson 386Effect Size Interpretation GuidelinesLesson 2126Cost and Complexity Exceed Benefit
Cost-effective
Reduces travel and administrative costs by concentrating data collection in selected clusters
Lesson 238Multistage Sampling
COUNT
, **SUM**, **AVG**, **MIN**, and **MAX**—together with **GROUP BY** to create rich summaries of grouped data.
Lesson 892GROUP BY with Different Aggregate Functions
Count the significant spikes
before the cut-off
Lesson 777Identifying MA Order (q) Using ACF
Count the signs
Ignore zeros; count how many differences are positive (+) and how many are negative (−)
Lesson 391The Sign Test for Medians
COUNT(column_name)
counts only the rows where that specific column has a **non-null value**.
Lesson 882COUNT: Counting Rows and Non-Null ValuesLesson 894NULL Values in GROUP BY
COUNT(right_table_column)
ignores NULLs → correct for "how many matches"
Lesson 933Aggregating with LEFT JOINs
Counter-example
Drawing two cards from a deck *without replacement* creates dependence.
Lesson 101Defining Statistical Independence
Counter-metrics
and **guardrails** are defensive metrics designed to catch these problems before they damage your business.
Lesson 1624Counter-Metrics and GuardrailsLesson 1635Media and Content Metrics: Watch Time and Content Performance
Counting distinct performance levels
rather than absolute positions
Lesson 1009DENSE_RANK(): Ranking Without Gaps
Course-correct quickly
Discover that your chosen metric doesn't align with business goals *before* building a complete pipeline.
Lesson 2111Fast Feedback Loops with Stakeholders
Courses
(CourseID, CourseName)
Lesson 1065Second Normal Form (2NF)
covariance
as the "raw" measure of how two variables move together.
Lesson 478The Formula for Pearson's rLesson 519Computing β₁: The Slope Estimate
Covariate balance
means the distribution of baseline characteristics—age, prior purchase behavior, device type, etc.
Lesson 1491Covariate Balance and Diagnostics
Coverage error
occurs when your **sampling frame**—the actual list or method you use to select your sample— doesn't include everyone in the target population.
Lesson 249Coverage Error and Undercoverage
Cox & Snell R²
Based on likelihood ratios but capped below 1
Lesson 702Pseudo R-Squared Measures
Cox models
to identify which covariates (customer age, past purchases, email open time) predict faster responses
Lesson 841Campaign Response Time Analysis
Cox Proportional Hazards Model
(or Cox regression) is a semi-parametric method that lets you predict how covariates (like age, treatment, or risk factors) affect survival time **without** assuming what the baseline hazard distribution looks like.
Lesson 825What is the Cox Proportional Hazards Model?Lesson 835Customer Churn Prediction with Survival AnalysisLesson 836Employee Turnover and Retention AnalysisLesson 839Time-to- Conversion in Marketing FunnelsLesson 840Loan Default Timing and Credit Risk
Cramér's V
is the standard effect size measure for chi-squared tests of independence.
Lesson 429Effect Size: Cramér's V and Phi
Crash queries
some databases have row limits or timeouts
Lesson 943CROSS JOIN Results: Size and Structure
Create a narrative flow
Guide readers from question → exploration → findings → conclusions in a linear, readable format
Lesson 1982Literate Programming with Notebooks
Create disparate impact
even without intent (your model systematically disadvantages a protected group)
Lesson 1888Protected Classes and Sensitive Attributes
Create dynamic task groups
that adapt to data
Lesson 1836Task Dependencies and Flow Control
Create predictable rhythms
Schedule recurring meetings at the project's start.
Lesson 2104Communication Cadence and Updates
Create strata
by grouping units with identical covariate values
Lesson 1489Stratified Randomization Fundamentals
Create unexpected results
accidental CROSS JOINs are a common SQL mistake
Lesson 943CROSS JOIN Results: Size and Structure
Creates a smoother series
that's easier to analyze
Lesson 755Moving Averages for Trend Estimation
Creating
and modifying table structures
Lesson 844What is SQL?
Creating new features
from existing ones: combining columns, extracting date components (day of week, month), or calculating ratios that encode domain knowledge.
Lesson 2088Stage 4: Feature Engineering and Preparation
Creative costs
design, copywriting, video production
Lesson 1753Customer Acquisition Cost (CAC): Components and Calculation
credible interval
is the Bayesian alternative to a confidence interval.
Lesson 1562Credible Intervals for ProportionsLesson 1573What is a Credible Interval?
Credit history location
→ historical discrimination effects
Lesson 1889Proxy Variables and Redlining
Credit scoring
built on historical lending bias against minorities
Lesson 1881Historical and Societal Bias
Criminal justice data
reflecting decades of discriminatory policing practices
Lesson 1881Historical and Societal Bias
Critical caveat
Due to Jensen's inequality, `exp(E[log(Y)])` ≠ `E[Y]`.
Lesson 594Interpreting Models After Transformation
Critical insight
Controlling for a collider *opens* a spurious path between X and Y, creating bias where none existed.
Lesson 1471Mediators and Colliders
Critical pipeline failures
Page on-call engineer via PagerDuty
Lesson 1851Error Logging and Notifications
Critical point
`AVG()` automatically *ignores* NULL values.
Lesson 884AVG: Computing Averages
Critical requirement
MVT demands significantly more traffic than A/B testing because you're splitting visitors across many more variants.
Lesson 1689Multivariate Testing and Personalization
Critical rule
Each step must map to at least one clear event.
Lesson 1679Defining Funnel Steps and Events
Critical/Page
Data corruption, complete pipeline failure, SLA breach—requires immediate human intervention
Lesson 1858Alerting Strategies
Critically
You *cannot* say "there's a 95% probability the true proportion lies in this interval" under the frequentist interpretation—the parameter either is or isn't in that specific interval.
Lesson 1564Comparing Bayesian and Frequentist Proportion Inference
Cross-filtering
Filtering data in one view updates all views
Lesson 1304Subplots and Linked Interactions
Cross-tabulations
Visualize frequency patterns across two categorical variables
Lesson 1224Heatmaps and Correlation Matrices
Crossing or converging lines
→ Interaction present; one factor's effect changes depending on the other
Lesson 466Visualizing Interactions
CRS
defines exactly how coordinates relate to positions on Earth.
Lesson 1308Geographic Data Types and Coordinate Systems
CSV
is human-readable and universal but slow to parse and memory-intensive.
Lesson 1133Performance Considerations Across Formats
CTE name
(like `cte_name`) that you choose
Lesson 990Basic CTE Syntax and Structure
CTEs
are named and defined upfront (we'll cover these in detail soon):
Lesson 974When to Use FROM Subqueries vs CTEsLesson 991CTEs vs Subqueries: When to Use Each
Cube Root Transformation
(`x^(1/3)`) is useful for:
Lesson 213Square Root and Cube Root Transformations
Cube the z-scores
For each value, subtract the mean, divide by standard deviation, then cube it.
Lesson 65Calculating Skewness
Cubing
keeps the sign, so values below the mean contribute negatively and values above contribute positively.
Lesson 65Calculating Skewness
Cultural buy-in
Stakeholders may question "peeking" at results, requiring education
Lesson 1515Trade-offs: Sample Size, Speed, and Complexity
Cumulative metrics
$12,500 total revenue from Jan cohort by Week 3
Lesson 1647Building a Cohort Table
Cumulative probabilities
P(X ≤ k) — at most k successes, or P(X ≥ k) — at least k successes
Lesson 130Calculating Binomial Probabilities
Cumulative probability
requires summing multiple exact probabilities.
Lesson 130Calculating Binomial Probabilities
Curiosity
Great data scientists ask "why?
Lesson 7The Data Science Skill Stack
CURRENT ROW
Start or end at the current row being processed
Lesson 1020UNBOUNDED and CURRENT ROW Keywords
Curved patterns
Your relationship isn't actually linear (violates linearity assumption)
Lesson 556What Are Residuals and Why Plot Them?Lesson 1189Detecting Nonlinear Relationships
Custom functions
Tailor spending to your business needs
Lesson 1512Group Sequential Testing
Custom metrics
unique to your business problem
Lesson 306Bootstrap for Non-Standard Problems
Custom order
Based on domain meaning (e.
Lesson 1178Bar Charts for Categorical Data
Customer Arrivals
A coffee shop averages 15 customers per hour.
Lesson 144Poisson Applications: Arrivals and Events
Customer behavior
Whether users stay, leave, or convert
Lesson 1516Business Metrics: Definition and Examples
Customer demographics
Do purchases align with market segment sizes?
Lesson 421Applications: Uniform, Genetic Ratios, and Distributions
Customer Lifespan
measures how long customers stay active (in the same time unit).
Lesson 1663Simple LTV: Average Revenue Per Customer
Customer Lifetime
= 1 / Monthly Churn Rate
Lesson 1666LTV for Subscription Businesses
Customer Lifetime Value (CLV)
Predicted total revenue from a customer
Lesson 1516Business Metrics: Definition and Examples
Customer Lifetime Value (LTV)
is the total revenue a customer generates over their entire relationship with a business—from their first purchase to their last interaction before churning.
Lesson 1661What is Customer Lifetime Value (LTV)?
Customer preferences
Your marketing theory predicts 40% will choose red, 35% blue, 25% green.
Lesson 414Introduction to Chi-Squared Goodness of Fit Test
Customer Retention Rate
Percentage of customers who remain active
Lesson 1516Business Metrics: Definition and Examples
Customer segments
A customer classified as "new" cannot simultaneously be "returning"
Lesson 81Mutually Exclusive Events
Customer service calls arriving
If no one has called in the last 5 minutes, that doesn't make a call in the next minute more or less likely
Lesson 167Memoryless Property of Exponential
Customer success failures
(poor onboarding, lack of support)
Lesson 1675Churn Attribution and Root Cause Analysis
Customers table
customer ID, name, email
Lesson 918What is an INNER JOIN?
CUSUM
tracks the *cumulative sum* of deviations from a target value.
Lesson 1403CUSUM and EWMA ChartsLesson 1415CUSUM: Cumulative Sum Control Chart
Cut technical debt
Features with low adoption *and* low frequency are candidates for removal
Lesson 1696Feature Adoption and Usage Frequency
Cycle through each parameter
, sampling from its conditional distribution:
Lesson 1591Gibbs Sampling for Multivariate Posteriors
Cycle Time
tracks how long it takes to complete one unit from start to finish, while **throughput** measures actual units produced per time period.
Lesson 1636Manufacturing Metrics: OEE, Yield, and Cycle Time
Cyclical
Variable period (could be 3 years, then 5 years, then 4 years)
Lesson 708Cyclical Patterns: Non-Fixed Fluctuations

D

DAGs (Directed Acyclic Graphs)
are the heart of Airflow.
Lesson 1833Introduction to Apache Airflow
Dagster
log every execution step.
Lesson 1164Tools for Lineage Tracking
Daily Active Users (DAU)
counts how many unique users performed a meaningful action in your product on a given day.
Lesson 1694Daily Active Users (DAU) and Monthly Active Users (MAU)
Daily data
with weekly seasonality → period = 7
Lesson 746Choosing Seasonal Period
Damage vulnerable populations
through automated decisions
Lesson 1888Protected Classes and Sensitive Attributes
Damaged trust
in data science as a field
Lesson 34Recognizing Boundaries of Competence
Damped trend methods
add a *damping parameter* (usually denoted φ, pronounced "phi") that gradually flattens the trend over time.
Lesson 762Damped Trend Methods
Dampens irregular fluctuations
(the noise component)
Lesson 755Moving Averages for Trend Estimation
Dark patterns
are interface designs that deliberately trick users into giving up data or privacy rights.
Lesson 1914Consent in Digital Contexts
Dash
(by Plotly) offers **more control and flexibility**.
Lesson 1330Introduction to Interactive Dashboards
dashboard
is an interactive, real-time (or near-real-time) monitoring tool that updates automatically as underlying data changes.
Lesson 1974Defining Dashboards and ReportsLesson 1997Viewing Repository State with git status
Dashboards excel at monitoring
, letting stakeholders track KPIs, spot anomalies, and maintain situational awareness.
Lesson 1980Hybrid Approaches and When to Use Both
Data access
Are there privacy restrictions or missing historical data?
Lesson 1169Clarifying Assumptions and Constraints
Data Analyst
investigates *why* the Northeast region underperformed and identifies the key factors
Lesson 4Data Science vs Data Analytics vs Business Intelligence
Data Analysts
focus on *understanding the past and present*.
Lesson 2138Data Analyst vs Data Scientist vs ML Engineer
Data auditing
Find customers who placed orders in 2023 but not in 2024
Lesson 1002EXCEPT: Finding Differences
Data augmentation
Add more representative examples from underrepresented groups to balance your dataset.
Lesson 1894Auditing and Remediation Strategies
Data availability
You might have a task that checks which data sources updated today, then spawns processing tasks only for those sources.
Lesson 1844Dynamic Dependencies
Data Cleaning and Preparation
to extract meaningful insights—for example, analyzing customer reviews (text) to understand sentiment, or processing images to detect patterns.
Lesson 16Structured vs Unstructured DataLesson 20Primary Data Sources: Databases and Data WarehousesLesson 38What is Central Tendency?
Data Collection and Acquisition
(which you learned earlier), you'll encounter both types.
Lesson 16Structured vs Unstructured DataLesson 20Primary Data Sources: Databases and Data Warehouses
Data Completeness
The percentage of expected records that successfully arrived.
Lesson 1856Key Metrics to Monitor
Data consistency risk
Aggregates can become stale or incorrect if updates fail
Lesson 1073Storing Computed Values and Aggregates
Data corruption during export
Special characters lost when saving to formats that don't support Unicode.
Lesson 1139Dealing with Special Characters and Unicode
Data debt
Undocumented preprocessing steps that become "tribal knowledge"
Lesson 2131What is Technical Debt in Data Science?
Data decays quickly
Real-time bidding for ads loses value after milliseconds
Lesson 1788Streaming Data and Real-Time Requirements
Data drift
New patterns emerge that your model hasn't seen
Lesson 15Deployment, Monitoring, and Iteration
Data exploration
Understanding the range of values in a column
Lesson 873Understanding DISTINCT: Removing Duplicate Rows
Data Freshness
The time lag between when data is generated and when it's available for use.
Lesson 1856Key Metrics to Monitor
Data Freshness SLO
"Dashboard data will be no more than 4 hours old during business hours"
Lesson 1860SLA and SLO Definitions
Data isolation
Keep `data/` and `outputs/` in `.
Lesson 2032Organizing Repository Structure for Data Science
Data lineage
is the documented history of data from its original source through every transformation, merge, filter, and calculation until it reaches its final form in a report, model, or dashboard.
Lesson 1159What is Data Lineage?Lesson 1875Data Lineage and ProvenanceLesson 1908Data Subject Access Requests (DSARs)
Data locality
means running your computation where the data already lives.
Lesson 1772Data Locality and Network Bottlenecks
Data minimization
Collect only what you actually need (goodbye "vacuum up everything" strategies)
Lesson 1904What is GDPR and Why It MattersLesson 1905Core Principles of GDPR
Data parallelism
is like having five chefs all chopping vegetables using the same technique.
Lesson 1769Task Parallelism and Work Distribution
Data pipeline issues
Some users' data might not be logged correctly
Lesson 1524Sample Ratio Mismatch (SRM)
Data pipeline maintenance
Sources change schemas, APIs deprecate, databases get restructured
Lesson 1979Maintenance and Sustainability Considerations
Data points overlap
Points in front hide those behind, potentially concealing important patterns
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Data Poisoning
Adversaries might deliberately feed corrupted data into your pipeline to manipulate model outputs (e.
Lesson 1920Anticipating Misuse of Data Products
Data Protection Impact Assessment
"
Lesson 1931When to Push Back on Requests
Data provenance
is the documented history of your data: where it came from, who collected it, when it was gathered, and every transformation it underwent.
Lesson 23Data Provenance and MetadataLesson 26Reproducibility vs. ReplicabilityLesson 1875Data Lineage and Provenance
Data quality
"What's the earliest date in this dataset?
Lesson 885MIN and MAX: Finding Extremes
Data quality checks
Comparing `COUNT(*)` vs `COUNT(DISTINCT column)` reveals how many duplicates exist
Lesson 887Aggregates with DISTINCT
Data quality drift
Your model expects feature X between 0-100, but upstream changes cause values of 0-1000.
Lesson 2136Monitoring Gaps and Silent Failures
Data quality validation
enforces business rules:
Lesson 1826Data Validation and Schema Enforcement
data reconciliation
finding discrepancies between two datasets, like customers in your CRM but not in your billing system, or vice versa.
Lesson 938Symmetric Difference PatternLesson 941Use Cases: Data Reconciliation
Data requirements
Ensure sufficient sample size in each age group
Lesson 1204From Hypothesis to Analysis Plan
Data rows
Each subsequent line represents one record
Lesson 1125CSV Files: Structure and Common Issues
Data Science Lifecycle
, lessons 9-15) are so important.
Lesson 26Reproducibility vs. Replicability
Data science problem
"Build a binary classification model to predict 30-day churn probability, achieving minimum 80% recall to catch potential churners, using historical customer behavior data from the past 2 years.
Lesson 2085Stage 1: Problem Definition and Scoping
Data science specifics
Check for proper handling of missing data, appropriate train-test splits, reproducibility (random seeds), and whether assumptions of statistical methods are met.
Lesson 2024Code Review Best Practices
Data Scientist
builds a predictive model to forecast next quarter's sales and recommends which products to promote
Lesson 4Data Science vs Data Analytics vs Business Intelligence
Data Scientists
focus on *predicting and prescribing*.
Lesson 2138Data Analyst vs Data Scientist vs ML Engineer
Data shows
Customer purchases spike on Tuesdays
Lesson 1201Domain Knowledge as a Hypothesis Source
Data splitting
`train_test_split` shuffles differently each run
Lesson 2055Why Randomness Matters in Data Science
Data surprises
Real-world data is messy.
Lesson 2109Why Data Science is Inherently Iterative
Data to visualize
Your x and y values
Lesson 1257Creating Your First Plot
Data type matching
`int64`, `float64`, `object`, `datetime64` match expectations
Lesson 1151Schema Validation
Data types
Is `age` an integer, not a string?
Lesson 1151Schema Validation
Data updates frequently
and users want current information
Lesson 1330Introduction to Interactive Dashboards
Data Version Control (DVC)
extends Git to handle large data files.
Lesson 2066Version Control for Data Files
Data versioning issues
The dataset changes, but nobody tracks which version was used (Data Provenance)
Lesson 30The Reproducibility Crisis and Solutions
Data Volume
The number of records processed per run.
Lesson 1856Key Metrics to Monitor
data warehouse
, on the other hand, is like a massive library that collects copies of information from many different filing cabinets across an entire organization.
Lesson 20Primary Data Sources: Databases and Data WarehousesLesson 1807Data Warehouse vs Database: Architecture and Purpose
Data-driven/algorithmic
Uses statistical models to weight contributions based on observed patterns and incrementality testing
Lesson 1637What is Metric Attribution?
data-ink ratio
the proportion of ink (or pixels) in your chart that actually represents data versus non-data elements.
Lesson 1237Chart Junk and Data-Ink RatioLesson 1246Visual Clutter and Chartjunk
database
as a digital filing cabinet where organizations store their day-to-day information in an organized way.
Lesson 20Primary Data Sources: Databases and Data WarehousesLesson 842What is a Database?
Database Management System (DBMS)
is software that sits between you and your database files, handling all the complex operations of storing, organizing, retrieving, and managing data.
Lesson 845Database Management Systems (DBMS)
Database portability
Switch from SQLite to PostgreSQL with minimal code changes
Lesson 1117What is an ORM and Why Use It?
Database snapshots
or compressed archives
Lesson 2033Git Large File Storage (LFS) for Data Assets
Datadog
automate this process, offering dashboards that show pipeline status at a glance and trigger alerts when thresholds are breached.
Lesson 1861Monitoring Tools and Dashboards
DataFrames
organize your data into named columns—like a spreadsheet or SQL table—but distributed across a cluster.
Lesson 1778DataFrames and Spark SQL Basics
Dataset
A collection of typed objects (numbers, strings, custom objects).
Lesson 1777RDDs: Resilient Distributed Datasets Fundamentals
Datasets
over ~10 MB that change occasionally
Lesson 2033Git Large File Storage (LFS) for Data Assets
Date columns
Find the earliest and most recent dates
Lesson 885MIN and MAX: Finding Extremes
Date fields
(`order_date`, `created_at`) — most queries filter by time ranges
Lesson 1812Partitioning and Clustering Strategies
Date sequences
Start dates should precede end dates.
Lesson 1155Consistency Checks Across Fields
Date truncation
cuts off the precision beyond a certain level, effectively "rounding down" a timestamp to the beginning of that time period.
Lesson 1043Date Truncation and Rounding
DATETIME
, **TIMESTAMP**: Date and time values
Lesson 846Tables, Schemas, and Data Types
Dating app data
Attractiveness and personality both lead to getting matches.
Lesson 1473Conditioning on Colliders: Selection Bias
DAU
counts unique users engaging with your platform each day, while **MAU** tracks monthly uniques.
Lesson 1631Social Media Metrics: DAU/MAU and Content Engagement
Day 3
1 event occurs, 5 customers at risk
Lesson 811Computing the Kaplan-Meier Product
Day 5
1 event occurs, 4 still at risk
Lesson 811Computing the Kaplan-Meier Product
Day-1 (D1)
Did users find immediate value?
Lesson 1657Day-1, Day-7, Day-30 Benchmarks
Day-7 (D7)
Are users forming a habit?
Lesson 1657Day-1, Day-7, Day-30 Benchmarks
Days or weeks saved
in fast-moving business environments
Lesson 1515Trade-offs: Sample Size, Speed, and Complexity
dbt
(transforms + documentation), and **OpenLineage** (open standard) embed lineage capture directly into your code.
Lesson 1164Tools for Lineage TrackingLesson 1821Hybrid Approaches and Modern Data Stacks
Dead Letter Queue
is a separate storage location where permanently failed tasks or messages are routed after exhausting all retry attempts.
Lesson 1852Dead Letter Queues
Dead Letter Queues
Verify that permanently failed records actually land in your dead letter queue for later investigation.
Lesson 1854Testing Error Handling
DEBUG
Detailed diagnostics (variable values, loop iterations)
Lesson 1857Logging Best Practices
Decide
If the proposal is higher, always go there.
Lesson 1590The Metropolis-Hastings Algorithm
Deciles
(10 groups): Divide data into tenths.
Lesson 57Quantiles: Quartiles, Deciles, and Beyond
DECIMAL
Numbers with decimals (e.
Lesson 846Tables, Schemas, and Data Types
Decision
We either reject H₀ (evidence is convincing) or fail to reject H₀ (insufficient evidence)
Lesson 312Hypothesis Testing as a Legal Analogy
Decision hesitation
(need to consult others, compare options)
Lesson 1681Time-Based Funnel Analysis
Decision points
"If we can explain 60% of the variance, we can make the call"
Lesson 2117Defining 'Good Enough' with Stakeholders
Decision-focused
Compute expected gains or losses for business decisions
Lesson 1570Comparing Two Means: Bayesian Approach
Decisions Made
Document any filtering or transformation decisions.
Lesson 1180Documenting Univariate Findings
Declining Session Frequency
A user who visited daily but now appears weekly is sending a signal.
Lesson 1700Leading Indicators of Disengagement
Decomposing Seasonality
concept you've learned, the technique involves:
Lesson 1408Handling Multiple Seasonal Periods
Decreased Depth of Activity
Fewer actions per session, less content consumed, or shallow navigation compared to their historical baseline.
Lesson 1700Leading Indicators of Disengagement
Decreasing adjusted R-squared
when you add a variable means that variable adds more noise than signal—it's not worth including
Lesson 614Interpreting Adjusted R-Squared Values
Dedicate regular time blocks
for learning—even 30 minutes daily beats sporadic weekend marathons.
Lesson 2143Continuous Learning and Skill Development
Default position
The defendant is innocent (H₀)
Lesson 312Hypothesis Testing as a Legal Analogy
Defect Rate
quantifies quality problems, often measured in defects per million opportunities (DPMO) in Six Sigma environments.
Lesson 1636Manufacturing Metrics: OEE, Yield, and Cycle Time
Defense in depth
means building multiple independent security layers so that if one fails, others still protect you.
Lesson 1109Input Validation and Defense in Depth
Defensive Deletes
If your pipeline needs to "delete and reload," make deletions specific.
Lesson 1848Designing Idempotent Operations
Define active users
Who counts as "engaged"?
Lesson 1693Defining User Engagement
Define an error metric
(like Mean Squared Error or Mean Absolute Error)
Lesson 772Holt-Winters Parameter Optimization
Define constraints
minimum spend per channel (contractual obligations), maximum spend (capacity limits), total budget
Lesson 1742Budget Optimization Using MMM
Define flexible step completion
– count a step as completed the first time it occurs, regardless of order
Lesson 1683Multi-Path and Non-Linear Funnels
Define targets
for LTV:CAC ratio (typically 3:1 minimum)
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Define terms once
When you must use jargon, explain it immediately
Lesson 1967Writing Clear and Concise Analysis Sections
Define the event
Product failure or malfunction
Lesson 837Product Warranty and Failure Analysis
Define your population
clearly (e.
Lesson 234Simple Random Sampling
Defined
using intuitive methods (e.
Lesson 1868Great Expectations Framework
Defining success criteria
What does "good enough" look like?
Lesson 2085Stage 1: Problem Definition and Scoping
Definition drift
Changing what "active" means breaks trend comparisons
Lesson 1694Daily Active Users (DAU) and Monthly Active Users (MAU)
Deflating r
Conversely, an outlier that doesn't follow the general pattern (like an extremely tall person who weighs very little due to illness) can weaken an otherwise strong correlation by pulling the line away from the main cluster.
Lesson 481Outliers and Their Impact on r
Degrading trends
Newer cohorts drop off faster.
Lesson 1650Comparing Cohorts Over Time
Degree centrality
How many connections a node has (the "popular" nodes)
Lesson 1320Network Metrics and Visual Analysis
degrees of freedom (df)
, typically *n - 1* for a mean (where *n* is sample size).
Lesson 268Critical Values and the t-DistributionLesson 270Degrees of Freedom in t-Intervals
Delayed feedback
Subscription renewals happen after 12 months
Lesson 1517Surrogate Metrics: When Direct Measurement is Impractical
Delayed response
You discover issues after substantial damage occurs
Lesson 1617The Danger of Lagging-Only MetricsLesson 1739Adstock and Carryover Effects
DELETE
With cascading rules, the database must find and handle all child records
Lesson 1060Trade-offs: Performance vs Integrity
DELETE protection
You cannot delete a parent record if children still reference it (unless you specify cascading behavior)
Lesson 1052Foreign Key Constraints
Delete ruthlessly
If code isn't in production and hasn't been touched in months, remove it.
Lesson 2135Dead Experimental Code and Feature Sprawl
Delimiter
The character separating values (usually a comma, but sometimes tabs, semicolons, or pipes)
Lesson 1125CSV Files: Structure and Common Issues
Deliver faster
You can answer the question in hours instead of weeks
Lesson 2110The Minimum Viable Analysis (MVA)
Deliver incrementally
Instead of one massive final report, provide preliminary findings early.
Lesson 2099Aligning with Business Timelines and Decision Points
Demand mechanism
Can you explain *how* improving this metric causes the outcome?
Lesson 1615Correlation Without Causation
Demographic parity
Do groups receive positive outcomes at similar rates?
Lesson 1884Detecting Bias in Your Data
Demographic Parity (Statistical Parity)
Lesson 1887Defining Fairness in Data Science
Demographic statistics
Regions with different population counts
Lesson 43Weighted Mean and Its Applications
denominator
(SE) scales that difference by how much variability you'd expect due to random sampling.
Lesson 353Calculating the t-StatisticLesson 478The Formula for Pearson's r
DENSE_RANK()
produces: 1, 2, 2, 3 (no gap!
Lesson 1009DENSE_RANK(): Ranking Without Gaps
Department B (less competitive)
Lesson 1428The Simpson's Paradox Example
Department choice
is the confounder.
Lesson 1428The Simpson's Paradox Example
Dependency hell
occurs when your project requires specific package versions, but those requirements conflict with each other or with what's installed in different environments.
Lesson 2048The Dependency Hell Problem
Dependency isolation
is critical in data science because:
Lesson 2039Virtual Environments: Concept and Benefits
Dependent (paired) samples
have a natural one-to-one correspondence between observations.
Lesson 360Independent vs. Dependent Samples
Dependent samples
require a **paired t-test**, which analyzes the *differences* within each pair, effectively reducing the problem to a one-sample test on those differences
Lesson 360Independent vs. Dependent Samples
Dependent variable
your time series values
Lesson 738Linear Detrending
Deployment constraints
(pure Python libraries install easier in restricted environments)
Lesson 1087Database Drivers and Connection Libraries
Deployment instructions
specifying environment requirements
Lesson 2091Stage 7: Communication and Handoff
Deployment lag
By the time you retrain and deploy, the distribution may have shifted again
Lesson 2128Data Distribution Shifts Frequently
Depth
means creating many hierarchical levels.
Lesson 1623Depth vs Breadth in Metric Trees
Descendants
of both pathways (e.
Lesson 1432Colliders and Bad Controls
Descriptive Statistics
Calculate summary metrics for each segment—average purchase frequency, mean customer lifetime value, median recency, typical basket size.
Lesson 1709Segment Profiling and Interpretation
Desired significance level (α)
Typically 0.
Lesson 505Sample Size and Power for Correlation Tests
Destination loaders
Write processed data to warehouses, lakes, or operational systems
Lesson 1822What is a Data Pipeline?
Detect interactions
How one variable's effect depends on another
Lesson 1190Introduction to Multivariate Analysis
Detecting multiple anomalies simultaneously
without the masking effect where one outlier hides another
Lesson 1405What is Seasonal Hybrid ESD?
Detection delay
measures the time lag between when the change actually occurs and when your algorithm flags it.
Lesson 1418Evaluating Change-Point Detection Methods
Detection lag
grows exponentially—the longer it takes to notice, the more data and decisions are affected
Lesson 2136Monitoring Gaps and Silent Failures
Determine proportions
Calculate what percentage of the total population each stratum represents
Lesson 236Stratified Sampling
Deterministic with ORDER BY
The same query produces the same numbering
Lesson 1007ROW_NUMBER(): Assigning Unique Row Numbers
Detrend the Series
Subtract the trend from the original data: `Y - T`.
Lesson 744Classical Decomposition Methods
Deuteranopia/Deuteranomaly
(green-weak): difficulty distinguishing red from green
Lesson 1248Color Blindness and Color Palette Design
Development branch (`develop`)
The integration point for completed experiments that passed initial validation.
Lesson 2035Branching Strategies for Experiments
Deviance
is the generalization of RSS for all GLMs.
Lesson 697Deviance: A Measure of Model Fit
Deviation from mean
`value - AVG(value) OVER (PARTITION BY category)`
Lesson 1019Comparing Values to Window Aggregates
Device
Each phone/computer gets assigned
Lesson 1481Unit of Randomization
DFFITS
(pronounced "dee-fits") zooms in on a more specific question: *"How much does the predicted value for observation i change when we remove observation i from the dataset?
Lesson 576DFFITS: Influence on Fitted ValuesLesson 589Deciding Whether to Remove Outliers
Diagnostic outcomes
A test result cannot be both "positive" and "negative" at the same time
Lesson 81Mutually Exclusive Events
Diagonal patterns
Gradual retention decline is normal; sudden cliff-drops warrant investigation
Lesson 1649Visualizing Cohort Data with Heatmaps
DiD estimate
Treatment effect = Treatment change - Control change
Lesson 1452The Difference-in-Differences Setup
Difference from average
`sale_amount - AVG(sale_amount) OVER (PARTITION BY region)`
Lesson 1019Comparing Values to Window Aggregates
Differences
between two sample means (comparing groups)
Lesson 225CLT for Sums and Other Statistics
Differences are subtle
small variations must be detectable
Lesson 1233Position as the Most Effective Channel
Differences in distribution shape
between groups, not just center/spread
Lesson 1286Violin Plots and Distribution Shape
differential privacy
(which adds noise to protect individuals), MPC provides exact computation with zero information leakage about inputs—assuming parties don't collude.
Lesson 1903Secure Multi-Party ComputationLesson 1911GDPR Compliance for Data Scientists
Difficult to change
Update the same value in multiple places
Lesson 2072Configuration Files vs Hard-Coded Values
Difficulty interpreting effects
You can't confidently say "holding all else constant, X increases Y by.
Lesson 580What is Multicollinearity?
dimension tables
(containing descriptive attributes like customer names, product details, or dates).
Lesson 956Star Schema JoinsLesson 1808Star Schema and Fact TablesLesson 1809Dimension Tables and Slowly Changing Dimensions
Dimensionality reduction
solves this by mathematically projecting your high-dimensional data into a lower-dimensional space while preserving as much important structure as possible.
Lesson 1196Dimensionality Reduction for Visualization
Dimensions matter
Journals often require specific figure sizes (e.
Lesson 1369Publication-Ready Plot Styling
diminishing returns
, and we capture it mathematically with **saturation curves**.
Lesson 1740Saturation Curves and Diminishing ReturnsLesson 2116Diminishing Returns and the 80/20 Rule
direct
relationship between two variables, you must "control for" or "hold constant" potential confounders.
Lesson 509Confounding Variables and ControlLesson 1712Common Channel Categories
Direct attention
Add a single annotation or callout box pointing to what matters: "Sales dropped 30% here.
Lesson 1958Simplifying Visual Complexity
Direct download
Obtaining existing datasets or files
Lesson 11Data Collection and Acquisition
Direct identifier removal
means stripping obvious PII like names, social security numbers, email addresses, and phone numbers.
Lesson 1895Data Anonymization Basics
Direct interpretation
Roughly tells you the "typical distance" values fall from the mean
Lesson 49Standard Deviation: Interpretable Spread
Direct probability statements
"There's a 92% chance treatment A has a higher mean than control"
Lesson 1570Comparing Two Means: Bayesian Approach
Direct traffic
(users typing your URL directly)
Lesson 1711What Are Acquisition Channels?
Directed
Tasks flow in specific directions (task A → task B)
Lesson 1833Introduction to Apache Airflow
Directed Acyclic Graph
is a mathematical structure where nodes (tasks) are connected by directed edges (dependencies) with one ironclad rule: **no cycles allowed**.
Lesson 1842Directed Acyclic Graphs (DAGs)
Directed Acyclic Graph (DAG)
is a visual diagram where:
Lesson 1468Introduction to Directed Acyclic Graphs (DAGs)
Directed edges
(arrows) represent causal relationships pointing from cause to effect
Lesson 1468Introduction to Directed Acyclic Graphs (DAGs)
Directed graphs
show asymmetrical relationships.
Lesson 1316Introduction to Network Graphs and Graph Theory Basics
Direction
Do points trend upward (positive) or downward (negative)?
Lesson 480Scatterplots and Visual AssessmentLesson 2122When Uncertainty Is Acceptable
Directional (one-tailed)
"The new landing page will *increase* sign-ups by 5%"
Lesson 1479Formulating Hypotheses
Directional alignment
When the surrogate goes up, the business metric should too (not down!
Lesson 1518The Relationship Between Surrogate and Business Metrics
Dirty reads
Reading uncommitted data that gets rolled back
Lesson 1116Transaction Isolation and Concurrency
Disagreement requires judgment
If tests conflict, lean on your visual evidence and domain knowledge.
Lesson 718Interpreting Stationarity Test Results
Disclose conflicts
openly to stakeholders
Lesson 35Conflicts of Interest and Independence
Disclose failed model iterations
and why they didn't work
Lesson 1929Avoiding Cherry-Picking Results
Discovery-driven iteration
Your analysis reveals unexpected patterns, missing data, or invalid assumptions.
Lesson 2092Iteration and Feedback Loops in Practice
Discrete (apples)
You have exactly 5 apples or 6 apples.
Lesson 18Numerical Variables: Discrete and Continuous
Discrete data
assigning categories to distinct colors or positions
Lesson 1344Scales and Coordinate Systems
Discrete numerical data
consists of whole numbers that represent *counts* of distinct items.
Lesson 18Numerical Variables: Discrete and Continuous
Discrete or categorical variables
(e.
Lesson 1446Exact Matching
Discriminatory Application
Your fair hiring model might be selectively applied only to certain demographics while others bypass it entirely.
Lesson 1920Anticipating Misuse of Data Products
Discussion is centralized
Questions, explanations, and decisions live alongside the code
Lesson 2022Understanding Pull Requests
Display issues
Characters appearing as , boxes, or garbled text—usually an encoding mismatch.
Lesson 1139Dealing with Special Characters and Unicode
Display outputs inline
Charts, tables, and statistical results appear right below the code that generated them
Lesson 1982Literate Programming with Notebooks
Distance from max
`MAX(value) OVER (PARTITION BY category) - value`
Lesson 1019Comparing Values to Window Aggregates
Distinguish stakeholder types
The person requesting may not be the end user.
Lesson 2102Understanding Stakeholder Goals and Constraints
Distracts
your audience's attention from the actual data
Lesson 1963Removing Chartjunk
Distributed
Your data is partitioned across multiple machines.
Lesson 1777RDDs: Resilient Distributed Datasets Fundamentals
Distributing heterogeneous jobs
to available workers
Lesson 1769Task Parallelism and Work Distribution
Distribution Characteristics
Note shape (skewed, bimodal), spread, and central tendency.
Lesson 1180Documenting Univariate Findings
Distribution Checks
Compare your data's distribution against historical baselines.
Lesson 1157Statistical Anomaly Detection in QA
Distribution comparisons
ensuring all histograms use the same bin ranges
Lesson 1276Sharing Axes Between Subplots
Distribution plots
Show how data is spread (histograms, KDE plots)
Lesson 1281Introduction to Seaborn's Statistical Plots
Distribution shape
describes the overall form or silhouette of your data when visualized—whether values cluster symmetrically in the middle, bunch up on one side, or spread out evenly.
Lesson 63Understanding Distribution ShapeLesson 1267Histograms and Distribution Plots
Diversification
Relying on a single channel is risky; tracking reveals over-dependence
Lesson 1711What Are Acquisition Channels?Lesson 1716Channel Mix and Portfolio Thinking
Do missingness patterns correlate
with other variables?
Lesson 1207Missing Data Assessment and Strategy
Docker container
runs a lightweight, isolated instance of a complete operating system environment.
Lesson 2045Docker for Complete Environment Reproducibility
Document active experiments
Maintain a simple tracking file listing which experiments are ongoing, which succeeded, and which are archived.
Lesson 2135Dead Experimental Code and Feature Sprawl
Document and mitigate
Create threat models; build monitoring, rate limits, access controls, or kill switches
Lesson 1924Red Team Thinking for Data Scientists
Document changes
so teams understand why the structure evolved.
Lesson 1626Maintaining and Evolving Metric Trees
Document data sources
In your README, specify where data lives and how to access it
Lesson 2070Separating Data from Code
Document original purpose
explicitly in your consent forms and data governance policies
Lesson 1915Secondary Use and Scope Creep
Document sensitivity
Report when conclusions are stable or when they depend on prior choice
Lesson 1572Sensitivity Analysis and Prior Robustness
Document trade-offs
Where do optimizations create tension?
Lesson 1625Cross-Functional Metric Dependencies
Document your assumptions
(why did you draw each arrow?
Lesson 1469Building a Simple Causal DAG
Document your methods
before seeing results (prevents post-hoc justification)
Lesson 35Conflicts of Interest and Independence
Documentation debt
Skipping README updates or data dictionaries
Lesson 2131What is Technical Debt in Data Science?
Documentation Licenses
(README, tutorials, papers):
Lesson 2082Choosing a License for Data Science Projects
Documentation Standards
, and **Ethical Principles**.
Lesson 33Transparency and Explainability
Domain context
is the background knowledge about the field you're analyzing: its terminology, business processes, constraints, typical patterns, and unwritten rules.
Lesson 1168Understanding Domain Context
Domain knowledge suggests
"Our email campaign goes out Monday evenings—let's check if opens predict Tuesday purchases"
Lesson 1201Domain Knowledge as a Hypothesis Source
Domain rules
Values outside physically possible ranges (negative age, 500% growth rate)
Lesson 1209Outlier Detection and Investigation
Domain validity
Your model might fit your training data beautifully (high R-squared) but make nonsensical predictions outside the observed range.
Lesson 537When R-Squared is Not Enough
Domain-specific rules
when you have expert knowledge about what constitutes "normal"
Lesson 1411Applications and Limitations
Don't
This is still vulnerable to SQL injection if the list contains user input.
Lesson 1108Handling IN Clauses Safely
Don't skip the diagnostics
Check histograms, Q-Q plots, and variance equality tests *before* running your test.
Lesson 368Common Pitfalls and Best Practices
Don't use SUM for
Counting rows (use `COUNT`), averaging (use `AVG`), or non-numeric data (it only works with numbers).
Lesson 883SUM: Calculating Totals
Double funnel
Variance is small in the middle but large at both extremes
Lesson 559Detecting Heteroscedasticity (Non-Constant Variance)
Double-counting in partitions
When using the law of total probability, make sure your conditioning events are mutually exclusive and collectively exhaustive—no overlap, no gaps.
Lesson 100Common Conditional Probability Mistakes
Doubling your sample size
doesn't cut the standard error in half—it reduces it by a factor of √2 ≈ 1.
Lesson 223Standard Error and the CLT
Download buttons
for saving charts as static images
Lesson 1300Creating Basic Interactive Charts with Plotly Express
Downside
Wastes compute and time on unchanged data.
Lesson 1828Incremental vs Full Load Strategies
Downstream
`train_model` and `generate_report` (direct and transitive)
Lesson 1841Upstream and Downstream Dependencies
Downstream dependencies
are the tasks that rely on *your* task's output.
Lesson 1841Upstream and Downstream Dependencies
Downward (negative) trend
Values generally decrease (e.
Lesson 706Trend: Long-Term Direction
Draft pull requests
are a special PR state that signals "this is work-in-progress—feedback welcome, but don't merge yet.
Lesson 2029Draft Pull Requests and WIP Workflows
Draw a random sample
of size n from that population
Lesson 222Visualizing the CLT with Simulations
Draw arrows
from causes to effects
Lesson 1469Building a Simple Causal DAG
Drop non-significant predictors
if they don't contribute beyond noise.
Lesson 703Sequential Model Building Strategy
Dry runs
Execute the DAG structure logic (declaring dependencies, checking conditions) without running the actual data processing.
Lesson 1846Testing and Validating Dependency Graphs
Dual use
refers to technology, methods, or data that can be applied for both beneficial and harmful purposes.
Lesson 1919Defining Dual Use in Data ScienceLesson 1920Anticipating Misuse of Data ProductsLesson 1931When to Push Back on Requests
Dummy variable encoding
creates separate binary (0/1) columns for each category.
Lesson 635Dummy Variable Encoding Basics
Dunn's test
follows Kruskal-Wallis to identify which specific group pairs are significantly different.
Lesson 473Post-Hoc Tests After Kruskal-Wallis: Dunn's Test
Dunnett's test
is specialized for situations where you have one control or reference group and several experimental treatments.
Lesson 460Dunnett's Test for Control Comparisons
Duplicate rows
from the join can inflate your aggregates
Lesson 933Aggregating with LEFT JOINs
Durability
Once committed, changes persist even if the system crashes
Lesson 1110What Are Database Transactions?
Duration
How long the test will run
Lesson 1485Documentation and Pre-Registration
During deep dives
(testing relationships, checking distributions): exploration
Lesson 1216Choosing the Right Purpose
During off-peak hours
(to avoid impacting production systems)
Lesson 1831What is Job Scheduling?
During Training
Models need an objective function—a mathematical definition of "better.
Lesson 2130No Clear Success Metric or Feedback Loop
During Validation
Even if you train something, how do you know it works?
Lesson 2130No Clear Success Metric or Feedback Loop
Dynamic dependencies
let your pipeline decide its own structure while running.
Lesson 1844Dynamic Dependencies
Dynamic filtering
Filter based on calculated values (like averages, maximums) rather than hardcoded numbers
Lesson 959Introduction to Subqueries in WHERE

E

E(Y) = b'(θ)
the first derivative of the cumulant function gives you the expected value
Lesson 667Mean and Variance in the Exponential Family
Early detection
Catch bad data before it corrupts downstream analysis
Lesson 1158Automated Validation Frameworks
Early feedback on approach
"Before I process these 50 datasets, does this transformation logic look right?
Lesson 2029Draft Pull Requests and WIP Workflows
Early in your workflow
(profiling, hypothesis generation, outlier detection): exploration
Lesson 1216Choosing the Right Purpose
Early quality detection
Comparing survival curves (using the log-rank test) between production batches, suppliers, or manufacturing plants reveals if one group has significantly higher failure rates.
Lesson 837Product Warranty and Failure Analysis
Early research
Studies showed coffee drinkers had higher rates of heart disease.
Lesson 1426Real-World Examples: Correlation vs Causation
Early-life failures
(infant mortality).
Lesson 189Fitting Weibull Models to Lifetime Data
Early-stage customer discovery
What sparks initial interest?
Lesson 1720First-Touch Attribution Model
Easier collaboration
When your team knows data will always arrive in tidy format, everyone can use the same templates, functions, and workflows without custom adaptations.
Lesson 1149Benefits of Tidy Data for Downstream Work
Easier dimension updates
Changing a category name happens in one place
Lesson 1810Snowflake Schema and Normalization Trade-offs
Easier Maintenance
Smaller, focused tables are simpler to understand, query, and modify than massive, repetitive tables.
Lesson 1061Introduction to Normalization
Easiest wins
Where do top-performing segments reveal best practices?
Lesson 1685Actionable Insights from Funnel Analysis
Easy to interpret
it's in the same units as your data.
Lesson 801Forecast Evaluation Metrics
Economic business cycles
are the classic example.
Lesson 708Cyclical Patterns: Non-Fixed Fluctuations
Economic growth
might correlate with **increased coffee consumption**, not because coffee drives the economy, but because both rise with population and urbanization.
Lesson 1422Spurious Correlations
Edge Color/Style
Differentiate relationship types with color or dashed/solid lines (friend vs.
Lesson 1319Styling Network Visualizations
Edge Weight
Make thicker lines represent stronger relationships (more messages, higher correlation).
Lesson 1319Styling Network Visualizations
Edit the section
to reflect the desired final state
Lesson 2011Resolving Merge Conflicts
Education
Does the average test score in your classroom differ from the district standard of 75?
Lesson 351When to Use a One-Sample t-Test
Education and income
Does higher education lead to higher income, or do wealthier families afford better education?
Lesson 1424Reverse Causality
Educational records
reflecting unequal access to opportunities
Lesson 1881Historical and Societal Bias
Effect
Compresses large values more than small ones, pulling in the right tail of skewed distributions.
Lesson 592Common Transformations: Log, Square Root, Reciprocal
Effect Size (δ)
The minimum detectable difference you care about—determined by your MDE (Minimum Detectable Effect).
Lesson 1496The Four Parameters of Sample Size Calculation
Effective sample size (ESS)
Estimates how many independent samples you truly have after accounting for autocorrelation
Lesson 1592Burn-in, Thinning, and Convergence Diagnostics
Ego-network analysis
Model and measure the spillover explicitly
Lesson 1527Ignoring Network Effects
elbow method
on within-cluster variance or **silhouette scores** to quantify segment quality at each cut point.
Lesson 1706Hierarchical Clustering for SegmentationLesson 1708Choosing the Number of Segments
Electricity demand
might show daily (24-hour) *and* weekly (168-hour) patterns
Lesson 746Choosing Seasonal Period
Elevation
How high above (or below) the horizontal plane your camera sits, measured in degrees.
Lesson 1326Viewing Angles and Projection Types
Eliminates bottlenecks
Shared resources (like shared memory or a central database) become traffic jams as you scale.
Lesson 1771Shared-Nothing Architecture
Eliminates sign problems
Squaring makes all errors positive, so they can't cancel each other out.
Lesson 517The Least Squares Criterion
ELT flips this order
Extract, Load, *then* Transform.
Lesson 1816What is ELT? Extract, Load, Transform Explained
Email
Campaigns sent to your owned email list
Lesson 1712Common Channel Categories
Email campaigns
(newsletters, promotional sends)
Lesson 1711What Are Acquisition Channels?
Email list size
(without open or click rates)
Lesson 1612What Are Vanity Metrics?
embarrassingly parallel
(easy to split across machines), while others require extensive data shuffling or coordination.
Lesson 1786Data Processing Patterns Best Suited for SparkLesson 1790What is Dask and When to Use It
Emotional connection
means helping your audience *feel* the human stakes behind the numbers.
Lesson 1941Emotional Connection Without Manipulation
Emphasizes larger errors
A point that's 4 units away contributes 16 to the sum, while one that's 2 units away contributes only 4.
Lesson 517The Least Squares Criterion
Emphasizes smaller values
– the square root compresses large values and stretches small ones, making patterns clearer
Lesson 560Scale-Location Plot (Spread-Location Plot)
Emphasizing trends over time
or ordered categories
Lesson 1288Point Plots for Trend Visualization
Empirical Rule
is your quick mental map for normal distributions.
Lesson 171The 68-95-99.7 Rule (Empirical Rule)
Employee-manager relationships
An `employees` table where each employee has a `manager_id` pointing to another employee in the same table
Lesson 945Introduction to Self-Joins
Empty result sets
`AVG()` on zero rows returns NULL, not zero
Lesson 884AVG: Computing Averages
Enable comparison
You can compare typical values across groups ("Team A averages 15 sales per week vs.
Lesson 38What is Central Tendency?
Enable step-by-step execution
Others can run each cell independently to verify your work or experiment with modifications
Lesson 1982Literate Programming with Notebooks
Enables decision-making
You can calculate probabilities, credible intervals, and expected values directly from it
Lesson 1537The Posterior Distribution
Enables parallelism
by identifying independent tasks
Lesson 1790What is Dask and When to Use It
Enclosure
Elements surrounded by a boundary are perceived as a group.
Lesson 1236Gestalt Principles in Visualization
End-to-end integration tests
Run your pipeline on sample data in a test environment to verify the execution order produces expected results.
Lesson 1846Testing and Validating Dependency Graphs
Enforces Status Checks
Automated tests, linters, or CI/CD pipelines must pass before merging.
Lesson 2027Protecting Branches and Required Reviews
Engagement Rate
= (Likes + Comments + Shares) / Impressions × 100
Lesson 1631Social Media Metrics: DAU/MAU and Content Engagement
Engagement scoring
might show that 20% of users generate 80% of value (power users)
Lesson 1701What is Customer Segmentation?
Engaging storytelling
Making trends memorable and intuitive
Lesson 1306Animation and Time-Based Transitions
Engineering Team Objective
Optimize performance
Lesson 1608Connecting North Star Metrics to OKRs
Enhancements (additional layers)
Smoothing lines, confidence bands, annotations
Lesson 1347Understanding Layers in ggplot2
Enrollments
(StudentID, CourseID, StudentName, CourseName, Grade)
Lesson 1065Second Normal Form (2NF)
Ensure data integrity
(no duplicate or corrupted records)
Lesson 842What is a Database?
Ensure immutability
Changing a primary key causes cascading headaches
Lesson 1050Choosing Effective Primary Keys
Enter
or **Space** (to activate), and **arrow keys** (for fine control).
Lesson 1253Interactive Accessibility: Keyboard Navigation
Entry
Transaction begins automatically
Lesson 1114Transaction Context Managers in Python
Environment artifacts
virtual environments, cache folders
Lesson 1996The .gitignore File
Environment details
Which worker, timestamp, resource usage
Lesson 1851Error Logging and Notifications
Environment differences
Code runs differently on different machines (Code and Environment Management)
Lesson 30The Reproducibility Crisis and Solutions
Environment files
that list all software dependencies and versions
Lesson 29Code and Environment Management
Environment management
means recording *exactly* which software versions you used.
Lesson 29Code and Environment Management
Environment-driven iteration
External changes (new regulations, market shifts, updated systems) force you to revisit earlier decisions.
Lesson 2092Iteration and Feedback Loops in Practice
Environment-specific
Your local paths won't work on a colleague's machine or cloud server
Lesson 2072Configuration Files vs Hard-Coded Values
Epidemiological data
that helps track disease spread can reveal individuals' health status or movements, enabling discrimination or persecution.
Lesson 1919Defining Dual Use in Data Science
Equal information
Posterior sits roughly halfway between
Lesson 1567Posterior Mean as Weighted Average
Equal opportunity
Among qualified individuals, do groups succeed at similar rates?
Lesson 1884Detecting Bias in Your Data
Equal probability
(typically 50/50)
Lesson 1487Simple Random Assignment
Equal to minimum
`WHERE value = (SELECT MIN(value) FROM table)`
Lesson 964Subqueries with Aggregate Functions
Equal-area
Preserves area ratios but distorts shapes
Lesson 1308Geographic Data Types and Coordinate Systems
Equal-width vs equal-frequency
Different bin strategies tell different stories
Lesson 1245Misleading Aggregations and Binning
Equality searches
(`WHERE id = 100`) jump straight to the target
Lesson 1079B-Tree Indexes: Structure and Mechanics
Erasure isn't absolute
GDPR includes exemptions when you must keep data:
Lesson 1909Right to Erasure and Data Retention Policies
Ergodicity
Long-run averages converge to expectations under the stationary distribution
Lesson 1589Markov Chains: The Foundation of MCMC
ERROR
Failures requiring attention
Lesson 1857Logging Best Practices
Error bars
attach vertical or horizontal lines to a point (often a mean) showing ±1 standard deviation, ±2 SE, or confidence intervals.
Lesson 55Visualizing SpreadLesson 1244Omitting Uncertainty and Variability
Error classification
Transient network issue or data quality problem?
Lesson 1851Error Logging and Notifications
Error handling
Logs failures, sends alerts, and implements retry logic
Lesson 1822What is a Data Pipeline?
Error measures
For numerical predictions, how far off were your guesses on average?
Lesson 14Model Evaluation and Validation
Escalation patterns
Moving from chat to phone to "speak to a manager"
Lesson 1673Leading Indicators of Churn
Establish coordination protocols
When do teams need to align before taking action?
Lesson 1625Cross-Functional Metric Dependencies
Estimate densities
`stat_density()` creates smooth distribution curves
Lesson 1352Statistical Transformations with stat_* Layers
Eta-squared
is the most straightforward effect size for ANOVA.
Lesson 445Effect Size: Eta-Squared and Omega-Squared
Ethical collection
– Was this data gathered with people's informed consent?
Lesson 36Responsible Data Sourcing and Use
Ethical violations
if you mishandle sensitive data or misrepresent uncertainty
Lesson 34Recognizing Boundaries of Competence
ETL
stands for **Extract, Transform, Load**—a traditional data integration pattern that moves data from source systems into a data warehouse or analytics platform.
Lesson 1815What is ETL? Extract, Transform, Load Explained
Etsy
Gross Merchandise Sales (GMS) — captures value for both buyers (finding unique items) and sellers (making sales).
Lesson 1606Examples of North Star Metrics by Industry
Evaluate improvement
Does deviance drop meaningfully?
Lesson 703Sequential Model Building Strategy
Evaluate trade-offs
Parametric tests have higher power when assumptions hold; non-parametric tests are safer when assumptions are questionable
Lesson 398Choosing Between Parametric and Non-Parametric Tests
Evaluates segments
For each possible segmentation (sets of change-points), calculates a cost based on how well each segment fits the data
Lesson 1416PELT Algorithm: Pruned Exact Linear Time
Event frequencies
Average 2.
Lesson 1647Building a Cohort Table
Event indicator
(1 = event occurred, 0 = censored)
Lesson 828Fitting the Cox Model
Event logs
specific actions like "clicked_ad", "opened_email", "viewed_product"
Lesson 1719The Customer Journey and Touchpoints
Events
subjects who experienced the event (death, churn, failure) at that exact time
Lesson 812Handling Event Times and CensoringLesson 1679Defining Funnel Steps and Events
Every selection is independent
picking one individual doesn't affect who else gets picked
Lesson 234Simple Random Sampling
Everything-to-Target
Always examine relationships *with* your target variable first.
Lesson 1210Relationship Exploration: Correlation and Association
Evidence is strong
Overwhelming data swamps your initial belief
Lesson 115Prior Sensitivity Analysis
Evidence is weak
A mildly positive test result won't overcome a very skeptical or very confident prior
Lesson 115Prior Sensitivity Analysis
Evolving data
Show how metrics change over time
Lesson 1327Creating Animations with FuncAnimation
EWMA
applies weighted averaging where recent observations matter more than older ones.
Lesson 1403CUSUM and EWMA Charts
Exact duplicate detection
Find rows where *all* columns match exactly—these are often accidental copies from data loading errors.
Lesson 1154Uniqueness and Duplication Checks
Exact p-values
use the true, theoretical probability distribution without approximation.
Lesson 322Exact vs Asymptotic P-Values
Exact pinning
guarantees that everyone running your code uses identical package versions, maximizing reproducibility.
Lesson 2050Pinning Versions vs Flexible Ranges
Exact probabilities
P(X = k) — exactly k successes
Lesson 130Calculating Binomial Probabilities
Exactly 4 accept
Calculate P(X=4) directly with the PMF
Lesson 130Calculating Binomial Probabilities
exactly the same
as taking the Pearson correlation coefficient between X and Y and squaring it.
Lesson 534R-Squared vs Correlation SquaredLesson 647Impact on Model Results and Reporting
Exactly two outcomes
Success or failure, no middle ground
Lesson 123Bernoulli Trial Definition and Properties
Examine edge cases
Look for missing groups entirely—this reveals coverage error.
Lesson 250Strategies for Bias Detection and Mitigation
Examine the coefficient magnitude
in its real-world units
Lesson 609Practical vs Statistical Significance
Examine transformations
Review each transformation step—was a join dropping records?
Lesson 1870Root Cause Analysis for Quality Issues
Example (left-tailed)
H₀: μ = 100 vs H₁: μ < 100
Lesson 311One-Sided vs Two-Sided Alternatives
Example (right-tailed)
H₀: μ = 100 vs H₁: μ > 100
Lesson 311One-Sided vs Two-Sided Alternatives
Example 1
In educational research, improving test scores by d = 0.
Lesson 386Effect Size Interpretation Guidelines
Example 2
In pharmaceutical trials, a pain medication with d = 0.
Lesson 386Effect Size Interpretation Guidelines
Example 3
In physics experiments measuring fundamental constants, even d = 0.
Lesson 386Effect Size Interpretation Guidelines
Example analogy
A company's sales look higher in stores with fewer employees.
Lesson 1194Simpson's Paradox and ConfoundingLesson 1937The Hero's Journey: Making Your Audience the Hero
Example scenario
You want to find average order values by region, but only for orders placed in 2023.
Lesson 895Combining WHERE and GROUP BYLesson 898HAVING Clause FundamentalsLesson 1600Business Examples: Revenue vs Pipeline
Example use case
Instead of joining `orders` and `products` and summing totals every time someone checks monthly sales, create a materialized view that stores those monthly totals.
Lesson 1076Materialized Views and Summary Tables
Example Values
A few representative samples
Lesson 2064Creating Data Dictionaries
Example violation
In a customer churn study, if high-risk customers are more likely to stop using your product *and* more likely to unsubscribe from your tracking emails (causing censoring), your results will be biased.
Lesson 821Assumptions of the Log-Rank Test
Excel
adds formatting overhead and reads even slower than CSV, especially with multiple sheets.
Lesson 1133Performance Considerations Across Formats
Excess Kurtosis (Pearson's)
Sometimes the calculation stops before the final "-3" adjustment.
Lesson 67Calculating Kurtosis
Excessive borders
and boxes around every element
Lesson 1963Removing Chartjunk
Excessive grid lines
Too many or overly prominent gridlines
Lesson 1246Visual Clutter and Chartjunk
exchangeability
if the groups had swapped assignments, we'd expect the same average outcome.
Lesson 1438Ensuring Balance Between GroupsLesson 1443Observational Studies vs Randomized Experiments
Excluding multiple values
Lesson 868The NOT Operator
Excluding pattern matches
Lesson 868The NOT Operator
Executes the subquery first
and gets back multiple rows (each with one value)
Lesson 961IN Operator with Subqueries
Execution order chaos
Tasks might run simultaneously when they should be sequential, causing some to fail because required data isn't ready yet.
Lesson 1840What is Dependency Management in Pipelines?
Executive Summary
(1 page): Key findings, recommendations, business impact
Lesson 1966Report Structure and Executive Summary
Executive/Business stakeholders
Lead with directional findings and practical significance.
Lesson 1953Adjusting Statistical Depth by Audience
Executives
making strategic decisions need clean, simple charts that communicate the main point at a glance: think bar charts showing three key metrics or a single trend line.
Lesson 1954Tailoring Visualizations to Audience Needs
Executor
Determines *how* tasks run (locally, distributed, etc.
Lesson 1833Introduction to Apache Airflow
Exercise and health
Do healthier people exercise more, or does exercise make people healthier?
Lesson 1424Reverse Causality
Existing knowledge
What do subject-matter experts already know?
Lesson 1168Understanding Domain Context
EXISTS
stops searching as soon as it finds *any* matching row.
Lesson 985EXISTS vs IN: Performance Considerations
Exit
Always happens, ensuring clean boundaries
Lesson 1114Transaction Context Managers in Python
Exogeneity
means that your predictor variable X is determined *outside* the model and is completely independent of the error term ε.
Lesson 553Exogeneity: X Must Be Independent of Errors
Expanding funnel
Residuals start tight on the left and fan out wider on the right.
Lesson 559Detecting Heteroscedasticity (Non-Constant Variance)
Expectations and Success Criteria
Document what each stakeholder considers "success.
Lesson 2101Identifying and Mapping Stakeholders
Expected
The count you would expect if the null hypothesis were true (which you calculated in the previous lesson)
Lesson 417The Chi-Squared Test Statistic Formula
Expected Impact
Prevent 20-30 account cancellations/month ($50K-75K MRR saved)
Lesson 1948The Recommendation Slide: Making It Actionable
Expected loss threshold
Stop when the expected loss of choosing variant B is below $X
Lesson 1585Early Stopping in Bayesian Tests
Expected outputs
What files or results should appear
Lesson 1989Best Practices for Sharing Reproducible Reports
Expected uniqueness violated
An ID column contains repeats
Lesson 1154Uniqueness and Duplication Checks
Experiment
with different sample sizes, population shapes, or statistics
Lesson 259Simulating Sampling DistributionsLesson 498Bradford Hill Criteria for Causation
Experiment snapshots
`exp-baseline-xgboost`, `exp-feature-engineering-v3`
Lesson 2037Tagging Releases and Experiment Snapshots
Experimental tracking nightmare
Hard to remember which parameter combinations you've tested
Lesson 2072Configuration Files vs Hard-Coded Values
Experimentation
Create branches to test new approaches without breaking working code
Lesson 1990What is Version Control and Why Git?Lesson 2005What are Branches and Why Use Them?
Expert input
(what practitioners say "never happens")
Lesson 75Domain-Specific Outlier Rules
Explanatory visualization
is your public communication tool.
Lesson 1213Exploratory vs Explanatory Visualization
Explicit transactions
You manually control when a group of statements is committed or rolled back.
Lesson 1111Autocommit Mode vs Explicit Transactions
Exploitation
playing the machine you currently believe is best
Lesson 1586Multi-Armed Bandit Connections
Exploration
trying different machines to learn which pays best
Lesson 1586Multi-Armed Bandit Connections
Exploratory Analysis
means investigating your data to discover patterns, spot anomalies, and understand relationships between different pieces of information.
Lesson 13Exploratory Analysis and ModelingLesson 38What is Central Tendency?Lesson 1395When to Use Grubbs' TestLesson 1727Linear Attribution Model
Exploratory data analysis
where you need to see data distributions, plot trends, and test hypotheses interactively
Lesson 2074Notebooks vs Scripts: When to Use Each
Exploratory research
Might use α = 0.
Lesson 342Alpha Level Trade-offs
Exploratory visualization
is your private investigation tool.
Lesson 1213Exploratory vs Explanatory Visualization
Exploring data
You're getting familiar with a new table and want to see what's in it
Lesson 851Selecting All Columns with Asterisk
Exponential
works when failure or arrival rates are constant over time (memoryless property).
Lesson 193Choosing Between Distributions in PracticeLesson 664What is the Exponential Family of Distributions?
Exponential complexity
With multiple attributes, the number of subgroups grows quickly
Lesson 1893Intersectionality in Fairness
Exponential decay
Sharp initial drop, then gradual decline (common in digital marketing)
Lesson 1639Time Windows and Attribution Decay
exponential distribution
flips this around—it models *how long you wait* until the next event occurs.
Lesson 164The Exponential DistributionLesson 182Special Cases: Exponential and Chi-Squared
exponential family
is a special class of probability distributions that can all be written in the same mathematical form.
Lesson 664What is the Exponential Family of Distributions?Lesson 666Natural Parameter and Sufficient StatisticsLesson 690The Poisson Distribution as a GLM
Exponential smoothing
uses a declining weight scheme controlled by parameter `α`.
Lesson 764Exponential Smoothing vs Moving Averages
Exponentiate the bounds
Transform to odds ratio scale:
Lesson 685Confidence Intervals for Odds Ratios
Expose data issues early
Basic analysis quickly reveals data quality problems
Lesson 2110The Minimum Viable Analysis (MVA)
Extended
Multiple months if seasonal patterns matter to your metric
Lesson 1484Duration and Timing Considerations
External benchmarks
for your product category
Lesson 1657Day-1, Day-7, Day-30 Benchmarks
External conditions
A task queries an API to see which regions have new data, then dynamically creates one downstream task per region.
Lesson 1844Dynamic Dependencies
External task dependencies
explicitly declare that a task in Pipeline A depends on a task in Pipeline B.
Lesson 1845Cross-Pipeline Dependencies
External validity
asks: *Do these results apply beyond your study's specific conditions?
Lesson 1441Internal vs External Validity
Extra transparency
about consequences of declining
Lesson 1918Special Populations and Vulnerable Groups
Extract and lightly transform
sensitive data (masking PII) before loading to comply with regulations
Lesson 1821Hybrid Approaches and Modern Data Stacks
Extract the Irregular (I)
Subtract both trend and seasonal components from the original: `I = Y - T - S`.
Lesson 744Classical Decomposition Methods
Extract the timezone
Determine what zone a timestamp uses
Lesson 1042Working with Timestamps and Time Zones
Extract the Trend (T)
Apply a moving average to smooth out the data.
Lesson 744Classical Decomposition Methods
Extract, Transform, Load
a traditional data integration pattern that moves data from source systems into a data warehouse or analytics platform.
Lesson 1815What is ETL? Extract, Transform, Load Explained
Extracting
data from operational databases, APIs, or files
Lesson 1816What is ELT? Extract, Load, Transform Explained
Extraction timestamp
Exact date and time you pulled the data
Lesson 1161Documenting Data Sources
Extraction tools
(Fivetran, Airbyte) that load raw data with minimal transformation
Lesson 1821Hybrid Approaches and Modern Data Stacks
Extreme outliers
Even with larger samples, severe outliers can distort the mean and inflate the standard error, making your t-statistic misleading.
Lesson 390When Parametric Tests Fail: Violations of Assumptions
Extreme predictions
When estimating values far from your data's center
Lesson 550Normality of Residuals

F

F < 10
Weak instrument—your second-stage estimates may be severely biased
Lesson 1467Testing Instrument Strength and Validity
F-ratio
is simply the ratio of these two mean squares:
Lesson 443Mean Squares and the F-Ratio
F-statistic
and **p-value** in the "Between Groups" row tell you whether your groups differ significantly.
Lesson 444The ANOVA TableLesson 464Main Effects in Two-Way ANOVALesson 1467Testing Instrument Strength and Validity
F-statistic is large
and the **p-value is small** (typically < 0.
Lesson 627The F-Test for Model Comparison
F-test for model comparison
gives you a statistical answer.
Lesson 627The F-Test for Model Comparison
F(Factor A)
= MS_A / MS_within
Lesson 467Two-Way ANOVA F-Tests
F(Factor B)
= MS_B / MS_within
Lesson 467Two-Way ANOVA F-Tests
F(Interaction)
= MS_A×B / MS_within
Lesson 467Two-Way ANOVA F-Tests
Facebook/Meta
Monthly Active Users (MAU) or Daily Active Users (DAU) — engagement captures the platform's value through connection and content sharing.
Lesson 1606Examples of North Star Metrics by Industry
Faceted plots
split your data by a third variable, showing if the pattern changes across groups
Lesson 1195Interaction Effects Between Variables
fact table
(containing measurements like sales amounts, quantities, or counts) connects to multiple **dimension tables** (containing descriptive attributes like customer names, product details, or dates).
Lesson 956Star Schema JoinsLesson 1808Star Schema and Fact Tables
Factor A
Teaching method (online, hybrid, in-person)
Lesson 463Introduction to Two-Way ANOVA
Factor A main effect
Does Factor A matter, overall?
Lesson 464Main Effects in Two-Way ANOVA
Factor B
Time of day (morning, afternoon)
Lesson 463Introduction to Two-Way ANOVA
Factor B main effect
Does Factor B matter, overall?
Lesson 464Main Effects in Two-Way ANOVA
Factor impact
Do email campaigns accelerate conversion compared to ads?
Lesson 839Time-to-Conversion in Marketing Funnels
Failed jobs create duplicates
Rerunning after a crash might insert the same records twice
Lesson 1847What is Idempotency?
Failure
Rolls back automatically if an exception occurs
Lesson 1114Transaction Context Managers in Python
Fair dice or spinners
Are all six faces equally likely?
Lesson 421Applications: Uniform, Genetic Ratios, and Distributions
fairness
in how credit flows, enabling better resource allocation and learning.
Lesson 1643Building Attribution FrameworksLesson 1878What is Bias in Data?
Fairness audits
Test model outcomes across demographic groups
Lesson 1883Protected Classes and Proxy Variables
Fairness through awareness
takes the opposite approach: explicitly include sensitive attributes so you can measure and correct for disparate impact.
Lesson 1892Fairness Through Unawareness vs Awareness
Fairness through unawareness
sounds intuitive—if the model can't see protected attributes, it can't discriminate, right?
Lesson 1892Fairness Through Unawareness vs Awareness
False Discovery Rate (FDR)
Controls the expected proportion of false positives among all significant results
Lesson 512Testing Significance in Correlation MatricesLesson 1505False Discovery Rate (FDR)Lesson 1506Benjamini-Hochberg Procedure
False Negatives (FN)
Missed actual change-points
Lesson 1418Evaluating Change-Point Detection Methods
False Positive Rate
You should see "significant" results only at your chosen alpha level (e.
Lesson 1483Pre-Experiment Validation
False Positives (FP)
Flagged changes where none exist (false alarms)
Lesson 1418Evaluating Change-Point Detection Methods
False precision
The mathematical convenience might tempt you to use an inappropriate prior
Lesson 1555Advantages and Limitations of Conjugate Priors
Falsifiability
Can be proven wrong with data
Lesson 1200Formulating Specific, Testable Hypotheses
Familiar API
If you know SQL or pandas, DataFrames feel natural
Lesson 1778DataFrames and Spark SQL Basics
family-wise error rate
(the probability of making *any* Type I error across all tests) balloons.
Lesson 337Error Rates in Practice: Multiple TestingLesson 1501The Multiple Testing Problem
Family-Wise Error Rate (FWER)
is the probability of making **at least one false discovery** (Type I error) across a "family" of hypothesis tests conducted simultaneously.
Lesson 1502Family-Wise Error Rate (FWER)Lesson 1505False Discovery Rate (FDR)
Fan-in
Multiple tasks must complete before one starts
Lesson 1843Declaring Dependencies in Orchestration Tools
Fan-out
One task triggers multiple parallel tasks
Lesson 1843Declaring Dependencies in Orchestration Tools
Fast-moving funnel
Users zip through in minutes (smooth experience)
Lesson 1681Time-Based Funnel Analysis
Faster payback
= more cash to reinvest in growth
Lesson 1757Payback Period: Definition and Importance
Fault isolation
If one node crashes, others keep running.
Lesson 1771Shared-Nothing Architecture
Favor parsimony
When models perform similarly, choose the simpler one (Occam's Razor)
Lesson 633Practical Model Selection Strategy
Feather
is a lightweight columnar format optimized for speed.
Lesson 1129Parquet and Feather: Columnar Formats
Feature bloat
Models train slower and become harder to explain when filled with irrelevant features
Lesson 2135Dead Experimental Code and Feature Sprawl
Feature branches
you're working on alone
Lesson 2020The Golden Rule of Rebase
Feature development
Build new model features without disrupting production code
Lesson 2005What are Branches and Why Use Them?
Feature engineering needs
"Create interaction term between X and Y"
Lesson 1212EDA Summary Documentation and Next Steps
Feature engineering repeats
New behaviors may require new features entirely
Lesson 2128Data Distribution Shifts Frequently
Feature requests
Stakeholders inevitably want new views or filters
Lesson 1979Maintenance and Sustainability Considerations
Feature selection discipline
After adding features, measure their importance and remove low-contributors before the next iteration.
Lesson 2135Dead Experimental Code and Feature Sprawl
Feature sprawl
happens when you keep accumulating features for models without ever pruning the ones that don't contribute value.
Lesson 2135Dead Experimental Code and Feature Sprawl
Features
Early behavior metrics like days to first purchase, first-order value, login frequency in week one, number of products viewed, engagement with onboarding emails
Lesson 1668Predictive LTV Models
Feedback loops
A biased model's decisions become tomorrow's training data (e.
Lesson 1882Algorithmic Amplification of BiasLesson 1923Algorithmic Amplification of Harm
Few covariates
to match on (2-4 variables)
Lesson 1446Exact Matching
Fewer Type I errors
– You're less likely to reject H₀ when it's actually true
Lesson 342Alpha Level Trade-offs
Field conventions
(what does your discipline expect?
Lesson 324Common Significance Levels: 0.05, 0.01, and 0.10
Field standards
Psychology and social sciences commonly use α = 0.
Lesson 342Alpha Level Trade-offs
Figure
is the entire building—the blank canvas or container that holds everything.
Lesson 1255The Anatomy of a Matplotlib Figure
Fill rate
% of buyer requests successfully matched
Lesson 1630Marketplace Metrics: GMV, Take Rate, and Liquidity
Filling gaps
Use LEAD to preview the next non-null value
Lesson 1023Introduction to Window Functions: LAG and LEAD
Filter conditions
WHERE clauses applied at different stages.
Lesson 1084Reading and Interpreting Query Execution Plans
Filter early
Use `WHERE` before `DISTINCT` or `ORDER BY` to reduce the data volume
Lesson 880Performance Considerations and Best PracticesLesson 997CTE Best Practices and Performance
Filtering
Writing matching rows to a new file
Lesson 1800Chunked Reading with read_csv
Filtering conditions
Applying `WHERE` clauses early reduces row counts before joining
Lesson 951Join Order and Performance
Filtering early
Place restrictive `WHERE` conditions before or with early joins
Lesson 951Join Order and Performance
Financial analysis
(comparing stocks with vastly different price ranges)
Lesson 200Comparing Values Across Different Distributions
Financial conflicts
You're analyzing sales data for a product your company desperately needs to succeed.
Lesson 35Conflicts of Interest and IndependenceLesson 1930Managing Conflicts of Interest
Financial portfolios
Assets with different investment amounts
Lesson 43Weighted Mean and Its Applications
Financing decisions
Investors scrutinize this metric to assess capital efficiency
Lesson 1757Payback Period: Definition and Importance
Find minimal adjustment sets
the smallest set of variables that, when conditioned on, blocks all backdoor paths
Lesson 1475Using DAGs to Guide Analysis
Find only common rows
between queries
Lesson 998Introduction to Set Operations
Find shared drivers
Do two metrics both depend on the same underlying factor?
Lesson 1625Cross-Functional Metric Dependencies
Find the column
for the second decimal place (e.
Lesson 198Using Z-Tables for Probability
Find the largest i
where p ᵢ ≤ (i/m) × α
Lesson 1506Benjamini-Hochberg Procedure
Find the midpoint
of each group (class interval)
Lesson 45Central Tendency for Grouped Data
Find the quantiles
of your posterior distribution that capture that probability mass
Lesson 1562Credible Intervals for ProportionsLesson 1575Computing Equal-Tailed Credible Intervals
Find the row
corresponding to the first two digits of your Z-score (e.
Lesson 198Using Z-Tables for Probability
Finding duplicates
Match rows where key fields are identical
Lesson 947Self-Joins for Comparisons Within a Table
Finding gaps
Identify products in inventory but never sold
Lesson 1002EXCEPT: Finding Differences
Finding unique categories
What products do we sell?
Lesson 873Understanding DISTINCT: Removing Duplicate Rows
Findings
Your main insights with supporting visualizations
Lesson 1966Report Structure and Executive Summary
Finite Population Correction (FPC)
factor adjusts for this by *shrinking* the standard error to reflect the extra precision:
Lesson 264Finite Population Correction
Firewall Issues
block traffic between your application and database.
Lesson 1093Troubleshooting Connection Issues
First batch of data
Start with Beta(2, 2) prior → observe 10 successes, 15 failures → get Beta(12, 17) posterior
Lesson 1563Sequential Updating with New Data
First difference
Treatment group's change = (After - Before)
Lesson 1452The Difference-in-Differences Setup
First difference (Control group)
Calculate the change in the control group over the same period: `(Y_control_after - Y_control_before)`
Lesson 1454Calculating the DiD Estimator
First difference (Treatment group)
Calculate the change in the treatment group from before to after the intervention: `(Y_treatment_after - Y_treatment_before)`
Lesson 1454Calculating the DiD Estimator
First evidence (fingerprint found)
Apply Bayes' Theorem → posterior becomes 60%.
Lesson 114Sequential Updating
First join
(LEFT): All customers appear, even those without orders (orders columns show NULL)
Lesson 952Mixing Join Types
First meaningful action
When a user performs a core action that indicates true engagement
Lesson 1646Defining Cohort Start Events
First name
can reveal gender or ethnic background
Lesson 1883Protected Classes and Proxy Variables
First or last names
→ ethnicity, religion, gender
Lesson 1889Proxy Variables and Redlining
First purchase
When a user converts from browser to buyer
Lesson 1646Defining Cohort Start Events
First touch
(30%) – The initial interaction that brings awareness
Lesson 1730W-Shaped Attribution Model
First touch matters
Someone discovered you somehow—that channel deserves significant credit
Lesson 1729Position-Based (U-Shaped) Attribution
First-Pass Yield
measures the percentage of units that pass quality checks without rework on the first attempt.
Lesson 1636Manufacturing Metrics: OEE, Yield, and Cycle Time
Fisher's z-transformation
, which converts *r* into a value *z'* that *is* approximately normally distributed:
Lesson 503Confidence Intervals for Correlation Coefficients
Fit models
Kaplan-Meier for overall survival curves; Cox models to test effects of predictors like manufacturing date, component supplier, or usage intensity
Lesson 837Product Warranty and Failure Analysis
Fit the Holt-Winters model
with each combination
Lesson 772Holt-Winters Parameter Optimization
Fitness trackers
revealing military base locations through jogging patterns
Lesson 1922Surveillance and Secondary Data Uses
Fitted value (Ŷ ᵢ)
= β₀ + β₁X ᵢ
Lesson 538What Are Fitted Values?
Fitted value (ŷ)
"Here's what the model *predicts* based on the linear relationship"
Lesson 543Residuals as Unexplained Variation
fitted values
(group means) on the x-axis and **residuals** (observed minus predicted) on the y-axis.
Lesson 451Diagnostic Plots for ANOVALesson 538What Are Fitted Values?
Fix
Always use specific join conditions.
Lesson 949Avoiding Common Self-Join Pitfalls
Fixed aspect ratios
ensure equal spacing (crucial for maps)
Lesson 1344Scales and Coordinate Systems
Fixed n
Known number of trials (products, patients, voters, visitors)
Lesson 131Real-World Applications of Binomial Distributions
Fixed probability
The probability p stays the same each time
Lesson 123Bernoulli Trial Definition and Properties
Fixed random seeds
where randomness is involved
Lesson 1981What Makes a Report Reproducible?
Fixing inconsistencies
Standardizing formats (like dates, phone numbers, or categories) so everything follows the same pattern.
Lesson 12Data Cleaning and Preparation
Flag
Create indicator columns for "was missing"
Lesson 1207Missing Data Assessment and Strategy
Flag legitimate values
as outliers in skewed data
Lesson 1390Assumptions of Grubbs' Test
Flat (uniform) priors
over a reasonable range
Lesson 1565Prior Distributions for Normal Means
Flat rolling mean
= constant average (good!
Lesson 715Visual Tests for Stationarity
Flat rolling std
= constant variability (good!
Lesson 715Visual Tests for Stationarity
Flexible
Combines the strengths of multiple sampling techniques you've already learned
Lesson 238Multistage SamplingLesson 1557The Beta-Binomial Model
Flexible ranges
allow newer patch or minor versions, enabling automatic security fixes and bug patches without manual intervention.
Lesson 2050Pinning Versions vs Flexible Ranges
Flipped coordinates
swap x and y axes for horizontal layouts
Lesson 1344Scales and Coordinate Systems
FLOAT
or **DECIMAL**: Numbers with decimals (e.
Lesson 846Tables, Schemas, and Data Types
Focus indicators
are critical: users must see *where* they are in the interface at all times.
Lesson 1253Interactive Accessibility: Keyboard Navigation
Focus on large-data scenarios
With abundant data, the likelihood dominates and priors matter less (robustness naturally increases)
Lesson 1572Sensitivity Analysis and Prior Robustness
Focus on the slope
The slope still tells you about the relationship *within your data range*
Lesson 526When the Intercept Has No Meaning
Folium
and **Plotly** transform your geographic data into engaging web visualizations.
Lesson 1313Interactive Maps with Folium and Plotly
Follow ethical guidelines
established by your organization or profession
Lesson 35Conflicts of Interest and Independence
Follow multiple channels
academic papers for cutting-edge research, industry blogs for practical applications, documentation for tool updates, and community forums for real-world problem-solving patterns.
Lesson 2143Continuous Learning and Skill Development
Follow up with nonrespondents
Send reminders, offer incentives, or use different contact methods to reduce nonresponse bias.
Lesson 250Strategies for Bias Detection and Mitigation
Follow-up analysis
needed to refine the approach
Lesson 1970Recommendations and Next Steps
Follow-ups
If confirmed, design experiment to optimize marketing for that segment
Lesson 1204From Hypothesis to Analysis Plan
Font face
Make titles bold with `face = "bold"` or italicize annotations
Lesson 1364Customizing Text Elements
Font size
Measured in points; larger for titles, smaller for tick labels
Lesson 1297Font Properties and Text StylingLesson 1364Customizing Text Elements
Font sizing
At publication dimensions, default fonts become tiny.
Lesson 1369Publication-Ready Plot Styling
Font style
'normal', 'italic', or 'oblique'
Lesson 1297Font Properties and Text Styling
Font weight
'normal', 'bold', 'light', or numeric values (100-900)
Lesson 1297Font Properties and Text Styling
Foot traffic
customers entering stores—acts as a leading indicator.
Lesson 1634Retail Metrics: Same-Store Sales and Inventory Turnover
For a one-sided test
at α = 0.
Lesson 326Critical Values
For a two-sided test
at α = 0.
Lesson 326Critical Values
For comparisons
between categories → use bar charts or column charts
Lesson 1230Choosing the Right Chart Type
For distributions
of continuous data → use histograms or box plots
Lesson 1230Choosing the Right Chart Type
For each p-value
, calculate its threshold: (i/m) × α, where i is its rank, m is the total number of tests, and α is your target FDR level (e.
Lesson 1506Benjamini-Hochberg Procedure
For expensive aggregations
approximate when possible, or use `.
Lesson 1796Limitations and Differences from Pandas
For floats
If precision beyond ~7 significant digits isn't critical for your analysis, `float32` cuts memory in half with minimal impact on most calculations.
Lesson 1799Optimal Data Types and Downcasting
For nested models
Use the **Partial F-Test** (which you learned in lesson 623) to formally test whether the extra predictors significantly improve the model
Lesson 626Nested vs Non-Nested Models
For non-nested models
Compare using **Adjusted R-Squared**, **AIC**, or **BIC** — but you cannot use the Partial F- Test
Lesson 626Nested vs Non-Nested Models
For part-to-whole composition
→ use stacked bar charts (or reluctantly, pie charts)
Lesson 1230Choosing the Right Chart Type
For performance
`UNION ALL` skips the expensive duplicate-checking step, making it significantly faster on large datasets
Lesson 1000UNION ALL: Preserving Duplicates
For positive values
It applies a transformation similar to Box-Cox
Lesson 215Yeo-Johnson Transformation
For relationships
between two numeric variables → use scatter plots
Lesson 1230Choosing the Right Chart Type
For sorting
minimize sorts or do them after filtering to smaller datasets.
Lesson 1796Limitations and Differences from Pandas
For strings
The `categorical` dtype stores each unique value once plus integer codes—massive savings when cardinality is low relative to row count.
Lesson 1799Optimal Data Types and Downcasting
For three variables
→ use bubble charts or heatmaps
Lesson 1230Choosing the Right Chart Type
For trends over time
→ use line charts
Lesson 1230Choosing the Right Chart Type
Forecast accurately
Build models on the stable, adjusted series
Lesson 748Seasonally Adjusted Data
Forecast ahead
Generate predictions for the held-out period
Lesson 790Out-of-Sample Forecast Evaluation
Forecast future churn
more accurately using cohort-specific curves
Lesson 1672Cohort-Based Churn Analysis
Forecast future values
based on historical direction
Lesson 706Trend: Long-Term Direction
Forecast(t)
is the previous forecast
Lesson 758Simple Exponential Smoothing (SES)
Foreign Key
A column that references the primary key in another table (like `customer_id` in an orders table)
Lesson 843Relational Database ConceptsLesson 921Primary and Foreign Key RelationshipsLesson 1051Introduction to Foreign Keys
Form
Is the relationship linear (straight-line pattern) or curved?
Lesson 480Scatterplots and Visual Assessment
Formal definition
A **sampling distribution** is the probability distribution of a statistic (like the mean, median, or proportion) computed from *all possible samples* of a fixed size drawn from the same population.
Lesson 251What is a Sampling Distribution?
formal hypothesis test
that helps you determine whether your data is normally distributed.
Lesson 205Shapiro-Wilk TestLesson 1389What is Grubbs' Test?
Formal reviews
Monthly presentations for key milestones and decision points
Lesson 2104Communication Cadence and Updates
Formal test second
Does it confirm major concerns, or is it just picking up minor noise?
Lesson 570Q-Q Plots vs Formal Normality Tests: When Visual Checks Matter
Format errors
Malformed email addresses or phone numbers
Lesson 1109Input Validation and Defense in Depth
Format expectations
You assume dates come in one format, numeric codes have certain meanings, or null values are handled consistently—until they're not.
Lesson 2133Undocumented Data Dependencies
Format inconsistencies
Dates in different formats, mixed capitalizations
Lesson 1150What is Data Validation?
Format selection
Vector formats (PDF, SVG) scale perfectly for print; PNG works for web at 300+ DPI.
Lesson 1369Publication-Ready Plot Styling
Foundation (base layer)
Your data and aesthetic mappings using `ggplot()`
Lesson 1347Understanding Layers in ggplot2
Foundation for inference
Understanding the sampling distribution lets us say things like "we're 95% confident the true population mean is between X and Y"—which we'll explore in future lessons.
Lesson 251What is a Sampling Distribution?
Fragmentation
occurs when data pages become scattered physically on disk due to inserts, updates, and deletes.
Lesson 1086Index Maintenance and Monitoring
frame
the subset of rows within your partition that the function operates on.
Lesson 1015ROWS vs RANGE Frame SpecificationsLesson 1016Cumulative Sums and Running Totals
Frame count or data
how many frames to generate
Lesson 1327Creating Animations with FuncAnimation
Framing the technical problem
Is this supervised learning?
Lesson 2085Stage 1: Problem Definition and Scoping
Frequency order (descending)
Best for spotting the most/least common categories at a glance—this is usually recommended for EDA
Lesson 1178Bar Charts for Categorical Data
Frequentist
"If we ran this test repeatedly, 95% of intervals constructed this way would capture the true rate.
Lesson 1564Comparing Bayesian and Frequentist Proportion Inference
Frequentist A/B testing
treats the true conversion rate as a fixed (but unknown) parameter.
Lesson 1580Bayesian vs Frequentist A/B Testing
Frequentist interpretation
treats probability as a **long-run frequency**.
Lesson 1540Comparing Bayesian and Frequentist Interpretations
Frequently accessed lookup values
Product categories, customer names, or status labels that rarely change but are queried constantly.
Lesson 1074Duplicating Data Across Tables
Freshness
Maximum age of data (e.
Lesson 1869Data Quality Metrics and SLAs
Friction zone
2-10 GB datasets may work but cause slowdowns and memory pressure
Lesson 1783Data Size Thresholds: When Pandas Isn't Enough
From domain expertise
If historical data suggests 100 conversions from 500 trials, you could use α = 100, β = 400 as an informative prior that reflects actual experience.
Lesson 1558Choosing Informative Priors for Proportions
From percentile to z-score
Work backwards.
Lesson 199Finding Percentiles with Z-Scores
FROM subqueries
are embedded inline:
Lesson 974When to Use FROM Subqueries vs CTEs
FROM table1
Your starting table (often called the "left" table)
Lesson 919Basic INNER JOIN Syntax
From z-score to percentile
Look up your z-score in a z-table.
Lesson 199Finding Percentiles with Z-Scores
FULL OUTER JOIN
(also called FULL JOIN) returns all rows from both tables, regardless of whether there's a matching row in the other table.
Lesson 935What is a FULL OUTER JOIN?Lesson 936FULL OUTER JOIN Syntax
Funnel analysis
is a method that visualizes and measures how users move through a defined sequence of steps (a "funnel") toward completing a desired action—like making a purchase, signing up, or subscribing.
Lesson 1678What is Funnel Analysis?
Funnel or cone shape
Variance increases or decreases with predictions
Lesson 560Scale-Location Plot (Spread-Location Plot)
Funnel shapes
Variance changes as predictions increase (violates homoscedasticity)
Lesson 556What Are Residuals and Why Plot Them?
Future-proofing
You can resurrect old projects years later
Lesson 2047What is Dependency Management?
Fuzzy
A job training program *offered* to all unemployed workers over age 55.
Lesson 1461Sharp vs Fuzzy RDD

G

Gaming the System
A credit scoring model could be reverse-engineered by fraudsters who game input features to appear creditworthy.
Lesson 1920Anticipating Misuse of Data Products
Gamma distribution
is a continuous probability distribution that describes positive real numbers (values greater than zero).
Lesson 181Gamma Distribution: Shape and Rate ParametersLesson 1552Gamma-Poisson Conjugacy
Gap analysis
Find when values changed from the prior period
Lesson 1024LAG Function: Accessing Previous Row Values
gaps
(empty bins) that might signal data collection issues or natural separations, and **outliers** (isolated bars far from the main cluster).
Lesson 1175Histograms for Distribution ShapeLesson 1220Histograms for Continuous Distributions
Gaussian mechanism
adds noise from a normal (Gaussian) distribution.
Lesson 1899Adding Noise for Privacy
GDPR principles
, your organization's **conflicts of interest** policy, or industry standards rather than personal objections.
Lesson 1931When to Push Back on Requests
General Multiplication Rule
, and it works for *any* two events—dependent or independent.
Lesson 88General Multiplication Rule
Generalization
replaces specific values with broader categories: exact ages become age ranges (25-30), precise locations become regions, exact salaries become brackets.
Lesson 1895Data Anonymization BasicsLesson 1896K-Anonymity
Generalized Linear Models
that handle non-normal outcomes.
Lesson 664What is the Exponential Family of Distributions?
Generalized Linear Models (GLMs)
, which extend regression to non-normal outcomes.
Lesson 668Common Distributions as Exponential Family Members
Generate a randomization
using your chosen method
Lesson 1492Rerandomization and Practical Implementation
Generate replicated data
For each posterior sample of parameters, simulate a new dataset
Lesson 1596Posterior Predictive Checks and Model Comparison
Generate scenarios
"What if budget increases 20%?
Lesson 1742Budget Optimization Using MMM
Generate the file automatically
from your current environment
Lesson 1987Environment and Dependency Management
Generated outputs
model binaries (`.
Lesson 1996The .gitignore File
Genetics
You expect a 9:3:3:1 ratio of phenotypes.
Lesson 414Introduction to Chi-Squared Goodness of Fit Test
GeoDataFrame
like a pandas DataFrame, but with a special `geometry` column containing the actual shapes.
Lesson 1311Working with Shapefiles and GeoJSON
Geographic limitations
Missing homeless populations or remote areas
Lesson 249Coverage Error and Undercoverage
Geographic regions
(`country`, `region`) — when analysis is region-specific
Lesson 1812Partitioning and Clustering Strategies
GeoJSON
is a newer, web-friendly format built on JSON.
Lesson 1311Working with Shapefiles and GeoJSON
Geometric objects (geoms)
The actual visual marks—points, lines, bars, polygons—that represent your data
Lesson 1339What is the Grammar of Graphics?Lesson 1342Geometric Objects (geoms)
Geometries (geom)
The visual marks representing data (points, lines, bars, boxes)
Lesson 1340The Seven Layers of Grammar
ggplot2
ships with a distinctive gray background with white gridlines—a deliberate choice to reduce visual clutter while maintaining reference lines.
Lesson 1371Default Aesthetics and Design ChoicesLesson 1373Statistical Transformations: Built-in vs Manual
Ghost ads
and **PSA (Public Service Announcement) tests** are incrementality testing techniques where you show *neutral content* instead of your real ads to a control group.
Lesson 1747Ghost Ads and PSA Tests
Global F-test
Asks "Does *at least one* predictor help explain the outcome?
Lesson 622Relationship Between F-Test and t-Tests
Go deeper
when you need to diagnose problems in a specific area (e.
Lesson 1623Depth vs Breadth in Metric Trees
Go wider
when you need comprehensive coverage (e.
Lesson 1623Depth vs Breadth in Metric Trees
Goals or targets
"We've achieved 85% of our annual target"
Lesson 1962Contextualizing Numbers
Good
Random cloud of points scattered evenly around the horizontal line at y=0, with consistent spread
Lesson 557The Residuals vs Fitted Values PlotLesson 562Index Plots and Time-Ordered ResidualsLesson 1857Logging Best Practices
Good (pyramid)
"Reducing churn requires increasing early-user engagement.
Lesson 1942The Pyramid Principle: Starting with the Conclusion
Good example
"Bar chart comparing Q4 sales across five regions.
Lesson 1250Text Alternatives and Screen Reader Compatibility
Good hypothesis
"Changing the checkout button from blue to green will increase conversion rate by at least 2 percentage points.
Lesson 1479Formulating Hypotheses
Goodhart's Law
*"When a measure becomes a target, it ceases to be a good measure.
Lesson 1521Risks of Optimizing for Surrogates
Goodness-of-fit
How well the model explains the data (measured via likelihood)
Lesson 629Akaike Information Criterion (AIC)Lesson 700AIC and BIC for Model Selection
Goodness-of-fit tests
Formal tests compare observed versus expected frequencies.
Lesson 693Overdispersion in Count Data
GPL
Requires derivative works to also be open source (more restrictive)
Lesson 2082Choosing a License for Data Science Projects
graceful degradation
the pipeline remains healthy and productive while you handle edge cases systematically.
Lesson 1852Dead Letter QueuesLesson 1854Testing Error Handling
Grafana
, and **Datadog** automate this process, offering dashboards that show pipeline status at a glance and trigger alerts when thresholds are breached.
Lesson 1861Monitoring Tools and Dashboards
Grammar of Graphics
is a systematic approach to creating visualizations by combining independent building blocks, rather than selecting from a fixed menu of chart types.
Lesson 1339What is the Grammar of Graphics?
Graph
Visual representation of task relationships
Lesson 1833Introduction to Apache Airflow
Graph algorithms
Computing connected components or PageRank involves recursive traversals
Lesson 1784Computation Complexity: Beyond Data Size
Great Expectations
is the leading Python library for this purpose.
Lesson 1158Automated Validation FrameworksLesson 1164Tools for Lineage Tracking
Greater sensitivity to effects
With less noise in your estimate, even small differences between your null hypothesis and reality become detectable.
Lesson 340Power and Sample Size Relationship
Greenwood's formula
gives us the standard error (SE) of the Kaplan-Meier estimator at any time t.
Lesson 814Standard Errors and Confidence Intervals
GridSpec
lets you treat your figure like a flexible grid where subplots can span multiple cells, much like merging cells in a spreadsheet.
Lesson 1278GridSpec for Complex Layouts
Gross Profit Margin
Revenue minus cost of goods sold
Lesson 1516Business Metrics: Definition and Examples
Group A
Values tightly clustered around 50 (median = 50)
Lesson 394Interpreting Rank-Based Tests: Medians vs Distributions
Group B
Values spread widely from 20 to 80 (median = 50)
Lesson 394Interpreting Rank-Based Tests: Medians vs Distributions
GROUP BY groups
The remaining rows are organized into groups
Lesson 915Combining WHERE and HAVING
Group your data
by the categorical variable (e.
Lesson 1185Grouped Summary Statistics
Grouped (side-by-side) bars
excel when you want to compare specific values across categories.
Lesson 1188Stacked and Grouped Bar Charts
Grouped analyses
compare correlation coefficients or slopes between subgroups
Lesson 1195Interaction Effects Between Variables
Grouped bar charts
place bars side-by-side for easy direct comparison between groups.
Lesson 1188Stacked and Grouped Bar Charts
grouped bars
when precise, side-by-side comparison of subcategories is your priority.
Lesson 1226Stacked and Grouped Bar ChartsLesson 1266Bar Plots: Categorical Comparisons
Growth stage startups
often tolerate higher CAC and longer payback because they're prioritizing market capture.
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Grubbs' test tables
(organized by sample size and α level) or calculate them using formulas involving the t- distribution.
Lesson 1392Critical Values and Significance Testing
guardrails
are defensive metrics designed to catch these problems before they damage your business.
Lesson 1624Counter-Metrics and GuardrailsLesson 1925Mitigation Strategies and Responsible Disclosure
Guide onboarding
Focus new users on high-value features first
Lesson 1696Feature Adoption and Usage Frequency

H

H ₐ
At least one method produces a different average score
Lesson 439ANOVA Hypotheses and Research Questions
H statistic
that measures how much the rank sums vary between groups.
Lesson 471Kruskal-Wallis H Test: The Non-Parametric One-Way ANOVA
H₀ (Null Hypothesis)
All groups have equal variances
Lesson 380Testing Equal Variances: Levene's and Bartlett's Tests
H₀: p = 0.10
(the claim is correct)
Lesson 401Setting Up Hypotheses for Proportions
H₁ (Alternative Hypothesis)
At least one group has a different variance
Lesson 380Testing Equal Variances: Levene's and Bartlett's Tests
H₁: p > 0.10
(nausea rate exceeds the claim)
Lesson 401Setting Up Hypotheses for Proportions
H₁: μ_d > 0
(training increases scores on average)
Lesson 373Hypotheses for Paired t-Tests
Hadoop MapReduce
was the original distributed processing engine—breaking jobs into "map" (process chunks independently) and "reduce" (combine results) phases.
Lesson 1764The Big Data Technology Landscape
Hamiltonian Monte Carlo (HMC)
borrows physics concepts to guide sampling intelligently.
Lesson 1593Hamiltonian Monte Carlo and NUTS
Handle edge cases explicitly
New scenario?
Lesson 2128Data Distribution Shifts Frequently
Handle staggered timing
Each unit's treatment effect is estimated relative to its own adoption date
Lesson 1457Multiple Time Periods and Staggered Adoption
Handling missing values
Deciding what to do when data points are absent—should you fill them in, remove those rows, or use another strategy?
Lesson 12Data Cleaning and Preparation
Hard to enforce rules
You can't easily prevent invalid combinations (like a refund with no related sale)
Lesson 1148Handling Multiple Types in One Table
Harder interpretation
when you're drowning in similar variables
Lesson 1197Identifying Variable Importance and Redundancy
Harder to judge accurately
this is why pie charts are often criticized.
Lesson 1231Channels of Visual Encoding
HARKing
(Hypothesizing After Results are Known), where you retrofit explanations to unexpected patterns.
Lesson 1485Documentation and Pre-Registration
Hash joins
excel with large tables that fit in memory—they're fast but require space to build the hash structure.
Lesson 957Join Strategies: Nested Loop, Hash, Merge
HAVING filters last
It removes entire groups based on their aggregated values
Lesson 915Combining WHERE and HAVING
Header row
The first line often contains column names (`name`, `age`, `city`)
Lesson 1125CSV Files: Structure and Common Issues
Health monitoring
Disease onset or treatment effectiveness
Lesson 1412What is Change-Point Detection?
Health standards
Is the average blood pressure of patients in a clinic different from the national average of 120 mmHg?
Lesson 351When to Use a One-Sample t-Test
Health studies
Volunteers may be more health-conscious than average
Lesson 246Volunteer and Self-Selection Bias
Healthcare
Predicting disease outbreaks or personalizing treatment plans
Lesson 6Common Data Science Applications
Heavy grid lines
that compete with your data points
Lesson 1963Removing Chartjunk
Heavy gridlines
Use subtle, minimal guides only when necessary
Lesson 1237Chart Junk and Data-Ink Ratio
Heavy-tailed distributions
(with extreme outliers): even larger samples required
Lesson 220Sample Size Requirements for the CLTLesson 1379Assumptions and Limitations
Hedging
protects against channel-specific risks (platform bans, seasonal dips)
Lesson 1716Channel Mix and Portfolio Thinking
Height
Taller bars (positive or negative) indicate stronger correlation
Lesson 722ACF Plots and Interpretation
Height and Weight
If you're predicting adult weight from height, the intercept represents the predicted weight when height = 0 inches.
Lesson 526When the Intercept Has No Meaning
Heroku
is a general-purpose cloud platform that works with both Streamlit and Dash.
Lesson 1338Deployment and Sharing Dashboards
Hidden randomness
Random processes without fixed seeds produce varying results (Random Seeds)
Lesson 30The Reproducibility Crisis and Solutions
Hidden subgroups
Averaging diverse populations together (Simpson's Paradox territory)
Lesson 1245Misleading Aggregations and Binning
Hide cyclicality
Show only the upswing of a seasonal pattern while ignoring the inevitable downturn
Lesson 1241Cherry-Picking Time Ranges
Hide technical depth strategically
Methodology, statistical tests, and data quality checks belong in appendices or backup slides (lesson 1949).
Lesson 1965Progressive Disclosure Techniques
Hiding data
Adding a dense geom (like `geom_ribbon()`) last can hide points underneath
Lesson 1355Layer Order and Plot Composition
Hiding trade-offs
Your values might mask important considerations others prioritize differently
Lesson 1927Separating Analysis from Advocacy
Hierarchical
Clear parent-child or directed flow relationships
Lesson 1318Network Layout Algorithms
Hierarchical relationships
If you have `city`, `state`, and `country` columns, does "Boston" really belong to "Texas" or "Canada"?
Lesson 1155Consistency Checks Across Fields
High influence
= actually changes the fitted model (unusual X *and* unusual Y given that X)
Lesson 574Influence: Impact on Fitted Model
high leverage
not because their score is unusual, but because their study time is far from the typical range.
Lesson 572Leverage: Distance in X-SpaceLesson 574Influence: Impact on Fitted Model
High missingness
(>50%): Variable may be unusable
Lesson 1179Identifying Missing Values Patterns
High noise-to-signal ratio
When errors dominate true patterns, models learn randomness instead of relationships.
Lesson 2124Insufficient or Low-Quality Data
High p-value (≥ α)
The observed frequencies are reasonably close to expected frequencies.
Lesson 420Interpreting Chi-Squared Test Results
High power
to detect non-normality makes it a go-to choice.
Lesson 378Testing Normality: Statistical Tests
High-resolution export
in required format
Lesson 1369Publication-Ready Plot Styling
High-risk, engaged recently
In-product interventions (tooltips, feature prompts) based on usage gaps
Lesson 1676Win-Back and Retention Strategies
High-risk, high-value
Personalized outreach, account manager check-ins, or special loyalty offers
Lesson 1676Win-Back and Retention Strategies
Higher adjusted R-squared
suggests a better balance of fit and simplicity
Lesson 615Comparing Models with Adjusted R-Squared
Higher alpha (0.10)
Like a more-sensitive detector.
Lesson 334Setting Alpha: Choosing Your Significance Level
Higher confidence (e.g., 99%)
More reliable method, but wider intervals
Lesson 267Interpreting Confidence Levels
Higher confidence level
→ Wider margin (you cast a wider net to be more certain)
Lesson 294Margin of Error and Its Components
Higher evidence bar
– Only stronger signals will be deemed "significant"
Lesson 342Alpha Level Trade-offs
Higher variance/standard deviation
= outcomes are more spread out, more unpredictable
Lesson 148Variance and Standard Deviation of Discrete Distributions
Higher λ
means events happen more frequently → shorter waiting times
Lesson 164The Exponential Distribution
Highest Density Interval (HDI)
takes a smarter approach: it finds the *shortest possible* interval that still contains your desired probability mass (say, 95%).
Lesson 1576Highest Density Intervals (HDI)
Highlight, don't decorate
Use bold or saturated colors for the 1-3 most important data points you want your audience to notice first.
Lesson 1961Color as Communication Tool
Highly skewed distributions
(like income data): you may need n = 50, 100, or more
Lesson 220Sample Size Requirements for the CLT
Hill function
(also called logistic or S-curve):
Lesson 1740Saturation Curves and Diminishing Returns
Histogram of residuals
Should look roughly bell-shaped
Lesson 449Normality of Residuals
Historical patterns
If a job normally takes 10-15 minutes, alert at 30+ minutes, not 16
Lesson 1858Alerting StrategiesLesson 1878What is Bias in Data?
Historical performance
"This is our best quarter in three years"
Lesson 1962Contextualizing Numbers
Historical snapshots
Copying current product prices into order records preserves what the customer actually paid, even if prices change later.
Lesson 1074Duplicating Data Across Tables
Historical Trends
Show how values change over time.
Lesson 1939Context and Comparison: Making Numbers Meaningful
Holt-Winters exponential smoothing
comes in.
Lesson 765Introduction to Holt-Winters Method
Holt-Winters Multiplicative Model
is designed for time series where seasonal fluctuations change in size as the overall level of the series changes.
Lesson 768Holt-Winters Multiplicative Model
Holt's Method
adds a second equation to track the trend separately.
Lesson 761Double Exponential Smoothing (Holt's Method)
Horizontal patterns
One cohort performing differently across all periods indicates something unique about that acquisition group
Lesson 1649Visualizing Cohort Data with Heatmaps
Horizontal scaling (scale-out)
means distributing work across multiple machines working in parallel.
Lesson 1767Scale-Up vs Scale-Out Architectures
Horizontal trend line
Good news—variance is roughly constant (homoscedasticity)
Lesson 560Scale-Location Plot (Spread-Location Plot)
Hospital studies
Disease severity and access to care both lead to hospitalization.
Lesson 1473Conditioning on Colliders: Selection Bias
Hourly data
with daily seasonality → period = 24
Lesson 746Choosing Seasonal Period
Hover tooltips
displaying data values when you move your mouse over points
Lesson 1300Creating Basic Interactive Charts with Plotly Express
How many extra parameters
you added (degrees of freedom cost)
Lesson 627The F-Test for Model Comparison
How much better
the full model fits the data (lower RSS—residual sum of squares)
Lesson 627The F-Test for Model Comparison
How strongly
it would need to relate to the treatment (exposure)
Lesson 1434Sensitivity Analysis for Confounding
How to report bugs
Where should users file issues?
Lesson 2083Contributing Guidelines and Contact Information
How to suggest features
Is there a template or discussion forum?
Lesson 2083Contributing Guidelines and Contact Information
HR < 1
Decreased hazard (protective effect).
Lesson 827Hazard Ratios and Interpretation
HubSpot
Weekly active teams using the platform — activation and ongoing engagement signal product- market fit.
Lesson 1606Examples of North Star Metrics by Industry
Hue
is what we typically call "color": red, green, blue, purple, etc.
Lesson 1234Color: Hue, Saturation, and LuminanceLesson 1238Matching Encoding to Data Type
Human-readability
CSV > JSON > Excel > Parquet/Feather
Lesson 1133Performance Considerations Across Formats
Human-readable units
Same units as your original data
Lesson 49Standard Deviation: Interpretable Spread
Hybrid approach
– handling both global and local anomalies in one framework
Lesson 1405What is Seasonal Hybrid ESD?
Hypothesis
The specific change and expected directional effect
Lesson 1485Documentation and Pre-Registration
Hypothesis Testing
Z-scores help us ask "Is this result surprising?
Lesson 201Z-Score Applications and Limitations
Hypothesize
Based on funnel analysis, identify a bottleneck
Lesson 1692Statistical Significance and Iteration

I

I Chart (Individuals Chart)
Plots each single measurement and tracks whether the process mean is stable.
Lesson 1404Control Charts for Individual Observations
I-MR charts
(Individual and Moving Range charts) come in.
Lesson 1404Control Charts for Individual Observations
idempotency
so rerunning doesn't corrupt data, **checkpointing** to resume mid-pipeline, and **monitoring/alerts** for quick detection.
Lesson 1825Designing Pipeline ArchitectureLesson 1847What is Idempotency?Lesson 1850Retry StrategiesLesson 1853Partial Failure Recovery
Identify
cells with residuals beyond ±2 (moderate) or ±3 (strong)
Lesson 428Post-Hoc Analysis and Residuals
Identify "flattening"
When curves level off, you've found your core retained users—the ones likely to stick around long- term.
Lesson 1656Visualizing Retention Curves
Identify all relevant periods
hourly (24), daily (7), weekly, etc.
Lesson 1408Handling Multiple Seasonal Periods
Identify all systems
holding that person's data (data lineage helps here!
Lesson 1909Right to Erasure and Data Retention Policies
Identify backdoor paths
between treatment and outcome
Lesson 1475Using DAGs to Guide Analysis
Identify conflicts
Run `git status` to see which files have conflicts (marked as "both modified")
Lesson 2018Resolving Conflicts During Rebase
Identify core value actions
What behaviors indicate someone is getting value?
Lesson 1693Defining User Engagement
Identify data sources
Where should each variable come from?
Lesson 2098Identifying Data Availability Gaps Early
Identify direct causal relationships
(does X directly cause Y?
Lesson 1469Building a Simple Causal DAG
Identify direct links
Does your metric directly influence another team's metric?
Lesson 1625Cross-Functional Metric Dependencies
Identify meaningful strata
Divide your population into non-overlapping groups based on important characteristics (age, income, region, education level, etc.
Lesson 236Stratified Sampling
Identify patterns
A single representative value helps you spot trends over time or differences between categories
Lesson 38What is Central Tendency?
Identify power features
High adoption + high frequency = core value drivers
Lesson 1696Feature Adoption and Usage Frequency
Identify problematic cohorts
that need intervention
Lesson 1672Cohort-Based Churn Analysis
Identify stratification variables
(usually 1-3 key covariates)
Lesson 1489Stratified Randomization Fundamentals
Identify the confounder
(from your previous analysis)
Lesson 1430Controlling for Confounders: Stratification
Identify the pre-rebase state
Look for the entry just before you started the problematic rebase
Lesson 2021Recovering from Rebase Mistakes
Identify the reference distribution
(standard normal for Z, t-distribution for t, etc.
Lesson 319Calculating P-Values from Test Statistics
Identify Unused Indexes
Query your database's system catalogs to find indexes that are never or rarely used.
Lesson 1086Index Maintenance and Monitoring
Identifying actionable next steps
What should the business *do* differently?
Lesson 2090Stage 6: Interpretation and Insight Generation
Identifying trends
Detect consecutive increases or decreases
Lesson 1023Introduction to Window Functions: LAG and LEAD
Identity
Coefficients are direct additive effects (simplest interpretation).
Lesson 678Choosing the Right Link Function
If it changes direction
The control variable was suppressing the true relationship (a suppressor effect)
Lesson 508Interpreting Partial Correlations
If it remains strong
The relationship between your two variables is genuine, independent of the control variable(s)
Lesson 508Interpreting Partial Correlations
If p-value < α
Reject H₀ (the result is "statistically significant")
Lesson 323What is a Significance Level (α)?
If p-value ≤ α
Reject H₀ (the data are unlikely under the null hypothesis)
Lesson 327Decision Rules: Reject or Fail to RejectLesson 356Making Decisions and Stating Conclusions
If p-value ≥ α
Fail to reject H₀ (insufficient evidence)
Lesson 323What is a Significance Level (α)?
If you reject H₀
You have sufficient evidence to support the alternative hypothesis.
Lesson 404Making Decisions and Drawing Conclusions
Ignore baseline context
Start your chart at an unusual low point to make normal recovery look exceptional
Lesson 1241Cherry-Picking Time Ranges
Ignored anomalies
(revenue drops 15%, but who investigates?
Lesson 1619What is Metric Ownership?
Ignoring geographic size bias
Large empty regions dominate visually even with low values.
Lesson 1309Choropleth Maps: Basics and Best Practices
Ignoring the base rates
When calculating P(A|B), people forget that the prior probability P(A) matters enormously.
Lesson 100Common Conditional Probability Mistakes
Ignoring the clock
You're two weeks past the deadline chasing marginal improvements while stakeholders have moved on or made decisions without you.
Lesson 2119Signs You're Over-Engineering
Immutable Data Patterns
Rather than updating records in place, append new versions with timestamps or version numbers.
Lesson 1848Designing Idempotent Operations
Imperfect measurement instruments
A broken thermometer that reads 2°C high introduces systematic error.
Lesson 1880Measurement and Label Bias
Implement access controls
that enforce purpose-based restrictions
Lesson 1915Secondary Use and Scope Creep
Implement pagination
(showing results in batches, like "page 1 of 100")
Lesson 877LIMIT: Restricting the Number of Rows Returned
Implementation bugs
Maybe your randomization code has an off-by-one error or timestamp issues
Lesson 1524Sample Ratio Mismatch (SRM)
Implicit transformations
You depend on data that's already been filtered or aggregated upstream, but that logic changes without notice.
Lesson 2133Undocumented Data Dependencies
Impractical test duration
– If your required sample size is large but your daily traffic is small, you'll need to run the test for weeks or months.
Lesson 1493Why Sample Size Matters in A/B Tests
Improve Data Integrity
When that customer moves, you update one row in one table, not dozens of scattered records.
Lesson 1061Introduction to Normalization
Improve interpretability
Clearer story about what drives your outcome
Lesson 585Remedies: Variable Selection
Improving trends
Later cohorts retain better than earlier ones.
Lesson 1650Comparing Cohorts Over Time
Impute
Replace with mean, median, mode, or modeled values
Lesson 1207Missing Data Assessment and Strategy
IN
typically builds a complete list of values first, then checks membership.
Lesson 985EXISTS vs IN: Performance Considerations
In coordinated sequences
(extract → transform → load, in order)
Lesson 1831What is Job Scheduling?
In final deliverables
(reports, presentations, dashboards): explanation
Lesson 1216Choosing the Right Purpose
In Production
ML systems degrade over time as data distributions shift.
Lesson 2130No Clear Success Metric or Feedback Loop
In-place modifications
aren't supported—methods like `df.
Lesson 1796Limitations and Differences from Pandas
Incapacitated individuals
People with cognitive impairments, dementia, or mental health conditions may not fully comprehend what they're consenting to
Lesson 1918Special Populations and Vulnerable Groups
Include a quick-start section
that gets someone from zero to a working result in under five minutes—this builds confidence and engagement.
Lesson 2080Usage Examples and Running Your Code
Include null results
that show no effect or relationship
Lesson 1929Avoiding Cherry-Picking Results
Includes the row
if there's a match
Lesson 961IN Operator with Subqueries
Inconsistent definitions
across teams (is "active user" last 7 or 30 days?
Lesson 1619What is Metric Ownership?
Inconsistent formats
User-entered data with typos, duplicates, or conflicting values
Lesson 1762Extended Dimensions: Veracity and Value
Inconsistent standards
If your data collection team changes definitions midway (e.
Lesson 1880Measurement and Label Bias
Incorporate offline touchpoints
Sales calls, conferences, or direct mail that standard models ignore
Lesson 1731Custom Rule-Based Attribution
Incorporates uncertainty
It's a full distribution, not just a point estimate
Lesson 1537The Posterior Distribution
Incorporates uncertainty naturally
The width of your posterior reflects how confident you are
Lesson 1570Comparing Two Means: Bayesian Approach
Incorrect conclusions
that harm decision-making
Lesson 34Recognizing Boundaries of Competence
Increase I/O
transferring massive result sets
Lesson 911Performance Considerations with Multiple Groups
Increased CAC pressure
You need constant acquisition just to maintain size, let alone grow
Lesson 1670What is Churn and Why It Matters
Increases cognitive load
people work harder to extract meaning
Lesson 1963Removing Chartjunk
Increases statistical power
by focusing only on within-pair changes
Lesson 370Differences as the Unit of Analysis
Incremental collaboration
Breaking large features into reviewable chunks while still working
Lesson 2029Draft Pull Requests and WIP Workflows
Incremental efficiency
does adding channel X improve overall LTV:CAC?
Lesson 1716Channel Mix and Portfolio Thinking
Incremental testing
runs initiatives sequentially or uses holdout groups to isolate each team's effect.
Lesson 1640Attribution in Multi-Team Environments
Incrementality correlation
Do the model's channel credits align with incrementality tests (like those control group experiments you learned)?
Lesson 1734Comparing and Validating Attribution Models
Independence of Paired Differences
Lesson 374Assumptions of the Paired t-Test
Independence violated
→ Reconsider your analysis approach entirely
Lesson 383Diagnostic Workflow: When to Proceed or Switch Tests
Independent advocates
for incapacitated individuals
Lesson 1918Special Populations and Vulnerable Groups
Independent groups
different subjects in each group, not repeated measures
Lesson 438When to Use One-Way ANOVA
independent observations
and a **sufficiently large sample size** (typically n ≥ 30, though this depends on the population distribution).
Lesson 225CLT for Sums and Other StatisticsLesson 1389What is Grubbs' Test?
Independent samples
come from two different, unrelated groups.
Lesson 360Independent vs. Dependent Samples
Independent variable
time (often just the index: 1, 2, 3, .
Lesson 738Linear Detrending
index
is a separate data structure that the database maintains to help find rows quickly without scanning the entire table.
Lesson 1078What Are Indexes and Why They MatterLesson 1804Index Optimization and Reset Strategies
Index bloat
happens when deleted records leave empty space that isn't automatically reclaimed, making indexes larger than necessary.
Lesson 1086Index Maintenance and Monitoring
Index plots
of residuals to spot specific observation numbers
Lesson 587Identifying Outliers in Regression Context
Index supporting columns
Ensure columns referenced in correlated conditions are indexed
Lesson 969Performance Considerations for SELECT Subqueries
Index usage
Are your indexes still effective?
Lesson 1077Measuring Performance Impact of Denormalization
Index-based selection
is severely limited.
Lesson 1796Limitations and Differences from Pandas
Indexes
Using indexed columns can make certain join orders faster
Lesson 951Join Order and Performance
Individual t-tests
Ask "Does *this specific* predictor add value?
Lesson 622Relationship Between F-Test and t-Tests
Industry norms
What metrics matter in this field?
Lesson 1168Understanding Domain Context
Industry research
Published studies, competitor analyses, domain blogs
Lesson 1201Domain Knowledge as a Hypothesis Source
Industry standards
"We're 20% above market average"
Lesson 1962Contextualizing Numbers
Inference
When you need trustworthy hypothesis tests and prediction intervals
Lesson 550Normality of ResidualsLesson 1594PyMC: Probabilistic Programming in Python
Inflated standard errors
The uncertainty around coefficient estimates increases dramatically
Lesson 580What is Multicollinearity?
Inflating r
Imagine plotting height vs.
Lesson 481Outliers and Their Impact on r
Influence
is about *actual impact*—how much the regression line would change if you removed that observation.
Lesson 571What Are Leverage and Influence?Lesson 574Influence: Impact on Fitted ModelLesson 2101Identifying and Mapping Stakeholders
Influence vs. Interest Matrix
Plot stakeholders on two axes:
Lesson 2101Identifying and Mapping Stakeholders
Influenced by time trends
Both variables increase over time independently
Lesson 494Spurious Correlations and Coincidence
INFO
Normal operations (job started, file processed)
Lesson 1857Logging Best Practices
Info/Log only
Minor retries succeeded, small delays—for forensic review later
Lesson 1858Alerting Strategies
Informed
Clear, jargon-free explanation of:
Lesson 1912What is Informed Consent in Data Science?
Infrastructure costs
Hosting, computing resources, and database connections
Lesson 1979Maintenance and Sustainability Considerations
Infrastructure debt
Manual processes that should be automated
Lesson 2131What is Technical Debt in Data Science?
Initial belief (prior)
Maybe there's a 20% chance the suspect is guilty based on background.
Lesson 114Sequential Updating
INNER JOIN
is SQL's way of bringing together information from two separate tables based on a relationship between them.
Lesson 918What is an INNER JOIN?Lesson 928LEFT JOIN vs INNER JOIN: When to Use Each
INNER JOIN table2
The table you're joining to (the "right" table)
Lesson 919Basic INNER JOIN Syntax
INNER JOINs
to connect each relevant dimension
Lesson 956Star Schema Joins
Inner query
Sum sales by department
Lesson 973Nested Subqueries in FROM
Inner query alias
(`inner`): identifies columns from the subquery
Lesson 976Basic Correlated Subquery Syntax
Input data context
What data was being processed when it broke?
Lesson 1851Error Logging and Notifications
Input metadata
(file names, row counts, date ranges)
Lesson 1857Logging Best Practices
Input(s)
The component property you're monitoring (e.
Lesson 1335Dash Callbacks: Adding Interactivity
INSERT protection
You cannot add a child record unless the referenced parent exists
Lesson 1052Foreign Key Constraints
INSERT/UPDATE
The database verifies the foreign key value exists in the parent table
Lesson 1060Trade-offs: Performance vs Integrity
Inside the bounds
(between the dashed lines): The autocorrelation is **not statistically significant**—it could easily be random noise
Lesson 723Significance Bounds in ACF Plots
Inspect source data
Go to the original data source.
Lesson 1870Root Cause Analysis for Quality Issues
Inspect your data first
If an integer column only contains values between 0 and 100, you don't need `int64`—`int8` (range: -128 to 127) suffices.
Lesson 1799Optimal Data Types and Downcasting
Installation Instructions
Step-by-step commands to set up the environment
Lesson 2077The Purpose and Anatomy of a Good README
Institutional review
(like ethics boards) before data collection
Lesson 1918Special Populations and Vulnerable Groups
Instrumentation Issues
Logging errors, tracking bugs, or data pipeline problems often surface during A/A tests
Lesson 1483Pre-Experiment Validation
Insurance claims
Total claim amounts in a period
Lesson 181Gamma Distribution: Shape and Rate Parameters
INTEGER
or **INT**: Whole numbers (e.
Lesson 846Tables, Schemas, and Data Types
Integer division
Some databases may truncate decimal places if the column is an integer type
Lesson 884AVG: Computing Averages
Integrates segments
into downstream workflows like marketing automation, pricing engines, or customer support tools
Lesson 1710Operationalizing Segments: Scoring and Deployment
Integrity and Confidentiality
Lesson 1905Core Principles of GDPR
Intent matters
Ask yourself: "Am I creating this visualization to inform or to persuade dishonestly?
Lesson 1247The Ethics of Visualization Design
Intent-to-Treat (ITT)
means you analyze every participant in the group they were *originally randomized to*, regardless of what they actually did.
Lesson 1439Intent-to-Treat AnalysisLesson 1748Intent-to-Treat Analysis
Interaction analysis
(do color and size amplify each other's effects?
Lesson 1482Control and Treatment Design
Interaction effects analysis
(Lesson 1195) shows whether variables work together or independently
Lesson 1197Identifying Variable Importance and Redundancy
Interaction is available
(rotation helps overcome perspective distortion)
Lesson 1323Introduction to 3D Plotting in Matplotlib
Interaction plots
make these non-additive effects visible at a glance.
Lesson 466Visualizing Interactions
interaction term
represents a relationship where the effect of one predictor on your outcome variable *depends on* the level or value of another predictor.
Lesson 648What are Interaction Terms?Lesson 653Interpreting Categorical × Categorical InteractionsLesson 1455DiD with Regression
Interactive 2D plots
Let users filter and explore without perspective distortion
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Interactive dashboards are primary
While R has Shiny, Python's Streamlit and Dash often integrate more naturally into broader Python ecosystems.
Lesson 1375Choosing Tools: When to Use R vs Python for Visualization
Interactive zoom
Let users explore crowded areas at different scales
Lesson 1310Point Maps and Scatter Plots on Maps
Interest
How much do they care about the outcome?
Lesson 2101Identifying and Mapping Stakeholders
Interleave explanation with code
Write markdown cells that introduce your analysis approach, then show the actual code that implements it
Lesson 1982Literate Programming with Notebooks
Intermediate outputs
Cleaned datasets, feature engineering results
Lesson 2065Tracking Data Lineage
Internal databases
Your organization's own records (sales, customer info, logs)
Lesson 11Data Collection and Acquisition
Internal first
Alert your organization's leadership and legal/ethics teams
Lesson 1925Mitigation Strategies and Responsible Disclosure
Internal validity
asks: *Are the results truly caused by what you think caused them?
Lesson 1441Internal vs External Validity
Interpretation cells
Discuss what results mean (markdown referencing outputs above)
Lesson 1982Literate Programming with Notebooks
Interpretation guideline
context (small/medium/large, or domain-specific benchmarks)
Lesson 389Reporting Effect Sizes in Practice
Interpreting the condition number
Lesson 583Condition Number and Eigenvalues
Interpreting variability
Standard deviation and variance assume certain shapes.
Lesson 63Understanding Distribution Shape
Intersectionality
recognizes that a Black woman's experience isn't just "being Black" plus "being a woman"—it's a unique combined experience.
Lesson 1893Intersectionality in Fairness
Interval censoring
means you know the event occurred within a specific time window, but not the precise moment.
Lesson 805Left and Interval Censoring
Interval data
(numeric with no true zero: temperature in Celsius, dates) suits:
Lesson 1238Matching Encoding to Data Type
Interval/ratio data
Meaningful numeric measurements
Lesson 398Choosing Between Parametric and Non-Parametric Tests
Introduces scope creep
that derails core objectives
Lesson 2107Saying No and Pushing Back Constructively
Introduction
Problem statement, objectives, context
Lesson 1966Report Structure and Executive Summary
Introduction cells
State the question and context (markdown)
Lesson 1982Literate Programming with Notebooks
Intuitive interpretation
The Beta parameters have natural meanings:
Lesson 1551Beta-Binomial Conjugacy
Invalid inference
Hypothesis tests and confidence intervals are incorrect
Lesson 734Why Differencing and Detrending Matter
Invalid statistical inference
Standard errors, confidence intervals, and hypothesis tests become meaningless because they assume stability that isn't there.
Lesson 713Why Stationarity Matters
Inventory turnover ratio
measures how many times you sell and replace stock annually:
Lesson 1634Retail Metrics: Same-Store Sales and Inventory Turnover
Inverse-Gamma
part models uncertainty about σ²
Lesson 1568Unknown Variance: Normal-Inverse-Gamma Model
Invest in data collection
if possible, but accept that sometimes you need to deliver value *now* with what you have.
Lesson 2124Insufficient or Low-Quality Data
Investment advice
based only on winning stocks ignores all the losers that went to zero
Lesson 247Survivorship Bias
Involuntary churn
occurs without customer intent—usually from failed payments, expired credit cards, or technical issues.
Lesson 1670What is Churn and Why It MattersLesson 1671Churn Rate Calculation Methods
IoT sensor data
Temperature, energy consumption, or manufacturing metrics with predictable rhythms
Lesson 1411Applications and Limitations
IQR
(Interquartile Range) shines.
Lesson 54When to Use Each Measure
IQR method
makes no such assumption—it relies on quartiles and is robust to skewed or non-normal distributions.
Lesson 1386IQR Method vs Z-Score: When to Use Each
IQR methods
give you rules of thumb for flagging outliers, Grubbs' Test takes a more rigorous approach.
Lesson 1389What is Grubbs' Test?
Irreducibility
The chain can eventually reach any state from any other state
Lesson 1589Markov Chains: The Foundation of MCMC
Irregular
components, you face a fundamental choice: do these pieces combine by adding or by multiplying?
Lesson 710Additive vs Multiplicative ModelsLesson 744Classical Decomposition Methods
Irreversibility matters
Decisions are costly to reverse
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Isolate Seasonality (S)
Average the detrended values for each season (e.
Lesson 744Classical Decomposition Methods
Isolate the problem
Test from different machines or networks to rule out local issues
Lesson 1093Troubleshooting Connection Issues
Isolates brand impact
The PSA has no commercial intent, so any lift from your real ad is truly incremental
Lesson 1747Ghost Ads and PSA Tests
Isolating the treatment effect
Differences in outcomes are more likely due to treatment, not pre-existing differences
Lesson 1445The Matching Framework
Isolation
Concurrent transactions don't interfere with each other
Lesson 1110What Are Database Transactions?
Issue tracker
Direct link to GitHub Issues or your bug tracking system
Lesson 2083Contributing Guidelines and Contact Information
It slows down
every new feature or improvement
Lesson 2132Pipeline Glue Code and Complexity Creep
It's poorly documented
("I'll remember what this does")
Lesson 2132Pipeline Glue Code and Complexity Creep
It's tightly coupled
to specific data formats or versions
Lesson 2132Pipeline Glue Code and Complexity Creep
Iteration
means deliberately refining your approach based on what you learned—testing a new feature, adjusting model complexity, or exploring a different angle after stakeholder feedback.
Lesson 2112Iteration vs Rework: Learning from Each CycleLesson 2142Interviewing: Technical and Behavioral Prep
Iteration is critical
You need to test dozens of variants quickly
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Iterative algorithms
Machine learning models that require hundreds of passes over the data
Lesson 1784Computation Complexity: Beyond Data Size

J

Jarque-Bera test
takes a unique approach: it specifically looks at two shape characteristics—**skewness** and **kurtosis**—and combines them into a single test statistic.
Lesson 208Jarque-Bera Test
Jitter
them (each person shifts slightly so all faces are visible)
Lesson 1353Position Adjustments: Dodge, Stack, and Jitter
Jittering
Slightly randomize positions (when exact location isn't critical)
Lesson 1310Point Maps and Scatter Plots on Maps
Job performance studies
Both competence and charisma can lead to promotion.
Lesson 1473Conditioning on Colliders: Selection Bias
Joining tables
Multiple tables might have overlapping column names, causing confusion
Lesson 851Selecting All Columns with Asterisk
JSON
handles nested data well but creates significant memory overhead with all its bracket and quote characters.
Lesson 1133Performance Considerations Across FormatsLesson 1779Reading and Writing Data in SparkLesson 2072Configuration Files vs Hard-Coded Values
Just right
Reveals the true distribution shape clearly
Lesson 1267Histograms and Distribution Plots

K

K-anonymity
ensures that each record is indistinguishable from at least *k-1* other records when considering quasi-identifiers (age, ZIP code, gender).
Lesson 1895Data Anonymization BasicsLesson 1896K-AnonymityLesson 1897L-Diversity and T- ClosenessLesson 1911GDPR Compliance for Data Scientists
Kaplan-Meier
to estimate response probability curves by segment
Lesson 841Campaign Response Time Analysis
Kaplan-Meier estimator
, you can plot conversion curves that account for censoring (prospects still "alive" but not yet converted).
Lesson 839Time-to-Conversion in Marketing Funnels
KDE (Kernel Density Estimate)
adds a smooth curve that estimates the underlying probability distribution, helping you see trends the blocky bins might obscure.
Lesson 1267Histograms and Distribution Plots
Keep conditions simple
Complex nested CASE statements are hard to maintain and slower to execute.
Lesson 1037CASE Best Practices and Performance
Keep CTEs focused
Each CTE should represent one logical step
Lesson 997CTE Best Practices and Performance
Keep it simple
Single-column integer keys perform best for joins and indexing
Lesson 1050Choosing Effective Primary KeysLesson 1679Defining Funnel Steps and Events
Keep separate
Analyze complete vs incomplete groups
Lesson 1207Missing Data Assessment and Strategy
Kendall
when you have outliers, skewed distributions, or ordinal (ranked) data.
Lesson 1184Correlation Coefficients in Bivariate Analysis
Kendall correlation
also uses ranks but counts how often pairs of observations agree in their ordering.
Lesson 1184Correlation Coefficients in Bivariate Analysis
Kendall's Tau
counts *concordant and discordant pairs*—comparing every possible pair of observations to see if they agree in direction.
Lesson 490Kendall's Tau vs Spearman's Rho
Kernel Density Estimation
is the mathematical technique behind these visualizations.
Lesson 1312Heatmaps and Density Maps for Spatial Data
Kernel Density Estimation (KDE)
is a technique that creates a smooth curve approximating your data's probability distribution.
Lesson 1177Density Plots and KDE
Kernel Density Plots
(or density curves) smooth out the histogram into a continuous curve.
Lesson 203Visual Assessment: Histograms and Density Plots
Kernel Matching
uses a weighted average of *all* control units, with weights based on distance from each treated unit's propensity score.
Lesson 1448Propensity Score Matching Methods
Key advantage
Simple correction; coefficients remain interpretable as log-rate ratios.
Lesson 694Quasi-Poisson and Negative Binomial Models
Key characteristic
Unlike the sampling distribution of the mean (which becomes normal thanks to the CLT), the sampling distribution of the variance follows a **chi-squared distribution** when the population is normal.
Lesson 254Sampling Distribution of the Sample Variance
Key findings
(2-3 bullet points with numbers)
Lesson 1966Report Structure and Executive Summary
Key lesson
Always ask "what else might explain this pattern?
Lesson 1426Real-World Examples: Correlation vs Causation
Key Results
are 2-5 specific, measurable outcomes that define *how* you'll know you've succeeded.
Lesson 1607Introduction to OKRs (Objectives and Key Results)
Kill the jargon
Replace technical variable names like "churn_propensity_score_v2" with "Customer Risk Level.
Lesson 1958Simplifying Visual Complexity
Know your exit options
Sometimes you must escalate to leadership or, in extreme cases, consider **responsible disclosure** or changing roles.
Lesson 1931When to Push Back on Requests
Knowledge spreads
Reviewers learn about changes they didn't write; contributors get feedback that improves their skills
Lesson 2022Understanding Pull Requests
Known Unknowns
"We don't yet know if historical patterns hold post-merger"
Lesson 2100Documenting Assumptions and Open Questions
Known variance structure
The variance function follows directly from the exponential family form
Lesson 670Why Exponential Family Matters for GLMs
Kolmogorov-Smirnov
to check if numeric distributions match theoretical ones (like normal distribution).
Lesson 1208Distribution Checks for All Variables
Kolmogorov-Smirnov (K-S) test
takes a slightly different approach.
Lesson 206Kolmogorov-Smirnov Test
KPSS fails to reject
(high p-value) → Evidence *for* stationarity
Lesson 717KPSS Test
KPSS rejects
(low p-value) → Evidence *against* stationarity
Lesson 717KPSS Test
Kruskal-Wallis test
The non-parametric cousin of one-way ANOVA, comparing medians across groups using ranks
Lesson 470When Parametric ANOVA Assumptions Fail
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test
is a statistical test for stationarity that works *opposite* to the Augmented Dickey-Fuller test you just learned.
Lesson 717KPSS Test

L

L well-represented values
for sensitive attributes.
Lesson 1897L-Diversity and T-Closeness
L-Diversity
addresses this by requiring that each equivalence class (group of indistinguishable records) contains at least **L well-represented values** for sensitive attributes.
Lesson 1897L-Diversity and T-Closeness
L(θ | data)
or **P(data | θ)**
Lesson 1535The Likelihood Function
LA's coefficient
(say, -5): means LA is 5 units lower than Boston
Lesson 643Interpreting Coefficients Relative to Reference
Label bias
happens when subjective human judgment creates inconsistent or skewed labels for supervised learning.
Lesson 1880Measurement and Label Bias
Labels
Add text to critical nodes.
Lesson 1319Styling Network Visualizations
Labels + Color
Add direct text labels or annotations to clarify what colors represent
Lesson 1251Avoiding Reliance on Color Alone
Labels and Titles
Use `set_xlabel()`, `set_ylabel()`, and `set_title()` to give context.
Lesson 1270Customizing Axes: Labels, Limits, and Scales
LAG
and **LEAD** window functions make this trivial by letting you "peek" at other rows directly.
Lesson 1023Introduction to Window Functions: LAG and LEAD
Lag 1
Correlation between consecutive observations (today vs.
Lesson 719What is Autocorrelation?Lesson 720The Autocorrelation Function (ACF)
Lag 2
Correlation between observations two steps apart (today vs.
Lesson 719What is Autocorrelation?Lesson 720The Autocorrelation Function (ACF)
Lag k
Correlation at any time lag *k*
Lesson 719What is Autocorrelation?
Lagging
Ultimate business metrics (revenue per user, customer lifetime value)
Lesson 1601Balancing Leading and Lagging Metrics
Lagging indicators
are outcome-focused metrics that tell you *what has already happened*.
Lesson 1597What Are Leading and Lagging Indicators?
Lagging metrics provide accountability
They're your scoreboard, measuring final outcomes.
Lesson 1601Balancing Leading and Lagging Metrics
Language-agnostic
The same DataFrame operations work in Python, Scala, R, and Java
Lesson 1778DataFrames and Spark SQL Basics
Laplace mechanism
adds noise drawn from a Laplace distribution.
Lesson 1899Adding Noise for Privacy
Large Cook's Distance
Point substantially changes the regression line
Lesson 578Visualizing Leverage and Influence
Large data files
store these separately (cloud storage, data lakes)
Lesson 1996The .gitignore File
Large datasets
where finding exact twins is feasible
Lesson 1446Exact Matching
Large magnitude values
(typically |residual| > 2 or 3): potential outliers or poorly-fit observations
Lesson 701Deviance Residuals
Large p-value (≥ 0.05)
Weak evidence.
Lesson 619Interpreting the F-Statistic and P-Value
Large p-value (e.g., 0.40)
Your data isn't unusual at all under H₀.
Lesson 318What is a P-Value?
large sample sizes
, even tiny, meaningless slopes become statistically significant.
Lesson 529Practical vs Statistical SignificanceLesson 1386IQR Method vs Z-Score: When to Use Each
large samples
(n > 5000): Tests often reject normality due to trivial deviations that won't affect your downstream analyses.
Lesson 209Sample Size Considerations in Normality TestsLesson 550Normality of Residuals
Large tables
Retrieving unnecessary columns wastes bandwidth and memory
Lesson 851Selecting All Columns with Asterisk
Larger sample size
→ Narrower margin (more data gives better estimates)
Lesson 294Margin of Error and Its Components
larger sample sizes
and is particularly sensitive to differences in the middle of the distribution.
Lesson 206Kolmogorov-Smirnov TestLesson 630Bayesian Information Criterion (BIC)Lesson 1482Control and Treatment Design
Larger ε
= weaker privacy (less noise, more accuracy)
Lesson 1898Differential Privacy Fundamentals
Last Non-Direct Click Attribution
credits the last touchpoint in the customer journey *before* conversion, **excluding any direct traffic**.
Lesson 1722Last Non-Direct Click Attribution
Last touch
(30%) – The final interaction before purchase
Lesson 1730W-Shaped Attribution Model
Last touch matters
Something convinced them to finally convert
Lesson 1729Position-Based (U-Shaped) Attribution
Latency matters
Fraud detection must happen in seconds, not overnight
Lesson 1788Streaming Data and Real-Time Requirements
Latitude
measures north-south position from the equator (0°) to the poles (±90°).
Lesson 1308Geographic Data Types and Coordinate Systems
Law of Total Probability
lets you split the sample space into separate, non-overlapping scenarios (a partition), calculate the probability within each scenario, and then add them up to get your answer.
Lesson 90The Law of Total ProbabilityLesson 97Law of Total Probability
Lawful basis for processing
You can't just collect data because it's convenient—you need explicit consent or a legitimate legal reason
Lesson 1904What is GDPR and Why It Matters
Lawfulness, Fairness, and Transparency
Lesson 1905Core Principles of GDPR
Layer in supporting evidence
Once the headline lands, show the *why*—perhaps a single clear visualization with strong annotation (lessons 1960-1961).
Lesson 1965Progressive Disclosure Techniques
Lazy loading
means only computing what's currently visible, deferring expensive operations until absolutely necessary.
Lesson 1337Dashboard Performance and Caching
LEAD
window functions make this trivial by letting you "peek" at other rows directly.
Lesson 1023Introduction to Window Functions: LAG and LEAD
Lead conversion
(30%) – The moment a prospect becomes a qualified lead (e.
Lesson 1730W-Shaped Attribution Model
Lead generation
Form submission or demo request
Lesson 1686Defining Conversions and Conversion Rate
Lead with findings
Put results before methodology when possible
Lesson 1967Writing Clear and Concise Analysis Sections
Leading
Surrogate metrics (click-through rate, engagement time, sign-up rate)
Lesson 1601Balancing Leading and Lagging Metrics
Leading indicators
are predictive, forward-looking metrics that signal *what is likely to happen in the future*.
Lesson 1597What Are Leading and Lagging Indicators?
Leading indicators of disengagement
are behavioral signals that precede actual churn—like smoke before fire.
Lesson 1700Leading Indicators of Disengagement
Learn & Repeat
Use insights to generate next hypothesis
Lesson 1692Statistical Significance and Iteration
Learn strategically
by identifying skill gaps
Lesson 34Recognizing Boundaries of Competence
Learning the structure
Analyze the original data's distributions, correlations, and statistical properties
Lesson 1901Synthetic Data Generation
least squares criterion
says: choose the line that minimizes the **sum of squared residuals**.
Lesson 517The Least Squares CriterionLesson 518Deriving the Least Squares Estimators
Left censoring
occurs when you know an event *has already occurred* before your observation period began, but you don't know exactly when.
Lesson 805Left and Interval Censoring
LEFT JOIN
Returns **all** rows from the left table, plus matching rows from the right (or NULL if no match)
Lesson 928LEFT JOIN vs INNER JOIN: When to Use EachLesson 936FULL OUTER JOIN SyntaxLesson 946Self-Joins for Hierarchical Data
Left pane
Your changes (current branch)
Lesson 2019Using Diff Tools for Conflict Resolution
left to right
(though the optimizer may reorder them internally).
Lesson 950Chaining Multiple JoinsLesson 952Mixing Join Types
Left-only rows
Right-side columns are NULL
Lesson 937Identifying Matched vs Unmatched Rows
Left-skewed (negative skew)
A long tail to the left; most values cluster high (e.
Lesson 1175Histograms for Distribution Shape
Legacy systems
Some older SQL environments don't support CTEs
Lesson 974When to Use FROM Subqueries vs CTEs
Legal compliance
– Are you following laws like GDPR, CCPA, or HIPAA that govern data use in different regions and industries?
Lesson 36Responsible Data Sourcing and UseLesson 2062Why Data Source Documentation Matters
Legend interactivity
to show/hide data series by clicking
Lesson 1300Creating Basic Interactive Charts with Plotly Express
Legends
identify what different visual elements represent—especially crucial when you have multiple lines, colors, or groups.
Lesson 1271Adding Legends, Annotations, and Text
Lends
an available connection when your code requests one
Lesson 1092Connection Pooling Basics
Length of stay (LOS)
tracks average days hospitalized.
Lesson 1633Healthcare Metrics: Patient Outcomes and Operational Efficiency
Lengthening Time-to-Return
The gap between visits grows longer.
Lesson 1700Leading Indicators of Disengagement
Leptokurtic
(kurtosis > 3 or excess kurtosis > 0): Heavy tails and a sharp peak.
Lesson 66Kurtosis: Definition and Interpretation
Less SQL boilerplate
You write Python code, not SQL strings
Lesson 1117What is an ORM and Why Use It?
Less typing
You save keystrokes, reducing errors and speeding up query writing.
Lesson 924Using Table Aliases in Joins
Lesson 804
, you learned about right censoring (when someone drops out before the event happens).
Lesson 805Left and Interval Censoring
Let supporting details orbit
around these three points, but never introduce a fourth major message
Lesson 1940The Rule of Three in Data Storytelling
Level shifts
Sudden jumps to a new baseline that persists
Lesson 715Visual Tests for Stationarity
Levene's test
or the **F-test** (covered in lesson 363), or simply inspect side-by-side boxplots.
Lesson 379The Assumption of Equal Variances (Homoscedasticity)Lesson 380Testing Equal Variances: Levene's and Bartlett's Tests
Leverage
refers to an observation's *position* in the predictor space—specifically, how far its X-value is from the mean of all X-values.
Lesson 571What Are Leverage and Influence?Lesson 573Calculating and Interpreting Hat ValuesLesson 574Influence: Impact on Fitted ModelLesson 575Cook's Distance
Leverage associations
Use culturally familiar color meanings: red for danger/stop/negative, green for go/positive, blue for neutral/calm.
Lesson 1961Color as Communication Tool
License
The legal terms under which you can use and share the data.
Lesson 2063Essential Metadata to Capture
Lightweight artifacts
Small CSVs of feature importance, confusion matrices, or performance metrics belong in version control
Lesson 2034Committing Data Artifacts and Model Outputs
Likelihood P(Evidence | Guilty)
probability of seeing this evidence if guilty
Lesson 112Legal Evidence and Jury Reasoning
Likelihood Ratio Test
compares two **nested models**—where one model (the simpler one) is a special case of the other (the more complex one).
Lesson 699The Likelihood Ratio TestLesson 791Comparing Nested and Non-Nested ModelsLesson 830Testing Coefficient Significance
likelihood ratio test (LRT)
compares two nested models by examining how well each explains the data.
Lesson 628Likelihood Ratio TestsLesson 684Likelihood Ratio Tests for Model Comparison
Likelihood: P(B|A)
How probable the evidence B is *if* A is true
Lesson 107Bayes' Theorem Formula and Components
Limit CTE reuse
If you reference a CTE many times, consider a temp table instead
Lesson 997CTE Best Practices and Performance
Limit result set size
Filter rows in the outer query before applying expensive subqueries
Lesson 969Performance Considerations for SELECT Subqueries
Limit your palette
Too many colors create cognitive overload—your audience spends mental energy decoding the legend instead of understanding your insight.
Lesson 1961Color as Communication Tool
Limitations and confidence levels
honesty builds trust
Lesson 2091Stage 7: Communication and Handoff
Limitations and uncertainties
Where might the model fail?
Lesson 1917Transparency in Analysis and Models
Limited control
You can't influence outcomes that already crystallized
Lesson 1617The Danger of Lagging-Only Metrics
Limited flexibility
Your prior beliefs must fit the conjugate family's shape, even if reality suggests otherwise
Lesson 1555Advantages and Limitations of Conjugate Priors
Limits
the total number of concurrent connections to prevent overwhelming the database
Lesson 1092Connection Pooling Basics
Limits and breaks
controlling what range displays and where tick marks appear
Lesson 1344Scales and Coordinate Systems
Line charts
Showing trends over time (monthly revenue, daily user counts)
Lesson 1959Choosing Familiar Chart Types
Line style + Color
Vary dashed, dotted, and solid lines in addition to color
Lesson 1251Avoiding Reliance on Color Alone
Line Styles
control how lines appear:
Lesson 1272Colors, Markers, and Line Styles
Line type
`linetype` — solid, dashed, dotted
Lesson 1341Data and Aesthetic Mappings
lineage
information, so if a partition fails, it can rebuild just that piece—providing fault tolerance without constant replication overhead.
Lesson 1774What is Apache Spark and Why Use It?Lesson 1871Why Version Control for Data?
Linear decay
Attribution drops steadily over time (e.
Lesson 1639Time Windows and Attribution Decay
Linear interpolation
draws an imaginary line between the 7th and 8th values and picks the point halfway between them.
Lesson 58Calculating Percentiles: Methods and Algorithms
Linear pattern
Points should roughly follow a straight line, not a curve
Lesson 480Scatterplots and Visual Assessment
Linear scalability
Add more nodes, get proportionally more capacity.
Lesson 1771Shared-Nothing Architecture
Linearity assumption
Are patterns randomly scattered, or do residuals show curves that suggest a non-linear relationship?
Lesson 544The Role of Residuals in DiagnosticsLesson 547Linearity: The Relationship Must Be LinearLesson 558Identifying Non-Linearity in Residual Plots
Lines
(`geom_line`) connecting observations in sequence
Lesson 1342Geometric Objects (geoms)
Linestyle
controls whether your line is solid, dashed, dotted, or dash-dotted.
Lesson 1258Customizing Lines: Colors, Styles, and Markers
link function
transforms the expected value of your response variable so it can be modeled with a linear predictor.
Lesson 671What is a Link Function?Lesson 672The Identity LinkLesson 690The Poisson Distribution as a GLM
Link to business costs
"Type I vs Type II errors" becomes "cost of investigating false alarms vs cost of missing real problems"
Lesson 2105Translating Between Technical and Business Language
Linked selections
Selecting points in one plot highlights them in others
Lesson 1304Subplots and Linked Interactions
List all possible outcomes
and their probabilities
Lesson 152Decision Making Under Uncertainty
List relevant variables
in your research question
Lesson 1469Building a Simple Causal DAG
List required features
What specific variables does your analysis need?
Lesson 2098Identifying Data Availability Gaps Early
Live presentations
Keep slides sparse.
Lesson 1957Adapting Delivery Format: Live vs Async
Ljung-Box test
is a formal hypothesis test that checks whether residuals show significant autocorrelation at multiple lags at once.
Lesson 783Ljung-Box Test for Residual IndependenceLesson 799Fitting and Diagnosing SARIMA Models
Load
only clean, aggregated, ready-to-query data into the warehouse
Lesson 1817Historical Context: Why ETL Came First
Load balancing
assigning work so no worker sits idle while others are overloaded
Lesson 1769Task Parallelism and Work Distribution
Load raw data
into your cloud warehouse for most sources
Lesson 1821Hybrid Approaches and Modern Data Stacks
Loading
raw data directly into staging tables in the warehouse
Lesson 1816What is ELT? Extract, Load, Transform Explained
Loans
table tracks who borrowed what
Lesson 1051Introduction to Foreign Keys
Local branches
that exist only on your machine
Lesson 2020The Golden Rule of Rebase
Local control
Changes in one region don't affect distant regions
Lesson 662Polynomial Features vs Splines
Locks
Constraints can hold locks longer, blocking concurrent operations
Lesson 1060Trade-offs: Performance vs Integrity
Log transformation of X
If you modeled `Y = β₀ + β₁log(X)`, then β₁ represents the change in Y when X is *multiplied* by some factor (like doubling).
Lesson 594Interpreting Models After Transformation
Log transformation of Y
If you modeled `log(Y) = β₀ + β₁X`, the coefficient β₁ represents the *proportional* change in Y.
Lesson 594Interpreting Models After Transformation
Log-normal
suits variables that are products of many small multiplicative factors—like incomes, stock prices, or city sizes.
Lesson 193Choosing Between Distributions in Practice
Log-rank tests
to compare response timing across different campaign variants
Lesson 841Campaign Response Time Analysis
Logically Connected
Every recommendation must flow directly from your analysis.
Lesson 1970Recommendations and Next Steps
logit link function
you just learned transforms these probabilities into a scale where linear modeling works, then transforms back to give valid probabilities.
Lesson 679Logistic Regression Setup and the Binary ResponseLesson 680The Logit Link Function and Odds
Logo churn
counts *how many customers* cancel (e.
Lesson 1628SaaS Metrics: MRR, ARR, and Logo Churn
Long flat sections
Periods with no events or only censored observations
Lesson 815Survival Curve Plots and Interpretation
Long format
(the tidy version) stacks these observations vertically, using one column for the variable name (`month`) and another for its value (`sales`).
Lesson 1144Common Violations: Wide vs Long Format
Long-term business viability
A 2% floor vs 20% changes unit economics dramatically
Lesson 1658Flattening and Asymptotic Behavior
Long, complex journeys
where no single touchpoint dominates
Lesson 1727Linear Attribution Model
Longer-term fluctuations
tied to economic or business cycles, but *without* a fixed period.
Lesson 705The Four Classical Components
Longitude
measures east-west position from the Prime Meridian (0°) through ±180°.
Lesson 1308Geographic Data Types and Coordinate Systems
LOO
(Leave-One-Out cross-validation) to compare them:
Lesson 1596Posterior Predictive Checks and Model Comparison
Look for confounders
What third factor might drive both metrics?
Lesson 1615Correlation Without Causation
Look for patterns
Are curves converging?
Lesson 1659Comparing Retention Across Cohorts
Look for subgroup patterns
Split your data by the suspected confounder—does the treatment-outcome relationship change or reverse?
Lesson 1429Identifying Confounders in Practice
Looks unprofessional
in serious analytical contexts
Lesson 1963Removing Chartjunk
Lookups
Fetching related data without a JOIN
Lesson 967Subqueries in the SELECT Clause
Losing credibility
Decision-makers can't tell where facts end and opinions begin
Lesson 1927Separating Analysis from Advocacy
Love plots
(or balance plots) display SMDs before and after matching, making it easy to see which covariates improved and which remain problematic.
Lesson 1450Assessing Balance After Matching
Low baseline rate example
If your current conversion is 2%, improving to 3% (a 50% relative lift!
Lesson 1499Adjusting for Baseline Conversion Rates
Low p-value (< α)
Your observed frequencies differ significantly from expected frequencies.
Lesson 420Interpreting Chi-Squared Test Results
Low statistical power
– You might miss real effects (false negatives).
Lesson 1493Why Sample Size Matters in A/B Tests
Lower alpha (0.01)
Like a less-sensitive detector.
Lesson 334Setting Alpha: Choosing Your Significance Level
Lower bound only
Mean - t*(SE)
Lesson 275One-Sided Confidence Bounds
Lower confidence
to 95% (slightly riskier, smaller sample needed)
Lesson 295Trade-offs: Precision, Confidence, and Cost
Lower confidence (e.g., 90%)
Narrower intervals, but less reliable method
Lesson 267Interpreting Confidence Levels
Lower is better
The model with the smallest AIC or BIC is preferred
Lesson 781Information Criteria: AIC and BIC
Lower peak
The center is slightly flatter than a normal curve
Lesson 352The t-Distribution and Degrees of Freedom
Lower threshold (A)
Based on acceptable Type II error (β, false negative rate)
Lesson 1511Sequential Probability Ratio Test (SPRT)
Lower values are better
they indicate a superior balance of fit and simplicity.
Lesson 785Information Criteria: AIC and BIC
Lower variance/standard deviation
= outcomes cluster tightly around the expected value, more predictable
Lesson 148Variance and Standard Deviation of Discrete Distributions
Lower λ
means events happen less frequently → longer waiting times
Lesson 164The Exponential Distribution
LTV:CAC ratio
divides lifetime value by customer acquisition cost (CAC) to reveal whether you're spending wisely.
Lesson 1667LTV:CAC Ratio and ProfitabilityLesson 1669LTV Segmentation and TargetingLesson 1756LTV:CAC Ratio as a Health Metric
Luminance
(or lightness/value) is how bright or dark the color appears, from near-black to near-white.
Lesson 1234Color: Hue, Saturation, and Luminance

M

MA process
ACF cuts off sharply; PACF decays gradually
Lesson 731PACF for AR Process Identification
MA(q)
process shows **gradual exponential decay** or a damped sinusoidal pattern in the PACF—no clean cutoff.
Lesson 732PACF Patterns for Common ModelsLesson 775Moving Average (MA) ModelsLesson 777Identifying MA Order (q) Using ACF
Machine failures
For certain systems, past survival time doesn't reduce future failure risk
Lesson 167Memoryless Property of Exponential
Machine Learning (ML)
These are techniques that let computers find patterns and make predictions automatically.
Lesson 7The Data Science Skill Stack
Machine Learning Feature Scaling
Many algorithms (like k-nearest neighbors or neural networks) perform better when features are standardized to similar ranges.
Lesson 201Z-Score Applications and Limitations
Machine learning methods
(Isolation Forest, autoencoders) for complex multivariate patterns
Lesson 1411Applications and Limitations
MAD
(Mean Absolute Deviation) is useful when you want interpretability similar to standard deviation but with less sensitivity to outliers.
Lesson 54When to Use Each Measure
MAD (Median Absolute Deviation)
instead of standard deviation (a robust measure of spread you learned earlier)
Lesson 73Modified Z-Score Using MAD
MAE (Mean Absolute Error)
Average of absolute differences; easy to interpret in original units
Lesson 790Out-of-Sample Forecast Evaluation
Mahalanobis distance
measures how far a point is from the center of a multivariate distribution, accounting for correlations between variables.
Lesson 74Multivariate Outlier DetectionLesson 1381Multivariate Z-Score Methods
Main branch (`main`)
Your production-ready, validated code.
Lesson 2035Branching Strategies for Experiments
Main effect of degree
the intercept difference between groups
Lesson 652Interpreting Categorical × Continuous Interactions
Main effect of experience
the baseline slope (for the reference group, no degree)
Lesson 652Interpreting Categorical × Continuous Interactions
Main effects
test whether each factor matters *on its own*, averaging across all levels of the other factor.
Lesson 464Main Effects in Two-Way ANOVALesson 465Interaction EffectsLesson 1689Multivariate Testing and Personalization
Main ingredients (geom layers)
Points, lines, bars added with `geom_*()` functions
Lesson 1347Understanding Layers in ggplot2
Maintain integrity
by being transparent about what you can and cannot deliver
Lesson 34Recognizing Boundaries of Competence
Maintain predictive power
Often little loss in R-squared
Lesson 585Remedies: Variable Selection
Maintain valid α
and power guarantees
Lesson 1510Sequential Testing Overview
Maintainability
Adding or reordering parameters won't break your code
Lesson 1106Parameter Placeholders: Named Parameters
Maintenance
Rule-based systems are trivial to update.
Lesson 2123Simple Rules Beat Complex Models
Maintenance burden
Outdated code may break when dependencies change, triggering false alarms
Lesson 2135Dead Experimental Code and Feature Sprawl
Maintenance scheduling
When k > 1, rising hazard rates signal when preventive maintenance is cost-effective
Lesson 188Weibull Distribution: Hazard Function and Reliability
Make better decisions
React to actual changes, not predictable cycles
Lesson 748Seasonally Adjusted Data
Make decisions
Knowing the typical outcome helps guide business choices and predictions
Lesson 38What is Central Tendency?Lesson 2121Timeboxing and Deadlines
Make decisions faster
on average
Lesson 1510Sequential Testing Overview
Make probabilistic statements
Calculate P(μ₁ > μ₂ | data) or credible intervals for δ
Lesson 1570Comparing Two Means: Bayesian Approach
Makes everything positive
– no more negative values, so you're only looking at spread magnitude
Lesson 560Scale-Location Plot (Spread-Location Plot)
Making findings accessible
Translate technical metrics into business language.
Lesson 2090Stage 6: Interpretation and Insight Generation
Making multiplicative relationships additive
– Easier to model and interpret
Lesson 212Log Transformations
Manager
Lead a team (4-8 people), conduct 1-on-1s, handle performance reviews, remove blockers
Lesson 2140Individual Contributor vs Management Tracks
Manages memory
by processing chunks sequentially when needed
Lesson 1790What is Dask and When to Use It
Managing execution order
through task graphs or workflow engines
Lesson 1769Task Parallelism and Work Distribution
Mann-Whitney U test
(also called the Wilcoxon Rank-Sum test) offers a robust alternative to the two-sample t-test.
Lesson 393Mann-Whitney U Test (Wilcoxon Rank-Sum)
Manual entry
Recording observations or measurements
Lesson 11Data Collection and Acquisition
Manual transformations
(typical Python workflow):
Lesson 1373Statistical Transformations: Built-in vs Manual
Manually edit
Choose which changes to keep, or combine them, then remove the conflict markers
Lesson 2018Resolving Conflicts During Rebase
Manufacturing
Predicting when machines need maintenance before they break
Lesson 6Common Data Science ApplicationsLesson 1412What is Change-Point Detection?
Manufacturing Defects
A production line averages 2 defects per 1,000 units.
Lesson 144Poisson Applications: Arrivals and Events
Map common path variations
– use path analysis or Sankey diagrams to visualize popular alternate routes
Lesson 1683Multi-Path and Non-Linear Funnels
Map projections
are mathematical transformations that flatten the globe onto a plane—like peeling an orange and trying to lay the peel flat.
Lesson 1308Geographic Data Types and Coordinate Systems
Map secondary applications
What could someone build *on top of* your work that causes harm?
Lesson 1924Red Team Thinking for Data Scientists
MAR (Missing at Random)
Missingness relates to *observed* data (e.
Lesson 1207Missing Data Assessment and Strategy
marginal effect
the change in your outcome variable when that predictor increases by one unit, *while holding all other predictors constant*.
Lesson 604Marginal Effects and Ceteris ParibusLesson 659Interpreting Polynomial Regression Coefficients
marginal likelihood
) is the denominator that normalizes the posterior distribution.
Lesson 1536The Evidence (Marginal Likelihood)Lesson 1546The Role of the Normalizing Constant
Markdown text
Plain text with simple formatting (headers, lists, bold, italics)
Lesson 1983R Markdown for Dynamic Reports
Marker size reduction
Smaller points reduce overlap
Lesson 1310Point Maps and Scatter Plots on Maps
Marker styles
change the shape of each point (circles, squares, triangles, etc.
Lesson 1265Scatter Plots: Relationships Between Variables
Market or product changes
Launching in new markets, adding product lines, or facing competitive threats requires rethinking which metrics matter most and how they interconnect.
Lesson 1626Maintaining and Evolving Metric Trees
Marketing
Identifying which customers are likely to cancel subscriptions
Lesson 6Common Data Science Applications
Marketing effectiveness
Compare cohorts from different channels
Lesson 1644What is Cohort Analysis?
Marketing expenses
ad spend across all channels, content creation, marketing tools and software
Lesson 1753Customer Acquisition Cost (CAC): Components and Calculation
Marketplaces
More sellers (treatment) improve selection for all buyers
Lesson 1527Ignoring Network Effects
Markov chain
is a sequence of states where the next state depends *only* on the current state, not on how you got there.
Lesson 1589Markov Chains: The Foundation of MCMCLesson 1733Markov Chain Attribution Models
Mask volatility
Show a smooth period while hiding the chaos before and after
Lesson 1241Cherry-Picking Time Ranges
Massively scalable compute engines
(BigQuery, Snowflake, Redshift): These could query petabytes of data in seconds, separating storage from compute power
Lesson 1818The Rise of ELT: Cloud Storage and Compute
Match
Apply exact matching on these coarsened bins—much easier now!
Lesson 1449Coarsened Exact Matching (CEM)
Match regions
Pair test and control geographies with similar historical sales, demographics, and seasonality
Lesson 1746Geo-Lift Experiments
Match the question
If you care about *means*, try parametric first.
Lesson 398Choosing Between Parametric and Non-Parametric Tests
Match user intent
Steps should reflect meaningful progress, not just technical page loads.
Lesson 1679Defining Funnel Steps and Events
Matched records with discrepancies
(e.
Lesson 941Use Cases: Data Reconciliation
Matched rows
Both sides have real values (no NULLs in key columns)
Lesson 937Identifying Matched vs Unmatched Rows
Mathematical convenience
Some priors (called *conjugate priors*) make calculations simpler
Lesson 1534The Prior Distribution
Mathematical dependencies
Subtotals should equal their parts.
Lesson 1155Consistency Checks Across Fields
Mathematical elegance
Squared terms have smooth derivatives, making it possible to solve for the optimal slope and intercept using calculus.
Lesson 517The Least Squares Criterion
Mathematically elegant
If your prior is Beta(α, β) and you observe `k` successes in `n` trials (Binomial likelihood), your posterior is simply Beta(α + k, β + n - k).
Lesson 1551Beta-Binomial Conjugacy
Mathematics & Statistics
Lesson 1Defining Data Science
Matplotlib
historically had a more technical, MATLAB-inspired look with white backgrounds and primary colors (blue, orange, green).
Lesson 1371Default Aesthetics and Design ChoicesLesson 1373Statistical Transformations: Built-in vs Manual
Matplotlib subplots
Best when you need completely different plot types, custom layouts, or fine-grained control over individual panels
Lesson 1372Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
Matplotlib's Object-Oriented Interface
treats plotting like building with objects.
Lesson 1370Syntax Philosophy: Grammar of Graphics vs Object-Oriented
Matplotlib's subplots
, however, are more imperative and manual.
Lesson 1372Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
Matrix plots
Visualize data tables (heatmaps, cluster maps)
Lesson 1281Introduction to Seaborn's Statistical Plots
Mature businesses
optimize for efficiency—targeting shorter payback (under 12 months), higher ROAS (3x+), and stable CAC.
Lesson 1759Optimizing ROAS, CAC, and Payback Together
MAX
together with **GROUP BY** to create rich summaries of grouped data.
Lesson 892GROUP BY with Different Aggregate FunctionsLesson 894NULL Values in GROUP BY
Maximize ROAS
You'll likely reduce spend, raise CAC (fewer efficient channels), and shorten payback (only safest bets)
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Maximizing external validity
You recruit diverse students from multiple schools, allow natural variation in implementation, and study real-world conditions.
Lesson 1441Internal vs External Validity
Maximizing internal validity
You recruit a highly homogeneous group of students, control every aspect of the environment, use strict protocols, and carefully monitor compliance.
Lesson 1441Internal vs External Validity
Maximum test
Only checks if the *largest* value is an outlier
Lesson 1393Two-Sided vs One-Sided Grubbs' Test
MDE
= minimum detectable effect (absolute difference in means)
Lesson 1498Sample Size Formulas for Continuous Metrics
Mean = 1/λ
The average waiting time is simply the inverse of the rate
Lesson 166Exponential Distribution: Mean and Variance
Mean Square Between (MSB)
Variance *between* groups
Lesson 443Mean Squares and the F-Ratio
Mean Square Within (MSW)
Variance *within* groups (also called Mean Square Error, MSE)
Lesson 443Mean Squares and the F-Ratio
Mean Squared Error (MSE)
or **Mean Absolute Error (MAE)** on historical data.
Lesson 759Choosing the Smoothing Parameter αLesson 763Evaluating Exponential Smoothing Models
mean squares
(variance estimates) by dividing sum of squares by their respective df.
Lesson 442Degrees of Freedom in ANOVALesson 443Mean Squares and the F-Ratio
Mean-variance relationship
For Poisson, `Var(Y) = μ` (variance equals the mean)
Lesson 690The Poisson Distribution as a GLM
Measurable Success Criteria
"We need 70% accuracy" or "Reduce customer churn by 15%"
Lesson 10Problem Definition and Scoping
Measure impact
Did that feature launch in March improve Day-30 retention for the March cohort compared to February?
Lesson 1659Comparing Retention Across Cohorts
Measure lift
Compare actual outcomes in test regions vs predicted outcomes (based on control region trends)
Lesson 1746Geo-Lift Experiments
Measure outcomes
Track your KPI for both groups over the same time window
Lesson 1641Isolating Effects with Control Groups
Measure the outcome
(purchases, sign-ups, visits) for both groups
Lesson 1747Ghost Ads and PSA Tests
Measurement
Salary data that systematically underreports gig economy earnings
Lesson 1878What is Bias in Data?
Measurement bias
occurs when your data collection instruments, procedures, or definitions consistently produce inaccurate values.
Lesson 1880Measurement and Label Bias
Measurement error in X
Recording mistakes correlate with unpredictable noise
Lesson 553Exogeneity: X Must Be Independent of Errors
Measures customer value directly
– It reflects how much value customers extract from your product
Lesson 1604What is a North Star Metric?
Medcouple-based detection
measures skewness more robustly than traditional methods
Lesson 1388Limitations and Alternatives to IQR Detection
Media Mix Modeling (MMM)
and **attribution modeling** help marketers understand marketing effectiveness, they examine the problem from fundamentally different angles—like comparing satellite imagery to street-level photography.
Lesson 1736MMM vs Attribution: Key Differences
Median (Q2)
the 50th percentile (middle value)
Lesson 59The Five-Number Summary and Box Plots
Median Absolute Deviation (MAD)
(a robust measure of spread)
Lesson 1380Modified Z-Score Using Median
Median survival times
for each group (from lesson 816)
Lesson 817Comparing Multiple Survival Curves
Median time-to-conversion
How long until half your prospects convert?
Lesson 839Time-to-Conversion in Marketing Funnels
mediator
sits *on* the causal path between treatment and outcome.
Lesson 1471Mediators and CollidersLesson 1476Common DAG Patterns and Pitfalls
Mediators
On the path (X → M → Y).
Lesson 1471Mediators and Colliders
Medical datasets
excluding or underrepresenting certain populations
Lesson 1881Historical and Societal Bias
Medical measurements
(comparing height, weight, and blood pressure on the same scale)
Lesson 200Comparing Values Across Different Distributions
Medical trials
that only track patients who completed treatment miss those who got too sick to continue
Lesson 247Survivorship Bias
Medium-risk
Automated email campaigns highlighting underused features that correlate with retention
Lesson 1676Win-Back and Retention Strategies
Meetups and conferences
offer face-to-face learning and relationship building.
Lesson 2144Networking and Community Engagement
Memorable and motivating
Teams should *want* to achieve it
Lesson 1609Setting Effective Objectives
memoryless property
when r = 1, but only the geometric distribution is truly memoryless in the classical sense.
Lesson 137Geometric vs Negative Binomial: Key DifferencesLesson 167Memoryless Property of Exponential
Mental Model
Humans naturally think "group by region, then by product" differently than "group by product, then by region"
Lesson 906Order Matters: Column Sequence in GROUP BY
Mercator
Preserves angles (useful for navigation) but distorts area dramatically near poles
Lesson 1308Geographic Data Types and Coordinate Systems
Merge
creates a new commit that combines two branches, preserving the complete history of both branches.
Lesson 2014Understanding Git Rebase vs MergeLesson 2026Merge Strategies: Merge vs Squash vs Rebase
Merge joins
are efficient for pre-sorted data or when the database can sort cheaply, advancing through both tables in lockstep.
Lesson 957Join Strategies: Nested Loop, Hash, Merge
Mesokurtic
(kurtosis ≈ 3 or excess kurtosis ≈ 0): Matches the normal distribution.
Lesson 66Kurtosis: Definition and Interpretation
Message
The explanation you provided with `git commit`
Lesson 1999Viewing Commit History
Message passing coordination
Nodes use network protocols to exchange data.
Lesson 1771Shared-Nothing Architecture
Messaging apps
Features that change how one person messages affect the recipient
Lesson 1527Ignoring Network Effects
Metadata
is "data about data"—the descriptive information that explains what each piece of data means.
Lesson 23Data Provenance and MetadataLesson 1163Metadata and Data DictionariesLesson 1871Why Version Control for Data?
Metadata Database
Stores DAG runs, task states, and logs
Lesson 1833Introduction to Apache Airflow
Method
Interview whoever agrees until each quota is full
Lesson 240Quota Sampling
Method calls
generate SQL behind the scenes
Lesson 1117What is an ORM and Why Use It?
Methodological transparency
Show your process, not just results
Lesson 2141Building a Portfolio and Personal Brand
Methodology
(brief): High-level approach without technical minutiae
Lesson 1966Report Structure and Executive Summary
Metric attribution
is the process of assigning credit to the specific drivers that caused a metric to move.
Lesson 1637What is Metric Attribution?
Metric Stability
Your success metrics shouldn't show statistically significant differences between identical groups
Lesson 1483Pre-Experiment Validation
metric tree
(also called a "metric hierarchy" or "decomposition tree") is a visual framework that breaks a single top-level metric—often your North Star Metric—into the sub-metrics that mathematically drive it.
Lesson 1621Metric Trees: Structure and PurposeLesson 1632Financial Services Metrics: AUM, NIM, and Credit Metrics
Metrics/Measures
Numerical values you aggregate (sum, average, count)
Lesson 1808Star Schema and Fact Tables
Mid-level managers
Add one layer of confidence.
Lesson 1953Adjusting Statistical Depth by Audience
Mid-period acquisitions
Should new customers acquired *during* the period count in the denominator?
Lesson 1671Churn Rate Calculation Methods
Middle touches aren't irrelevant
They nurtured the relationship and kept your brand top-of-mind
Lesson 1729Position-Based (U-Shaped) Attribution
MIN
, and **MAX**—together with **GROUP BY** to create rich summaries of grouped data.
Lesson 892GROUP BY with Different Aggregate FunctionsLesson 894NULL Values in GROUP BY
Minimize
data movement across the network
Lesson 1780Transformations vs Actions in Spark
Minimize CAC
ROAS might drop (reaching less-qualified audiences), and payback could extend
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Minimize color
Use color purposefully to highlight the key finding, not to decorate every category.
Lesson 1958Simplifying Visual Complexity
Minimize payback
You may sacrifice ROAS (focusing on quick wins, not best returns) and accept higher CAC initially
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Minimum sizes matter
For digital displays, axis labels should typically be at least 10–12 points; titles 14–18 points.
Lesson 1252Font Size, Typeface, and Readability
Minimum test
Only checks if the *smallest* value is an outlier
Lesson 1393Two-Sided vs One-Sided Grubbs' Test
Minimum Wage Changes
One of the most famous DiD studies compared New Jersey (which raised minimum wage) to Pennsylvania (which didn't).
Lesson 1459Real-World DiD Applications
Minor departures matter more
– slight skewness or a few outliers become "statistically significant"
Lesson 209Sample Size Considerations in Normality Tests
Misaligned incentives
The team running the test is measured on engagement, not profitability, so they optimize for engagement.
Lesson 1530Mismatched Metrics and Goals
Misallocating budget
based on attribution models that reward proximity, not causation
Lesson 1717Incrementality and True Channel Impact
Misallocating resources
to ineffective initiatives
Lesson 1637What is Metric Attribution?
Miss true anomalies
in heavy-tailed distributions
Lesson 1390Assumptions of Grubbs' Test
Missed opportunities
Early signals of success or failure go unnoticed
Lesson 1617The Danger of Lagging-Only Metrics
Missing context
Always include a clear legend, data source, and what the metric represents.
Lesson 1309Choropleth Maps: Basics and Best Practices
Missing details
Analysis steps aren't fully documented (remember Documentation Standards?
Lesson 30The Reproducibility Crisis and Solutions
Missing hidden drivers
that actually moved the needle
Lesson 1637What is Metric Attribution?
Missing information
needed to proceed
Lesson 1681Time-Based Funnel Analysis
Missing patterns
Gaps or unusual concentrations?
Lesson 1208Distribution Checks for All Variables
Missing required values
NULLs where they shouldn't exist
Lesson 1150What is Data Validation?
Missing Value Codes
How nulls or missing data are represented (NA, -999, blank)
Lesson 2064Creating Data Dictionaries
Mistaking correlation patterns
Seeing A and Y correlate and assuming A → Y, when really both are caused by unmeasured C.
Lesson 1476Common DAG Patterns and Pitfalls
MIT or Apache 2.0
Permissive licenses allowing commercial use with minimal restrictions
Lesson 2082Choosing a License for Data Science Projects
Mitigation measures
technical safeguards (encryption, access controls) and organizational policies (training, audits)
Lesson 1910Data Protection Impact Assessments (DPIAs)
Mixed (Numeric × Categorical)
When one variable is numeric and the other categorical (like "salary" across "department"), you're comparing distributions of the numeric variable across groups.
Lesson 1182Choosing Analysis Methods by Variable Types
ML Engineers
focus on *production systems and scale*.
Lesson 2138Data Analyst vs Data Scientist vs ML Engineer
MLlib
provides scalable machine learning algorithms that work on distributed data:
Lesson 1775Spark Components: Core, SQL, MLlib, Streaming
MMM
works at the aggregate level, analyzing total spend and performance across channels over time (typically weeks or months).
Lesson 1736MMM vs Attribution: Key Differences
Modality
One peak (unimodal), two (bimodal), or many?
Lesson 63Understanding Distribution Shape
Model 1
Start with just square footage
Lesson 617Practical Example: Variable Selection
Model 1 (Binary)
Predicts whether an observation is a "certain zero" (structural) using logistic regression.
Lesson 695Zero-Inflated Models
Model 2 (Count)
For those who *can* have the event, predicts the count using Poisson or negative binomial regression.
Lesson 695Zero-Inflated Models
Model A
Predicts house price using square footage (R² = 0.
Lesson 612Why R-Squared Alone Is Misleading
model artifacts
, version everything: the serialized model file, training code commit hash, hyperparameters, training data version, and evaluation metrics.
Lesson 1877Versioning Strategies for Different Data TypesLesson 2091Stage 7: Communication and Handoff
Model B
Predicts house price using square footage + owner's favorite color (R² = 0.
Lesson 612Why R-Squared Alone Is MisleadingLesson 630Bayesian Information Criterion (BIC)
Model checking
Does your fitted model generate realistic fake data?
Lesson 1571Posterior Predictive Distribution for New Data
Model comparison
Test models with different subsets and compare fit metrics
Lesson 585Remedies: Variable Selection
Model debt
Quick fixes to model performance without understanding why they work
Lesson 2131What is Technical Debt in Data Science?
Model Development
involves selecting appropriate statistical methods or machine learning algorithms, training them on your prepared data, and tuning parameters.
Lesson 2089Stage 5: Model Development and Validation
Model insights
A baseline model might show your problem is too easy (100% accuracy suggests data leakage) or impossibly hard (random performance suggests the question can't be answered with available data).
Lesson 2109Why Data Science is Inherently Iterative
Model metadata
Training parameters, hyperparameters, dataset versions, and evaluation scores (stored in JSON or YAML)
Lesson 2034Committing Data Artifacts and Model Outputs
Model mismatch
Real problems often involve non-conjugate likelihoods or complex dependencies that conjugates can't handle
Lesson 1555Advantages and Limitations of Conjugate Priors
Model performance metrics
"This model is 85% accurate on the test set"
Lesson 2122When Uncertainty Is Acceptable
Model performance plateaus
Your validation accuracy improves from 0.
Lesson 2116Diminishing Returns and the 80/20 Rule
Model selection
(certain algorithms handle discrete vs continuous differently)
Lesson 18Numerical Variables: Discrete and Continuous
Model staleness
Your model was trained on 2022 data.
Lesson 2136Monitoring Gaps and Silent Failures
Model versioning gaps
occur when you can't reliably reproduce a model's results because critical information wasn't tracked: which exact code version, which data snapshot, which hyperparameters, which library versions, or even which random seed was used.
Lesson 2134Model Versioning and Reproducibility Gaps
Model versions
`model-churn-v2`, `fraud-detector-deployed`
Lesson 2037Tagging Releases and Experiment Snapshots
Model-ready format
Machine learning libraries expect features in columns and samples in rows.
Lesson 1149Benefits of Tidy Data for Downstream Work
Modeling considerations
"Imbalanced classes—use stratified sampling"
Lesson 1212EDA Summary Documentation and Next Steps
Modeling relationships
Capture how variables relate to each other (e.
Lesson 1901Synthetic Data Generation
Moderate baseline rate example
If your baseline is 50%, improving to 51% has much higher variance (0.
Lesson 1499Adjusting for Baseline Conversion Rates
Moderately skewed data
n ≥ 30 usually works
Lesson 220Sample Size Requirements for the CLT
Modern best practice
Use Welch's t-test by default for two independent samples.
Lesson 362Welch's t-Test for Unequal Variances
Modern tools available
MCMC samplers and probabilistic programming libraries handle non-conjugate cases well
Lesson 1556Choosing Between Conjugate and Non-Conjugate Priors
Modern, Python-first orchestration
focused on developer experience.
Lesson 1839Alternative Orchestration Tools
Modes
`overwrite`, `append`, `ignore`, `errorIfExists`
Lesson 1779Reading and Writing Data in Spark
Modified box plots
adjust the fence calculations to account for skewness:
Lesson 1388Limitations and Alternatives to IQR Detection
Modified files
Files you've changed since your last commit but haven't staged yet
Lesson 1998Checking Repository Status
modified Z-score
replaces the vulnerable mean and standard deviation with robust alternatives:
Lesson 73Modified Z-Score Using MADLesson 1380Modified Z-Score Using Median
Monitor cohort-level performance
to detect when trade-offs shift
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Monitor Index Statistics
Track metrics like index size, scan counts vs seek counts, and fragmentation percentage.
Lesson 1086Index Maintenance and Monitoring
Monitoring infrastructure
You need systems to check stopping conditions regularly (daily, hourly, or continuously)
Lesson 1515Trade-offs: Sample Size, Speed, and Complexity
Monitoring overhead
You need infrastructure to detect drift before it damages performance
Lesson 2128Data Distribution Shifts Frequently
Monitoring recommendations
what to watch when it's live
Lesson 2091Stage 7: Communication and Handoff
Monotonic
S(t) never increases; it either stays flat or decreases
Lesson 810The Survival Function S(t)
monotonic relationships
where one variable consistently increases as the other increases (positive monotonic) or consistently decreases (negative monotonic)—even if the relationship isn't a straight line.
Lesson 486Spearman's Rank Correlation CoefficientLesson 490Kendall's Tau vs Spearman's Rho
Monotonic vs Linear Relationships
Lesson 487When to Use Spearman vs Pearson
Monthly Active Users (MAU)
does the same over a 30-day window.
Lesson 1694Daily Active Users (DAU) and Monthly Active Users (MAU)
Monthly cycles
Credit card spending spikes at month-end
Lesson 707Seasonality: Regular Periodic Patterns
Monthly data
with yearly seasonality → period = 12
Lesson 746Choosing Seasonal Period
Monthly Recurring Revenue (MRR)
, a metric tree might decompose it as:
Lesson 1621Metric Trees: Structure and Purpose
Monthly vs. annual churn
A 5% monthly churn rate doesn't equal 60% annual churn.
Lesson 1671Churn Rate Calculation Methods
More complex write logic
to keep redundant data synchronized
Lesson 1071When to Denormalize: Performance Trade-offs
More normal-looking
as sample size increases
Lesson 252Sampling Distribution of the Sample Mean
More Type II errors
– You're more likely to miss real effects (false negatives increase)
Lesson 342Alpha Level Trade-offs
More variability
→ Wider margin (unpredictable data means less precision)
Lesson 294Margin of Error and Its Components
Most accurate
channel for quantitative data—humans excel at comparing positions along a common scale.
Lesson 1231Channels of Visual Encoding
Motion
Movement catches the eye before anything else
Lesson 1235Pre-Attentive Attributes
Moving Average (MA) models
use previous *forecast errors* (also called residuals or shocks).
Lesson 775Moving Average (MA) Models
Moving averages
give equal weight to all observations in the window.
Lesson 764Exponential Smoothing vs Moving Averages
MRR
is the normalized monthly value of all active subscriptions.
Lesson 1628SaaS Metrics: MRR, ARR, and Logo Churn
MS (Mean Square)
SS divided by df—the average variation per degree of freedom
Lesson 444The ANOVA Table
Multi-panel dashboards
where consistent scales prevent visual confusion
Lesson 1276Sharing Axes Between Subplots
Multi-path funnels
recognize that users can reach the same endpoint through different sequences of events.
Lesson 1683Multi-Path and Non-Linear Funnels
Multi-step ahead forecasting
projects multiple periods into the future (e.
Lesson 794Forecasting Concepts and Horizons
Multi-touch
Credit is distributed across multiple touchpoints (linear, time-decay, position-based)
Lesson 1637What is Metric Attribution?
Multimodal data
Multiple clusters make mean/std deviation misleading
Lesson 1379Assumptions and Limitations
Multiple columns
"What's the average salary for each job title *within* each department?
Lesson 905Grouping by Multiple Columns: Basics
Multiple linear regression
extends the same least squares framework to include several predictor variables simultaneously.
Lesson 595From Simple to Multiple Linear Regression
Multiple lines
Often overlay several cohorts or segments for comparison
Lesson 1653What are Retention Curves?
Multiple outliers expected
Use robust methods like Modified Z-score or clustering techniques
Lesson 1395When to Use Grubbs' Test
Multiple subqueries
execute independently instead of sharing work
Lesson 966Performance Considerations for WHERE Subqueries
Multiple testing correction method
(Bonferroni, Holm-Bonferroni, Benjamini-Hochberg, etc.
Lesson 1508Pre-Registration and Correction Strategy
Multiple views
of the same data are needed (charts, tables, maps together)
Lesson 1330Introduction to Interactive Dashboards
multiplication rule
you just learned works beautifully for two events: P(A ∩B) = P(A) × P(B|A).
Lesson 95Chain Rule for Multiple EventsLesson 107Bayes' Theorem Formula and Components
Multiplicative forecasting formula
Lesson 771Forecasting with Holt-Winters
Multiplicative models
assume components are multiplied:
Lesson 743Additive vs Multiplicative Models
Multiplicative seasonality
means seasonal swings grow or shrink proportionally with the trend level.
Lesson 766Additive vs Multiplicative Seasonality
Multiply by the likelihood
`P(data|θ)` — how probable the observed data is for each possible θ value
Lesson 1545Calculating the Posterior Distribution
Multiplying by a constant
If you multiply *X* by constant *a*, the expectation scales proportionally:
Lesson 149Properties of Expectation and Variance
Must control
for C to isolate A's effect on Y.
Lesson 1476Common DAG Patterns and Pitfalls
Mutual independence
(also called *joint independence*) is stronger.
Lesson 103Mutual Independence vs Pairwise Independence

N

N-1
(one less than your sample size)
Lesson 50Population vs Sample Variance
Nagelkerke R²
Adjusts Cox & Snell to range fully from 0 to 1
Lesson 702Pseudo R-Squared Measures
Name
The actual column identifier
Lesson 1163Metadata and Data Dictionaries
Named parameters
give each placeholder a meaningful name using the `:name` syntax, making your queries self- documenting and easier to modify.
Lesson 1106Parameter Placeholders: Named ParametersLesson 1108Handling IN Clauses Safely
Naming conventions prevent chaos
Establish patterns like `YYYY-MM-DD_project_dataset_version.
Lesson 2068Data Provenance Best Practices
Narrower
than the original population distribution
Lesson 252Sampling Distribution of the Sample Mean
Natural order
For ordinal categories like "Small, Medium, Large" or months, preserve logical sequence
Lesson 1178Bar Charts for Categorical Data
Natural workflow
Mirrors how we actually learn—bit by bit, not all at once
Lesson 1538Updating Beliefs with Sequential Data
Navigate with keyboard only
can you reach all interactive features?
Lesson 1254Testing Visualizations for Accessibility
Near-duplicates
Similar records that might represent the same entity (e.
Lesson 1154Uniqueness and Duplication Checks
Near-real-time
(minutes) allows micro-batches.
Lesson 1825Designing Pipeline Architecture
Nearest-Neighbor Matching
pairs each treated unit with the control unit(s) having the closest propensity score.
Lesson 1448Propensity Score Matching Methods
Necessity assessment
why this processing is needed, what legal basis applies
Lesson 1910Data Protection Impact Assessments (DPIAs)
Negative coefficient
→ that category has a lower average outcome than the reference
Lesson 637Interpreting Dummy Variable Coefficients
Negative correlation
Points trend downward (as one increases, the other decreases)
Lesson 1222Scatter Plots for Relationships
Negative r
Variables move in opposite directions (car age and resale value, temperature and heating bills)
Lesson 477Interpreting the Correlation Coefficient
Negative residuals
(`e_i < 0`) occur when the actual value is *below* the fitted line.
Lesson 540The Residual Formula
Negative slope
X and Y move opposite (X ↑ means Y ↓)
Lesson 524The Meaning of the Slope
Negative values
= lighter tails than normal (platykurtic)
Lesson 67Calculating KurtosisLesson 720The Autocorrelation Function (ACF)
Neglecting the complement
When you know P(A|B), don't assume you automatically know P(A|not B).
Lesson 100Common Conditional Probability Mistakes
Neither
recognizes the critical middle steps that move customers down the funnel
Lesson 1724Limitations of Single-Touch Attribution
Net Revenue Retention
measures revenue retention *including* expansion from existing customers:
Lesson 1629SaaS Growth Metrics: Quick Ratio and Net Revenue Retention
Netflix
Hours watched — reflects content value and reduces churn risk.
Lesson 1606Examples of North Star Metrics by Industry
Network effects
In social features, randomizing by user may "leak" treatment effects to control users who interact with treated users.
Lesson 1481Unit of RandomizationLesson 1923Algorithmic Amplification of Harm
Network Errors
happen when Python can't reach the database server.
Lesson 1093Troubleshooting Connection Issues
Neural networks
Many deep learning frameworks expect one-hot encoded inputs
Lesson 638One-Hot Encoding Overview
Never
choose your tail configuration after seeing your results.
Lesson 350Choosing the Right Tail Configuration
Never extrapolate
interpretation beyond your data range.
Lesson 523The Meaning of the Intercept
Never hardcode credentials
Store them in environment variables or configuration files:
Lesson 1090Establishing a Connection with psycopg2 (PostgreSQL)
New Customers
Recent converters who've made their first purchase or subscription.
Lesson 1704Customer Lifecycle Stages
New hypothesis
Email engagement is the driver, not day-of-week effects
Lesson 1201Domain Knowledge as a Hypothesis Source
New insights from data
Analysis might reveal that what you thought was a key driver (a branch metric) actually has minimal impact on your North Star.
Lesson 1626Maintaining and Evolving Metric Trees
No arbitrary cutoff
All past data contributes (just with declining weight)
Lesson 757Introduction to Exponential Smoothing
No autocorrelation
values shouldn't predict future values
Lesson 709Irregular Component: Random Noise
No change
"Customer satisfaction hasn't changed after the redesign" (before = after)
Lesson 307Defining the Null Hypothesis (H₀)
No decay (flat)
Equal weight throughout the window—simpler but less realistic
Lesson 1639Time Windows and Attribution Decay
No difference
"The mean weight of Group A equals the mean weight of Group B" (μ₁ = μ₂)
Lesson 307Defining the Null Hypothesis (H₀)
No effect
"This new drug has no effect on blood pressure" (effect = 0)
Lesson 307Defining the Null Hypothesis (H₀)
No extreme outliers
A single outlier can dramatically distort r
Lesson 480Scatterplots and Visual Assessment
No manual editing
of intermediate files between steps
Lesson 1981What Makes a Report Reproducible?
No one person understands
the entire chain anymore
Lesson 2132Pipeline Glue Code and Complexity Creep
No outliers
Mean provides more information because it uses all data points.
Lesson 42Comparing Mean, Median, and Mode
No patterns remaining
if you see structure in the residuals, you've missed something
Lesson 709Irregular Component: Random Noise
No relationship
"There's no correlation between study hours and test scores" (correlation = 0)
Lesson 307Defining the Null Hypothesis (H₀)
No repeating groups
there are no columns like "Phone1", "Phone2", "Phone3" storing similar data
Lesson 1064First Normal Form (1NF)
No seasonality
(no repeating patterns)
Lesson 758Simple Exponential Smoothing (SES)
No selection bias
(nothing observable or unobservable influences assignment)
Lesson 1487Simple Random Assignment
No significant spikes
→ No clear direct autoregressive pattern
Lesson 730Interpreting PACF Plots
No trend (flat)
Values fluctuate around a stable mean with no persistent direction
Lesson 706Trend: Long-Term Direction
Node Color
Use color to encode categories (communities, types) or continuous values (temperature scales for metrics).
Lesson 1319Styling Network Visualizations
Node Size
Scale nodes by importance metrics (degree centrality, betweenness) or attributes (population, budget).
Lesson 1319Styling Network Visualizations
Nodes
represent variables (like treatment, outcome, confounders)
Lesson 1468Introduction to Directed Acyclic Graphs (DAGs)
Noise addition
introduces random perturbation to numerical data, making exact values uncertain while preserving statistical properties for aggregate analysis.
Lesson 1895Data Anonymization Basics
Nominal
variables are categories without any inherent ranking or order.
Lesson 17Categorical Variables: Nominal and Ordinal
Nominal data
(categories with no order: fruit types, countries, product names) pairs best with:
Lesson 1238Matching Encoding to Data Type
Non-canonical links
are any other valid link functions you might choose for that distribution.
Lesson 676Canonical vs Non-Canonical Links
Non-correlated subqueries
are completely independent.
Lesson 968Correlated vs Non-Correlated Subqueries in SELECT
Non-directional (two-tailed)
"The new landing page will *change* sign-ups"
Lesson 1479Formulating Hypotheses
Non-independence
If observations are related in ways you haven't accounted for (clustered data, time series correlations), the independence assumption fails completely, and your p-values become meaningless.
Lesson 390When Parametric Tests Fail: Violations of Assumptions
Non-independent pairs
Results are unreliable; reconsider your study design
Lesson 374Assumptions of the Paired t-Test
Non-informative (flat) priors
essentially say "I know nothing" — they let the data dominate the analysis completely.
Lesson 1534The Prior Distribution
Non-linear funnels
acknowledge that users don't always move forward.
Lesson 1683Multi-Path and Non-Linear Funnels
Non-linear relationships
R-squared measures *linear* fit.
Lesson 537When R-Squared is Not Enough
Non-linearity
The relationship between X and Y curves rather than forming a straight line
Lesson 591When and Why to Transform Variables
Non-negative
Can't go below zero
Lesson 689When to Use Poisson Regression
Non-nested models
are competitors that can't be simplified into one another.
Lesson 791Comparing Nested and Non-Nested Models
Non-normal data
Switch to IQR-based detection; it's distribution-agnostic
Lesson 1395When to Use Grubbs' Test
Non-normal differences (large n)
Paired t-test is usually still robust
Lesson 374Assumptions of the Paired t-Test
Non-normal differences (small n)
Consider the Wilcoxon signed-rank test (a non-parametric alternative)
Lesson 374Assumptions of the Paired t-Test
Non-normal residuals
The Q-Q plot shows heavy tails, skewness, or other departures from normality
Lesson 591When and Why to Transform Variables
Non-normality with small samples
If your sample size is small (typically n < 30) and your data show strong skewness, heavy outliers, or non-normal distributions (confirmed through visual checks or tests like Shapiro-Wilk), the t- test's results become unreliable.
Lesson 390When Parametric Tests Fail: Violations of Assumptions
Non-nullability
A primary key can never be `NULL`.
Lesson 1048What Are Primary Keys?
Non-parametric part
The baseline hazard function (risk over time for someone with all covariates = 0) is **not assumed** to follow any distribution—it's left flexible.
Lesson 825What is the Cox Proportional Hazards Model?
non-probability sampling
method where you select individuals or items simply because they're easy to reach.
Lesson 239Convenience SamplingLesson 242Probability vs Non-Probability SamplingLesson 247Survivorship Bias
Non-probability sampling limitations
Lesson 242Probability vs Non-Probability Sampling
Non-regression contexts
Classification tasks, clustering, or similarity calculations
Lesson 638One-Hot Encoding Overview
Non-repeatable reads
Reading the same row twice and getting different values
Lesson 1116Transaction Isolation and Concurrency
Non-response bias
When certain groups don't respond to your survey.
Lesson 244Selection Bias and Its Causes
Non-stationarity
The statistical properties (mean, variance) often change over time—seasonal patterns, trends, and structural breaks are common
Lesson 704What Makes Time Series Data Different?
Non-technical audiences
(executives, stakeholders, general public) typically:
Lesson 1950Identifying Your Audience: Technical vs Non-Technical
Non-trivial
`StudentID → StudentName` (actually tells us something)
Lesson 1063Functional Dependencies
Nonlinear relationships
Curves, U-shapes, or other patterns that aren't straight lines
Lesson 1222Scatter Plots for Relationships
Normal posterior
Use the mean ± (z-score × standard deviation) where z-score comes from the normal distribution.
Lesson 1579Practical Computation of Credible Intervals
Normal-Inverse-Gamma (NIG)
distribution is a conjugate prior for the normal likelihood when both mean (μ) and variance (σ²) are unknown.
Lesson 1568Unknown Variance: Normal-Inverse-Gamma Model
Normal-Normal
conjugacy (known variance σ²):
Lesson 1554Updating Conjugate Priors with Data
Normal-Normal conjugacy
Normal prior + Normal likelihood = Normal posterior.
Lesson 1553Normal-Normal Conjugacy
Normality checks
Residuals should be roughly normally distributed (histogram, Q-Q plot)
Lesson 799Fitting and Diagnosing SARIMA Models
Normality holds
Your data (or sampling distribution) is approximately normal, especially with small samples (n < 30)
Lesson 398Choosing Between Parametric and Non-Parametric Tests
Normality required
Your data must be approximately normally distributed (Grubbs' is parametric)
Lesson 1389What is Grubbs' Test?
Normality violated
Severe skewness, outliers, or small samples from non-normal populations
Lesson 398Choosing Between Parametric and Non-Parametric Tests
Normality violated, small sample
→ Switch to non-parametric alternative (Mann-Whitney, Wilcoxon signed-rank)
Lesson 383Diagnostic Workflow: When to Proceed or Switch Tests
Normality violations
The Central Limit Theorem saves us here.
Lesson 382Robustness of t-Tests to Assumption Violations
Normalization
splitting tables so each stores one logical entity with proper relationships maintained through primary and foreign keys.
Lesson 1062Data Anomalies: Insert, Update, Delete
Normalize by the evidence
`P(data)` — a scaling constant that ensures probabilities sum to 1
Lesson 1545Calculating the Posterior Distribution
Normalize each metric
to a 0–100 scale first
Lesson 1699Engagement Scoring Systems
Normalize to UTC
Convert any timezone-aware timestamp to UTC for storage
Lesson 1042Working with Timestamps and Time Zones
Normalizing values
means replacing variations with a single standard form.
Lesson 1138Cleaning and Standardizing Text Fields
North Star Metric
(NSM) is the one metric that best captures the core value your product or service delivers to customers.
Lesson 1604What is a North Star Metric?
Not collectively exhaustive
(missing 4 and 5)
Lesson 82Collectively Exhaustive Events
Not quite
Different libraries often maintain **separate random number generators** with independent states.
Lesson 2058Seed Scope and Multiple Libraries
Not reproducible
Others can't easily adapt your paths and settings
Lesson 2072Configuration Files vs Hard-Coded Values
Not Robust
For non-normal data, percentiles or IQR-based methods often work better than z-scores.
Lesson 201Z-Score Applications and Limitations
Not so fast
You've ignored the thousands of failed startups that *also* took big risks but went bankrupt.
Lesson 247Survivorship Bias
Not sure
Try multiple window sizes and compare how well they balance smoothness with responsiveness for your specific problem
Lesson 752Choosing the Window Size
Notebooks vs code
Use notebooks (`notebooks/`) for exploration and communication.
Lesson 2069Project Directory Structure
Novelty bias
Users often react differently to changes initially—either with excitement (novelty effect) or resistance (change aversion).
Lesson 1484Duration and Timing Considerations
Novelty Effect
Users interact more with something *because it's new and different*, not because it's actually better.
Lesson 1525Novelty and Primacy Effects
NPS surveys
that don't correlate with renewal rates in your specific business
Lesson 1616Metrics Divorced from Revenue
Null deviance
measures how poorly an intercept-only model (just predicting the overall mean/rate) fits your data.
Lesson 698Null and Residual Deviance
NULL values
from unmatched rows can affect your counts differently than expected
Lesson 933Aggregating with LEFT JOINs
Number at risk
= all subjects who haven't yet had the event *and* haven't been censored before time *t*
Lesson 812Handling Event Times and Censoring
Number of categories
How many distinct groups or bins your data falls into
Lesson 418Degrees of Freedom in Goodness of Fit
Number of events
= only those who actually experienced the event at time *t*
Lesson 812Handling Event Times and Censoring
Number of variables
Are you showing one variable, comparing two, or exploring relationships among three or more?
Lesson 1230Choosing the Right Chart Type
Numeric × Numeric
When both variables are continuous or discrete numbers, you want to assess linear relationships, strength, and direction.
Lesson 1182Choosing Analysis Methods by Variable Types
Numeric columns
Find the lowest and highest numbers
Lesson 885MIN and MAX: Finding Extremes
Numeric-to-Categorical
Compare distributions using grouped summary statistics and visualizations (box plots by group).
Lesson 1210Relationship Exploration: Correlation and Association
Numeric-to-Numeric
Use correlation coefficients (Pearson, Spearman) and correlation matrices to spot linear and monotonic relationships.
Lesson 1210Relationship Exploration: Correlation and Association
Numerical stability
Stan's implementation of Hamiltonian Monte Carlo (NUTS) includes automatic differentiation and careful numerical engineering
Lesson 1595Stan: High-Performance Bayesian Inference
NUTS (No-U-Turn Sampler)
is an advanced version of HMC that automatically tunes a critical parameter: how long to let the "ball" roll.
Lesson 1593Hamiltonian Monte Carlo and NUTS
NYC's coefficient
(say, +15): means NYC is 15 units higher than Boston
Lesson 643Interpreting Coefficients Relative to Reference

O

O'Brien-Fleming
Spends very little alpha early (conservative early looks), saving most for the final analysis
Lesson 1512Group Sequential Testing
Objective
is a clear, inspiring goal that describes *what* you want to achieve.
Lesson 1607Introduction to OKRs (Objectives and Key Results)
Objectives
are the qualitative, inspirational statements that describe *what* you want to achieve.
Lesson 1609Setting Effective Objectives
Observations are independent
Each observation doesn't influence others
Lesson 399When to Use the One-Sample Z-Test for Proportions
Observe
Collect data for a period (say, one day's worth of conversions)
Lesson 1582Updating Beliefs with Test Data
Observed
The actual count you got in each category
Lesson 417The Chi-Squared Test Statistic Formula
Observed - Expected
the raw difference
Lesson 428Post-Hoc Analysis and Residuals
Offline
Can use computationally intensive methods; you can look both forward *and* backward from any point
Lesson 1414Offline vs Online Change-Point Detection
Offline (batch) change-point detection
works like a detective reviewing cold cases.
Lesson 1414Offline vs Online Change-Point Detection
Omega-squared
provides a less biased, more conservative estimate by adjusting for sample size.
Lesson 445Effect Size: Eta-Squared and Omega-Squared
Omitted variable bias
A third variable influences both X and Y, creating a spurious relationship
Lesson 553Exogeneity: X Must Be Independent of Errors
Omitted variables
Important confounders are missing from your model and hide in the error term
Lesson 1464Instrumental Variables: The Endogeneity Problem
ON
is required for complex conditions (inequalities, multiple different columns)
Lesson 953Join Conditions: ON vs USING
ON condition
The matching rule, usually comparing a column from each table
Lesson 919Basic INNER JOIN Syntax
On macOS
Open Terminal and type `git --version`.
Lesson 1991Installing Git and Initial Configuration
Onboarding completion
When a user finishes setup steps
Lesson 1646Defining Cohort Start Events
Once spent, it's gone
You cannot query indefinitely—eventually you exhaust your budget and must stop
Lesson 1900Privacy Budget and Composition
One categorical independent variable
(the "factor") with **three or more levels/groups**
Lesson 438When to Use One-Way ANOVA
One continuous dependent variable
(the outcome you're measuring)
Lesson 438When to Use One-Way ANOVA
One idea per paragraph
Don't mix method explanation with result interpretation
Lesson 1967Writing Clear and Concise Analysis Sections
One numerical, one categorical
Do salaries differ by department?
Lesson 1181What is Bivariate Analysis?
One slide, one message
Don't cram.
Lesson 1944Executive Summary Best Practices
One-hot encoding
takes a different approach: it creates k dummy variables for k categories—one for *every* level, with no reference category left out.
Lesson 638One-Hot Encoding Overview
One-sided (maximum)
You're testing product dimensions where oversized items break downstream machinery, but undersized items are fine.
Lesson 1393Two-Sided vs One-Sided Grubbs' Test
One-sided (minimum)
You're checking server response times where slow responses matter, but faster-than-expected times are welcomed.
Lesson 1393Two-Sided vs One-Sided Grubbs' Test
One-sided (one-tailed) tests
These focus on a specific direction:
Lesson 1393Two-Sided vs One-Sided Grubbs' Test
One-size-fits-all
A news app *should* have high DAU/MAU; tax software shouldn't
Lesson 1694Daily Active Users (DAU) and Monthly Active Users (MAU)
One-step ahead forecasting
predicts just the next immediate time period (e.
Lesson 794Forecasting Concepts and Horizons
One-Step-Ahead Forecasting Only
Lesson 756Limitations of Moving Averages
One-tailed
H₁: p₁ > p₂ or H₁: p₁ < p₂ (testing for a specific direction)
Lesson 406Two-Sample Proportion Test Setup
Online
Must be fast enough to keep pace with incoming data; can only look backward at history
Lesson 1414Offline vs Online Change-Point Detection
Online (real-time) change-point detection
is like a security guard monitoring live camera feeds.
Lesson 1414Offline vs Online Change-Point Detection
Online communities
(Reddit's r/datascience, Twitter/X, LinkedIn, Discord servers, Stack Overflow) provide daily touchpoints.
Lesson 2144Networking and Community Engagement
Online reviews
Only people with strong opinions (very happy or very angry) typically write reviews
Lesson 246Volunteer and Self-Selection Bias
Online-only surveys
Excluding people without internet access
Lesson 249Coverage Error and Undercoverage
Open conflicting files
Look for conflict markers (`<<<<<<<`, `=======`, `>>>>>>>`) showing both versions
Lesson 2018Resolving Conflicts During Rebase
Open Questions
"Does the marketing team track promo codes consistently?
Lesson 2100Documenting Assumptions and Open Questions
Open-source contributions
demonstrate your skills publicly while improving tools others use.
Lesson 2144Networking and Community Engagement
Opening balance method
Use only customers at period start (simpler, more conservative)
Lesson 1671Churn Rate Calculation Methods
OpenLineage
(open standard) embed lineage capture directly into your code.
Lesson 1164Tools for Lineage Tracking
OpenStreetMap
is the most popular open-source tile provider, offering street-level detail perfect for urban data visualization.
Lesson 1314Basemaps and Map Tiles
Operational alignment
Does the model match what your marketing team observes qualitatively?
Lesson 1734Comparing and Validating Attribution Models
operators
(templates for tasks like PythonOperator, BashOperator, or SQLOperator).
Lesson 1833Introduction to Apache AirflowLesson 1835Airflow Operators and Tasks
Opportunity
Can you convert core users to power users?
Lesson 1698Power User Curves and Engagement Distribution
Optimal bandwidth selectors
Methods like Imbens-Kalyanaraman or Calonico-Cattaneo-Titiunik that balance bias and variance
Lesson 1463RDD Bandwidth Selection and Local Estimation
Optimal intervention timing
Reach out *before* the high-risk window
Lesson 835Customer Churn Prediction with Survival Analysis
Optimization
Each channel has unique conversion funnels and drop-off patterns you can improve
Lesson 1711What Are Acquisition Channels?Lesson 1716Channel Mix and Portfolio Thinking
Optimization pressure
Algorithms optimize for accuracy on biased data, which means they get *better* at replicating and intensifying discriminatory patterns
Lesson 1882Algorithmic Amplification of Bias
Optimization traps
You improve the surrogate at the expense of the business metric (e.
Lesson 1518The Relationship Between Surrogate and Business Metrics
Optimize
your query by rearranging or combining operations
Lesson 1780Transformations vs Actions in Spark
Optimize execution
Tasks with no mutual dependencies can run in parallel
Lesson 1841Upstream and Downstream Dependencies
Optimize timing
See how much time passes between interactions
Lesson 1719The Customer Journey and Touchpoints
Optimize within bounds
for maximum revenue or profit, not individual metrics
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Optimized execution
Spark's Catalyst optimizer rewrites your queries for performance
Lesson 1778DataFrames and Spark SQL Basics
Optimizes computation
by eliminating redundant operations
Lesson 1790What is Dask and When to Use It
Optimizing warranty periods
By fitting a Cox model or Kaplan-Meier curve to historical failure data, you can estimate what percentage of products will fail within 1 year, 2 years, etc.
Lesson 837Product Warranty and Failure Analysis
Orchestration layer
Manages task scheduling, dependencies, retries, and monitoring
Lesson 1822What is a Data Pipeline?
order
(q) tells you how many previous error terms to include.
Lesson 777Identifying MA Order (q) Using ACFLesson 951Join Order and Performance
Order conditions by likelihood
Place the most frequently matched conditions first to minimize unnecessary evaluations.
Lesson 1037CASE Best Practices and Performance
Order reversal
The reciprocal transformation **reverses the order** of your values.
Lesson 216Reciprocal and Inverse Transformations
Order your p-values
from smallest to largest: p₁ ≤ p₂ ≤ .
Lesson 1504Holm-Bonferroni MethodLesson 1506Benjamini-Hochberg Procedure
Ordered position
(left-to-right or top-to-bottom)
Lesson 1238Matching Encoding to Data Type
Ordering all event times
from earliest to latest
Lesson 809Introduction to the Kaplan-Meier Estimator
Orders
is a foreign key that must match a `customer_id` in **Customers**.
Lesson 1051Introduction to Foreign Keys
Orders table
order ID, customer ID, order amount
Lesson 918What is an INNER JOIN?
Ordinal
variables have categories with a natural, meaningful order or ranking.
Lesson 17Categorical Variables: Nominal and OrdinalLesson 392Wilcoxon Signed-Rank Test
Organic
Unpaid search traffic from Google, Bing, etc.
Lesson 1712Common Channel Categories
Organic search
(SEO-driven traffic)
Lesson 1711What Are Acquisition Channels?
Organizational conflicts
Your employer wants data to support a predetermined decision.
Lesson 35Conflicts of Interest and Independence
Organizational pressure
Your employer wants a particular conclusion to justify a strategy they've already committed to publicly.
Lesson 1930Managing Conflicts of Interest
Organize your 2×2 table
and identify b and c (the off-diagonal counts)
Lesson 436Conducting McNemar's Test
Orientation
Tilted lines among horizontal lines pop out
Lesson 1235Pre-Attentive Attributes
Orientation matters
Rotating the view can completely change the story your data tells—a sign the visualization isn't robust
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Origin
Database name, URL, file path, API endpoint, or vendor name
Lesson 1161Documenting Data Sources
ORM
(Object-Relational Mapper) is built on top of it.
Lesson 1118SQLAlchemy Core vs ORM
ORM (Object-Relational Mapper)
is a tool that lets you interact with database tables using Python objects and classes instead of writing raw SQL queries.
Lesson 1117What is an ORM and Why Use It?
Ornamental borders
Keep focus on the data itself
Lesson 1237Chart Junk and Data-Ink Ratio
Ornamental illustrations
(like pictures of coins on financial charts)
Lesson 1963Removing Chartjunk
Ornate borders and frames
Decorative elements around the chart
Lesson 1246Visual Clutter and Chartjunk
Orphaned tasks
(nodes with no connections)
Lesson 1846Testing and Validating Dependency Graphs
Orthographic projection
All objects maintain their size regardless of distance from the camera.
Lesson 1326Viewing Angles and Projection Types
Other cloud platforms
like AWS, Google Cloud, or Azure offer the most flexibility and scalability but demand more technical knowledge around servers, containers, and networking.
Lesson 1338Deployment and Sharing Dashboards
Outcome-focused
– Measures results, not activities
Lesson 1610Defining Measurable Key Results
Outcomes are mutually exclusive
You can't have both success and failure simultaneously
Lesson 123Bernoulli Trial Definition and Properties
Outdated lists
Phone directories missing new residents or unlisted numbers
Lesson 249Coverage Error and Undercoverage
Outer query
Average those department sums
Lesson 973Nested Subqueries in FROM
Outer query alias
(`outer`): identifies columns from the main query
Lesson 976Basic Correlated Subquery Syntax
outlier
in regression is an observation with an unusual **Y value** given its X value—it doesn't follow the pattern of the other data points.
Lesson 587Identifying Outliers in Regression ContextLesson 1389What is Grubbs' Test?
Outlier Detection
If a data point has a z-score beyond ±3, it's unusual enough to investigate.
Lesson 201Z-Score Applications and LimitationsLesson 1157Statistical Anomaly Detection in QA
Outliers and influential points
Which observations have unusually large residuals that might distort your model?
Lesson 544The Role of Residuals in Diagnostics
Outliers present
The median is *robust*—extreme values don't affect it.
Lesson 42Comparing Mean, Median, and Mode
Output
A chi-squared-like test statistic with degrees of freedom = (k - 1), where k = number of conditions
Lesson 474Friedman Test: Non-Parametric Repeated Measures ANOVALesson 1580Bayesian vs Frequentist A/B Testing
Output(s)
The component property you'll update (e.
Lesson 1335Dash Callbacks: Adding Interactivity
Outputs are generated
Everything in `reports/` and `models/` should be reproducible from code—don't edit these files manually.
Lesson 2069Project Directory Structure
Outside the bounds
The autocorrelation is **statistically significant**—there's likely a real pattern at that lag
Lesson 723Significance Bounds in ACF Plots
Over-controlling
Adding every available variable to your model without checking the DAG.
Lesson 1476Common DAG Patterns and Pitfalls
Over-crediting vanity metrics
that didn't drive real outcomes
Lesson 1637What is Metric Attribution?
Over-differencing
can introduce unnecessary complexity and make patterns harder to model.
Lesson 736Higher-Order Differencing
Over-investing
in channels that capture existing demand rather than create it
Lesson 1717Incrementality and True Channel Impact
Over-smoothing
Mean income by state masks extreme inequality within states
Lesson 1245Misleading Aggregations and Binning
Overall Equipment Effectiveness (OEE)
is the gold standard for measuring production efficiency.
Lesson 1636Manufacturing Metrics: OEE, Yield, and Cycle Time
Overall model F-test
The global significance doesn't change
Lesson 647Impact on Model Results and Reporting
Overdispersion
occurs when the actual variance in your data significantly exceeds the mean—violating this core Poisson assumption.
Lesson 693Overdispersion in Count Data
Overfitting risk
increases with unnecessary predictors
Lesson 1197Identifying Variable Importance and Redundancy
Overhead allocation
portion of office space, utilities for marketing/sales teams
Lesson 1753Customer Acquisition Cost (CAC): Components and Calculation
Overlap
Too many bubbles or extreme size differences can create clutter—consider transparency or interactive tooltips
Lesson 1229Bubble Charts for Three Variables
Overlay geoms
Additional data layers for comparison
Lesson 1355Layer Order and Plot Composition

P

p < 0.05
(common threshold): Strong evidence that survival curves differ significantly.
Lesson 822Interpreting Log-Rank Test ResultsLesson 1692Statistical Significance and Iteration
P-value < 0.05
(or your chosen α): Reject the null → series is **stationary**
Lesson 716Augmented Dickey-Fuller Test
P-value ≥ 0.05
Fail to reject → series is **non-stationary** (has unit root)
Lesson 716Augmented Dickey-Fuller Test
P(A ∩ B)
is the probability that *both* A and B occur (the intersection)
Lesson 92Definition and Notation of Conditional Probability
P(A)
= your **prior belief** (what you thought before seeing evidence)
Lesson 108Updating Beliefs with New Evidence
P(A) = Σ P(A|B ᵢ)×P(B ᵢ)
Lesson 97Law of Total Probability
P(B|A) = P(B)
, then A and B are independent
Lesson 105Independence in Conditional Probability
P(both Aces)
= (4/52) × (3/51) ≈ 0.
Lesson 88General Multiplication Rule
P(event)
alongside **P(positive test | event)**
Lesson 110Base Rate Fallacy
P(Evidence | Innocent)
probability of seeing this evidence if innocent
Lesson 112Legal Evidence and Jury Reasoning
P(positive test | event)
Lesson 110Base Rate Fallacy
P(X < a)
Probability that X is less than some value *a* — the area to the *left* of *a*
Lesson 173Calculating Probabilities with the Normal Distribution
P(X = k)
The probability that random variable X equals exactly k successes
Lesson 127Binomial Distribution PMF
P(X > b)
Probability that X is greater than *b* — the area to the *right* of *b*
Lesson 173Calculating Probabilities with the Normal Distribution
P(X > k)
"More than k events" (complement of cumulative)
Lesson 143Cumulative Poisson Probabilities
P(X ≤ k)
"At most k events" (cumulative probability)
Lesson 143Cumulative Poisson Probabilities
P(Z < −1.23)
= same as P(Z > 1.
Lesson 198Using Z-Tables for Probability
PACF
(Partial Autocorrelation Function), however, measures **only the direct relationship** at lag k, controlling for all intermediate lags.
Lesson 728PACF vs ACF: Key DifferencesLesson 733Using ACF and PACF TogetherLesson 798SARIMA Model Selection
PACF of residuals
, you're checking the same thing:
Lesson 786ACF and PACF of Residuals
PACF plot
to identify **p** (AR order).
Lesson 779The Box-Jenkins Methodology
Package versions
(exact versions of every library you import)
Lesson 2038What is Environment Management and Why It Matters
Page 1
`LIMIT 25 OFFSET 0` (rows 1-25)
Lesson 878OFFSET: Skipping Rows for Pagination
Page 2
`LIMIT 25 OFFSET 25` (rows 26-50)
Lesson 878OFFSET: Skipping Rows for Pagination
Page 3
`LIMIT 25 OFFSET 50` (rows 51-75)
Lesson 878OFFSET: Skipping Rows for Pagination
Page view
→ **CTA click** → **Conversion**
Lesson 1690Landing Page and CTA Optimization
Paid
Any channel where you pay for placement (Google Ads, Facebook Ads, display networks, sponsored content)
Lesson 1712Common Channel Categories
Paid advertising
(Google Ads, Facebook, display networks)
Lesson 1711What Are Acquisition Channels?
Paid CAC
isolates only the costs and customers from *paid advertising channels*:
Lesson 1754Blended CAC vs Paid CAC
Paid Search
has a 4-month payback, while **Referral** pays back in 6 months.
Lesson 1758Cohort-Based Payback Analysis
Paired or Repeated Measurements
If you measure the same subjects twice (before/after treatment), those measurements aren't independent—they're linked to the same person.
Lesson 381Independence Assumption and Its Violations
paired t-test
, which analyzes the *differences* within each pair, effectively reducing the problem to a one- sample test on those differences
Lesson 360Independent vs. Dependent SamplesLesson 369When to Use a Paired t-TestLesson 375Paired t-Test vs Two-Sample t-Test
Paired t-tests
Remember, it's the *differences* that need to be normally distributed, not the original paired observations.
Lesson 376The Assumption of Normality in t-Tests
Pairing comparable units
Each treated unit gets matched with one or more control units based on observed characteristics (covariates)
Lesson 1445The Matching Framework
Pairing related items
Identify rows that share common attributes but differ in others
Lesson 947Self-Joins for Comparisons Within a Table
Pairwise independence
means every *pair* of events is independent.
Lesson 103Mutual Independence vs Pairwise Independence
Pan and zoom
capabilities for exploring dense datasets
Lesson 1300Creating Basic Interactive Charts with Plotly Express
Pandas
, you use `pivot()` or `pivot_table()`:
Lesson 1146Pivoting Data Wider (Cast)
Paper submissions
`paper-submission-neurips2024`
Lesson 2037Tagging Releases and Experiment Snapshots
Parallel lines
→ No interaction; effects are independent
Lesson 466Visualizing Interactions
parallel trends
without treatment, both groups would have changed similarly over time—a critical assumption you'll need to verify in practice.
Lesson 1452The Difference-in-Differences SetupLesson 1453The Parallel Trends AssumptionLesson 1746Geo-Lift Experiments
Parameter uncertainty
The spread of the distribution shows how confident you are.
Lesson 1547Interpreting Posterior Distributions
Parameters
are numerical characteristics that describe a population.
Lesson 228Defining Populations and ParametersLesson 229Defining Samples and Statistics
Parametric part
The model assumes covariates affect hazard through a mathematical formula with parameters (coefficients) you estimate.
Lesson 825What is the Cox Proportional Hazards Model?
Parental/guardian consent
for children, plus age-appropriate explanations
Lesson 1918Special Populations and Vulnerable Groups
Pareto
describes heavy-tailed phenomena where extreme values are common—wealth distributions, file sizes on servers, or social network connections.
Lesson 193Choosing Between Distributions in Practice
Pareto distribution
, which you learned about in the previous lesson.
Lesson 191Pareto Principle and the 80/20 Rule
Pareto Principle
, also called the **80/20 rule**.
Lesson 191Pareto Principle and the 80/20 Rule
Parquet
is a compressed, column-oriented format designed for efficiency.
Lesson 1129Parquet and Feather: Columnar FormatsLesson 1779Reading and Writing Data in Spark
Parquet and Feather
are columnar formats optimized for analytics.
Lesson 1133Performance Considerations Across Formats
Partial
Part of a composite key determines an attribute
Lesson 1063Functional Dependencies
Partial autocorrelation
(PACF) solves this by measuring the *direct* correlation between observations separated by k time steps, *after removing* the influence of all the intermediate lags.
Lesson 727What is Partial Autocorrelation?
Partial duplicate detection
Identify rows that match on key fields (like name and birthdate) but differ elsewhere—these might represent the same entity entered multiple ways.
Lesson 1154Uniqueness and Duplication Checks
Partial F-Test
(which you learned in lesson 623) to formally test whether the extra predictors significantly improve the model
Lesson 626Nested vs Non-Nested Models
Partial failure recovery
uses **checkpoints** and **transaction boundaries** to save progress at strategic points.
Lesson 1853Partial Failure Recovery
Partial Failure Risk
What if update #1 succeeds but update #2 fails?
Lesson 1075Handling Data Consistency in Denormalized Schemas
partition
of a sample space is a special collection of events that satisfies two critical properties simultaneously:
Lesson 83Partitions of the Sample SpaceLesson 97Law of Total ProbabilityLesson 1782Spark Performance Basics: Partitions and Caching
Partitioning
and **clustering** tell the warehouse how to physically organize your data so queries can skip entire chunks of irrelevant data.
Lesson 1812Partitioning and Clustering Strategies
Partnerships
(co-marketing, affiliate programs)
Lesson 1711What Are Acquisition Channels?
Past interactions
Previous purchases, feature usage patterns
Lesson 1689Multivariate Testing and Personalization
Patient satisfaction scores
capture experience quality through surveys—Net Promoter Score (NPS) or HCAHPS scores— serving as leading indicators for loyalty and reputation.
Lesson 1633Healthcare Metrics: Patient Outcomes and Operational Efficiency
Pattern
Points curve **above** the line at the upper-right end and **below** the line at the lower-left end —like a gentle S-curve.
Lesson 567Common Q-Q Plot Patterns: Heavy Tails and Light TailsLesson 722ACF Plots and InterpretationLesson 726Using ACF for Model Identification
Pattern + Color
In bar charts or area plots, add hatching, dots, or line patterns alongside color fills
Lesson 1251Avoiding Reliance on Color Alone
Pattern over-generalization
Models find and exploit subtle correlations in biased data that humans might overlook (e.
Lesson 1882Algorithmic Amplification of Bias
Patterns in plots
non-random patterns suggest model misspecification (e.
Lesson 701Deviance Residuals
PCA
when you want speed, interpretability, and care about global structure.
Lesson 1196Dimensionality Reduction for Visualization
PDF acts like weights
, telling you which regions contribute more to the average.
Lesson 159Expected Value and Variance for Continuous Variables
Pearson
detects *linear* relationships: as X increases by a constant amount, Y changes by a constant amount
Lesson 487When to Use Spearman vs PearsonLesson 1184Correlation Coefficients in Bivariate Analysis
Pearson correlation
is your go-to for linear relationships between normally distributed variables.
Lesson 1184Correlation Coefficients in Bivariate Analysis
Pearson's r
measures the strength and direction of the linear relationship between two variables
Lesson 534R-Squared vs Correlation Squared
Peek frequently
without invalidating your test
Lesson 1510Sequential Testing Overview
Peer groups
"Among similar-sized companies, we rank in the top 10%"
Lesson 1962Contextualizing Numbers
Pennies-per-terabyte storage
(Amazon S3, Google Cloud Storage): Storing raw data became so cheap that the cost of keeping everything in its original form was negligible
Lesson 1818The Rise of ELT: Cloud Storage and Compute
Percentage contribution
`value / SUM(value) OVER (PARTITION BY category)`
Lesson 1019Comparing Values to Window Aggregates
Percentage of total
`sale_amount / regional_total * 100`
Lesson 1019Comparing Values to Window Aggregates
percentile
tells you what percentage of the data falls *below* a specific value.
Lesson 56Understanding Percentiles and Their InterpretationLesson 199Finding Percentiles with Z- Scores
Perfect
multicollinearity means two or more predictors are perfectly linearly related—one can be expressed as an exact linear combination of the others.
Lesson 551No Perfect Multicollinearity in Simple Regression
Performance benchmarks
and expected behavior
Lesson 2091Stage 7: Communication and Handoff
Performance bottlenecks
When specific queries consistently timeout or slow down user experience despite indexing and optimization.
Lesson 1071When to Denormalize: Performance Trade-offs
Performance monitoring
Query optimization as data volume grows
Lesson 1979Maintenance and Sustainability Considerations
Performance needs
(C-based drivers like psycopg2 are faster than pure-Python alternatives)
Lesson 1087Database Drivers and Connection Libraries
Performance thresholds
"The model must achieve at least 85% accuracy" or "reduce processing time by 30%"
Lesson 2117Defining 'Good Enough' with Stakeholders
Performance-driven iteration
Your model doesn't meet accuracy thresholds, prompting cycles through feature engineering, data collection, or even problem rescoping.
Lesson 2092Iteration and Feedback Loops in Practice
Period 0
Usually 100% (the acquisition event itself)
Lesson 1648Cohort Retention Rates
Period 1
% who returned in the first subsequent period
Lesson 1648Cohort Retention Rates
Period 2
% who returned in the second period, and so on
Lesson 1648Cohort Retention Rates
Period selection
Choose periods that match your business cycle.
Lesson 1671Churn Rate Calculation Methods
Periodic patterns
Cyclical or wave-like relationships
Lesson 1189Detecting Nonlinear Relationships
Permanent
Fail fast, log the issue, alert immediately, possibly route to a dead-letter queue for investigation
Lesson 1849Transient vs Permanent Failures
Permanent failures
usually involve:
Lesson 1849Transient vs Permanent Failures
Permissions and licensing
Who authorized access?
Lesson 1161Documenting Data Sources
Permissive licenses
(like MIT, BSD, and Apache 2.
Lesson 2081Understanding Open Source Licenses
Permutation methods
are useful when testing whether two groups differ.
Lesson 291Non-Parametric Alternatives for Difference Intervals
Permutation or bootstrap approaches
Distribution-free methods that don't assume normality
Lesson 470When Parametric ANOVA Assumptions Fail
Permutation tests
offer a clever alternative: they use resampling to build a reference distribution from your own data.
Lesson 502Permutation Tests for Correlation
Person-time
Modeling disease incidence rates with different follow-up durations
Lesson 692Offset Terms for Exposure
Personal conflicts
A friend asks you to help prove their startup idea will work.
Lesson 35Conflicts of Interest and Independence
Personal relationships
You're analyzing data about a friend's project or a competitor of someone close to you.
Lesson 1930Managing Conflicts of Interest
Personalization
High-value segments might receive premium support, exclusive offers, or early access to features.
Lesson 1669LTV Segmentation and Targeting
Personalize experiences
Tailor messaging based on where users are in their journey
Lesson 1719The Customer Journey and Touchpoints
Perspective projection
(default): Objects farther away appear smaller, mimicking how human eyes see the world.
Lesson 1326Viewing Angles and Projection Types
Peto test
(Peto-Peto modification) uses a weighting scheme between log-rank and Wilcoxon.
Lesson 823Log-Rank Test vs Other Tests
Phantom reads
A query returns different rows on repeat execution because another transaction inserted/deleted data
Lesson 1116Transaction Isolation and Concurrency
Phi
is a special case used exclusively for 2×2 contingency tables.
Lesson 429Effect Size: Cramér's V and Phi
Pick a population distribution
(any shape—uniform, exponential, skewed, bimodal, doesn't matter)
Lesson 222Visualizing the CLT with Simulations
Pick comparable cohorts
Same definition (e.
Lesson 1659Comparing Retention Across Cohorts
Pie charts
Displaying parts of a whole (market share, budget allocation) — use sparingly and only with 2-5 slices
Lesson 1959Choosing Familiar Chart Types
Pipeline delays
Data usually arrives at 6 AM but starts arriving at 9 AM.
Lesson 2136Monitoring Gaps and Silent Failures
Pipeline Runtime
How long each run takes from start to finish.
Lesson 1856Key Metrics to Monitor
Pipeline validation
Detect unexpected changes in upstream data sources
Lesson 1871Why Version Control for Data?
Pipenv
) are modern dependency managers that treat your project like a publishable package from day one.
Lesson 2051Poetry and Modern Python Tools
Pitfall
Conditioning on M blocks the path from A to Y, hiding the causal effect you want to measure.
Lesson 1476Common DAG Patterns and Pitfalls
Plan changes
Modifying a task affects everything downstream
Lesson 1841Upstream and Downstream Dependencies
Planned vs. exploratory comparisons
Pre-specified contrasts vs.
Lesson 824Multiple Group Comparisons
Platform differences
compound the problem: A package compiled for Windows may behave differently than its macOS version, or the ARM architecture on newer Macs requires different binaries than Intel chips.
Lesson 2048The Dependency Hell Problem
Platform-friendly
Ad networks often require the slot to be filled
Lesson 1747Ghost Ads and PSA Tests
Platykurtic
(kurtosis < 3 or excess kurtosis < 0): Light tails and a flatter peak.
Lesson 66Kurtosis: Definition and Interpretation
Plausibility
Does a reasonable mechanism explain *how* X could cause Y, given current scientific knowledge?
Lesson 498Bradford Hill Criteria for Causation
Plot your data
boxplots and histograms for each group
Lesson 290Assumptions and Diagnostics for Difference Intervals
Plotly Express
by specifying an `animation_frame` parameter pointing to your time or category column.
Lesson 1306Animation and Time-Based Transitions
Plotting utilities
(`utils/plotting.
Lesson 2075Utility Modules and Helper Functions
Pocock
Distributes alpha more evenly across all looks
Lesson 1512Group Sequential Testing
Poetry
(and similarly, **Pipenv**) are modern dependency managers that treat your project like a publishable package from day one.
Lesson 2051Poetry and Modern Python Tools
Pointers
to storage locations rather than storing full copies in version control
Lesson 1871Why Version Control for Data?
Pointers, not files
For large models and datasets, commit references (like file hashes or DVC tracking files) rather than the actual binaries
Lesson 2034Committing Data Artifacts and Model Outputs
Points
(`geom_point`) for scatter plots showing individual observations
Lesson 1342Geometric Objects (geoms)
Points along the line
Perfect or near-perfect normality.
Lesson 566Reading Q-Q Plots: Interpreting Points Along the Reference Line
Points on the diagonal
Your data matches the normal distribution well
Lesson 204Q-Q Plots: Theory and Interpretation
Poisson probability tables
or statistical software.
Lesson 143Cumulative Poisson Probabilities
Poisson-distributed variables
(events occurring at a constant rate)
Lesson 213Square Root and Cube Root Transformations
Polar coordinates
transform bar charts into pie charts or create radial plots
Lesson 1344Scales and Coordinate Systems
Polynomial features
let you capture these curves *within* a linear regression framework by adding powers of your existing variables.
Lesson 657What Are Polynomial Features?Lesson 662Polynomial Features vs Splines
Pool all observations
and randomly reassign them to groups
Lesson 395Permutation Tests for Means and Beyond
Pooled variance
assumes both groups have the same underlying population variance.
Lesson 285Pooled vs Unpooled Variance Approaches
pooled variance t-test
is specifically designed for situations where you can reasonably assume both populations have the **same variance** (even if their means differ).
Lesson 361Pooled Variance t-TestLesson 362Welch's t-Test for Unequal VariancesLesson 379The Assumption of Equal Variances (Homoscedasticity)
Poor decision-making
based on correlation rather than causation
Lesson 1637What is Metric Attribution?
Poor interpretation
(stakeholders misread what the metric actually measures)
Lesson 1619What is Metric Ownership?
Poor model fit
Standard models assume stable variance and mean
Lesson 734Why Differencing and Detrending Matter
Poor objective
"Increase metrics"
Lesson 1609Setting Effective Objectives
Population distribution
is the complete album of everyone's heights in a country — every single person.
Lesson 258Comparing Population, Sample, and Sampling Distributions
Population mean (μ)
The average of all values in the population
Lesson 228Defining Populations and Parameters
Population proportion (p)
The fraction of the population with a certain characteristic
Lesson 228Defining Populations and Parameters
Population standard deviation (σ)
How spread out the population values are
Lesson 228Defining Populations and ParametersLesson 292Sample Size for Estimating a Mean
Population variability
(σ): More spread in the population → larger SE
Lesson 260Defining Standard Error
Population variance
Divide by **N** (total count of all values)
Lesson 50Population vs Sample Variance
Population variance (σ²)
Expected variability in each group
Lesson 289Sample Size Requirements for Difference Intervals
Portfolio thinking
means treating your channels like investments:
Lesson 1716Channel Mix and Portfolio Thinking
Position + Color
Use spatial separation or faceting along with color coding
Lesson 1251Avoiding Reliance on Color Alone
Position along an axis
(the most accurate encoding)
Lesson 1238Matching Encoding to Data Type
Position along non-aligned scales
(e.
Lesson 1232Perceptual Accuracy Hierarchy
Positive coefficient
→ that category has a higher average outcome than the reference
Lesson 637Interpreting Dummy Variable Coefficients
Positive correlation
Points trend upward from left to right (as one variable increases, so does the other)
Lesson 1222Scatter Plots for Relationships
Positive r
Variables move together (height and weight, study time and test scores)
Lesson 477Interpreting the Correlation Coefficient
Positive residuals
(`e_i > 0`) occur when the actual value is *above* the fitted line.
Lesson 540The Residual Formula
Positive slope
X and Y move together (X ↑ means Y ↑)
Lesson 524The Meaning of the Slope
Post
is a binary indicator (1 if observation is from post-treatment period, 0 if pre-treatment)
Lesson 1455DiD with Regression
Post-hoc considerations
include:
Lesson 824Multiple Group Comparisons
Post-hoc tests
(meaning "after this") are designed to make pairwise comparisons *after* finding a significant ANOVA result.
Lesson 455Why Post-Hoc Tests Are Needed After ANOVA
posterior distribution
is the end result of Bayesian inference—it's what you *actually care about*.
Lesson 1537The Posterior DistributionLesson 1539Interpreting Posterior ProbabilitiesLesson 1563Sequential Updating with New Data
Posterior distributions
tell you not just "which variant is winning?
Lesson 1586Multi-Armed Bandit ConnectionsLesson 1587Bayesian A/B Testing in Practice
Posterior mean
a weighted average of your prior mean and the sample mean, weighted by their precisions (inverse variances)
Lesson 1553Normal-Normal ConjugacyLesson 1561Posterior Mean and Mode
Posterior Mode
The peak of the posterior distribution, also called the Maximum A Posteriori (MAP) estimate — the single most probable value.
Lesson 1561Posterior Mean and Mode
Posterior predictive checks
answer this by simulating new datasets from your posterior distribution and comparing them to your observed data.
Lesson 1596Posterior Predictive Checks and Model Comparison
Posterior variance
combines information from both the prior and the data
Lesson 1553Normal-Normal Conjugacy
Posterior: P(B|A)
Your *updated* belief about A *after* observing evidence B
Lesson 107Bayes' Theorem Formula and Components
Power analysis
is the process of determining the minimum sample size required to detect an effect of a given size with adequate statistical power, all while controlling your Type I error rate (alpha).
Lesson 344Power Analysis in Study Design
Power-imbalanced contexts
Employees consenting to employer tracking, students in research studies, prisoners, or patients in medical settings
Lesson 1918Special Populations and Vulnerable Groups
powerful
when your data is truly normal, but it's very sensitive to non-normality—it might reject equal variances simply because your data isn't perfectly bell-shaped, not because variances actually differ.
Lesson 380Testing Equal Variances: Levene's and Bartlett's TestsLesson 450Homogeneity of Variance (Homoscedasticity)
Practical
Works when no complete population list exists
Lesson 238Multistage Sampling
Practical limit
Most effective trees are 3-5 levels deep with 3-7 branches per node
Lesson 1623Depth vs Breadth in Metric Trees
Praise good work
When you see clever solutions or clear code, say so!
Lesson 2024Code Review Best Practices
Pre-creates
a set number of connections when your application starts
Lesson 1092Connection Pooling Basics
Pre-experiment validation
means running tests to ensure your randomization works properly and your metrics behave as expected *before* you expose users to actual treatment differences.
Lesson 1483Pre-Experiment Validation
Pre-filtering problems
Different groups experiencing different dropout rates during assignment
Lesson 1524Sample Ratio Mismatch (SRM)
Pre-register analyses
Decide your approach *before* seeing results
Lesson 30The Reproducibility Crisis and Solutions
Pre-register your alpha level
(usually 0.
Lesson 368Common Pitfalls and Best Practices
Pre-registration
means writing down your hypotheses, metrics, sample size, stopping rules, and correction methods *before* you peek at any results.
Lesson 1508Pre-Registration and Correction Strategy
precise control
, especially when creating multiple subplots or building complex visualizations.
Lesson 1256Two Interfaces: pyplot vs Object-OrientedLesson 1277Adjusting Subplot Spacing and Layout
Precision & Recall
For classification problems, how many relevant items did you catch, and how many false alarms did you trigger?
Lesson 14Model Evaluation and Validation
Precision is needed
Reading exact values from 3D axes is significantly harder than 2D
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Precision matters
viewers need to read exact values or make close comparisons
Lesson 1233Position as the Most Effective Channel
Predicting product lifespans
Manufacturers use k < 1 to model defects caught in early testing
Lesson 188Weibull Distribution: Hazard Function and Reliability
prediction intervals
a range where we expect the true value to fall with a certain confidence level (often 80% or 95%).
Lesson 794Forecasting Concepts and HorizonsLesson 800Generating Forecasts with SARIMA
Prediction intervals grow wider
as the forecast horizon extends.
Lesson 800Generating Forecasts with SARIMA
Predictions
Every observation gets the identical predicted value
Lesson 647Impact on Model Results and Reporting
Predictive parity
When the model predicts success, is it equally accurate across groups?
Lesson 1884Detecting Bias in Your Data
Predicts sustainable growth
– When it improves, revenue and retention typically follow
Lesson 1604What is a North Star Metric?
Preliminary evidence
correlation coefficients, group comparisons, or statistical summaries that suggest the hypothesis may hold (e.
Lesson 1203Documenting Hypotheses and Evidence
Preprocessing utilities
(`utils/preprocessing.
Lesson 2075Utility Modules and Helper Functions
Prerequisites
Required software, packages, and versions
Lesson 1989Best Practices for Sharing Reproducible Reports
Present contradictory evidence
that challenges your hypothesis
Lesson 1929Avoiding Cherry-Picking Results
Preserves slower-moving patterns
(the trend component)
Lesson 755Moving Averages for Trend Estimation
Prevention
Use additive decomposition or agreed-upon attribution rules *before* initiatives launch.
Lesson 1642Attribution Pitfalls and Common Errors
Prevents Direct Pushes
No one can use `git push` directly to protected branches—all changes must go through pull requests.
Lesson 2027Protecting Branches and Required Reviews
Prevents Force Pushes
Protects against accidental history rewrites that could break reproducibility.
Lesson 2027Protecting Branches and Required Reviews
Preview data structure
without downloading entire tables
Lesson 877LIMIT: Restricting the Number of Rows Returned
Price sensitivity
(competitor pricing, perceived value)
Lesson 1675Churn Attribution and Root Cause Analysis
Pricing optimization
Test whether annual plans reduce hazard rates compared to monthly
Lesson 838Subscription and Membership Duration Modeling
Primacy Effect
Conversely, existing users are *already comfortable with the old version*.
Lesson 1525Novelty and Primacy Effects
Primary and secondary metrics
you'll measure
Lesson 1508Pre-Registration and Correction Strategy
Primary contact
Your email or project maintainer's handle
Lesson 2083Contributing Guidelines and Contact Information
Primary data geoms
The main visual elements (points, lines, bars)
Lesson 1355Layer Order and Plot Composition
primary metric
(or success metric) must directly align with your business goal.
Lesson 1478Defining Success MetricsLesson 1485Documentation and Pre-Registration
Primary test
Compare mean purchase frequency between age groups using appropriate statistical tests
Lesson 1204From Hypothesis to Analysis Plan
Principal Data Scientist
Strategic technical direction, influence company-wide architecture, recognized external expert
Lesson 2140Individual Contributor vs Management Tracks
Prior knowledge
| Ignored | Incorporated explicitly |
Lesson 1580Bayesian vs Frequentist A/B Testing
Prior mean (μ₀)
Your best guess for the population mean before seeing data
Lesson 1565Prior Distributions for Normal Means
Prior precision
How concentrated is your prior distribution?
Lesson 1549Prior-Likelihood Trade-offs
Prior probability P(Guilty)
base rate of guilt before evidence
Lesson 112Legal Evidence and Jury Reasoning
Prior standard deviation (σ₀)
How uncertain you are about that guess
Lesson 1565Prior Distributions for Normal Means
Prior: P(A)
Your initial belief about A *before* seeing evidence B
Lesson 107Bayes' Theorem Formula and Components
Prioritize by pain
Refactor the parts of your pipeline that cause the most frequent issues or slow you down most
Lesson 2137Refactoring Strategies and Debt Paydown
Prioritize ruthlessly
what matters most
Lesson 2121Timeboxing and Deadlines
Prioritized
Rank recommendations by impact, feasibility, or urgency.
Lesson 1970Recommendations and Next Steps
Priority level
based on business impact and testability, which hypotheses deserve formal testing first?
Lesson 1203Documenting Hypotheses and Evidence
Priors
Specify distributions for unknown parameters
Lesson 1594PyMC: Probabilistic Programming in Python
Priors are extreme
Starting at 0.
Lesson 115Prior Sensitivity Analysis
Priors are similar
Starting at 30% versus 35% won't create huge differences
Lesson 115Prior Sensitivity Analysis
Privacy Attacks
Models trained on sensitive data might leak information through inference attacks, even if you've applied privacy techniques.
Lesson 1920Anticipating Misuse of Data Products
Privacy budget (ε)
How much privacy you're willing to "spend" (smaller ε = more noise = more privacy)
Lesson 1899Adding Noise for Privacy
Privacy-preserving machine learning
where training data stays encrypted throughout
Lesson 1903Secure Multi-Party Computation
Proactive monitoring
Owner spots anomalies and drives root-cause analysis
Lesson 1619What is Metric Ownership?
Probability of Being Best
directly answers this question by computing the probability that a given variant has the highest true conversion rate (or other metric) compared to all other variants.
Lesson 1583Probability of Being BestLesson 1586Multi-Armed Bandit Connections
Probability sampling
means every member of the population has a *known, non-zero chance* of being selected.
Lesson 242Probability vs Non-Probability Sampling
Probability sampling methods
give you statistical validity.
Lesson 243Choosing the Right Sampling Method
Probability statements
You can make direct claims like "There's a 95% probability the conversion rate is between 0.
Lesson 1547Interpreting Posterior Distributions
Probability threshold
Stop when P(B better than A | data) > 0.
Lesson 1585Early Stopping in Bayesian Tests
Probe edge cases
"What happens if the model is wrong?
Lesson 2102Understanding Stakeholder Goals and Constraints
Probit
has thinner tails (based on the normal distribution)
Lesson 674The Probit LinkLesson 678Choosing the Right Link Function
probit link
does the same job but uses the cumulative distribution function (CDF) of the standard normal distribution instead.
Lesson 674The Probit LinkLesson 676Canonical vs Non-Canonical LinksLesson 677Interpreting Coefficients Under Different Links
Process
Each worker applies the same operation to its chunk independently
Lesson 1768Data Parallelism Fundamentals
Processing speed
One-pass transformation instead of read-then-transform
Lesson 1802Filtering During Read with dtype and Converters
Product A
10,000 new users/month, 10% retention → 1,000 active users
Lesson 1614Growth Without Retention
Product B
2,000 new users/month, 70% retention → 1,400 active users
Lesson 1614Growth Without Retention
Product categories
An item categorized as "electronics" cannot also be "clothing" (assuming mutually exclusive classification)
Lesson 81Mutually Exclusive Events
Product changes
Measure impact on cohorts before vs after a launch
Lesson 1644What is Cohort Analysis?
Product feedback
Early adopters who volunteer feedback aren't typical users
Lesson 246Volunteer and Self-Selection Bias
Product focus
Should you optimize for retention of casual users or delight of power users?
Lesson 1698Power User Curves and Engagement Distribution
Product gaps
(missing features, usability issues)
Lesson 1675Churn Attribution and Root Cause Analysis
Product Launches
Companies use DiD when rolling out features to some markets first.
Lesson 1459Real-World DiD Applications
Product recommendations
Finding pairs of products from the same `products` table
Lesson 945Introduction to Self-Joins
Product reviews
skew positive when only satisfied customers bother to write them
Lesson 247Survivorship Bias
Product Team Objective
Improve discovery experience
Lesson 1608Connecting North Star Metrics to OKRs
Product-market fit quality
Higher floors suggest stronger fit
Lesson 1658Flattening and Asymptotic Behavior
Production code
Applications should specify exactly which columns they need for clarity and performance
Lesson 851Selecting All Columns with Asterisk
Production deployment
Compiled models are easier to integrate into non-Python systems
Lesson 1595Stan: High-Performance Bayesian Inference
Production pipelines
that run automatically (ETL, model training, inference)
Lesson 2074Notebooks vs Scripts: When to Use Each
Production pipelines dominate
Python integrates better with web services, APIs, and deployment infrastructure.
Lesson 1375Choosing Tools: When to Use R vs Python for Visualization
Professional color palettes
(ColorBrewer, viridis)
Lesson 1369Publication-Ready Plot Styling
Profiling reports
go deeper: statistics for numeric columns (mean, min, max), cardinality for categorical fields, missing value percentages, and distribution summaries.
Lesson 2067Automating Documentation with Code
Profitability focus
By identifying unprofitable segments (LTV < CAC), you can adjust targeting criteria, reduce spend, or experiment with lower-cost channels.
Lesson 1669LTV Segmentation and Targeting
Programming
You'll need to write code to clean, analyze, and visualize data.
Lesson 7The Data Science Skill Stack
Project portability
Each project carries its own dependency specification, making deployment predictable
Lesson 2039Virtual Environments: Concept and Benefits
Project Structure
Brief overview of directory organization
Lesson 2077The Purpose and Anatomy of a Good README
Project templates
solve this by providing a blueprint—a cookie cutter, if you will—that stamps out a consistent structure every time you start fresh.
Lesson 2076Code Organization Templates and Cookiecutter
Project Title and Description
One-line summary and brief explanation of the project's purpose
Lesson 2077The Purpose and Anatomy of a Good README
Project-Join Normal Form
) eliminates **join dependencies**.
Lesson 1068Higher Normal Forms: 4NF and 5NF
Project-level
"Final presentation is in 3 weeks—no exceptions"
Lesson 2121Timeboxing and Deadlines
Prometheus
, **Grafana**, and **Datadog** automate this process, offering dashboards that show pipeline status at a glance and trigger alerts when thresholds are breached.
Lesson 1861Monitoring Tools and Dashboards
Proportion test
When your metric is a conversion rate or percentage
Lesson 1749Measuring Statistical Significance
Proportional allocation
assigns credit based on estimated contribution size (e.
Lesson 1640Attribution in Multi-Team Environments
Propose
a new location nearby (a candidate parameter value)
Lesson 1590The Metropolis-Hastings Algorithm
Pros
Lightning-fast reads, simplified queries
Lesson 1076Materialized Views and Summary Tables
Prospects
People who've shown interest but haven't purchased yet.
Lesson 1704Customer Lifecycle Stages
Protanopia/Protanomaly
(red-weak): similar red-green confusion
Lesson 1248Color Blindness and Color Palette Design
Protected classes
are groups of people shielded by law from discrimination.
Lesson 1888Protected Classes and Sensitive Attributes
Protection from SQL injection
Parameterization is automatic
Lesson 1117What is an ORM and Why Use It?
Prototyping models
and experimenting with different approaches
Lesson 2074Notebooks vs Scripts: When to Use Each
Provenance questions
Can you trust data from third-party APIs or scraped sources?
Lesson 1762Extended Dimensions: Veracity and Value
Provide context
Explain *why* something matters.
Lesson 2024Code Review Best Practices
Provide fast retrieval
through indexing and optimized queries
Lesson 842What is a Database?
Proximity
Elements placed close together are perceived as related.
Lesson 1236Gestalt Principles in Visualization
Proxy validation is skipped
Teams assume a surrogate metric correlates with the real goal without validating that relationship (remember lesson 1520: Validating Surrogate Metrics).
Lesson 1530Mismatched Metrics and Goals
proxy variable
is a feature that correlates strongly with a protected attribute, allowing a model to infer sensitive information indirectly.
Lesson 1883Protected Classes and Proxy VariablesLesson 1889Proxy Variables and Redlining
Prunes intelligently
Eliminates candidate change-points that can never be part of the optimal solution, based on proven mathematical conditions
Lesson 1416PELT Algorithm: Pruned Exact Linear Time
Pseudonymization
replaces identifiers with artificial labels—Patient A, Patient B—allowing you to track the same individual across records without knowing their real identity.
Lesson 1895Data Anonymization Basics
Public datasets
Government databases, research repositories, open data portals
Lesson 11Data Collection and Acquisition
Purchase Frequency
counts how many purchases the typical customer makes in a given period (say, per year).
Lesson 1663Simple LTV: Average Revenue Per Customer
Pure AR (Autoregressive) Process
Lesson 733Using ACF and PACF Together
Purpose limitation
Data collected for one purpose can't be repurposed for unrelated analytics without new consent
Lesson 1904What is GDPR and Why It MattersLesson 1905Core Principles of GDPR
Python class
represents a database **table**
Lesson 1117What is an ORM and Why Use It?
Python's approach
is like learning the second language natively from the start.
Lesson 1374Interactivity: plotly in R vs Python and Integration Patterns
Pyvis
is purpose-built for network visualization.
Lesson 1321Interactive Network Graphs with Plotly and Pyvis

Q

Q-Q linearity
Points hugging the diagonal reference line
Lesson 377Testing Normality: Visual Methods
Q-Q plot (quantile-quantile plot)
Residuals should fall along a straight diagonal line
Lesson 449Normality of ResidualsLesson 788Checking Residual Normality
Q-Q plot first
Does the pattern look problematic for your purposes?
Lesson 570Q-Q Plots vs Formal Normality Tests: When Visual Checks Matter
Q1
(25th percentile): 25% of data falls below this value
Lesson 1383Understanding the Interquartile Range (IQR)
Q1 (First Quartile)
The value at the 25% mark — one quarter of your data falls below this point
Lesson 51Interquartile Range (IQR)
Q3
(75th percentile): 75% of data falls below this value
Lesson 1383Understanding the Interquartile Range (IQR)
Q3 (Third Quartile)
The value at the 75% mark — three quarters of your data falls below this point
Lesson 51Interquartile Range (IQR)
Quadratic or polynomial trends
When your data curves upward or downward in an accelerating pattern
Lesson 736Higher-Order Differencing
Quadrupling your sample size
cuts the standard error in half
Lesson 223Standard Error and the CLT
Qualitative and aspirational
"Transform user onboarding" beats "Improve metrics"
Lesson 1609Setting Effective Objectives
Quality
Good units ÷ total units produced (capturing defects)
Lesson 1636Manufacturing Metrics: OEE, Yield, and Cycle Time
Quality checks
Validation rules applied, records removed
Lesson 2065Tracking Data Lineage
Quality control
A manufacturing process with high variability produces inconsistent products
Lesson 46What is Variability?Lesson 351When to Use a One-Sample t-Test
Quality control pass rates
(proportion of acceptable products)
Lesson 184Beta Distribution: Bounded Between 0 and 1
Quality gates exist
Automated tests can run, and approval requirements can block poor code from merging
Lesson 2022Understanding Pull Requests
Quality metrics
Products manufactured in different batch sizes
Lesson 43Weighted Mean and Its Applications
Quantiles
are the general family of cut-points that divide ranked data into *any* equal-sized groups.
Lesson 57Quantiles: Quartiles, Deciles, and BeyondLesson 306Bootstrap for Non-Standard Problems
Quarantine new work
Apply strict standards to new features while gradually improving old ones
Lesson 2137Refactoring Strategies and Debt Paydown
Quarterly cycles
Business revenues influenced by fiscal quarters
Lesson 707Seasonality: Regular Periodic Patterns
Quarterly data
with yearly seasonality → period = 4
Lesson 746Choosing Seasonal Period
Quartiles
(4 groups): Cut your data into quarters.
Lesson 57Quantiles: Quartiles, Deciles, and Beyond
Quartiles or deciles
Divide customers into equal-sized groups (top 10%, next 10%, etc.
Lesson 1669LTV Segmentation and Targeting
Query Complexity
Higher normalization means more tables.
Lesson 1070When to Stop Normalizing
Query execution time
The obvious metric, but run queries multiple times to account for caching
Lesson 1077Measuring Performance Impact of Denormalization
Query performance
Only read columns you need.
Lesson 1811Columnar Storage and Query Optimization
Queue theory
Time until multiple service completions
Lesson 181Gamma Distribution: Shape and Rate Parameters
Quick ad-hoc queries
You're doing temporary analysis and speed matters more than precision
Lesson 851Selecting All Columns with Asterisk
Quick Ratio
measures how much new and expansion revenue you gain versus how much you lose:
Lesson 1629SaaS Growth Metrics: Quick Ratio and Net Revenue Retention
Quick updates
Brief email summaries or Slack messages for "no blockers, progressing as planned"
Lesson 2104Communication Cadence and Updates
Quintiles
(5 groups): Split data into fifths, useful in economic studies and portfolio analysis.
Lesson 57Quantiles: Quartiles, Deciles, and Beyond
Quota
30 people aged 18-35, 30 aged 36-55, 40 aged 56+
Lesson 240Quota Sampling

R

r = -1
Perfect negative linear relationship (as one variable increases, the other decreases proportionally)
Lesson 476What is Pearson Correlation?Lesson 477Interpreting the Correlation Coefficient
r = +1
Perfect positive linear relationship (as one variable increases, the other increases proportionally)
Lesson 476What is Pearson Correlation?Lesson 477Interpreting the Correlation Coefficient
r = 0
No linear relationship (the variables don't follow a straight-line pattern together)
Lesson 476What is Pearson Correlation?Lesson 477Interpreting the Correlation Coefficient
R Charts
Best for small subgroups (n ≤ 10).
Lesson 1399Control Charts for Variability (R and S Charts)
R Charts (Range Charts)
track the difference between the highest and lowest values in each sample group.
Lesson 1399Control Charts for Variability (R and S Charts)
R-hat statistic
Compares variance within and between multiple chains; values near 1.
Lesson 1592Burn-in, Thinning, and Convergence Diagnostics
R-squared
(written as R² or r²) tells you the **proportion of variance in Y that is explained by X**.
Lesson 531What is R-Squared?Lesson 543Residuals as Unexplained Variation
R-squared and adjusted R-squared
Model fit is unchanged
Lesson 647Impact on Model Results and Reporting
R's approach
is like having an interpreter who translates your speech (ggplot2 code) into another language (plotly).
Lesson 1374Interactivity: plotly in R vs Python and Integration Patterns
measures the proportion of variance in Y explained by your regression model
Lesson 534R-Squared vs Correlation SquaredLesson 613The Adjusted R-Squared Formula
R² = 0
Your model explains none of the variance; you might as well use the mean of Y as your prediction
Lesson 531What is R-Squared?Lesson 533Interpreting R-Squared Values
R² = 0.15
Only 15% of variance is explained; 85% remains unexplained.
Lesson 533Interpreting R-Squared Values
R² = 0.7
Your model explains 70% of the variance in Y
Lesson 531What is R-Squared?
R² = 0.85
Your model explains 85% of the variance—most of the variation is captured by your regression line.
Lesson 533Interpreting R-Squared Values
R² = 1
Your model perfectly predicts every Y value (rare in real life!
Lesson 531What is R-Squared?
Radio silence after complaints
Sometimes the absence of follow-up signals they've given up
Lesson 1673Leading Indicators of Churn
Radioactive decay
An atom that hasn't decayed for an hour is no more "due" to decay than a fresh atom
Lesson 167Memoryless Property of Exponential
Rainbow palettes
They suggest order where none exists and aren't colorblind-friendly.
Lesson 1309Choropleth Maps: Basics and Best Practices
Random Assignment
Each participant has an equal chance of being assigned to either group
Lesson 1435What is a Randomized Controlled Trial?Lesson 1486Why Randomization Matters in A/B Tests
Random failures
(constant hazard rate).
Lesson 189Fitting Weibull Models to Lifetime Data
Random number generators
Does each digit appear with equal frequency?
Lesson 421Applications: Uniform, Genetic Ratios, and Distributions
Random sampling
Your data comes from a random process
Lesson 419Assumptions and Minimum Expected Frequencies
Randomization Quality
Both groups should have similar characteristics (demographics, behavior patterns) if randomization works correctly
Lesson 1483Pre-Experiment Validation
Randomization unit
User, session, or other unit you defined
Lesson 1485Documentation and Pre-Registration
Randomize assignment
Split users randomly into control and treatment groups (e.
Lesson 1641Isolating Effects with Control Groups
Randomized Controlled Trial (RCT)
is an experimental method where participants are randomly assigned to either a **treatment group** (receives the intervention) or a **control group** (does not receive the intervention).
Lesson 1435What is a Randomized Controlled Trial?Lesson 1677Measuring Churn Reduction Impact
Randomizing by session
gives you more experimental units (higher power), but risks violating independence assumptions and creates inconsistent experiences.
Lesson 1481Unit of Randomization
Randomizing by user
gives cleaner results and consistent experience, but requires more users to detect effects.
Lesson 1481Unit of Randomization
Randomly assign users
to treatment (real ad) or control (PSA/ghost ad)
Lesson 1747Ghost Ads and PSA Tests
Randomly select
some clusters
Lesson 237Cluster Sampling
Range analysis
"What's the age range of our customers?
Lesson 885MIN and MAX: Finding Extremes
Range queries
(`WHERE age BETWEEN 25 AND 35`) find the starting point, then scan consecutive sorted leaves
Lesson 1079B-Tree Indexes: Structure and Mechanics
Range retention
User was active *at any point* from start through that period (cumulative)
Lesson 1648Cohort Retention Rates
Range sliders
excel with time-series data or any ordered sequence where users need to examine specific intervals (e.
Lesson 1303Range Sliders and Zoom Controls
Rank the absolute values
of differences from smallest to largest
Lesson 392Wilcoxon Signed-Rank Test
Rank them
from 1 (smallest) to n (largest), averaging tied ranks
Lesson 393Mann-Whitney U Test (Wilcoxon Rank-Sum)
Rank users
by their activity level (highest to lowest)
Lesson 1698Power User Curves and Engagement Distribution
ranking
you pool all observations from both groups, assign ranks from smallest to largest (ignoring which group they came from), then sum the ranks for each group.
Lesson 393Mann-Whitney U Test (Wilcoxon Rank-Sum)Lesson 474Friedman Test: Non-Parametric Repeated Measures ANOVALesson 488Computing Spearman Correlation
Rankings
(1st, 2nd, 3rd) alongside the actual data values
Lesson 1005Introduction to Window Functions
Rankings are important
ordering items from high to low
Lesson 1233Position as the Most Effective Channel
Ranks all observations
from smallest to largest across *all* groups combined (ignoring group membership temporarily)
Lesson 471Kruskal-Wallis H Test: The Non-Parametric One-Way ANOVA
Rare Events
Earthquakes per year, typos per page, or accidents per month—anything that happens occasionally but at a predictable average rate.
Lesson 144Poisson Applications: Arrivals and Events
Raster formats
(like PNG, JPG) store pixels.
Lesson 1273Saving Figures: Formats and Resolution
Rate data
Counts per unit of time, space, or population (e.
Lesson 689When to Use Poisson Regression
Rate parameter (β, "beta")
Controls how quickly probability "decays" or spreads out.
Lesson 181Gamma Distribution: Shape and Rate Parameters
rate parameter λ
(lambda), you can calculate the probability of observing *exactly* k events in your interval.
Lesson 140Poisson Probability Mass FunctionLesson 166Exponential Distribution: Mean and Variance
rates
when your observations have unequal exposure times or denominators.
Lesson 692Offset Terms for ExposureLesson 1613Raw Counts vs. Rates and Ratios
Ratio data
(numeric with meaningful zero: height, count, salary) leverages:
Lesson 1238Matching Encoding to Data Type
Ratio to partition average
`value / AVG(value) OVER (PARTITION BY category)`
Lesson 1019Comparing Values to Window Aggregates
Raw Kurtosis (Fisher's)
The complete formula above, which subtracts 3 at the end.
Lesson 67Calculating Kurtosis
RDD (Resilient Distributed Dataset)
is Spark's core data structure—a collection of objects distributed across the nodes in your cluster.
Lesson 1777RDDs: Resilient Distributed Datasets Fundamentals
Reactivations
If churned customers return, do you subtract them from "customers lost"?
Lesson 1671Churn Rate Calculation Methods
Reactive mode
You're constantly firefighting instead of preventing fires
Lesson 1617The Danger of Lagging-Only Metrics
Read both versions carefully
to understand what each branch changed
Lesson 2011Resolving Merge Conflicts
Read Committed
You only see committed data, but values can change during your transaction
Lesson 1116Transaction Isolation and Concurrency
Read the intersection
this value is P(Z ≤ your Z-score)
Lesson 198Using Z-Tables for Probability
Read Uncommitted
You can see other transactions' uncommitted changes (risky!
Lesson 1116Transaction Isolation and Concurrency
Read-heavy workloads
If a table is queried 10,000 times daily but updated once, duplicating data to avoid joins is worthwhile.
Lesson 1071When to Denormalize: Performance Trade-offsLesson 1073Storing Computed Values and Aggregates
Readability matters
Can you understand what the code does?
Lesson 2024Code Review Best Practices
Readmission rate
measures the percentage of patients returning within 30 days—a lagging indicator of both care quality and discharge planning effectiveness.
Lesson 1633Healthcare Metrics: Patient Outcomes and Operational Efficiency
Real-time
(milliseconds-to-seconds) demands streaming pipelines with immediate processing.
Lesson 1825Designing Pipeline Architecture
Real-time learning
Update beliefs as information arrives rather than waiting
Lesson 1538Updating Beliefs with Sequential Data
Real-world example
Consider medical testing.
Lesson 100Common Conditional Probability Mistakes
Real-World Needs
If you're building an analytics dashboard that constantly needs customer names with their order totals, joining `customers` and `orders` thousands of times per hour might waste resources.
Lesson 1070When to Stop Normalizing
Realism matters more
Your domain knowledge doesn't fit standard conjugate families
Lesson 1556Choosing Between Conjugate and Non-Conjugate Priors
Reassess consent
before any new use case—even internal ones
Lesson 1915Secondary Use and Scope Creep
Rebalance quarterly
as business conditions evolve
Lesson 1759Optimizing ROAS, CAC, and Payback Together
Rebase
rewrites history by moving your branch's commits to start from a different point.
Lesson 2014Understanding Git Rebase vs MergeLesson 2016Rebasing Feature Branches
Rebuild Fragmented Indexes
When fragmentation exceeds 30-40%, rebuild the index to reorganize data pages.
Lesson 1086Index Maintenance and Monitoring
Recalculate the test statistic
for this permuted dataset
Lesson 395Permutation Tests for Means and Beyond
Recalculates centers
based on the customers assigned to them
Lesson 1705K-Means Clustering for Segmentation
Recall
TP / (TP + FN) — of all real changes, how many did you catch?
Lesson 1418Evaluating Change-Point Detection Methods
Recency
How recently did they make a purchase?
Lesson 1703RFM Analysis: Recency, Frequency, Monetary Value
Reciprocal
(`1/Y`) can handle extreme heteroscedasticity but changes interpretation dramatically.
Lesson 591When and Why to Transform Variables
Recognizing boundaries of competence
means honestly assessing what you know versus what a problem requires, and making responsible decisions about whether to proceed alone, seek help, or decline the work entirely.
Lesson 34Recognizing Boundaries of Competence
Recommendation
Launch automated alerts for at-risk accounts
Lesson 1948The Recommendation Slide: Making It Actionable
Recommendations
Actionable next steps tied to findings
Lesson 1966Report Structure and Executive Summary
Recommendations backed by evidence
, not just observations
Lesson 2091Stage 7: Communication and Handoff
Recommended action
(specific and time-bound)
Lesson 1966Report Structure and Executive Summary
Record time-to-event
Days/months until failure (or censoring if still working at study end)
Lesson 837Product Warranty and Failure Analysis
Recovery is risky
Fixing an error by rerunning might make things worse
Lesson 1847What is Idempotency?
Recursive Member
The self-referencing query that adds the next "layer" by joining back to what you've already found.
Lesson 996Recursive CTEs: Introduction
Recursive operations
CTEs support recursion; subqueries don't
Lesson 974When to Use FROM Subqueries vs CTEs
Recuse yourself
from projects where you can't be objective
Lesson 35Conflicts of Interest and Independence
Recycles
the connection back to the pool when you're done (via `close()` or context manager)
Lesson 1092Connection Pooling Basics
Red flags for non-stationarity
Lesson 715Visual Tests for Stationarity
Redshift
Offers both traditional nodes and newer "Spectrum" for separated storage
Lesson 1813Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
Reduce multicollinearity
VIF values drop for remaining predictors
Lesson 585Remedies: Variable Selection
Reduce noise
– Random variations get averaged out
Lesson 750What is a Moving Average?
Reduce Redundancy
Instead of storing a customer's address in every order record, you store it once in a `customers` table and reference it using a foreign key.
Lesson 1061Introduction to Normalization
Reduce wasted effort
If the simple answer settles the question, you saved days of work
Lesson 2110The Minimum Viable Analysis (MVA)Lesson 2111Fast Feedback Loops with Stakeholders
Reduced data redundancy
Category names aren't repeated for every product
Lesson 1810Snowflake Schema and Normalization Trade-offs
Reduced Feature Adoption
When active users stop exploring new features or abandon key workflows they once used regularly, disengagement may be brewing.
Lesson 1700Leading Indicators of Disengagement
Reduced human error
No forgotten runs or copy-paste mistakes
Lesson 1986Automated Report Generation
Reduced LTV
Shorter customer lifespans mean less total revenue per customer
Lesson 1670What is Churn and Why It Matters
Reduced opportunity cost
of running inferior variants
Lesson 1515Trade-offs: Sample Size, Speed, and Complexity
Reduced peak memory
Never materialize the "wrong" version
Lesson 1802Filtering During Read with dtype and Converters
Reduced power
– Your ability to detect true effects decreases
Lesson 342Alpha Level Trade-offs
Reduced power per test
With the same overall sample size, each pairwise comparison has less data and thus less ability to detect real effects
Lesson 1528Testing Too Many Variants
Reduced sampling variability
Larger samples produce statistics (like means) that cluster more tightly around the true population value.
Lesson 340Power and Sample Size Relationship
Reduced statistical significance
Even though your overall model might fit well (good R-squared), individual predictors may appear non-significant
Lesson 580What is Multicollinearity?
Reduces clarity
by creating visual noise
Lesson 1963Removing Chartjunk
Reducing skewness
– Converting the stretched-out tail into a more symmetric bell shape
Lesson 212Log Transformations
Redundant variables
highly correlated features that provide similar information
Lesson 1192Correlation Matrices and Heatmaps
Reference lines
Add `geom_hline()` or `geom_vline()` early so data appears over them, or late to emphasize thresholds
Lesson 1355Layer Order and Plot CompositionLesson 1962Contextualizing Numbers
Referral
Traffic from links on other websites (blogs, news articles, partner sites)
Lesson 1712Common Channel CategoriesLesson 1758Cohort-Based Payback Analysis
Referrals
(word-of-mouth, referral programs)
Lesson 1711What Are Acquisition Channels?
Referrer Headers
are automatically sent by browsers, telling your server which website the user came from.
Lesson 1713Tracking Users by Channel
Reframe, don't just refuse
Instead of "I can't do that," try:
Lesson 1931When to Push Back on Requests
Regression models
treat LTV as a continuous outcome.
Lesson 1668Predictive LTV Models
Regression plots
Fit and display linear models
Lesson 1281Introduction to Seaborn's Statistical Plots
Regular audits
Schedule quarterly reviews to identify unused notebooks, deprecated feature columns, and abandoned model variants.
Lesson 2135Dead Experimental Code and Feature Sprawl
Regular, predictable patterns
that repeat at fixed intervals—daily, weekly, monthly, or yearly.
Lesson 705The Four Classical Components
Regularization
adds a penalty to the model that discourages large coefficient values, stabilizing estimates even when predictors overlap.
Lesson 586Remedies: Regularization PreviewLesson 1569Shrinkage and Regularization Effects
Regularly
(daily sales reports)
Lesson 1831What is Job Scheduling?
Regulatory constraints
Are there legal requirements (HIPAA, GDPR) or industry standards that limit what you can analyze or recommend?
Lesson 1168Understanding Domain Context
Regulatory context
(HIPAA for healthcare, SOX for finance)
Lesson 2145Transitioning Between Industries and Domains
Reject
the entire batch (strict pipelines)
Lesson 1826Data Validation and Schema Enforcement
Reject all hypotheses
up to (but not including) that stopping point
Lesson 1504Holm-Bonferroni MethodLesson 1506Benjamini-Hochberg Procedure
Reject H₀
| Type I Error (α) | Correct (Power = 1-β) |
Lesson 338What is Statistical Power?
Rejecting invalid inserts
You can't add a row with a foreign key value that doesn't exist in the parent table
Lesson 1055What is Referential Integrity?
rejection region
the specific zone in your test statistic's distribution where the evidence is strong enough to reject the null hypothesis.
Lesson 325The Rejection RegionLesson 336Visualizing Error Types with Sampling DistributionsLesson 345Directionality in Hypothesis Testing
Rejection region shrinks
– Fewer test statistics will fall in the "reject H₀" zone
Lesson 342Alpha Level Trade-offs
Related issues
(links to tickets or prior discussions)
Lesson 2023Creating a Pull Request
Relational plots
Explore relationships between variables (scatter, line plots with confidence intervals)
Lesson 1281Introduction to Seaborn's Statistical Plots
Relevance
Was it collected recently enough for your problem?
Lesson 23Data Provenance and Metadata
Relevant Scales
Help audiences grasp magnitude.
Lesson 1939Context and Comparison: Making Numbers Meaningful
Remainder
(or residual): Everything left over (like improvisations)—the noise and potential anomalies
Lesson 1406Decomposing Seasonality
Remove all conflict markers
(`<<<<<<<`, `=======`, `>>>>>>>`)
Lesson 2011Resolving Merge Conflicts
Remove chart junk
Delete unnecessary gridlines (keep only what's needed for reading values), drop borders, eliminate 3D effects, and ditch decorative fills.
Lesson 1958Simplifying Visual Complexity
Remove zeros
(ties where difference = 0)
Lesson 392Wilcoxon Signed-Rank Test
Removes between-subject variability
(some people naturally weigh more)
Lesson 370Differences as the Unit of Analysis
Removing duplicates
Identifying and eliminating repeated entries that could skew your analysis.
Lesson 12Data Cleaning and Preparation
Removing outliers
Identifying unusual values that might be errors or genuinely extreme cases requiring special handling.
Lesson 12Data Cleaning and Preparation
Removing redundancy
Cleaning up result sets with unwanted duplicates
Lesson 873Understanding DISTINCT: Removing Duplicate Rows
Repeatable Read
Once you read a value, it stays the same in your transaction
Lesson 1116Transaction Isolation and Concurrency
Replace metrics with outcomes
"5% improvement in precision" becomes "prevents 50 wasted sales calls per month"
Lesson 2105Translating Between Technical and Business Language
Report all preregistered analyses
, not just "successful" ones
Lesson 1929Avoiding Cherry-Picking Results
Reporting and analytics
Dashboards often aggregate data from many tables.
Lesson 1071When to Denormalize: Performance Trade-offs
Reports excel at explanation
, providing the context, methodology, and recommendations that dashboards can't accommodate.
Lesson 1980Hybrid Approaches and When to Use Both
Repository
The sealed, labeled package that's been officially sent and recorded
Lesson 1993The Three States: Working Directory, Staging, Repository
Representativeness
Does your dataset reflect the full population or just a subset?
Lesson 1169Clarifying Assumptions and Constraints
reproducible
when someone else (or future-you) can take the same raw data and the same code, run it again, and get *exactly* the same results, tables, figures, and conclusions.
Lesson 1981What Makes a Report Reproducible?Lesson 2036Code Review Practices for Data Science
Reproducible code
Clean GitHub repos with proper READMEs (as you've learned)
Lesson 2141Building a Portfolio and Personal Brand
Request more budget
(often not feasible)
Lesson 295Trade-offs: Precision, Confidence, and Cost
Required for self-joins
When joining a table to itself (covered later), aliases become essential.
Lesson 924Using Table Aliases in Joins
Required sample size
(what you're solving for)
Lesson 388Effect Size in Sample Size Planning
Required transformations
"Log-transform `income` to reduce skewness"
Lesson 1212EDA Summary Documentation and Next Steps
Requirements
Python/R version, key dependencies or link to `requirements.
Lesson 2077The Purpose and Anatomy of a Good README
Requires Reviews
You can mandate that 1, 2, or more team members approve a pull request before it can merge.
Lesson 2027Protecting Branches and Required Reviews
Rerandomization
is a technique where you check covariate balance *before* starting your experiment, and if balance is poor, you rerandomize until you get acceptable balance.
Lesson 1492Rerandomization and Practical Implementation
Resample your data
with replacement many times (typically 1,000–10,000 times)
Lesson 306Bootstrap for Non-Standard Problems
Research sharing
Publish datasets for reproducibility without exposing participants
Lesson 1901Synthetic Data Generation
Reset to that state
`git reset --hard <commit-hash>` restores your branch to that exact point
Lesson 2021Recovering from Rebase Mistakes
Residual (e ᵢ)
= Yᵢ - Ŷᵢ (the difference you learned about earlier)
Lesson 538What Are Fitted Values?
Residual (e)
"Here's how much the *actual* value differs from that prediction"
Lesson 543Residuals as Unexplained Variation
Residual deviance
measures how poorly your *fitted* model (with all predictors) fits.
Lesson 698Null and Residual Deviance
Residual patterns
A high R-squared can coexist with systematic patterns in your residuals—violations of the core assumptions that make your predictions unreliable.
Lesson 537When R-Squared is Not EnoughLesson 1189Detecting Nonlinear Relationships
Resilient
RDDs automatically recover from node failures.
Lesson 1777RDDs: Resilient Distributed Datasets Fundamentals
Resilient Distributed Dataset (RDD)
a fault-tolerant collection partitioned across nodes.
Lesson 1774What is Apache Spark and Why Use It?
Resolution (action)
What specific decision should stakeholders make based on this evidence?
Lesson 1933The Power of Narrative in Data Communication
Resource allocation
High-LTV customers justify higher acquisition costs (CAC) and more personalized outreach.
Lesson 1669LTV Segmentation and TargetingLesson 1711What Are Acquisition Channels?
Resource management
Prevents exhausting database connection limits
Lesson 1092Connection Pooling Basics
Resource Utilization
CPU, memory, disk I/O, and network usage during pipeline execution.
Lesson 1856Key Metrics to Monitor
Resource waste
Running tasks that depend on failed upstream tasks wastes compute resources and makes debugging harder.
Lesson 1840What is Dependency Management in Pipelines?
Resourced
Include rough estimates of time, cost, or personnel needed.
Lesson 1970Recommendations and Next Steps
Response expectations
"We typically respond within 48 hours"
Lesson 2083Contributing Guidelines and Contact Information
Response variable
Counts (0, 1, 2, 3, .
Lesson 690The Poisson Distribution as a GLM
Restore the signs
to each rank (positive or negative)
Lesson 392Wilcoxon Signed-Rank Test
RESTRICT
(or NO ACTION) prevents the parent operation if children exist:
Lesson 1054Cascading Actions: DELETE and UPDATELesson 1057ON DELETE and ON UPDATE Actions
Result
All quotas filled, but the sample includes only shoppers willing to stop and talk
Lesson 240Quota SamplingLesson 1566Conjugate Normal-Normal Model
Results/Output
What the project produces and where to find it
Lesson 2077The Purpose and Anatomy of a Good README
Retailer loyalty programs
data sold to data brokers who build detailed consumer profiles
Lesson 1922Surveillance and Secondary Data Uses
retention
strategies aim to prevent at-risk customers from leaving in the first place.
Lesson 1676Win-Back and Retention StrategiesLesson 1696Feature Adoption and Usage Frequency
Retention curves
plot the percentage of users who *remain active* over time (Day-1: 60%, Day-7: 40%, Day-30: 25%).
Lesson 1660Retention Curves vs Churn AnalysisLesson 1661What is Customer Lifetime Value (LTV)?Lesson 1678What is Funnel Analysis?
Retention insights
See if customers stick around longer over time
Lesson 1644What is Cohort Analysis?
Retention rates
45% of Jan cohort returned in Week 3
Lesson 1647Building a Cohort Table
Retraining is constant
You must retrain models regularly to capture new patterns
Lesson 2128Data Distribution Shifts Frequently
Retrieving
data (asking questions)
Lesson 844What is SQL?
Retry exhaustion
Slack channel with link to logs
Lesson 1851Error Logging and Notifications
retry logic
for transient errors, **idempotency** so rerunning doesn't corrupt data, **checkpointing** to resume mid-pipeline, and **monitoring/alerts** for quick detection.
Lesson 1825Designing Pipeline ArchitectureLesson 1854Testing Error Handling
Reusable functions and modules
that multiple projects import
Lesson 2074Notebooks vs Scripts: When to Use Each
Reveal trends
– The underlying direction becomes clearer
Lesson 750What is a Moving Average?
Revealing sequences
How rankings shift over years
Lesson 1306Animation and Time-Based Transitions
Revenue
is the quintessential lagging indicator—it tells you what already happened.
Lesson 1600Business Examples: Revenue vs Pipeline
Revenue accuracy
Which model's channel weights best predict revenue when you shift budget?
Lesson 1734Comparing and Validating Attribution Models
Revenue churn
measures *how much MRR* you lost from cancellations.
Lesson 1628SaaS Metrics: MRR, ARR, and Logo Churn
Revenue forecasting
Estimate lifetime value by modeling expected subscription duration
Lesson 838Subscription and Membership Duration ModelingLesson 1644What is Cohort Analysis?
Revenue generation
How much money comes in
Lesson 1516Business Metrics: Definition and Examples
Revenue per user
= total revenue / users (not just "made $50k!
Lesson 1613Raw Counts vs. Rates and Ratios
reverse
conditional probabilities—it lets you flip P(A|B) into P(B|A).
Lesson 107Bayes' Theorem Formula and ComponentsLesson 430Common Applications and Pitfalls
Reverse causality
occurs when two variables are correlated, but the direction of influence is the reverse of what you thought.
Lesson 496Reverse CausalityLesson 553Exogeneity: X Must Be Independent of ErrorsLesson 1424Reverse CausalityLesson 1464Instrumental Variables: The Endogeneity Problem
Reverse geocoding
works the opposite direction: you have coordinates (42.
Lesson 1315Geocoding and Reverse Geocoding
Reversibility is high
Changes can be rolled back easily if problems emerge later
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Reversing range logic
Lesson 868The NOT Operator
Review against WCAG checklist
document what passes and what needs fixing
Lesson 1254Testing Visualizations for Accessibility
Review checkpoint
Show results to stakeholders at sprint end
Lesson 2113Timeboxing and Sprint Planning for Data Projects
Review logs
Examine both application logs and database server logs for detailed error messages
Lesson 1093Troubleshooting Connection Issues
Review notebook-specific PRs carefully
Understand that diffs may still be noisy even with best practices.
Lesson 2030Version Control for Notebooks: Challenges and Solutions
Review promptly
Respect the author's time by reviewing within a day or two.
Lesson 2024Code Review Best Practices
Review recent changes
Check pipeline code commits, configuration changes, or dependency updates around when the issue started
Lesson 1870Root Cause Analysis for Quality Issues
Reweighting
Adjust training data by giving higher weight to underrepresented or historically disadvantaged groups.
Lesson 1894Auditing and Remediation Strategies
Rework
means repeating work because something was missed, misunderstood, or poorly executed the first time—rerunning analysis because you forgot to document your seed, rebuilding features because requirements weren't clarified, or re-validating a model b...
Lesson 2112Iteration vs Rework: Learning from Each Cycle
Rideshare apps
Drivers in treatment might reduce wait times for riders in control
Lesson 1527Ignoring Network Effects
Ridge regression
modifies least squares by adding a penalty proportional to the *squared* coefficient values.
Lesson 586Remedies: Regularization Preview
Right
H₀: The drug has no effect (μ = 0), H₁: The drug works (μ > 0)
Lesson 313Common Pitfalls in Hypothesis Formulation
RIGHT JOIN
returns *every row from the right (second) table*, along with matching data from the left (first) table where available.
Lesson 929RIGHT JOIN Syntax and SemanticsLesson 936FULL OUTER JOIN Syntax
Right pane
Their changes (incoming branch)
Lesson 2019Using Diff Tools for Conflict Resolution
Right to erasure
("right to be forgotten"): People can request deletion of their data, impacting training datasets and model retraining
Lesson 1904What is GDPR and Why It MattersLesson 1909Right to Erasure and Data Retention PoliciesLesson 1911GDPR Compliance for Data Scientists
Right to explanation
Individuals can demand to understand automated decisions affecting them—black-box models become problematic
Lesson 1904What is GDPR and Why It Matters
Right-continuous
It's continuous from the right side at jump points
Lesson 810The Survival Function S(t)
Right-only rows
Left-side columns are NULL
Lesson 937Identifying Matched vs Unmatched Rows
Right-skewed
The distribution has a long tail extending to the right (high values)
Lesson 178Log-Normal Distribution: Definition and Properties
Right-skewed (positive skew)
A long tail stretches to the right; most values cluster at the lower end (e.
Lesson 1175Histograms for Distribution Shape
Risk
With small samples, you might accidentally get imbalanced groups (e.
Lesson 1437Randomization Mechanisms
Risk assessment
Two investments with the same average return might have wildly different risks
Lesson 46What is Variability?
Risk identification
what could go wrong?
Lesson 1910Data Protection Impact Assessments (DPIAs)
Risk of gaming exists
Surrogates might improve while harming long-term value
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Risk tolerance
High variance might be unacceptable even with better expected value
Lesson 152Decision Making Under Uncertainty
Risk-adjusted returns
balancing profitability with stability
Lesson 1716Channel Mix and Portfolio Thinking
River One: Statistics (1800s–1900s)
Lesson 5The Evolution of Data Science
River Two: Computing (1950s–1990s)
Lesson 5The Evolution of Data Science
ROAS < 1
You're losing money directly on ad spend (spending more than you earn)
Lesson 1751Return on Ad Spend (ROAS): Definition and Calculation
ROAS = 1
Breaking even on ad spend (but likely unprofitable after other costs)
Lesson 1751Return on Ad Spend (ROAS): Definition and Calculation
ROAS > 1
Generating positive return, but profitability depends on margins
Lesson 1751Return on Ad Spend (ROAS): Definition and Calculation
Robust regression
techniques offer an alternative: they fit models that automatically downweight or ignore outliers during estimation, so extreme points don't drag your fitted line off course.
Lesson 590Robust Regression Techniques
Robustness Testing
ensures your model performs consistently.
Lesson 2089Stage 5: Model Development and Validation
ROI measurement
Understand true return on marketing investment
Lesson 1718Introduction to Marketing Attribution
Role-play each audience type
with a colleague
Lesson 1956Anticipating and Addressing Audience Questions
Rollback Mechanisms
Simulate a mid-pipeline failure during a database write or transformation.
Lesson 1854Testing Error Handling
Rolling statistics
Mean and variance shouldn't drift systematically
Lesson 741Testing Stationarity After Transformation
Rolling window
Train on a fixed-size window (e.
Lesson 789Overfitting and Cross-Validation for Time Series
Root cause analysis
becomes nearly impossible when you discover issues weeks later
Lesson 2136Monitoring Gaps and Silent Failures
Rotating 3D views
Spin a 3D plot to reveal all angles
Lesson 1327Creating Animations with FuncAnimation
Roughly constant variance
the noise level should be stable
Lesson 709Irregular Component: Random Noise
Row proportions
divide each cell by its row total.
Lesson 98Conditional Probability with Tables
Row-level aggregations
Comparing individual values to group statistics
Lesson 967Subqueries in the SELECT Clause
Row-level analytics
that require context from other rows without losing detail
Lesson 1005Introduction to Window Functions
Rows (Records)
Each row represents a single instance or observation.
Lesson 843Relational Database Concepts
RSS
= Residual Sum of Squares (the sum of all squared residuals)
Lesson 536Residual Standard Error (RSE)
Rule
Check that both np ≥ 10 *and* n(1-p) ≥ 10, where n is your sample size and p is your sample proportion.
Lesson 282Checking Assumptions for Proportion Intervals
Rule 1
Each drawer label describes one type of information (not "Age&Address").
Lesson 1143The Three Rules of Tidy DataLesson 1402Western Electric Rules
Rule 2
Each folder holds one person's complete record (not scattered pieces).
Lesson 1143The Three Rules of Tidy DataLesson 1402Western Electric Rules
Rule 3
Employee files and project files live in separate cabinets (not jumbled together).
Lesson 1143The Three Rules of Tidy DataLesson 1402Western Electric Rules
Rule 4
Eight consecutive points on one side of the centerline (even if within 1σ)
Lesson 1402Western Electric Rules
Rule of thumb
When sampling without replacement, your sample size should be less than 10% of the population to maintain approximate independence.
Lesson 282Checking Assumptions for Proportion IntervalsLesson 577DFBETAS: Influence on Individual CoefficientsLesson 1467Testing Instrument Strength and ValidityLesson 1481Unit of Randomization
Rule-of-thumb approaches
Use formulas based on sample size and variance
Lesson 1463RDD Bandwidth Selection and Local Estimation
Run optimization
using constrained optimization algorithms (like scipy's `minimize` with bounds)
Lesson 1742Budget Optimization Using MMM
Run statistical tests
Apply Shapiro-Wilk (for smaller samples) or Anderson-Darling (for general use).
Lesson 210Combining Visual and Statistical Methods
Run tests longer
Allow time for behaviors to stabilize (typically 2-4 weeks minimum for behavioral changes)
Lesson 1525Novelty and Primacy Effects
Run the experiment
Increase marketing in test regions for a fixed period
Lesson 1746Geo-Lift Experiments
Running hypothesis tests
(t-tests, z-tests) that rely on normal theory
Lesson 202Why Test for Normality?
Running totals
or moving averages while preserving individual transactions
Lesson 1005Introduction to Window Functions
Runs in linear time
Under typical conditions, achieves O(n) complexity instead of O(n²)—a massive speedup for large datasets
Lesson 1416PELT Algorithm: Pruned Exact Linear Time
Russian nesting dolls
the innermost subquery runs first, its result becomes a table for the next level up, and so on.
Lesson 973Nested Subqueries in FROM

S

S Charts
Preferred for larger subgroups (n > 10) where range becomes less efficient at capturing true variability.
Lesson 1399Control Charts for Variability (R and S Charts)
S(∞) = 0
Eventually, everyone experiences the event (in theory)
Lesson 810The Survival Function S(t)
S(0) = 1
Everyone starts "alive" or event-free
Lesson 810The Survival Function S(t)
SaaS Sign-up
Landing Page → Sign-up Form → Email Verification → Onboarding → First Use
Lesson 1678What is Funnel Analysis?
Sales expenses
sales team salaries and commissions, sales software (CRM, outreach tools), travel and entertainment
Lesson 1753Customer Acquisition Cost (CAC): Components and Calculation
Sales pipeline metrics
, on the other hand, are leading indicators.
Lesson 1600Business Examples: Revenue vs Pipeline
Sales(t)
is your outcome variable at time *t* (weekly sales, conversions, etc.
Lesson 1738The Core MMM Regression Model
Same-store sales (SSS)
, also called "comparable store sales" or "comps," isolates growth from stores open at least 12-13 months, revealing organic performance by controlling for expansion.
Lesson 1634Retail Metrics: Same-Store Sales and Inventory Turnover
Sample distribution
is one snapshot from that album — maybe 100 randomly selected people.
Lesson 258Comparing Population, Sample, and Sampling Distributions
Sample from each stratum
Use simple random sampling *within* each stratum, maintaining the correct proportions
Lesson 236Stratified Sampling
Sample Mean (x̄)
The expected value of the sample mean equals the population mean (μ).
Lesson 255Expected Value of Sample Statistics
Sample Proportion (p̂)
The expected value equals the true population proportion (p).
Lesson 255Expected Value of Sample Statistics
Sample quantiles
(your actual residual values, sorted) on the y-axis
Lesson 565What Q-Q Plots Show: Comparing Residual Distribution to Normal
Sample size challenges
Intersectional groups may be small, making statistical analysis harder
Lesson 1893Intersectionality in Fairness
Sample size is large
More observations make the data speak louder than assumptions
Lesson 115Prior Sensitivity Analysis
Sample size is small
With little data, your starting belief dominates
Lesson 115Prior Sensitivity Analysis
Sample size limitations
"Based on 500 customers, we're confident in the direction but not precise magnitude"
Lesson 2122When Uncertainty Is Acceptable
Sample size matters
Typically, n ≥ 30 is considered sufficient for the CLT to "kick in," though it depends on how non- normal the original population is.
Lesson 218What the Central Limit Theorem States
Sample sizes
(how much data supports this?
Lesson 1244Omitting Uncertainty and Variability
Sample variance
Divide by **N-1** (one less than your sample size)
Lesson 50Population vs Sample VarianceLesson 255Expected Value of Sample Statistics
Sampling
Training a facial recognition model primarily on one demographic
Lesson 1878What is Bias in Data?Lesson 2055Why Randomness Matters in Data Science
Sampling bias
is a systematic error in how you collect your sample that pushes your results in one direction, away from the truth.
Lesson 248Sampling Error vs Sampling BiasLesson 249Coverage Error and UndercoverageLesson 1879Selection Bias and Sampling Bias
sampling distribution
is the probability distribution of a statistic (like the mean, median, or proportion) computed from *all possible samples* of a fixed size drawn from the same population.
Lesson 251What is a Sampling Distribution?Lesson 257Shape of Sampling DistributionsLesson 258Comparing Population, Sample, and Sampling Distributions
Sampling error
is the natural, random variation you get just because you didn't measure everyone.
Lesson 248Sampling Error vs Sampling Bias
Sampling new records
Generate fresh rows that follow the learned patterns but represent no actual person
Lesson 1901Synthetic Data Generation
Sampling zeros
People who *could* experience it but happened not to (e.
Lesson 695Zero-Inflated Models
SARIMA
(Seasonal ARIMA) adds a second layer of similar components that operate specifically on the seasonal lags.
Lesson 795Seasonal ARIMA (SARIMA) Structure
Satellite imagery
Shows actual photographs from above
Lesson 1314Basemaps and Map Tiles
Saturated model
Perfect fit with one parameter per observation
Lesson 697Deviance: A Measure of Model Fit
Saturation
is the intensity or purity of the color, ranging from vivid/vibrant to dull/grayish.
Lesson 1234Color: Hue, Saturation, and Luminance
Saturation/luminance
(lighter to darker shades)
Lesson 1238Matching Encoding to Data Type
Say
"For every additional hour of study time, we expect students' test scores to increase by about 2.
Lesson 530Communicating Results to Non-Technical AudiencesLesson 1955Framing Insights in Business Language
Scale Transformations
Switch to logarithmic scales with `set_xscale('log')` when data spans multiple orders of magnitude (think: population sizes from villages to countries).
Lesson 1270Customizing Axes: Labels, Limits, and Scales
Scale-Location plot
solves this by plotting the *square root* of the *absolute value* of standardized residuals against fitted values.
Lesson 560Scale-Location Plot (Spread-Location Plot)
Scaled fonts
that remain legible
Lesson 1369Publication-Ready Plot Styling
Scatter plot matrices
(Lesson 1191) visually show near-perfect linear relationships
Lesson 1197Identifying Variable Importance and Redundancy
Schedule quarterly reviews
with stakeholders to assess whether the tree still represents reality and strategy.
Lesson 1626Maintaining and Evolving Metric Trees
Scheduler
Monitors DAGs and triggers tasks when dependencies are met
Lesson 1833Introduction to Apache Airflow
Scheduling
is like setting alarm clocks: "Run this job every day at 2 AM.
Lesson 1832Orchestration vs Scheduling
schema
is an organizational container that groups related tables together.
Lesson 846Tables, Schemas, and Data TypesLesson 1151Schema Validation
Schema assumptions
Your code expects a column named `user_id`, but upstream decides to rename it to `customer_id`.
Lesson 2133Undocumented Data Dependencies
Schema awareness
Spark knows your column names and data types
Lesson 1778DataFrames and Spark SQL Basics
Schema Changes
Tracking modifications to data structure (new columns, type changes, renamed fields).
Lesson 1856Key Metrics to MonitorLesson 2136Monitoring Gaps and Silent Failures
Schema extraction
pulls structural information: column names, data types, primary keys, constraints.
Lesson 2067Automating Documentation with Code
Schema validation
checks structural requirements:
Lesson 1826Data Validation and Schema Enforcement
scikit-learn
for prediction-focused workflows and machine learning pipelines.
Lesson 545Extracting Residuals and Fitted Values in PythonLesson 2058Seed Scope and Multiple Libraries
Scoped
"Identify the top 3 pages where users abandon our checkout process, so we can redesign them to increase completed purchases by 10%"
Lesson 10Problem Definition and ScopingLesson 1166Defining the Business Question
Scoping
means setting clear boundaries: What will you measure?
Lesson 10Problem Definition and Scoping
Scoping constraints
What data is available?
Lesson 2085Stage 1: Problem Definition and Scoping
Screen reader testing
with tools like NVDA, JAWS, or VoiceOver reveals whether your alternative text and data tables are actually helpful
Lesson 1254Testing Visualizations for Accessibility
Scripts
are executable files that run a complete workflow — useful for automation and reproducibility.
Lesson 2071Modular Code: Functions and Scripts
SE(p̂)
is the standard error of the proportion: √(p̂(1-p̂)/n)
Lesson 278Confidence Interval Formula for One Proportion
Seaborn FacetGrid
Similar benefits to ggplot2, with convenient statistical plotting functions built in
Lesson 1372Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
Seaborn's FacetGrid
follows a similar declarative philosophy.
Lesson 1372Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
Seamless visualization
Plotting libraries expect data in predictable formats.
Lesson 1149Benefits of Tidy Data for Downstream Work
Search and matching failures
"café" might not match "cafe" in pattern searches.
Lesson 1139Dealing with Special Characters and Unicode
Searched CASE
is more flexible—each WHEN clause can contain any boolean condition.
Lesson 1031Simple CASE vs Searched CASE
Seasonal AR terms
appear as significant spikes in the PACF at seasonal lags that cut off, while the ACF shows a gradual decay at those seasonal intervals.
Lesson 796Identifying Seasonal Patterns
Seasonal decomposition
– separating the data into trend, seasonal, and residual components
Lesson 1405What is Seasonal Hybrid ESD?
Seasonal differencing
works the same way, but instead of subtracting adjacent points, you subtract observations that are *one full season apart*.
Lesson 737Seasonal DifferencingLesson 797Seasonal Differencing
Seasonal effects
If your business has monthly billing cycles, holiday shopping patterns, or fiscal calendar impacts, your test duration should span these periods.
Lesson 1484Duration and Timing Considerations
Seasonal equation
Updates the seasonal pattern for each period
Lesson 767Holt-Winters Additive ModelLesson 768Holt-Winters Multiplicative Model
Seasonal fluctuations remain constant
in absolute size regardless of the trend level
Lesson 743Additive vs Multiplicative Models
Seasonal Hybrid ESD
approach you've learned extends to multiple periods by iteratively or simultaneously accounting for each cycle.
Lesson 1408Handling Multiple Seasonal Periods
Seasonal MA terms
show up as significant spikes in the ACF at seasonal lags (12, 24, 36) while cutting off after a certain seasonal lag.
Lesson 796Identifying Seasonal Patterns
Seasonally adjusted data
is your original time series with the seasonal component removed, leaving you with just the trend and irregular components.
Lesson 748Seasonally Adjusted Data
Second batch arrives
Use Beta(12, 17) as your new prior → observe 5 successes, 8 failures → get Beta(17, 25) posterior
Lesson 1563Sequential Updating with New Data
Second difference
Control group's change = (After - Before)
Lesson 1452The Difference-in-Differences Setup
Second difference (DiD)
Subtract the control group's change from the treatment group's change:
Lesson 1454Calculating the DiD Estimator
Second evidence (witness testimony)
Use that 60% as your new prior → apply Bayes' Theorem again → posterior becomes 85%.
Lesson 114Sequential Updating
Second join
(INNER): Only rows where payment exists stay in the result
Lesson 952Mixing Join Types
Second layer
Three supporting pillars—"Customer surveys show strong demand," "A/B test validated the prediction," "Risk analysis shows minimal downside.
Lesson 1952The Pyramid Principle: Leading with Conclusions
Second-order differencing
means you difference the already-differenced data:
Lesson 736Higher-Order Differencing
Secondary metrics
protect you from winning the battle but losing the war.
Lesson 1478Defining Success MetricsLesson 1485Documentation and Pre-Registration
Secondary use
occurs when data collected for one specific purpose gets repurposed for something else—often without obtaining fresh consent from the individuals involved.
Lesson 1915Secondary Use and Scope Creep
Secure auctions
where bids remain secret until the winner is determined
Lesson 1903Secure Multi-Party Computation
Security updates
Credentials refresh, access control adjustments
Lesson 1979Maintenance and Sustainability Considerations
See dynamic effects
Does the policy effect grow or fade over time?
Lesson 1457Multiple Time Periods and Staggered Adoption
Seek peer review
from colleagues with no stake in the outcome
Lesson 35Conflicts of Interest and Independence
Segment by path type
– compare conversion rates across different journey patterns
Lesson 1683Multi-Path and Non-Linear Funnels
Segment by user tenure
Compare new users (no primacy effect) separately from existing users
Lesson 1525Novelty and Primacy Effects
Segment by user type
Power users, casual users, and at-risk users have different engagement profiles
Lesson 1693Defining User Engagement
Segment differences
Compare curves using the log-rank test to see which groups need different retention strategies
Lesson 835Customer Churn Prediction with Survival Analysis
Segment insights
Do paid users stick around longer than free users?
Lesson 1659Comparing Retention Across Cohorts
Segmented analysis
By customer, product line, or geography
Lesson 1984Parameterized Reports
SELECT columns
Choose which columns to display from either or both tables
Lesson 919Basic INNER JOIN Syntax
Select every kth element
Starting from position 3, select every 10th element: the 3rd, 13th, 23rd, 33rd.
Lesson 235Systematic Sampling
Select the numeric variable
you want to summarize
Lesson 1185Grouped Summary Statistics
Select the parameters
that minimize the chosen error metric
Lesson 772Holt-Winters Parameter Optimization
Selectboxes
provide dropdown menus for choosing from predefined options:
Lesson 1332Streamlit Widgets: Inputs and Controls
Selecting features
that matter most: removing redundant or irrelevant variables that add noise without signal, reducing dimensionality while preserving information.
Lesson 2088Stage 4: Feature Engineering and Preparation
Selective reporting
Hiding inconvenient findings or uncertainty
Lesson 1926The Honest Broker Role
Selectivity
is how well a query condition narrows down the result set.
Lesson 1083Index Selectivity and Cardinality
Self-contained logic
Keep related calculations within a single query instead of multiple separate queries
Lesson 959Introduction to Subqueries in WHERE
Seller utilization
% of available supply actually transacted
Lesson 1630Marketplace Metrics: GMV, Take Rate, and Liquidity
Senior Data Scientist
Own complex projects end-to-end, mentor juniors informally
Lesson 2140Individual Contributor vs Management Tracks
Senior Manager/Director
Manage multiple teams or managers, set team strategy, align with business
Lesson 2140Individual Contributor vs Management Tracks
Sensitive attributes
leak through proxy variables—attributes correlated with protected classes.
Lesson 1888Protected Classes and Sensitive Attributes
Sensitive to all values
Every number in your dataset affects the mean—change one value, and the mean changes
Lesson 39The Mean (Arithmetic Average)
Sensitivity analyses
showing how results change under different assumptions
Lesson 1949Anticipating Questions: Building in Appendices
Sensitivity analysis
is the practice of deliberately varying your prior choices and observing how the posterior distribution responds.
Lesson 1572Sensitivity Analysis and Prior Robustness
Sensors
are specialized operators that continuously check for specific conditions—like whether another pipeline has completed or if a particular file exists in storage.
Lesson 1845Cross-Pipeline Dependencies
Sensors and IoT devices
Real-time measurements from physical equipment
Lesson 11Data Collection and Acquisition
Separate must-fix from suggestions
Use tags like "critical" vs "nit" or "optional.
Lesson 2024Code Review Best Practices
Separate signal from noise
by isolating the long-term pattern from short-term variability
Lesson 706Trend: Long-Term Direction
Separating columns
means splitting one column containing compound data (like "Smith, John" or "2024-01-15 14:30:00") into multiple columns ("LastName", "FirstName" or "Date", "Time").
Lesson 1147Separating and Uniting Columns
separation of concerns
the statistical calculation is independent of how you choose to visualize it.
Lesson 1352Statistical Transformations with stat_* LayersLesson 2069Project Directory Structure
Sequence validation
Check if values follow expected patterns
Lesson 1024LAG Function: Accessing Previous Row Values
Sequential
No gaps in numbering (always 1, 2, 3.
Lesson 1007ROW_NUMBER(): Assigning Unique Row Numbers
Sequential analysis
Analyze patterns across adjacent time periods
Lesson 1023Introduction to Window Functions: LAG and LEAD
Sequential chains
`task_a >> task_b >> task_c`
Lesson 1843Declaring Dependencies in Orchestration Tools
Sequential decomposition
Remove the strongest seasonal component first, then detect weaker ones in the residuals
Lesson 1408Handling Multiple Seasonal Periods
Sequential events
Comparing different timestamps or events within a single `events` table
Lesson 945Introduction to Self-Joins
Sequential ordering
Time flows in one direction; past observations may predict future ones, but not vice versa
Lesson 704What Makes Time Series Data Different?
Sequential testing
(also called *sequential analysis* or *continuous monitoring*) provides statistical methods that account for continuous or repeated looks at accumulating data.
Lesson 1510Sequential Testing Overview
Serializable
Transactions run as if they're completely alone (safest but slowest)
Lesson 1116Transaction Isolation and Concurrency
Server metrics monitoring
CPU, memory, or network traffic that follows daily business cycles
Lesson 1411Applications and Limitations
Service Level Agreement (SLA)
is a formal promise made to stakeholders or customers about minimum service levels, often with consequences if broken.
Lesson 1860SLA and SLO Definitions
Service Level Objective (SLO)
is a specific, measurable target for a service's performance—think of it as your internal goal.
Lesson 1860SLA and SLO Definitions
Session
Each visit to your site gets randomized independently
Lesson 1481Unit of Randomization
Session data
timestamps, referral sources, pages visited
Lesson 1719The Customer Journey and Touchpoints
Session Depth
counts the number of actions or page views within a session.
Lesson 1695Session-Based Engagement Metrics
Session Frequency
measures how often a user starts new sessions over a given period (e.
Lesson 1695Session-Based Engagement Metrics
Session Recency
measures the time since a user's last session.
Lesson 1695Session-Based Engagement Metrics
Sessions
are your workspace for database operations.
Lesson 1122Creating Tables and Session Management
Set alpha accordingly
Lower if Type I is costly; higher if Type II is costly.
Lesson 334Setting Alpha: Choosing Your Significance Level
Set constraints
based on cash position (max payback acceptable)
Lesson 1759Optimizing ROAS, CAC, and Payback Together
SET DEFAULT
Similar to SET NULL, but sets the foreign key to a predefined default value instead.
Lesson 1057ON DELETE and ON UPDATE Actions
Set priors
for each group's mean (often using the Normal-Inverse-Gamma or Normal-Normal models you've learned)
Lesson 1570Comparing Two Means: Bayesian Approach
Set random seeds
Make randomness predictable
Lesson 30The Reproducibility Crisis and Solutions
Set thresholds in advance
document your acceptance criteria before seeing data
Lesson 1492Rerandomization and Practical Implementation
Set time windows carefully
– decide if a 30-day journey with loops still counts as a single funnel attempt
Lesson 1683Multi-Path and Non-Linear Funnels
Set up hypotheses
H₀: The probability of switching in either direction is equal
Lesson 436Conducting McNemar's Test
Set your objective
maximize total conversions, revenue, or profit
Lesson 1742Budget Optimization Using MMM
Setup (context)
What problem motivated this analysis?
Lesson 1933The Power of Narrative in Data Communication
Setup cells
Import libraries and load data (code + output)
Lesson 1982Literate Programming with Notebooks
Shape + Color
In scatter plots, use different point shapes (circles, triangles, squares) in addition to different colors for categories
Lesson 1251Avoiding Reliance on Color Alone
Shape parameter
α_new = α_prior + Σx (add all observed counts)
Lesson 1552Gamma-Poisson Conjugacy
Shape parameter (α, "alpha")
Controls the shape of the curve.
Lesson 181Gamma Distribution: Shape and Rate Parameters
Shapiro-Wilk test
Tests the null hypothesis that residuals are normally distributed
Lesson 449Normality of ResidualsLesson 570Q-Q Plots vs Formal Normality Tests: When Visual Checks Matter
Share of Voice
tracks your brand's mentions versus competitors—critical for measuring platform or brand dominance within a category.
Lesson 1631Social Media Metrics: DAU/MAU and Content Engagement
Share your environment
Specify exact software versions
Lesson 30The Reproducibility Crisis and Solutions
Shared axes
Zooming on the x-axis affects all subplots
Lesson 1304Subplots and Linked Interactions
Shared credit models
treat the outcome as jointly owned, rewarding collaboration.
Lesson 1640Attribution in Multi-Team Environments
Sharp
A scholarship given to *all* students scoring ≥70 on an entrance exam.
Lesson 1461Sharp vs Fuzzy RDD
Shift+Tab
(to move backward), **Enter** or **Space** (to activate), and **arrow keys** (for fine control).
Lesson 1253Interactive Accessibility: Keyboard Navigation
Ship something
rather than nothing
Lesson 2121Timeboxing and Deadlines
Shopping patterns
→ protected class membership
Lesson 1889Proxy Variables and Redlining
Short queries
When the subquery is just a few lines
Lesson 974When to Use FROM Subqueries vs CTEs
Short-term initiatives
(this quarter)
Lesson 1970Recommendations and Next Steps
Shortened Session Duration
Sessions getting briefer over time suggest decreasing value extraction—users aren't finding what they need or losing interest.
Lesson 1700Leading Indicators of Disengagement
Show sensitivity analyses
that reveal how fragile findings are
Lesson 1929Avoiding Cherry-Picking Results
Show, don't tell
Give viewers the chart without explaining it.
Lesson 1964Testing Visualizations with Audiences
Showing temporal change
Population growth, stock prices, disease spread
Lesson 1306Animation and Time-Based Transitions
Showing uncertainty
through confidence intervals
Lesson 1288Point Plots for Trend Visualization
Shrinkage
means your posterior estimate gets "pulled" away from extreme sample values toward your prior belief.
Lesson 1569Shrinkage and Regularization Effects
Signal strength
How informative is each observation?
Lesson 1549Prior-Likelihood Trade-offs
Significance bounds
(also called confidence intervals) help you answer this question.
Lesson 723Significance Bounds in ACF Plots
Significance indicators
often asterisks or yes/no flags
Lesson 462Interpreting and Reporting Post-Hoc Results
significance level
, denoted by the Greek letter **α** (alpha), is a predetermined probability threshold you set *before* conducting a hypothesis test.
Lesson 323What is a Significance Level (α)?Lesson 388Effect Size in Sample Size Planning
Signup date
When a user creates an account (classic acquisition cohort)
Lesson 1646Defining Cohort Start Events
Silent failures
If Task A fails but Task B runs anyway (because it doesn't know to wait), you'll process incomplete or corrupted data without realizing it.
Lesson 1840What is Dependency Management in Pipelines?
Similarity
Objects sharing visual properties (color, shape, size) are seen as belonging together.
Lesson 1236Gestalt Principles in Visualization
Simple area chart
When you want to emphasize cumulative growth or magnitude over time
Lesson 1227Area Charts and Stacked Area Charts
Simple CASE
works like a switch statement in programming.
Lesson 1031Simple CASE vs Searched CASE
Simple composition
If you run 10 queries each with ε=0.
Lesson 1900Privacy Budget and Composition
Simple Exponential Smoothing
for level-only data and **Double Exponential Smoothing (Holt's Method)** for data with trend.
Lesson 765Introduction to Holt-Winters Method
Simple Moving Average (SMA)
smooths out short-term fluctuations in your time series by averaging the most recent *n* data points.
Lesson 751Simple Moving Average (SMA)
Simple random sampling
keeps it straightforward.
Lesson 243Choosing the Right Sampling Method
Simple ratio check
Calculate the ratio of residual deviance to degrees of freedom from your fitted Poisson model.
Lesson 693Overdispersion in Count Data
Simple, one-time transformations
When you need a quick intermediate step and won't reference it again
Lesson 974When to Use FROM Subqueries vs CTEs
Simplify communication
"Our average customer is 34 years old" is clearer than showing a spreadsheet of 10,000 ages
Lesson 38What is Central Tendency?
Simulate color blindness
on your chart—can distinctions still be seen?
Lesson 1254Testing Visualizations for Accessibility
Simulation tools
let you preview how your visualizations appear under different accessibility conditions:
Lesson 1254Testing Visualizations for Accessibility
Simulation visualization
Animate particle movements or algorithm steps
Lesson 1327Creating Animations with FuncAnimation
Simultaneity
X and Y determine each other simultaneously
Lesson 553Exogeneity: X Must Be Independent of Errors
Simultaneous decomposition
Use methods like STL (Seasonal-Trend decomposition using Loess) with multiple seasonal periods specified
Lesson 1408Handling Multiple Seasonal Periods
Single auto-incrementing integer
`customer_id` (1, 2, 3, .
Lesson 1048What Are Primary Keys?
Single column
"What's the average salary per department?
Lesson 905Grouping by Multiple Columns: Basics
Single outlier
Designed to detect one outlier at a time
Lesson 1389What is Grubbs' Test?
Single peak
One clear mode, not multiple humps
Lesson 377Testing Normality: Visual Methods
Single samples vary
Your one sample mean might be 170 cm, but someone else's might be 168 cm.
Lesson 251What is a Sampling Distribution?
Single source of truth
One person ensures consistent calculation and definition
Lesson 1619What is Metric Ownership?
Single trial
One Bernoulli trial = one observation
Lesson 123Bernoulli Trial Definition and Properties
Size of the gap
between groups over time
Lesson 817Comparing Multiple Survival Curves
Size perception
Humans judge area imperfectly, so don't encode critical comparisons in bubble size alone
Lesson 1229Bubble Charts for Three Variables
Size variation
can represent a third numeric variable—larger bubbles for higher values create a "bubble chart" effect.
Lesson 1265Scatter Plots: Relationships Between Variables
Skeptical stakeholders
Meet them where they are.
Lesson 1953Adjusting Statistical Depth by Audience
Skewed distributions
(lopsided): The mean gets "pulled" toward extreme values.
Lesson 42Comparing Mean, Median, and ModeLesson 221CLT for Different Population Distributions
Skewness direction
(long tail left or right)
Lesson 1286Violin Plots and Distribution Shape
Skip tasks conditionally
using trigger rules
Lesson 1836Task Dependencies and Flow Control
Skip the jargon
No one outside your team needs to hear "coefficient" or "residuals"
Lesson 530Communicating Results to Non-Technical Audiences
SLA misses
Dashboard alert for stakeholders
Lesson 1851Error Logging and Notifications
Sleep Quality
→ **Alertness** (poor sleep reduces alertness)
Lesson 1469Building a Simple Causal DAG
Sliders
let users select numeric values within a range—perfect for filtering years, adjusting thresholds, or setting parameters:
Lesson 1332Streamlit Widgets: Inputs and Controls
Slow down dramatically
processing scales with the product, not the sum
Lesson 943CROSS JOIN Results: Size and Structure
Slow onboarding
It's unclear which files or features represent the "real" solution
Lesson 2135Dead Experimental Code and Feature Sprawl
Slow sorting
operations (especially with `ORDER BY`)
Lesson 911Performance Considerations with Multiple Groups
Slow-moving funnel
Users take days between steps (friction, confusion, or decision paralysis)
Lesson 1681Time-Based Funnel Analysis
Slower payback
= need more capital or slower scaling
Lesson 1757Payback Period: Definition and Importance
Slowly decaying ACF
Bars decrease gradually → suggests a trend or non-stationarity
Lesson 722ACF Plots and Interpretation
Small drop
Your predictors may not be useful—the intercept-only model was nearly as good.
Lesson 698Null and Residual Deviance
Small multiples
Show different "slices" of your data in separate 2D panels
Lesson 1329Effective Use and Pitfalls of 3D Visualizations
Small p-value (e.g., 0.01)
Your observed data would be very rare if H₀ were true.
Lesson 318What is a P-Value?
Small sample sizes
(n < 30): Your confidence intervals and p-values rely heavily on the normality assumption
Lesson 550Normality of Residuals
Small tables
The table has few columns and you genuinely need all of them
Lesson 851Selecting All Columns with Asterisk
Smaller storage footprint
Less duplication means less disk space
Lesson 1810Snowflake Schema and Normalization Trade-offs
Smaller ε
= stronger privacy (more noise)
Lesson 1898Differential Privacy Fundamentals
Smart home devices
recording conversations used for product development (and sometimes reviewed by humans)
Lesson 1922Surveillance and Secondary Data Uses
Smooth lines
use `stat_smooth()` to fit regression or loess curves
Lesson 1343Statistical Transformations
Smooth seasonal patterns
– Short-term irregularities fade away
Lesson 750What is a Moving Average?
Smooth trends
`stat_smooth()` fits regression lines or curves
Lesson 1352Statistical Transformations with stat_* Layers
Snapshots
rather than patches (you can't "diff" binary files meaningfully)
Lesson 1871Why Version Control for Data?Lesson 2044Recreating Environments from Specifications
Snowflake
Pure separation; pause compute clusters without affecting data
Lesson 1813Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
Social
Unpaid clicks from social media platforms (Facebook, Twitter, LinkedIn, Instagram)
Lesson 1712Common Channel Categories
Social media followers
(without reach or influence)
Lesson 1612What Are Vanity Metrics?
Social media likes
without measuring conversion or brand lift
Lesson 1616Metrics Divorced from Revenue
Software-defined assets
approach treats data assets as first-class citizens.
Lesson 1839Alternative Orchestration Tools
Some aggregations
that require full dataset knowledge are unavailable or slow.
Lesson 1796Limitations and Differences from Pandas
Sorted retrieval
(`ORDER BY`) comes nearly free since data is already ordered
Lesson 1079B-Tree Indexes: Structure and Mechanics
Source connectors
Extract data from databases, APIs, cloud storage, or streaming services
Lesson 1822What is a Data Pipeline?
Source information
Original data location, collection date, version
Lesson 2065Tracking Data Lineage
Source URL or Location
The exact web address, API endpoint, database connection string, or file path where you obtained the data.
Lesson 2063Essential Metadata to Capture
Source/Derivation
Where the data came from or how it was calculated
Lesson 2064Creating Data Dictionaries
Space
(to activate), and **arrow keys** (for fine control).
Lesson 1253Interactive Accessibility: Keyboard Navigation
Spark Core
is the foundation of the entire framework.
Lesson 1775Spark Components: Core, SQL, MLlib, Streaming
Spark Streaming
enables real-time data processing through micro-batching:
Lesson 1775Spark Components: Core, SQL, MLlib, Streaming
Spatial Correlation
Geographic data points near each other (neighboring counties, adjacent plots of land) tend to be more similar than distant ones.
Lesson 381Independence Assumption and Its Violations
Spatial data
Neighboring geographic areas influence each other
Lesson 548Independence of Observations
Spatial grouping
(separate positions, but not ordered)
Lesson 1238Matching Encoding to Data Type
Spatial heatmaps
and **density maps** solve this by showing *where* activity is most concentrated, creating smooth gradients that reveal patterns invisible in raw point data.
Lesson 1312Heatmaps and Density Maps for Spatial Data
Spearman correlation
works with ranked data instead of raw values.
Lesson 1184Correlation Coefficients in Bivariate Analysis
Spearman's Rho
correlates the *ranks* of your data, essentially asking "how well does a linear relationship fit the ranked data?
Lesson 490Kendall's Tau vs Spearman's Rho
Specific and Actionable
Avoid vague advice like "improve customer retention.
Lesson 1970Recommendations and Next Steps
Specific and quantifiable
– Uses numbers, percentages, or binary outcomes
Lesson 1610Defining Measurable Key Results
Specification Limits
are the "voice of the customer.
Lesson 1400Control Limits vs Specification Limits
Speed and Scale
A biased recommendation algorithm can expose millions to harmful content in hours, far beyond what human curation could achieve.
Lesson 1923Algorithmic Amplification of Harm
Speed and simplicity
No transformation bottleneck during load—get data in fast, ask questions later.
Lesson 1816What is ELT? Extract, Load, Transform Explained
Speed up development
by working with small, fast result sets
Lesson 877LIMIT: Restricting the Number of Rows Returned
Spikes at regular intervals
(e.
Lesson 722ACF Plots and Interpretation
Spillovers
happen when the treatment affects the control group indirectly.
Lesson 1458Common DiD Pitfalls
Splines
and **piecewise methods** offer an alternative approach with some key advantages.
Lesson 662Polynomial Features vs Splines
Split
Partition your data into independent chunks (often by rows)
Lesson 1768Data Parallelism Fundamentals
Split each party's data
into encrypted "shares" distributed among participants
Lesson 1903Secure Multi-Party Computation
Split your data
Reserve the last portion (e.
Lesson 790Out-of-Sample Forecast Evaluation
Split your dataset
into strata based on confounder values (e.
Lesson 1430Controlling for Confounders: Stratification
Spot critical drop-off points
Is Week 1 your danger zone?
Lesson 1656Visualizing Retention Curves
Spot early warning signs
when new cohorts show unusual churn patterns
Lesson 1672Cohort-Based Churn Analysis
Spot real trends
See if unemployment is genuinely rising or just following seasonal patterns
Lesson 748Seasonally Adjusted Data
Spot trends over time
Are newer cohorts retaining better than older ones?
Lesson 1659Comparing Retention Across Cohorts
Spot underutilized gems
Low adoption but high frequency among adopters suggests poor discoverability
Lesson 1696Feature Adoption and Usage Frequency
Spotify's lightweight framework
that emphasizes simplicity and file-based targets.
Lesson 1839Alternative Orchestration Tools
Spreads
Which group shows more variability (wider IQR)?
Lesson 1186Box Plots and Violin Plots by Group
Spring
Unknown structure, want to discover communities
Lesson 1318Network Layout Algorithms
Sprint goal
"Deliver initial churn prediction baseline with three features"
Lesson 2113Timeboxing and Sprint Planning for Data Projects
Sprint-level
"This week's goal is baseline model only"
Lesson 2121Timeboxing and Deadlines
spurious correlation
occurs when two variables appear statistically related but have no genuine cause-and-effect relationship.
Lesson 494Spurious Correlations and CoincidenceLesson 1422Spurious Correlations
Spurious relationships
We might detect patterns or correlations that don't actually exist, leading to false confidence in our forecasts.
Lesson 713Why Stationarity MattersLesson 734Why Differencing and Detrending Matter
SQL and Stats Tests
often come first as screeners.
Lesson 2142Interviewing: Technical and Behavioral Prep
SQL Server
Often case-insensitive, but depends on collation settings
Lesson 862Case Sensitivity in Text FilteringLesson 940Database Support and Alternatives
SQLAlchemy Core
provides a *SQL Expression Language*—a Pythonic way to write SQL queries using functions and methods instead of raw strings.
Lesson 1118SQLAlchemy Core vs ORM
SQLAlchemy ORM
provides a higher-level abstraction where you work with *Python classes and objects* instead of tables and rows.
Lesson 1118SQLAlchemy Core vs ORM
Square root
(`sqrt(Y)`) is gentler than log and works well for count data.
Lesson 591When and Why to Transform Variables
Square Root Transformation
(`sqrt(x)`) works particularly well for:
Lesson 213Square Root and Cube Root Transformations
SS (Sum of Squares)
How much total variation comes from each source
Lesson 444The ANOVA Table
Stability over time
The relationship shouldn't suddenly shift
Lesson 1518The Relationship Between Surrogate and Business Metrics
Stabilize coefficient estimates
Less wobbling between models
Lesson 585Remedies: Variable Selection
Stabilizing variance
– Making the spread of data more consistent across different ranges
Lesson 212Log Transformations
Stable patterns
All cohorts behave similarly.
Lesson 1650Comparing Cohorts Over Time
Stack traces
The full path of execution leading to the failure
Lesson 1851Error Logging and Notifications
Stacked bar charts
pile segments on top of each other to show both part-to-whole relationships and totals.
Lesson 1188Stacked and Grouped Bar Charts
Staff Data Scientist
Technical leadership across multiple projects, set standards, solve org-wide problems
Lesson 2140Individual Contributor vs Management Tracks
Stage 1
Use cluster sampling to randomly select a few states
Lesson 238Multistage Sampling
Stage 2
Use stratified sampling to select universities within those states (ensuring you get different types: public, private, large, small)
Lesson 238Multistage Sampling
Stage 3
Use simple random sampling to select individual students from each chosen university
Lesson 238Multistage Sampling
Stage the resolved files
with `git add <filename>` (or `git add .
Lesson 2011Resolving Merge ConflictsLesson 2018Resolving Conflicts During Rebase
Staged files
Files you've added to the staging area with `git add`, ready for the next commit
Lesson 1998Checking Repository Status
Staging Area
(Index): The box where you arrange items you've decided to ship
Lesson 1993The Three States: Working Directory, Staging, Repository
Stakeholder Alignment
Everyone agrees on what "success" looks like before you start
Lesson 10Problem Definition and ScopingLesson 1973Report Review and Quality Checklist
Stakeholder communication
Inform affected communities before public release when possible
Lesson 1925Mitigation Strategies and Responsible Disclosure
Stakeholder confidence
They wonder why you're still working instead of moving forward
Lesson 2120The Opportunity Cost of Iteration
Stakeholder indifference
Additional precision doesn't change the business decision
Lesson 2116Diminishing Returns and the 80/20 Rule
Stakeholder learning
Non-technical partners often don't fully understand what they need until they see something concrete.
Lesson 2109Why Data Science is Inherently Iterative
Stakeholder management
Translating technical work into business impact
Lesson 2142Interviewing: Technical and Behavioral Prep
Stakeholder-driven iteration
Business users see preliminary results and refine requirements.
Lesson 2092Iteration and Feedback Loops in Practice
Stakeholders need self-service analytics
(executives checking KPIs, analysts exploring trends)
Lesson 1330Introduction to Interactive Dashboards
Stakes are low
Minor UI tweaks, button colors, or copy changes
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Stale tracking
(data pipeline breaks, no one notices for weeks)
Lesson 1619What is Metric Ownership?
Stamen Terrain
Emphasizes topography and natural features
Lesson 1314Basemaps and Map Tiles
Standard
2-4 weeks to account for novelty bias
Lesson 1484Duration and Timing Considerations
Standard Deviation (SD)
measures how spread out the *individual values* in your dataset are from the mean.
Lesson 261Standard Error vs Standard Deviation
Standard deviation = 1
One unit on the horizontal axis equals one standard deviation
Lesson 194The Standard Normal Distribution
Standard Error (SE)
measures how spread out the *sample means* would be if you took many samples from the same population.
Lesson 261Standard Error vs Standard Deviation
standard normal distribution
is a special case of the normal distribution with a **mean (μ) of 0** and a **standard deviation (σ) of 1**.
Lesson 194The Standard Normal DistributionLesson 403Finding P-Values for Proportion Tests
Standard normal tables
(Z-tables) after converting to Z-scores
Lesson 173Calculating Probabilities with the Normal Distribution
Standardization for Comparison
Comparing SAT scores (mean 1050, SD 200) to ACT scores (mean 21, SD 5) directly is meaningless.
Lesson 201Z-Score Applications and Limitations
Standardized coefficients
(also called **beta weights** or **β weights**) put all predictors on the same scale by expressing them in standard deviation units.
Lesson 608Standardized Coefficients (Beta Weights)
Standardized residuals
divide each residual by an estimate of its standard deviation:
Lesson 563Standardized and Studentized ResidualsLesson 588Standardized and Studentized Residuals
Standardizing capitalization
ensures "Apple", "APPLE", and "apple" are recognized as the same.
Lesson 1138Cleaning and Standardizing Text Fields
star schema
is a common data warehouse design where one central **fact table** (containing measurements like sales amounts, quantities, or counts) connects to multiple **dimension tables** (containing descriptive attributes like customer names, product detail...
Lesson 956Star Schema JoinsLesson 1808Star Schema and Fact Tables
start
a transaction block and how to **commit** it to save your work.
Lesson 1112Starting and Committing TransactionsLesson 1582Updating Beliefs with Test Data
Start small and targeted
Don't attempt to rewrite everything at once.
Lesson 2137Refactoring Strategies and Debt Paydown
Start visual
Create a histogram and Q-Q plot.
Lesson 210Combining Visual and Statistical Methods
Start with d=0
Check if your original series is already stationary using visual inspection and the Augmented Dickey-Fuller or KPSS tests you learned earlier.
Lesson 778Determining Differencing Order (d)
Start with domain knowledge
Which predictors make theoretical sense?
Lesson 633Practical Model Selection Strategy
Start with initial guesses
for all parameters (θ₁, θ₂, .
Lesson 1591Gibbs Sampling for Multivariate Posteriors
Start with the answer
Lead with your key finding or recommendation (remember the Pyramid Principle from lesson 1952).
Lesson 1965Progressive Disclosure Techniques
Start with your prior
`P(θ)` — your belief about parameter θ before seeing data
Lesson 1545Calculating the Posterior Distribution
Starts at 0
F(-∞) = 0 (no probability accumulated yet)
Lesson 157Cumulative Distribution Functions (CDFs) for Continuous Variables
State conclusions in context
, not just statistical jargon
Lesson 368Common Pitfalls and Best Practices
Static validation
Parse your DAG definition without executing it.
Lesson 1846Testing and Validating Dependency Graphs
Statistical confirmation
Run stationarity tests after differencing.
Lesson 778Determining Differencing Order (d)
Statistical exploration is central
R's grammar of graphics makes iterative statistical visualization seamless.
Lesson 1375Choosing Tools: When to Use R vs Python for Visualization
Statistical hypothesis testing
lets you quantify whether observed differences are likely real effects or just sampling noise.
Lesson 1684Statistical Significance in Funnel Comparisons
Statistical independence
Sessions from the same user aren't independent—they're correlated.
Lesson 1481Unit of Randomization
Statistical power increases
– the test becomes better at detecting *any* deviation
Lesson 209Sample Size Considerations in Normality TestsLesson 341Effect Size and Power
Statistical power varies
Some comparisons have more precision than others
Lesson 468Balanced vs Unbalanced Designs
Statistical significance testing
answers the question: "Is this predictor's coefficient reliably different from zero, or could I have gotten this result just from random variation?
Lesson 606Statistical Significance of Individual Coefficients
Statistical sophistication
Teams must understand alpha spending functions, confidence sequences, or group boundaries— not just basic t-tests
Lesson 1515Trade-offs: Sample Size, Speed, and Complexity
Statistical test
Run a DiD-style regression using only pre-treatment data, with placebo "treatment" dates.
Lesson 1456Testing Parallel Trends
Statistical testing
you're testing whether categories differ from the reference, not whether they differ from zero
Lesson 643Interpreting Coefficients Relative to Reference
Statistical validation
A test confirming the change isn't random noise
Lesson 1946Supporting Your Claims with Evidence
statistically significant
doesn't mean it's **practically meaningful**.
Lesson 609Practical vs Statistical SignificanceLesson 723Significance Bounds in ACF Plots
Statistics
focuses on testing hypotheses and understanding uncertainty with mathematical rigor.
Lesson 1Defining Data ScienceLesson 7The Data Science Skill StackLesson 229Defining Samples and Statistics
Statistics (stat)
Transformations applied to data (means, counts, smoothing)
Lesson 1340The Seven Layers of Grammar
Status dependencies
Certain field combinations are impossible.
Lesson 1155Consistency Checks Across Fields
Stay interpretable
Stakeholders understand exactly what changed and why
Lesson 2128Data Distribution Shifts Frequently
Steep drops
Many events happening at specific times
Lesson 815Survival Curve Plots and Interpretation
Step 1: Decompose
your historical data into trend, seasonal, and remainder components using your chosen method (classical or STL).
Lesson 749Using Decomposition for Forecasting
Step 5: Recombine
using your model type:
Lesson 749Using Decomposition for Forecasting
Step-by-step instructions
How to run the analysis from start to finish
Lesson 1989Best Practices for Sharing Reproducible Reports
STL
stands for **S**easonal-**T**rend decomposition using **L**oess.
Lesson 745STL Decomposition (Seasonal-Trend Loess)
Stop
at the first p-value that fails to reject; all subsequent tests are also not rejected
Lesson 1504Holm-Bonferroni Method
Stop and accept H₀
(no significant difference)
Lesson 1511Sequential Probability Ratio Test (SPRT)
Stop and reject H₀
(declare a winner)
Lesson 1511Sequential Probability Ratio Test (SPRT)
Stop early
when evidence is strong (saving time and resources)
Lesson 1510Sequential Testing Overview
Stop when stationary
Don't difference more than necessary—if your tests confirm stationarity, stop there.
Lesson 778Determining Differencing Order (d)
Stopping
| Fixed sample size or sequential correction needed | Natural sequential updating, stop anytime |
Lesson 1580Bayesian vs Frequentist A/B Testing
Store data separately
Use cloud storage (S3, Google Cloud), shared drives, or dedicated data warehouses
Lesson 2070Separating Data from Code
Strain applications
displaying or processing thousands of rows
Lesson 911Performance Considerations with Multiple Groups
strata
(homogeneous subgroups) and then sampling proportionally from each stratum.
Lesson 236Stratified SamplingLesson 817Comparing Multiple Survival Curves
Strategic boundaries
Choosing cutoffs that produce desired patterns rather than natural ones
Lesson 1245Misleading Aggregations and Binning
Strategic goals
Long-term company objectives
Lesson 1516Business Metrics: Definition and Examples
Strategic planning
Identify which touchpoints work best at different customer journey stages
Lesson 1718Introduction to Marketing Attribution
Strategically aligned
Connect directly to your North Star Metric or broader business priorities
Lesson 1609Setting Effective Objectives
Stratified Cox models
allow you to account for a variable's effect on survival *without* assuming proportional hazards for that variable.
Lesson 832Stratified Cox Models
Stratified or adjusted approaches
More sophisticated corrections that balance power and error control
Lesson 824Multiple Group Comparisons
Stratified randomization
solves this by first dividing your sample into homogeneous subgroups (strata) based on key covariates, then randomizing *within* each stratum.
Lesson 1489Stratified Randomization Fundamentals
Stratified sampling
solves this by dividing your population into **strata** (homogeneous subgroups) and then sampling proportionally from each stratum.
Lesson 236Stratified SamplingLesson 237Cluster SamplingLesson 240Quota SamplingLesson 243Choosing the Right Sampling MethodLesson 1885Mitigation Strategies: Data Collection
Streaming pipelines
work like a phone call—process information instantly as it flows through.
Lesson 1824Batch vs Streaming Pipelines
Streamlit
prioritizes **simplicity and speed**.
Lesson 1330Introduction to Interactive Dashboards
Streamlit Cloud
is the easiest option for Streamlit apps—simply connect your GitHub repository, and it deploys automatically.
Lesson 1338Deployment and Sharing Dashboards
strong
when it's much more likely to appear if the person is guilty than if innocent.
Lesson 112Legal Evidence and Jury ReasoningLesson 1610Defining Measurable Key Results
Strong correlation
Changes in the surrogate should consistently predict changes in the business metric
Lesson 1518The Relationship Between Surrogate and Business Metrics
Strong relationships
jump out as values near +1 or -1.
Lesson 511Reading and Interpreting Correlation Matrices
Strong validation exists
Your surrogate has proven correlation with business outcomes
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Structural zeros
People who would *never* experience the event (e.
Lesson 695Zero-Inflated Models
Structure
How many columns?
Lesson 1151Schema Validation
Structure your narrative
around these three pillars—each becomes a mini-story within your larger presentation
Lesson 1940The Rule of Three in Data Storytelling
Structured Query Language
.
Lesson 844What is SQL?
Student's t distributions
for heavier tails (more robust to outliers)
Lesson 1565Prior Distributions for Normal Means
Studentized residuals
go further: they refit the model *without* that specific observation and see how much it differs:
Lesson 563Standardized and Studentized ResidualsLesson 588Standardized and Studentized Residuals
Students
(StudentID, StudentName)
Lesson 1065Second Normal Form (2NF)
Style and consistency
Does it follow team conventions?
Lesson 2024Code Review Best Practices
Subgroup analyses
you plan to run, if any
Lesson 1508Pre-Registration and Correction Strategy
Subgroup analysis
Always disaggregate your fairness metrics across combinations of protected attributes (gender × race, age × disability status, etc.
Lesson 1893Intersectionality in Fairness
Subject matter experts
Talk to salespeople, operations staff, customers
Lesson 1201Domain Knowledge as a Hypothesis Source
Subjective labeling
When humans label training data—tagging images, rating sentiment, or classifying documents— their personal biases, cultural backgrounds, and varying interpretations create inconsistency.
Lesson 1880Measurement and Label Bias
Subscriber Acquisition Cost
Marketing spend divided by new subscribers, but media-specific: track which content drives sign- ups.
Lesson 1635Media and Content Metrics: Watch Time and Content Performance
Subscription duration modeling
treats cancellation as the "event" and subscription length as the "time" variable, letting you predict when customers are most likely to churn and what drives retention.
Lesson 838Subscription and Membership Duration Modeling
Subscription start
When a user begins a paid plan
Lesson 1646Defining Cohort Start Events
SUBSTRING
extracts a specific portion of a string
Lesson 1044String Manipulation: CONCAT, LENGTH, and SUBSTRING
Subtract 1
Because once you know the counts for all but one category, the last one is determined (they must sum to your total sample size)
Lesson 418Degrees of Freedom in Goodness of Fit
Subtract estimated parameters
If you had to estimate any population parameters from your data (like a mean or proportion), you lose additional degrees of freedom
Lesson 418Degrees of Freedom in Goodness of Fit
Subtract the mean
(x - μ): This centers your data point.
Lesson 196Calculating Z-Scores from Raw Data
Success
Commits automatically when the block completes
Lesson 1114Transaction Context Managers in Python
Success metrics
to track implementation
Lesson 1970Recommendations and Next Steps
Sudden shifts
equipment calibration changes, policy updates, or batch effects
Lesson 562Index Plots and Time-Ordered Residuals
Sum of absolute residuals
Better, but mathematically difficult to work with (no smooth derivative).
Lesson 517The Least Squares Criterion
Sum of raw residuals
No—positive and negative errors cancel out.
Lesson 517The Least Squares Criterion
Sum the positive ranks
(W ) and **negative ranks** (W )
Lesson 392Wilcoxon Signed-Rank Test
Sum the ranks
for each group separately
Lesson 393Mann-Whitney U Test (Wilcoxon Rank-Sum)
SUM()
naturally ignores NULLs, so unmatched rows contribute nothing (which is usually what you want)
Lesson 933Aggregating with LEFT JOINs
Sums of measurements
(total wait time, cumulative sales)
Lesson 225CLT for Sums and Other Statistics
Support complex relationships
between different types of information (customers → orders → products)
Lesson 842What is a Database?
Supporting observations
the patterns, anomalies, or visualizations that sparked the hypothesis (e.
Lesson 1203Documenting Hypotheses and Evidence
Suppression
removes certain values entirely when they're too identifying—like removing ZIP codes for rural areas where few people live.
Lesson 1895Data Anonymization BasicsLesson 1896K-Anonymity
Surface plots
provide a solid, colored representation that emphasizes the overall shape and makes valleys and peaks immediately visible.
Lesson 13253D Surface and Wireframe Plots
Surrogate
30-day engagement score or feature adoption rate
Lesson 1517Surrogate Metrics: When Direct Measurement is Impractical
Surrogate keys
are artificial identifiers created solely for database purposes—typically auto-incrementing integers or UUIDs.
Lesson 1050Choosing Effective Primary Keys
Survey data
When respondents represent different population sizes
Lesson 43Weighted Mean and Its Applications
Survey response rates
(proportion who respond)
Lesson 184Beta Distribution: Bounded Between 0 and 1
Surveys and questionnaires
Directly asking people for information
Lesson 11Data Collection and Acquisition
Survival bias
only certain types complete treatment and remain observable
Lesson 1444Selection Bias and Treatment Assignment
Survival models
predict both *how long* a customer will remain active and *how much* they'll spend during that time.
Lesson 1668Predictive LTV Models
Swamping
A valid point gets falsely flagged because outliers distort the statistics
Lesson 1407The ESD Component
Symmetric distributions
(bell-shaped): Mean, median, and mode are roughly equal—use any, though mean is most common.
Lesson 42Comparing Mean, Median, and ModeLesson 220Sample Size Requirements for the CLTLesson 221CLT for Different Population Distributions
Symmetrically distributed data
without extreme outliers
Lesson 39The Mean (Arithmetic Average)
Symmetry around zero
well-specified models should show roughly symmetric deviance residuals
Lesson 701Deviance Residuals
System dependencies
(compilers, system libraries)
Lesson 2038What is Environment Management and Why It Matters
Systematic deviations
Non-normal distribution
Lesson 204Q-Q Plots: Theory and Interpretation
Systematic sampling
is efficient and easy.
Lesson 243Choosing the Right Sampling Method

T

T-Closeness
goes further: the distribution of sensitive attributes in each group must be **close to the overall distribution** in the dataset (within threshold T).
Lesson 1897L-Diversity and T-Closeness
Tab
key (to move forward), **Shift+Tab** (to move backward), **Enter** or **Space** (to activate), and **arrow keys** (for fine control).
Lesson 1253Interactive Accessibility: Keyboard Navigation
table
is like a spreadsheet in a database—it stores data in rows and columns.
Lesson 846Tables, Schemas, and Data TypesLesson 1117What is an ORM and Why Use It?
Table name qualification
means prefixing column names with their table name using dot notation:
Lesson 922Selecting Columns from Joined Tables
Table sizes
Joining smaller tables first reduces intermediate result sets
Lesson 951Join Order and Performance
tables
with rows and columns, making it easy to store large volumes of information efficiently and access it reliably.
Lesson 842What is a Database?Lesson 843Relational Database Concepts
Take-Home Projects
test end-to-end skills: EDA, feature engineering, modeling, and communication.
Lesson 2142Interviewing: Technical and Behavioral Prep
Target ROAS
Varies by industry and margins, but often 3-4+ for healthy profitability
Lesson 1751Return on Ad Spend (ROAS): Definition and CalculationLesson 1752Target ROAS and Break-Even Analysis
Target variable
Actual LTV (from mature cohorts where you've observed full lifecycles)
Lesson 1668Predictive LTV Models
Targeted interventions
Identify high-risk periods (e.
Lesson 838Subscription and Membership Duration Modeling
Task-level
"Spend 2 hours exploring correlations, then move on"
Lesson 2121Timeboxing and Deadlines
tasks
(individual units of work) and **operators** (templates for tasks like PythonOperator, BashOperator, or SQLOperator).
Lesson 1833Introduction to Apache AirflowLesson 1835Airflow Operators and Tasks
Tau-a
Simplest, doesn't adjust for ties (rarely used)
Lesson 491Handling Ties in Rank Correlations
Tau-b
Adjusts for ties in both variables (most common)
Lesson 491Handling Ties in Rank Correlations
Tau-c
Adjusts for table size in contingency tables
Lesson 491Handling Ties in Rank Correlations
Tax Reforms
When a city or state changes tax policy, neighboring regions serve as control groups.
Lesson 1459Real-World DiD Applications
Teaching and documentation
where the process matters as much as the result
Lesson 2074Notebooks vs Scripts: When to Use Each
Teaching and prototyping
Perfect for learning Bayesian concepts or quickly testing ideas
Lesson 1555Advantages and Limitations of Conjugate Priors
Team alignment
Give marketing, product, and leadership a shared view of what's working
Lesson 1718Introduction to Marketing AttributionLesson 1727Linear Attribution Model
Team capacity
Your colleagues who depend on your work are blocked
Lesson 2120The Opportunity Cost of Iteration
Team-Level Key Results
Each team then defines 3-5 measurable Key Results that directly influence the North Star.
Lesson 1608Connecting North Star Metrics to OKRs
Technical
What systems must the solution integrate with?
Lesson 2102Understanding Stakeholder Goals and Constraints
Technical → Business
When stakeholders ask "How accurate is the model?
Lesson 2105Translating Between Technical and Business Language
Technical attributes
Browser, operating system, connection speed
Lesson 1682Segmenting Funnels by User Attributes
Technical audiences
(data scientists, engineers, analysts) typically:
Lesson 1950Identifying Your Audience: Technical vs Non-Technical
Technical costs
landing pages, tracking infrastructure, A/B testing tools
Lesson 1753Customer Acquisition Cost (CAC): Components and Calculation
Technical deep-dives
for the data-savvy audience members
Lesson 1949Anticipating Questions: Building in Appendices
Technical friction
(slow loading, complex forms)
Lesson 1681Time-Based Funnel Analysis
Technical peers
, on the other hand, often need diagnostic depth: distributions, error bars, residual plots, correlation matrices.
Lesson 1954Tailoring Visualizations to Audience Needs
Technical peers/data scientists
Show your work.
Lesson 1953Adjusting Statistical Depth by Audience
Technical reviewers
can evaluate your conclusion before diving into methods
Lesson 1942The Pyramid Principle: Starting with the Conclusion
Temperature readings
If you're monitoring a freezer that must stay below 0°C, a reading of 5°C is an outlier *by definition*, even if it's close to the mean due to equipment malfunction.
Lesson 75Domain-Specific Outlier Rules
Templates are your foundation
Create standardized templates for data documentation that include:
Lesson 2068Data Provenance Best Practices
Temporal dependence
Values at time *t* depend on values at *t-1*, *t-2*, etc.
Lesson 704What Makes Time Series Data Different?
Temporality
The cause must come *before* the effect—this is the only non-negotiable criterion.
Lesson 498Bradford Hill Criteria for Causation
Tenure and LTV
High-LTV churners warrant more personalized, generous offers
Lesson 1676Win-Back and Retention Strategies
Terms below were extracted from bolded phrases in lesson content. Click a lesson reference to jump
Terms of Service
present another illusion.
Lesson 1914Consent in Digital Contexts
Terms of service respect
– If you're scraping a website or using an API, are you honoring the platform's rules?
Lesson 36Responsible Data Sourcing and Use
Test
Run controlled experiment until statistical significance
Lesson 1692Statistical Significance and Iteration
Test before replacing
When proposing a new branch, validate that it truly influences parent metrics before permanently adding it to the tree.
Lesson 1626Maintaining and Evolving Metric Trees
Test causality
Run experiments where you deliberately move the metric and observe effects.
Lesson 1615Correlation Without Causation
Test credentials
Try connecting with a database client tool (like `psql` or SQLite browser) using the same credentials
Lesson 1093Troubleshooting Connection Issues
Test duration
creates its own problems.
Lesson 1500Practical Considerations and Trade-offs
Test parallel trends visually
Pre-treatment coefficients should be near zero
Lesson 1457Multiple Time Periods and Staggered Adoption
Test queries safely
by previewing just a handful of rows
Lesson 877LIMIT: Restricting the Number of Rows Returned
Test restoration
by recreating the environment on a fresh machine
Lesson 1987Environment and Dependency Management
Test segments
scores should separate retained vs churned cohorts clearly
Lesson 1699Engagement Scoring Systems
Test sequentially
starting with the smallest p-value
Lesson 1504Holm-Bonferroni Method
Test set
Fresh data held back until the very end for a final, unbiased evaluation
Lesson 14Model Evaluation and Validation
Test significance
using likelihood ratio tests, Wald tests, or AIC/BIC comparisons—tools you've already learned.
Lesson 703Sequential Model Building Strategy
Test small first
Use `LIMIT` while developing queries to avoid long waits
Lesson 880Performance Considerations and Best Practices
Test with multiple people
One person's confusion might be unique; three people struggling with the same element reveals a design problem.
Lesson 1964Testing Visualizations with Audiences
Test your work
Use CVD simulation tools to preview your visualizations as colorblind viewers see them.
Lesson 1248Color Blindness and Color Palette Design
Testable hypothesis
"Customers in Segment A have an average purchase frequency at least 20% higher than Segment B customers.
Lesson 1200Formulating Specific, Testable Hypotheses
Testing and development
Engineers can work with realistic data without privacy concerns
Lesson 1901Synthetic Data Generation
Testing becomes impossible
You can't validate pipeline logic if each run changes the outcome
Lesson 1847What is Idempotency?
Testing Multiple Claims Simultaneously
Lesson 313Common Pitfalls in Hypothesis Formulation
Text columns
Find alphabetically first and last values (based on sorting order)
Lesson 885MIN and MAX: Finding Extremes
Text inputs
capture free-form text for searches or custom labels:
Lesson 1332Streamlit Widgets: Inputs and Controls
Text labels
add context anywhere on your plot.
Lesson 1271Adding Legends, Annotations, and Text
Text processing
Regular expressions, tokenization, or NLP on millions of documents
Lesson 1784Computation Complexity: Beyond Data Size
That's analysis
Recommending Policy A because "efficiency matters most" is **advocacy**—it injects your (or your organization's) values into the decision.
Lesson 1927Separating Analysis from Advocacy
Their branch's latest commit
The tip of the branch you're merging in
Lesson 2009Three-Way Merges
Then aggregate again
at a different level
Lesson 973Nested Subqueries in FROM
Then filter
those aggregates
Lesson 973Nested Subqueries in FROM
Then join
the clean, preprocessed results
Lesson 994CTEs for Simplifying Complex Joins
Theoretical quantiles
(what we'd expect from a perfect normal distribution) on the x-axis
Lesson 565What Q-Q Plots Show: Comparing Residual Distribution to Normal
Theory
Does domain knowledge suggest this predictor matters?
Lesson 625Practical Workflow: Testing and Interpreting Predictors
there.
They all still apply
when you add more predictors.
Lesson 601Assumptions for Multiple Linear Regression
They generate moments
Taking derivatives at t=0 gives you the "raw moments" of the distribution.
Lesson 150Moment Generating Functions
They penalize complexity
Adding unnecessary parameters increases the score
Lesson 781Information Criteria: AIC and BIC
They simplify algebra
MGFs make it easier to prove properties about sums of independent random variables (like that the sum of independent Poisson variables is also Poisson).
Lesson 150Moment Generating Functions
They uniquely identify distributions
If two random variables have the same MGF, they have the same probability distribution—no other function needed!
Lesson 150Moment Generating Functions
They're correlated
and run against large tables (thousands of executions)
Lesson 966Performance Considerations for WHERE Subqueries
They're reasonable
The conjugate family genuinely captures your prior knowledge
Lesson 1556Choosing Between Conjugate and Non-Conjugate Priors
Thin tails
No extreme outliers pulling away
Lesson 377Testing Normality: Visual Methods
Think about costs
of acting on this information
Lesson 609Practical vs Statistical Significance
Think of it as
Knocking on someone's front door and asking politely for information they're willing to share.
Lesson 21APIs and Web Scraping
Thinning
means keeping only every *k*th sample (e.
Lesson 1592Burn-in, Thinning, and Convergence Diagnostics
Third batch arrives
Use Beta(17, 25) as prior → and so on.
Lesson 1563Sequential Updating with New Data
Third evidence (alibi confirmed)
Use 85% as the new prior → posterior drops to 30%.
Lesson 114Sequential Updating
Third Normal Form (3NF)
eliminates *transitive dependencies*, where a non-key attribute depends on another non-key attribute, which in turn depends on the primary key.
Lesson 1066Third Normal Form (3NF)
Third-party providers
Companies that sell or license data
Lesson 11Data Collection and Acquisition
This is your default
When in doubt, use two-tailed—it's more conservative and widely accepted.
Lesson 350Choosing the Right Tail Configuration
This uncertainty matters
When we estimate σ from a small sample, our confidence interval needs to be *wider* to account for the extra uncertainty.
Lesson 268Critical Values and the t-Distribution
Thompson Sampling
directly sample from posterior distributions to make allocation decisions—a natural Bayesian approach.
Lesson 1586Multi-Armed Bandit Connections
Threaded Scheduler
(default for single machine)
Lesson 1795Distributed Schedulers and Client Setup
Three columns
with 100 values each → up to 1,000,000 potential groups
Lesson 911Performance Considerations with Multiple Groups
Threshold adjustment
Use different decision thresholds for different groups to equalize outcomes.
Lesson 1894Auditing and Remediation Strategies
Threshold effects
Variables behave differently above/below a certain value
Lesson 1189Detecting Nonlinear Relationships
Tick Marks and Labels
Customize where tick marks appear and what they say using `set_xticks()` and `set_xticklabels()`.
Lesson 1270Customizing Axes: Labels, Limits, and Scales
Tick marks or crosses
Often indicate censored observations
Lesson 815Survival Curve Plots and Interpretation
Tidy data
is a standardized way of organizing datasets that follows three simple rules:
Lesson 1142What is Tidy Data?
Time and resource limits
"We need an answer in two weeks, even if it's rough"
Lesson 2117Defining 'Good Enough' with Stakeholders
Time for Spark
When datasets exceed available RAM or when processing takes hours instead of minutes
Lesson 1783Data Size Thresholds: When Pandas Isn't Enough
Time intervals
span durations: "January 2024 to March 2024" or "Q1 2023"
Lesson 19Temporal Data and Time Series
Time investment explodes
Simple features took hours; the next marginal improvement requires days of engineering
Lesson 2116Diminishing Returns and the 80/20 Rule
Time limitations
Do you have days or months?
Lesson 1169Clarifying Assumptions and Constraints
time origin
is your starting line—the moment when the clock begins for each subject.
Lesson 803Defining the Event and Time OriginLesson 835Customer Churn Prediction with Survival Analysis
Time periods
Sales in months with different numbers of days
Lesson 692Offset Terms for Exposure
Time plot of residuals
Should look randomly scattered around zero with constant variance
Lesson 799Fitting and Diagnosing SARIMA Models
Time series comparisons
multiple metrics over the same time period
Lesson 1276Sharing Axes Between Subplots
Time Series Data
Measurements taken over time (stock prices, daily temperatures) often show autocorrelation— today's value relates to yesterday's value.
Lesson 381Independence Assumption and Its ViolationsLesson 548Independence of Observations
Time series plot
Should show constant mean and variance over time
Lesson 741Testing Stationarity After Transformation
Time since churn
Fresh churners respond better than those gone 6+ months
Lesson 1676Win-Back and Retention Strategies
Time windows
set boundaries—how far back you look for attributable touchpoints.
Lesson 1639Time Windows and Attribution Decay
Time-based rules
Sales of winter coats in July might look like outliers, but they could be legitimate clearance sales or southern hemisphere orders.
Lesson 75Domain-Specific Outlier Rules
Time-based variations
Weekly, monthly, quarterly reports
Lesson 1984Parameterized Reports
Time-bound
Set a clear horizon (quarterly, annually) so urgency is built in
Lesson 1609Setting Effective ObjectivesLesson 1610Defining Measurable Key Results
Time-to-conversion
analysis models the journey from first contact (lead acquisition) to purchase, treating non- converters as **censored observations**—they didn't experience the "event" (conversion) during your observation window.
Lesson 839Time-to-Conversion in Marketing Funnels
Time-to-match
How long until a buyer finds a seller
Lesson 1630Marketplace Metrics: GMV, Take Rate, and Liquidity
Time-varying covariates
allow your survival model to reflect these dynamic changes.
Lesson 833Time-Varying Covariates
Timebox tasks
2 days EDA, 2 days feature prep, 1 day modeling
Lesson 2113Timeboxing and Sprint Planning for Data Projects
Timeboxing
means allocating a fixed duration—say, three days for EDA or one week for initial modeling—and forcing yourself to produce *something* deliverable when time runs out, even if it's imperfect.
Lesson 2113Timeboxing and Sprint Planning for Data ProjectsLesson 2121Timeboxing and Deadlines
Timely insights
Market conditions change; delays reduce relevance
Lesson 2120The Opportunity Cost of Iteration
Timeout Errors
occur when connections take too long to establish or queries run longer than allowed.
Lesson 1093Troubleshooting Connection Issues
Timestamps and Version Fields
Add `created_at` and `updated_at` timestamps to your data.
Lesson 1848Designing Idempotent Operations
Too few bins
You lose detail and may miss important patterns
Lesson 1267Histograms and Distribution Plots
Too few examples
Training a neural network with 50 samples?
Lesson 2124Insufficient or Low-Quality Data
Too large
Each partition takes a long time to process, limiting parallelism.
Lesson 1794Working with Partitions
Too many bins
The plot becomes noisy and hard to interpret
Lesson 1267Histograms and Distribution Plots
Too narrow
Creates noisy, overfit patterns from random variation
Lesson 1245Misleading Aggregations and Binning
Too noisy
Revenue varies wildly day-to-day, drowning out true effects
Lesson 1517Surrogate Metrics: When Direct Measurement is Impractical
Too rare
Conversions on high-ticket items are infrequent
Lesson 1517Surrogate Metrics: When Direct Measurement is Impractical
Too wide
Bins like "0-100" collapse all variation
Lesson 1245Misleading Aggregations and Binning
Top-of-funnel optimization
Where should you invest to grow your audience?
Lesson 1720First-Touch Attribution Model
Total downloads
(without usage or monetization)
Lesson 1612What Are Vanity Metrics?
Total registered users
(without knowing active users or retention)
Lesson 1612What Are Vanity Metrics?
Total Revenue
The sum of all money earned
Lesson 1516Business Metrics: Definition and Examples
Trace plots
Visualize the chain over iterations—it should look like random noise around a stable mean, not trending or stuck
Lesson 1592Burn-in, Thinning, and Convergence Diagnostics
Track improvements
Overlay cohorts from before and after a product change to see if retention improved.
Lesson 1656Visualizing Retention Curves
Track randomization seed
always save the random seed used for reproducibility
Lesson 1492Rerandomization and Practical Implementation
Track step repetition frequency
– identify which steps users commonly revisit
Lesson 1683Multi-Path and Non-Linear Funnels
Tracking Pixels
are tiny, invisible images embedded in emails or third-party sites.
Lesson 1713Tracking Users by Channel
Trade-off
Slightly lower power (5-15% efficiency loss if data *were* normal), and results describe distributions or medians, not means.
Lesson 475Choosing Between Parametric and Non-Parametric TestsLesson 1767Scale-Up vs Scale- Out Architectures
Tradeoffs
Choosing "good enough" over perfection
Lesson 2142Interviewing: Technical and Behavioral Prep
Traditional methods
work beautifully when:
Lesson 305When to Use Bootstrap vs Traditional Methods
Traffic source
Organic search, paid ads, social media, email, direct
Lesson 1682Segmenting Funnels by User Attributes
Trailing moving averages
(also called "backward-looking") use only past data points.
Lesson 753Centered vs Trailing Moving Averages
Train on the rest
Build your ARIMA, Holt-Winters, or other model using only the training portion
Lesson 790Out-of-Sample Forecast Evaluation
Training set
Data the model learns from
Lesson 14Model Evaluation and Validation
Transform
it on cheaper servers or ETL tools (like Informatica or DataStage)
Lesson 1817Historical Context: Why ETL Came First
Transform back
to the correlation scale using the inverse transformation
Lesson 503Confidence Intervals for Correlation Coefficients
Transform within the warehouse
using SQL-based tools like **dbt** (data build tool)
Lesson 1821Hybrid Approaches and Modern Data Stacks
Transform your data
to reflect the null hypothesis being true (e.
Lesson 396Bootstrap Hypothesis Testing
Transformation History
What cleaning or calculations were applied
Lesson 1163Metadata and Data Dictionaries
Transformation layers
like **dbt** that version-control SQL transformations, run tests, and document data models
Lesson 1821Hybrid Approaches and Modern Data Stacks
Transformation logic
Clean, join, aggregate, or enrich data (the "T" in ETL/ELT)
Lesson 1822What is a Data Pipeline?
Transformations are simpler
Operations like filtering, grouping, and summarizing follow predictable patterns
Lesson 1142What is Tidy Data?
Transformed coordinates
apply mathematical transformations to the entire space
Lesson 1344Scales and Coordinate Systems
Transforming
using SQL queries within the warehouse itself
Lesson 1816What is ELT? Extract, Load, Transform Explained
Transforming features
to meet model assumptions or improve performance: scaling numerical features, encoding categorical variables, handling skewed distributions, or creating polynomial terms.
Lesson 2088Stage 4: Feature Engineering and Preparation
Transient
Implement exponential backoff, retry 3-5 times
Lesson 1849Transient vs Permanent Failures
Transitive
`A → B` and `B → C`, so `A → C`
Lesson 1063Functional Dependencies
Transitive dependencies
are the hidden culprit: Package A depends on Package B version 2, but Package C needs Package B version 3.
Lesson 2048The Dependency Hell Problem
Transparency (alpha)
prevents overplotting in dense datasets.
Lesson 1265Scatter Plots: Relationships Between Variables
Transparency/alpha
Let overlapping points blend, showing density through darker areas
Lesson 1310Point Maps and Scatter Plots on Maps
Transportation
Optimizing delivery routes or predicting traffic patterns
Lesson 6Common Data Science Applications
Treated
is a binary indicator (1 if unit is in treatment group, 0 if control)
Lesson 1455DiD with Regression
Treated × Post
is the **interaction term** between the two indicators
Lesson 1455DiD with Regression
treatment
(version B—a new feature, design, or intervention), while the other receives the **control** (version A—the current state or baseline).
Lesson 1477Core Principles of A/B TestingLesson 1482Control and Treatment Design
Treatment Effect Estimation
calculates the difference in average outcomes between those who received the treatment and those who didn't.
Lesson 1440Treatment Effect Estimation
Treatment group, after intervention
Lesson 1452The Difference-in-Differences Setup
Treatment group, before intervention
(baseline)
Lesson 1452The Difference-in-Differences Setup
Treatment Type
(Drug A vs Drug B) and **Gender** (Male vs Female) on recovery time.
Lesson 653Interpreting Categorical × Categorical Interactions
Tree-based models
(decision trees, random forests): These algorithms don't use the same linear framework as regression and can handle all k variables without issues
Lesson 638One-Hot Encoding Overview
Trend or pattern
"Sales increased steadily from January to December"
Lesson 1250Text Alternatives and Screen Reader Compatibility
Trends in rolling stats
= non-stationary (needs fixing!
Lesson 715Visual Tests for Stationarity
Triggers
allow one pipeline to programmatically start another pipeline upon completion.
Lesson 1845Cross-Pipeline Dependencies
Trimming whitespace
removes leading and trailing spaces that creep in from manual data entry or faulty exports.
Lesson 1138Cleaning and Standardizing Text Fields
Tritanopia
(blue-yellow, rare): difficulty with blue and yellow
Lesson 1248Color Blindness and Color Palette Design
Trivial
`StudentID → StudentID` (always true, not useful)
Lesson 1063Functional Dependencies
Troubleshoot failures
If a downstream task fails, check its upstream dependencies first
Lesson 1841Upstream and Downstream Dependencies
True Positives (TP)
Correctly identified change-points
Lesson 1418Evaluating Change-Point Detection Methods
Truncate trends
End the chart before a reversal occurs
Lesson 1241Cherry-Picking Time Ranges
Trust erosion
with stakeholders when they catch problems before you do
Lesson 2136Monitoring Gaps and Silent Failures
Trustworthiness
Is this data from a reliable source?
Lesson 23Data Provenance and Metadata
Try common encodings explicitly
UTF-8 (most modern), Latin-1 (ISO-8859-1, Western European), or CP1252 (Windows)
Lesson 1135Detecting and Fixing Encoding Issues
Try d=1
If non-stationary, apply first-order differencing (subtracting each value from the previous one).
Lesson 778Determining Differencing Order (d)
Try different combinations
of alpha, beta, and gamma values (typically between 0 and 1)
Lesson 772Holt-Winters Parameter Optimization
Try multiple reasonable priors
Use informative, weakly informative, and uninformative priors for the same problem
Lesson 1572Sensitivity Analysis and Prior Robustness
Tukey's fences
use the IQR to build "boundary lines" beyond which data points are considered outliers.
Lesson 72IQR Method and Tukey's Fences
TV(t), Radio(t), Digital(t)
are your marketing spend amounts in each channel at time *t*
Lesson 1738The Core MMM Regression Model
Two columns
with 100 values each → up to 10,000 potential groups
Lesson 911Performance Considerations with Multiple Groups
Two numerical variables
Does house size relate to price?
Lesson 1181What is Bivariate Analysis?
Two-sided (two-tailed) test
This tests whether the *most extreme value* — either the maximum OR minimum — is an outlier.
Lesson 1393Two-Sided vs One-Sided Grubbs' Test
two-sided test
, you calculate the probability in *both* tails (values as extreme or more extreme in either direction).
Lesson 319Calculating P-Values from Test StatisticsLesson 325The Rejection Region
Type 1 (Overwrite)
Replace the old value with the new one.
Lesson 1809Dimension Tables and Slowly Changing Dimensions
Type 3 (Add Column)
Store both current and previous values in separate columns (e.
Lesson 1809Dimension Tables and Slowly Changing Dimensions
Type I Error (α)
appears as the shaded area *under the null curve* that falls into the rejection region.
Lesson 336Visualizing Error Types with Sampling Distributions
Type II Error (β)
appears as the shaded area *under the alternative curve* that falls *outside* the rejection region (where you fail to reject H₀).
Lesson 336Visualizing Error Types with Sampling Distributions
Type of phone
(iPhone vs Android) can correlate with socioeconomic status
Lesson 1883Protected Classes and Proxy Variables
Type safety
Your IDE can catch errors before runtime
Lesson 1117What is an ORM and Why Use It?
Types of contributions welcome
Documentation fixes?
Lesson 2083Contributing Guidelines and Contact Information

U

Uber
Rides completed — directly measures successful matching of drivers and riders.
Lesson 1606Examples of North Star Metrics by Industry
Unbounded above
Theoretically no maximum limit (though rare events in practice)
Lesson 689When to Use Poisson Regression
UNBOUNDED FOLLOWING
End at the very last row of the partition
Lesson 1020UNBOUNDED and CURRENT ROW Keywords
UNBOUNDED PRECEDING
Start at the very first row of the partition
Lesson 1020UNBOUNDED and CURRENT ROW Keywords
Unbounded Retention
(also called "Return on or After Day N") measures the percentage of users who come back *any time on or after* Day N.
Lesson 1654Classic vs Unbounded Retention
Uncertainty is present
The relationship between surrogate and business metric is unproven
Lesson 1522Balancing Speed and Accuracy in Metric Selection
Under-controlling
Ignoring confounders because they seem unimportant or weren't measured.
Lesson 1476Common DAG Patterns and Pitfalls
Under-investing
in channels with high incremental value but lower raw volume
Lesson 1717Incrementality and True Channel Impact
Undercoverage
Your sampling frame (the list you sample from) doesn't include part of the population.
Lesson 244Selection Bias and Its CausesLesson 249Coverage Error and Undercoverage
Undermining trust
Stakeholders may feel manipulated rather than informed
Lesson 1927Separating Analysis from Advocacy
Understand
where your model succeeds and fails
Lesson 542Computing Fitted Values and Residuals
Understand complexity
Most conversions aren't one-click decisions; they involve multiple channels and interactions
Lesson 1719The Customer Journey and Touchpoints
Understand conditional relationships
When relationships hold under specific circumstances
Lesson 1190Introduction to Multivariate Analysis
Understand decision-maker constraints
Your stakeholder might need results before quarterly board meetings, end-of-month planning sessions, or annual budget reviews.
Lesson 2099Aligning with Business Timelines and Decision Points
Understand structural changes
in your domain (markets expanding, behaviors shifting)
Lesson 706Trend: Long-Term Direction
Understanding cardinality
Join tables that produce smaller results first when possible
Lesson 951Join Order and Performance
Understanding patterns
Knowing that average temperature is 70°F doesn't tell you if you need both winter coats and shorts
Lesson 46What is Variability?
Understanding the business context
What decision will this analysis inform?
Lesson 2085Stage 1: Problem Definition and Scoping
Understanding the real world
Shape reveals the story behind your numbers.
Lesson 63Understanding Distribution Shape
Undirected graphs
show symmetrical relationships.
Lesson 1316Introduction to Network Graphs and Graph Theory Basics
Unexpected duplicates
The same transaction or observation recorded multiple times
Lesson 1154Uniqueness and Duplication Checks
Unexpected paths
(tasks that shouldn't depend on each other)
Lesson 1846Testing and Validating Dependency Graphs
Unexpected Patterns
Look for broken correlations (height and weight usually relate; if they suddenly don't, check your data), unusual counts (suddenly 200 records instead of the usual 50), or rare category values appearing too frequently.
Lesson 1157Statistical Anomaly Detection in QA
Unicode
is the universal character encoding standard that assigns a unique number to every character across all writing systems.
Lesson 1139Dealing with Special Characters and Unicode
uninformative prior
that assigns equal probability across all plausible values.
Lesson 1543Defining Prior DistributionsLesson 1581Setting Priors for A/B Tests
Unique identifier validation
Verify that ID columns contain no duplicates.
Lesson 1154Uniqueness and Duplication Checks
Uniqueness
Each value in the primary key column must be unique across the entire table.
Lesson 1048What Are Primary Keys?Lesson 1863Data Quality DimensionsLesson 1865Data Quality Checks in Pipelines
Unit tests for dependencies
Write tests that assert specific relationships exist.
Lesson 1846Testing and Validating Dependency Graphs
Uniting columns
is the reverse: combining multiple columns into one when they represent a single logical unit.
Lesson 1147Separating and Uniting Columns
Units
Currency (USD), measurements (kg, meters), percentages
Lesson 1163Metadata and Data DictionariesLesson 2064Creating Data Dictionaries
Unnatural constraints
Sometimes the conjugate form doesn't match your actual prior knowledge
Lesson 1555Advantages and Limitations of Conjugate Priors
Unnecessary legends
Label directly when possible
Lesson 1237Chart Junk and Data-Ink Ratio
Unpooled variance
treats each group's variance as unique.
Lesson 285Pooled vs Unpooled Variance Approaches
Unreliable forecasts
Predictions become meaningless outside your training period
Lesson 734Why Differencing and Detrending Matter
Unreliable predictions
Since the underlying process is changing, our model's parameters—estimated from past data— won't accurately describe future behavior.
Lesson 713Why Stationarity Matters
Unrepresentative samples
If your data doesn't reflect the real-world distribution, predictions will fail in production.
Lesson 2124Insufficient or Low-Quality Data
Unresolved issues
Tickets closed without satisfaction
Lesson 1673Leading Indicators of Churn
Unstable Coefficient Estimates
Lesson 581Symptoms of Multicollinearity
Unstable coefficients
Small changes in your data can lead to large swings in the estimated regression coefficients
Lesson 580What is Multicollinearity?
Unstructured data
doesn't fit neatly into tables.
Lesson 16Structured vs Unstructured Data
Untracked data sources
Multiple teams pull from the same database table, but nobody coordinates when structure or semantics change.
Lesson 2133Undocumented Data Dependencies
Untracked files
Files Git doesn't know about yet (never staged or committed).
Lesson 1997Viewing Repository State with git statusLesson 1998Checking Repository Status
Unused indexes
consume storage and slow down writes (INSERT, UPDATE, DELETE) without providing query benefits.
Lesson 1086Index Maintenance and Monitoring
Update
Apply Bayes' theorem to compute the posterior using that data
Lesson 1582Updating Beliefs with Test Data
UPDATE protection
You cannot change a foreign key to point to a non-existent parent
Lesson 1052Foreign Key Constraints
Update with data
from each group separately to get two posterior distributions: one for μ₁ and one for μ₂
Lesson 1570Comparing Two Means: Bayesian Approach
Updated beliefs
Compare the posterior to your prior.
Lesson 1547Interpreting Posterior Distributions
Updates belief
Strong data can overcome weak priors; strong priors resist contradictory weak data
Lesson 1537The Posterior Distribution
Updates segment membership
as customer behavior evolves
Lesson 1710Operationalizing Segments: Scoring and Deployment
Upper bound only
Mean + t*(SE)
Lesson 275One-Sided Confidence Bounds
Upper threshold (B)
Based on acceptable Type I error (α, false positive rate)
Lesson 1511Sequential Probability Ratio Test (SPRT)
Upserts (Update or Insert)
Instead of blindly inserting records, use operations that update existing records if they're already present.
Lesson 1848Designing Idempotent Operations
Upstream
`clean_data` and `extract_raw_data` (direct and transitive)
Lesson 1841Upstream and Downstream Dependencies
Upstream dependencies
are the tasks that must run *before* your current task.
Lesson 1841Upstream and Downstream Dependencies
Upward (positive) trend
Values generally increase over time (e.
Lesson 706Trend: Long-Term Direction
Upward or downward slope
Warning—variance is changing systematically as fitted values increase
Lesson 560Scale-Location Plot (Spread-Location Plot)
Usage
How to run scripts, notebooks, or generate reports
Lesson 2077The Purpose and Anatomy of a Good README
Use ±2
when missing real anomalies is costly (e.
Lesson 1378Setting Z-Score Thresholds
Use ±3
when false positives are costly (e.
Lesson 1378Setting Z-Score Thresholds
Use a random mechanism
to select your sample (random number generator, lottery-style draw)
Lesson 234Simple Random Sampling
Use accessible uncertainty language
.
Lesson 1928Communicating Uncertainty Honestly
Use active voice
"We tested three models" beats "Three models were tested"
Lesson 1967Writing Clear and Concise Analysis Sections
Use binomial logic
Under H₀, positive and negative signs are equally likely (p = 0.
Lesson 391The Sign Test for Medians
Use blocking first
rerandomization works best *after* applying stratification—it fine-tunes balance within strata
Lesson 1492Rerandomization and Practical Implementation
Use CASE when
You need inline conditional logic for 3-10 possible outcomes within a query.
Lesson 1037CASE Best Practices and Performance
Use charset detection libraries
that analyze byte patterns to suggest likely encodings
Lesson 1135Detecting and Fixing Encoding Issues
Use colorblind-friendly palettes
Tools like ColorBrewer, Viridis, and palette simulators help you test combinations.
Lesson 1248Color Blindness and Color Palette Design
Use concrete examples
Instead of explaining regularization abstractly, say "prevents the model from memorizing noise in the training data"
Lesson 2105Translating Between Technical and Business Language
Use concrete units
Always include what you're measuring ("dollars," "pounds," "hours")
Lesson 530Communicating Results to Non-Technical Audiences
Use configuration files
Create a `config.
Lesson 2070Separating Data from Code
Use consistent formatting
APA, IEEE, or your organization's standard
Lesson 1972Citations and References in Data Science Reports
Use explicit JOIN syntax
with `ON` clauses instead of comma-separated table lists
Lesson 955Avoiding Cartesian Products
Use Fisher's Exact Test
as an alternative (for 2×2 tables)
Lesson 426Assumptions and Sample Size Requirements
Use multiple channels strategically
Lesson 2104Communication Cadence and Updates
Use OO interface
for production code, complex layouts, multiple subplots, or when functions need to accept specific axes to plot on
Lesson 1256Two Interfaces: pyplot vs Object-Oriented
Use plain language
Avoid jargon like "feature importance" or "p-values.
Lesson 1944Executive Summary Best Practices
Use pyplot
for quick exploratory visualizations and simple single plots
Lesson 1256Two Interfaces: pyplot vs Object-Oriented
Use rank tests
when: data are skewed, outliers present, small samples where you can't verify normality, or you care about distribution shifts beyond just means
Lesson 397Power and Efficiency of Non-Parametric Tests
Use relative paths
`data/raw/sales.
Lesson 2070Separating Data from Code
Use sequential testing methods
specifically designed for interim analysis (like Group Sequential Testing or Always-Valid Inference from earlier lessons)
Lesson 1523Peeking at Results Early
Use stratified sampling
When you know certain groups are underrepresented, deliberately sample more from those groups to balance things out.
Lesson 250Strategies for Bias Detection and Mitigation
Use t
if you must *estimate* σ from your sample (using sample standard deviation s) — almost always the case
Lesson 272When to Use Z vs t
Use t-tests
when: data are approximately normal, moderate sample sizes, you want maximum power from clean data
Lesson 397Power and Efficiency of Non-Parametric Tests
Use table aliases carefully
ensure your `ON` clause references columns from *both* tables, not just one
Lesson 955Avoiding Cartesian Products
Use the appropriate test
for your data structure
Lesson 368Common Pitfalls and Best Practices
Use the bootstrap distribution
to build a confidence interval (percentile method, BCa, etc.
Lesson 306Bootstrap for Non-Standard Problems
Use the CDF
to find the area beyond your test statistic
Lesson 319Calculating P-Values from Test Statistics
Use WHERE subqueries
When filtering data, subqueries in WHERE typically outperform SELECT subqueries
Lesson 969Performance Considerations for SELECT Subqueries
Use z
if you *know* the population standard deviation (σ) — rare in real life
Lesson 272When to Use Z vs t
User behavior shifts
People interact with systems differently over time
Lesson 15Deployment, Monitoring, and Iteration
User confusion
about what to do next
Lesson 1681Time-Based Funnel Analysis
User demographics
Age group, gender, language preference
Lesson 1682Segmenting Funnels by User Attributes
User experience consistency
Randomizing by session means the same user might see different versions on different visits, creating confusion.
Lesson 1481Unit of Randomization
User identifiers
to stitch touchpoints together into coherent journeys
Lesson 1719The Customer Journey and Touchpoints
User input matters
(filtering by date range, region, or product category)
Lesson 1330Introduction to Interactive Dashboards
User support
Answering questions about metrics and functionality
Lesson 1979Maintenance and Sustainability Considerations
User/Customer
Each individual person gets one experience
Lesson 1481Unit of Randomization
Uses the one-sample t-test
on the differences (simpler than two-sample methods)
Lesson 370Differences as the Unit of Analysis
USING
only works when column names match exactly
Lesson 953Join Conditions: ON vs USING
Using linear regression
where residuals should be approximately normal
Lesson 202Why Test for Normality?
Using specific columns
Select only needed columns to reduce memory overhead
Lesson 951Join Order and Performance
UTC (Coordinated Universal Time)
is the universal baseline—think of it as the "source of truth" for time.
Lesson 1042Working with Timestamps and Time Zones
UTM Parameters
are tags appended to URLs that capture campaign details.
Lesson 1713Tracking Users by Channel

V

Vague observation
"Customer behavior looks different between segments.
Lesson 1200Formulating Specific, Testable Hypotheses
Valid Range
Min/max values, allowed categories, or regex patterns
Lesson 1163Metadata and Data Dictionaries
validate
that your leading indicator actually predicts the outcome you care about.
Lesson 1603Common Pitfalls in Indicator SelectionLesson 1692Statistical Significance and Iteration
Validate assumptions early
Does your preliminary analysis match stakeholder intuition?
Lesson 2111Fast Feedback Loops with Stakeholders
Validate with cross-validation
Ensure the model generalizes to unseen data
Lesson 633Practical Model Selection Strategy
Validate with stakeholders
Product, engineering, and analytics teams must agree on definitions.
Lesson 1679Defining Funnel Steps and Events
Validated
against incoming data batches in your pipeline
Lesson 1868Great Expectations Framework
Validating accuracy
Checking that values make sense—for example, ensuring ages aren't negative or dates aren't in the future.
Lesson 12Data Cleaning and Preparation
Validating updates
Changes to foreign key values are checked against the parent table
Lesson 1055What is Referential Integrity?
Validation
Test on held-out data or different time periods
Lesson 1204From Hypothesis to Analysis Plan
Validation becomes complex
What metrics indicate your retrained model is "good"?
Lesson 2128Data Distribution Shifts Frequently
Validation set
Data you use to check performance during development
Lesson 14Model Evaluation and Validation
Validation utilities
(`utils/validation.
Lesson 2075Utility Modules and Helper Functions
Value constraints
Non-null requirements, allowed categories
Lesson 1151Schema Validation
Values near zero
suggest little to no linear relationship at that lag
Lesson 720The Autocorrelation Function (ACF)
Vanity metrics
are measurements that appear impressive at first glance—often large, growing numbers—but don't connect to actionable business outcomes or inform strategic decisions.
Lesson 1612What Are Vanity Metrics?Lesson 1614Growth Without Retention
VARCHAR
or **TEXT**: Text strings (e.
Lesson 846Tables, Schemas, and Data Types
variability
(how much data points differ from each other), let's start with the simplest way to measure it: **range**.
Lesson 47Range: The Simplest MeasureLesson 294Margin of Error and Its ComponentsLesson 296Sample Size for Comparing Two Groups
Variability in the data
More spread (higher standard deviation) → larger standard error → larger margin of error.
Lesson 271Margin of Error
Variable distributions
Shape and spread along the diagonal
Lesson 1191Scatter Plot Matrices and Pairplots
Variable Name
The exact column name as it appears in your data
Lesson 2064Creating Data Dictionaries
Variables to exclude
"Drop `user_id` (high cardinality, no predictive value)"
Lesson 1212EDA Summary Documentation and Next Steps
Variance = 1/λ²
The spread is the square of the mean
Lesson 166Exponential Distribution: Mean and Variance
Variance inequality
Two-sample t-tests are more sensitive to unequal variances when sample sizes differ between groups.
Lesson 382Robustness of t-Tests to Assumption Violations
Variance Inflation Factor (VIF)
quantifies this problem by measuring how much the variance of a coefficient estimate is "inflated" due to correlation with other predictors.
Lesson 582Variance Inflation Factor (VIF)
Variance inspection
Directly compare the empirical variance and mean of your count variable across groups.
Lesson 693Overdispersion in Count Data
Variety
captures the diversity of data types and sources.
Lesson 1760Defining Big Data: The Three Vs
Vector formats
(like PDF, SVG, EPS) store mathematical descriptions of shapes.
Lesson 1273Saving Figures: Formats and Resolution
Vectorized operations
Modern CPUs process columns of uniform data types far faster than mixed-type rows.
Lesson 1811Columnar Storage and Query Optimization
Velocity
describes the speed at which data arrives and must be processed.
Lesson 1760Defining Big Data: The Three Vs
Verdict
We either reject innocence (guilty) or fail to reject it (not guilty — notice we don't say "innocent")
Lesson 312Hypothesis Testing as a Legal Analogy
Verifiable
– Anyone can check if it was achieved
Lesson 1610Defining Measurable Key Results
Verify balance
across both stratification variables and other covariates
Lesson 1489Stratified Randomization Fundamentals
Verify basics
Can you ping the database host?
Lesson 1093Troubleshooting Connection Issues
Verify data collection
Is this data being captured at all?
Lesson 2098Identifying Data Availability Gaps Early
Verify independence
review your sampling method and data collection
Lesson 290Assumptions and Diagnostics for Difference Intervals
Verify residuals
Check that the sum of residuals equals zero (or very close)
Lesson 522Implementing Least Squares from Scratch
Verify the value
against source data—is it a recording error?
Lesson 1209Outlier Detection and Investigation
Version
If the dataset has explicit versioning (like "v2.
Lesson 2063Essential Metadata to Capture
Version control
for tracking code changes over time
Lesson 29Code and Environment Management
Version control it
alongside your report code
Lesson 1987Environment and Dependency Management
Version drift
means that installing "the latest" packages today gives you a different environment than "the latest" six months ago, breaking reproducibility even when you follow the same steps.
Lesson 2048The Dependency Hell Problem
Version your data
Track which dataset version you used
Lesson 30The Reproducibility Crisis and Solutions
Version-controlled code
that documents every transformation
Lesson 1981What Makes a Report Reproducible?
Vertical bars
Each bar represents the correlation at a specific lag
Lesson 722ACF Plots and Interpretation
Vertical patterns
All cohorts struggling at the same time period (e.
Lesson 1649Visualizing Cohort Data with Heatmaps
Vertical scaling (scale-up)
means upgrading to a more powerful single machine—more RAM, more CPU cores, faster disks.
Lesson 1767Scale-Up vs Scale-Out Architectures
View the reflog
`git reflog` shows recent `HEAD` movements with timestamps and commit hashes
Lesson 2021Recovering from Rebase Mistakes
VIF = 1
No correlation with other predictors (ideal)
Lesson 582Variance Inflation Factor (VIF)
VIF = 1–5
Moderate correlation (usually acceptable)
Lesson 582Variance Inflation Factor (VIF)
VIF = 5–10
High correlation (concerning, investigate further)
Lesson 582Variance Inflation Factor (VIF)
VIF > 10
Severe multicollinearity (action needed)
Lesson 582Variance Inflation Factor (VIF)
VIF-guided removal
Remove the predictor with highest VIF, recalculate, repeat
Lesson 585Remedies: Variable Selection
Violin plots
go further by showing the **full probability density** of the data.
Lesson 1223Box Plots and Violin PlotsLesson 1268Box Plots and Violin Plots
Virality Coefficient (k)
= Invites Sent per User × Conversion Rate
Lesson 1631Social Media Metrics: DAU/MAU and Content Engagement
Viridis palettes
are perceptually uniform and colorblind-friendly:
Lesson 1368Color Scales and Palettes
Visual Diagnostics
Histograms or density plots overlaying treatment and control distributions make imbalances immediately visible.
Lesson 1491Covariate Balance and Diagnostics
Visual inspection first
Does your plot show obvious trend or changing variance?
Lesson 718Interpreting Stationarity Test Results
Visual methods
(histograms, density plots, Q-Q plots) give you the *intuitive picture*.
Lesson 210Combining Visual and Statistical MethodsLesson 377Testing Normality: Visual Methods
Visual proof
A chart that makes the trend immediately visible
Lesson 1946Supporting Your Claims with Evidence
Visual separation
of confidence bands (non-overlapping suggests real differences)
Lesson 817Comparing Multiple Survival Curves
Visual storytelling
Plots, dashboards, or interactive demos
Lesson 2141Building a Portfolio and Personal Brand
Visualization tools work smoothly
Libraries like Pandas and plotting tools expect tidy structure
Lesson 1142What is Tidy Data?
Visualizations
Use bar charts comparing segment characteristics side-by-side, box plots showing distributions of key metrics within segments, or radar charts displaying multiple dimensions simultaneously.
Lesson 1709Segment Profiling and Interpretation
Visualizations over tables
charts speak louder than numbers
Lesson 2091Stage 7: Communication and Handoff
Visualize
what "sampling variability" really means
Lesson 259Simulating Sampling Distributions
Visualize demographics
Plot key characteristics of your sample against the population.
Lesson 250Strategies for Bias Detection and Mitigation
Volume spike
Multiple complaints in a short window
Lesson 1673Leading Indicators of Churn
Voluntary churn
happens when customers actively choose to leave.
Lesson 1670What is Churn and Why It Matters
Volunteer bias
(also called **self-selection bias**) occurs when people choose whether or not to participate in a study, and those who volunteer differ in important ways from those who don't.
Lesson 246Volunteer and Self-Selection Bias
VP and above
Org-wide vision, resource allocation, executive influence
Lesson 2140Individual Contributor vs Management Tracks
Vulnerability
Over-reliance on power users means losing a few hurts badly
Lesson 1698Power User Curves and Engagement Distribution

W

W-shaped attribution model
recognizes that not all touchpoints are equally important.
Lesson 1730W-Shaped Attribution Model
WAIC
(Widely Applicable Information Criterion) or **LOO** (Leave-One-Out cross-validation) to compare them:
Lesson 1596Posterior Predictive Checks and Model Comparison
Wait for external conditions
before proceeding
Lesson 1836Task Dependencies and Flow Control
Wald tests
with **z-statistics** (because we're using maximum likelihood estimation, not least squares).
Lesson 683Hypothesis Tests for Individual Coefficients
Warning signs
before model fitting goes wrong
Lesson 584Correlation Matrices for Predictors
Warning/Email
Elevated error rate, slower performance, approaching thresholds—investigate during business hours
Lesson 1858Alerting Strategies
Warranty planning
Understanding failure patterns helps set optimal warranty periods
Lesson 188Weibull Distribution: Hazard Function and Reliability
Wasted computational resources
on variables that don't add value
Lesson 1197Identifying Variable Importance and Redundancy
Wasted effort
Including redundant features adds complexity without improving predictions
Lesson 513Applications: Feature Selection and Multicollinearity
Wasted space
Many columns contain `NULL` for half the rows
Lesson 1148Handling Multiple Types in One Table
Watch for hesitation
If someone pauses, squints, or re-reads labels, you've found friction.
Lesson 1964Testing Visualizations with Audiences
Watch Time
(or Listen Time): Total hours users spend consuming content.
Lesson 1635Media and Content Metrics: Watch Time and Content Performance
Weak
"Improve customer satisfaction"
Lesson 1610Defining Measurable Key Results
Weak or no relationships
appear as values near 0, suggesting variables are independent of each other.
Lesson 511Reading and Interpreting Correlation Matrices
Weakly Informative Prior
Use `Beta(2, 20)` or similar if you expect roughly 10% conversion but aren't certain.
Lesson 1581Setting Priors for A/B Tests
Weakly informative priors
gently guide the analysis away from unrealistic values (like 99% conversion) without imposing strong opinions.
Lesson 1534The Prior DistributionLesson 1559Uninformative and Weakly Informative PriorsLesson 1565Prior Distributions for Normal Means
Weaponization
A facial recognition system built for user authentication could be repurposed for mass surveillance or stalking.
Lesson 1920Anticipating Misuse of Data Products
Web Mercator
What you see in Google Maps and most web applications
Lesson 1308Geographic Data Types and Coordinate Systems
Web sources
Websites, social media, online reviews
Lesson 11Data Collection and Acquisition
Web traffic
Marketing campaigns or service outages
Lesson 1412What is Change-Point Detection?
Web traffic analysis
Detecting unusual spikes beyond typical weekday/weekend patterns or holiday seasons
Lesson 1411Applications and Limitations
Web UI
Visual dashboard for monitoring pipelines
Lesson 1833Introduction to Apache Airflow
Website traffic and sales
Marketing spend might drive both independently
Lesson 1423The Third Variable ProblemLesson 1424Reverse Causality
Week 4 retention
for January cohort: 45%
Lesson 1650Comparing Cohorts Over Time
Weibull
extends exponential by allowing failure rates to change over time (shape parameter).
Lesson 193Choosing Between Distributions in Practice
Weight by predictive power
use churn models or LTV correlations to guide weights
Lesson 1699Engagement Scoring Systems
Weight your data
If you can't get a perfect sample, assign weights to underrepresented groups so they count more in your analysis—this mathematically corrects for imbalance.
Lesson 250Strategies for Bias Detection and Mitigation
weighted average
of past observations, where recent values matter more than older ones.
Lesson 757Introduction to Exponential SmoothingLesson 1566Conjugate Normal-Normal Model
Welch's ANOVA
Handles unequal variances without requiring transformations
Lesson 470When Parametric ANOVA Assumptions Fail
What automated decisions
involve their data (if any)
Lesson 1908Data Subject Access Requests (DSARs)
what changed
, **why you changed it**, and **what assumptions you made** at each processing step.
Lesson 1162Documenting TransformationsLesson 1955Framing Insights in Business Language
What data
you hold about them (copy of all personal data)
Lesson 1908Data Subject Access Requests (DSARs)
What did you find
State the key insight in one clear sentence.
Lesson 1944Executive Summary Best Practices
What follow-up analyses
you'll run based on different outcomes
Lesson 1204From Hypothesis to Analysis Plan
What it means
Your residuals have more extreme values (outliers) than a normal distribution would predict.
Lesson 567Common Q-Q Plot Patterns: Heavy Tails and Light Tails
What metric defines success
(e.
Lesson 1167Identifying Success Criteria
What they actually test
Whether two groups have **identical distributions**.
Lesson 394Interpreting Rank-Based Tests: Medians vs Distributions
What this means
Hat values range from `1/n` to 1.
Lesson 573Calculating and Interpreting Hat Values
What threshold constitutes improvement
(e.
Lesson 1167Identifying Success Criteria
What validation approach
you'll apply
Lesson 1204From Hypothesis to Analysis Plan
What would constitute evidence
for or against your hypothesis
Lesson 1204From Hypothesis to Analysis Plan
What's the impact
Quantify the business outcome.
Lesson 1944Executive Summary Best Practices
WhatsApp
Number of messages sent — directly measures the utility users get from communication.
Lesson 1606Examples of North Star Metrics by Industry
When differences emerge
(curves may start together then diverge)
Lesson 817Comparing Multiple Survival Curves
When duplicates are meaningful
Combining sales records, event logs, or time-series data where each row represents a distinct occurrence
Lesson 1000UNION ALL: Preserving Duplicates
When satisfied
Your β₀ and β₁ estimates are **unbiased**—on average, they hit the true population values.
Lesson 552Zero Conditional Mean of Errors
When to shift focus
Once flattened, optimize retention earlier in the curve rather than fighting churn at the tail
Lesson 1658Flattening and Asymptotic Behavior
When to use which
Report eta-squared for descriptive purposes with your current sample; use omega-squared when making inferences about population-level effects.
Lesson 445Effect Size: Eta-Squared and Omega-Squared
When violated
Predictions are systematically wrong at certain X ranges; coefficient estimates are misleading.
Lesson 552Zero Conditional Mean of Errors
WHERE filters first
It eliminates individual rows from the raw table before any grouping or aggregation happens
Lesson 915Combining WHERE and HAVING
Which specific pairs
are problematic
Lesson 584Correlation Matrices for Predictors
Who bears the cost
Sometimes the aggregate accuracy loss is small, but one subgroup's performance drops significantly.
Lesson 1891Fairness-Accuracy Tradeoffs
Who drives value
Are 20% of users responsible for 70% of activity?
Lesson 1698Power User Curves and Engagement Distribution
Who might challenge this
Peers and auditors need enough detail to validate your rigor.
Lesson 1947Handling Methodology and Technical Details
Why it matters
With finite populations, sampling without replacement affects probabilities as you go.
Lesson 233Populations in PracticeLesson 541Properties of Residuals
Why it works
It handles multiple predictors simultaneously, quantifies each feature's impact, and produces interpretable coefficients.
Lesson 1674Churn Prediction Models
Why this works
The regression "controls for" the intermediate lags, removing their influence and revealing only the direct relationship.
Lesson 729Calculating Partial Autocorrelations
Wide confidence bands
High uncertainty (small risk set)
Lesson 815Survival Curve Plots and Interpretation
Widely understood
The standard language for discussing variability across fields
Lesson 49Standard Deviation: Interpretable Spread
Wilcoxon Signed-Rank Test
improves on this by incorporating the **size** of differences while remaining non-parametric (no normality assumption required).
Lesson 392Wilcoxon Signed-Rank Test
Wilcoxon test
(also called Breslow test) weights earlier time points more heavily because more subjects are at risk early on.
Lesson 823Log-Rank Test vs Other Tests
Win-back
strategies target customers who've already churned, while **retention** strategies aim to prevent at-risk customers from leaving in the first place.
Lesson 1676Win-Back and Retention Strategies
Win-Back Candidates
A subset of churned customers worth targeting for reactivation—perhaps they left for fixable reasons or represent high LTV potential.
Lesson 1704Customer Lifecycle Stages
Wireframe plots
show the underlying grid structure more clearly and reduce visual clutter when you need to see through the surface or understand the data's resolution.
Lesson 13253D Surface and Wireframe Plots
With adjustment
Include age in your regression model.
Lesson 1431Controlling for Confounders: Adjustment
Within Groups
(or "Error"): Variation due to random differences within groups
Lesson 444The ANOVA Table
Within-group variance (denominator)
Measures the average variability within each group (pooled across all groups)
Lesson 440The F-Statistic and Its Distribution
Without adjustment
Exercise appears negatively associated with blood pressure, but is that real or just because older people do both less?
Lesson 1431Controlling for Confounders: Adjustment
Without manipulation
means doing so honestly, proportionally, and with full context—not weaponizing emotion to bypass critical thinking or hide inconvenient truths.
Lesson 1941Emotional Connection Without Manipulation
Without manual oversight
(weekends, holidays, overnight)
Lesson 1831What is Job Scheduling?
Without ownership
, metrics suffer:
Lesson 1619What is Metric Ownership?
Word frequency
A few words appear constantly; most are rare
Lesson 190The Pareto Distribution: Heavy Tails and Power Laws
Work with ordinal data
(survey ratings like "good, better, best")
Lesson 486Spearman's Rank Correlation Coefficient
Working Directory
Your desk where you're actively working on documents
Lesson 1993The Three States: Working Directory, Staging, Repository
Working sessions
Bi-weekly meetings to review preliminary findings and get rapid feedback
Lesson 2104Communication Cadence and Updates
Worst-case scenarios
Can you survive the potential losses?
Lesson 152Decision Making Under Uncertainty
Write complexity
Every time underlying data changes, you must update the aggregate
Lesson 1073Storing Computed Values and AggregatesLesson 1075Handling Data Consistency in Denormalized Schemas
Write operation cost
How much slower are inserts and updates?
Lesson 1077Measuring Performance Impact of Denormalization
Wrong data type
Applying `AVG()` to non-numeric columns causes errors
Lesson 884AVG: Computing Averages
Wrong interpretation
"Going to the hospital makes people sick.
Lesson 496Reverse Causality
Wrong period
Seasonal spikes get flagged as false positives, or real anomalies blend into "normal" variation
Lesson 1409Setting Detection Parameters
Wrong summary
Using mean when data has outliers (better: median)
Lesson 1245Misleading Aggregations and Binning

X

X → Y
(causal path) and **Z → X → Y** plus **Z → Y** (a confounder creating a backdoor path **X ← Z → Y**), controlling for **Z** blocks the backdoor while preserving the causal arrow.
Lesson 1472The Backdoor Criterion
X value
but may fit the pattern perfectly.
Lesson 587Identifying Outliers in Regression Context
X-axis
Time periods since the initial event (Day 0, Day 7, Day 30, etc.
Lesson 1653What are Retention Curves?
X'
The transpose of X (flip rows and columns)
Lesson 598Estimating Coefficients with Least Squares
X(t)
instead of just **X**.
Lesson 833Time-Varying Covariates
X₁, X₂, ..., X
Your predictor variables (independent variables)
Lesson 596The Multiple Regression Equation

Y

Y value
given its X value—it doesn't follow the pattern of the other data points.
Lesson 587Identifying Outliers in Regression Context
Y-axis
Percentage of the original cohort still active (0-100%)
Lesson 1653What are Retention Curves?
Y-units per X-unit
, and this determines how you communicate your findings.
Lesson 525Units and Scale in Interpretation
YAML
Human-readable, great for hierarchical settings
Lesson 2072Configuration Files vs Hard-Coded Values
YAML header
Metadata at the top specifying output format, title, author, and date
Lesson 1983R Markdown for Dynamic Reports
Yearly cycles
Ice cream sales peak every summer, heating costs rise every winter
Lesson 707Seasonality: Regular Periodic Patterns
Years of experience
may serve as an age proxy
Lesson 1883Protected Classes and Proxy Variables
You
stay focused on what actually matters to your audience
Lesson 1942The Pyramid Principle: Starting with the Conclusion
You can say
"Given our data and prior beliefs, there's a 95% probability the true conversion rate is between 45% and 74%.
Lesson 1562Credible Intervals for ProportionsLesson 1578Interpreting Credible Intervals
You have limited data
Conjugacy helps when likelihood is weak
Lesson 1556Choosing Between Conjugate and Non-Conjugate Priors
You miss real effects
– Even if your new feature genuinely improves conversion by 2%, your test might conclude "no significant difference" simply because you didn't collect enough data.
Lesson 1529Running Underpowered Tests
You stay in control
of the narrative instead of improvising
Lesson 1949Anticipating Questions: Building in Appendices
You use NOT IN
with nullable columns (can miss results and run slowly)
Lesson 966Performance Considerations for WHERE Subqueries
You want stable variance
Some transformations stabilize variance across different data ranges, meeting another key assumption
Lesson 211Why Transform Data to Normality?
You waste resources
– Your engineering team built the feature, you split traffic for weeks, analyzed results.
Lesson 1529Running Underpowered Tests
You're building linear models
Transforming the response variable can improve model fit and prediction accuracy
Lesson 211Why Transform Data to Normality?
You're doing exploratory work
The interactive nature of Pandas in Jupyter makes rapid iteration easier than Spark's batch- oriented workflows.
Lesson 1787When to Optimize Pandas Instead
You're exploring
Early-stage analysis where perfect precision isn't critical
Lesson 1556Choosing Between Conjugate and Non-Conjugate Priors
Your branch's latest commit
The tip of your current branch
Lesson 2009Three-Way Merges
Your data is categorical/binary
Each observation falls into one of two categories (success/failure, yes/no, clicked/didn't click)
Lesson 399When to Use the One-Sample Z-Test for Proportions
Your data is skewed
Income data, reaction times, or count data often pile up on one side
Lesson 211Why Transform Data to Normality?
Your fitted model
The GLM you actually built
Lesson 697Deviance: A Measure of Model Fit
Your operations are vectorized
Pandas built on NumPy excels at vectorized operations.
Lesson 1787When to Optimize Pandas Instead
Your outcome is categorical
predicting "yes/no" or categories requires logistic regression or classification methods instead
Lesson 555When Regression Is and Isn't Appropriate
Your own experience
Previous projects or work in adjacent fields
Lesson 1201Domain Knowledge as a Hypothesis Source
Your own historical cohorts
to track improvement
Lesson 1657Day-1, Day-7, Day-30 Benchmarks
Your sample size (n)
Larger datasets have different thresholds
Lesson 1392Critical Values and Significance Testing
Your significance level (α)
Typically 0.
Lesson 1392Critical Values and Significance Testing
Your significance level α
(and whether your test is one-tailed or two-tailed)
Lesson 355Finding Critical Values and P-Values

Z

Z → X
(third variable influences X)
Lesson 1423The Third Variable Problem
Z ≈ 0
The value is close to average
Lesson 1376What is the Z-Score Method?
z-score method
uses this to flag outliers: if a data point is *too many* standard deviations away from the mean, it's probably an outlier.
Lesson 71Z-Score Method for Outlier DetectionLesson 1386IQR Method vs Z-Score: When to Use Each
Z-scores
(which you'll learn to calculate soon) tell you how many standard deviations away from the mean you are.
Lesson 62Percentiles vs Z-Scores: Complementary Position MeasuresLesson 1209Outlier Detection and Investigation
Z-table
(also called a standard normal table) is a reference chart that shows cumulative probabilities for the standard normal distribution.
Lesson 198Using Z-Tables for Probability
Z-test
Used when you have large samples or known population variance
Lesson 1749Measuring Statistical Significance
Z-tests
for proportions determine if selection rate differences are statistically meaningful
Lesson 1890Measuring Disparate Impact
Zero residuals
(`e_i = 0`) mean your prediction was exactly correct (rare in practice!
Lesson 540The Residual Formula
Zero slope
X has no linear relationship with Y
Lesson 524The Meaning of the Slope
ZIP code
often correlates with race and income due to historical segregation patterns
Lesson 1883Protected Classes and Proxy VariablesLesson 1889Proxy Variables and Redlining
Zip code or address
→ race, income, immigration status
Lesson 1889Proxy Variables and Redlining
Zombie users
Automated scripts or bots inflate counts without real engagement
Lesson 1694Daily Active Users (DAU) and Monthly Active Users (MAU)
Zoom controls
allow users to magnify regions of interest by scrolling or clicking-and-dragging, making dense visualizations navigable.
Lesson 1303Range Sliders and Zoom Controls