Data Science Glossary
Key terms from the Data Science course, linked to the lesson that introduces each one.
5,145 terms.
#
- `element_blank()`
- Removes elements entirely
- Lesson 1365 — Customizing Non-Text ElementsLesson 1366 — Theme() Function Deep Dive
- `element_line()`
- Controls line elements (grid lines, axis lines)
- Lesson 1365 — Customizing Non-Text ElementsLesson 1366 — Theme() Function Deep Dive
- `element_rect()`
- Controls rectangular elements (backgrounds, borders)
- Lesson 1365 — Customizing Non-Text ElementsLesson 1366 — Theme() Function Deep Dive
- ±3 standard deviations
- captures about 99.
- Lesson 1378 — Setting Z-Score ThresholdsLesson 1397 — Shewhart Control Chart Basics
- 1:1 to 3:1
- Breaking even or marginally profitable, but likely not covering operational overhead, support costs, or providing adequate return on capital.
- Lesson 1667 — LTV:CAC Ratio and ProfitabilityLesson 1756 — LTV:CAC Ratio as a Health Metric
- 80/20 rule
- .
- Lesson 191 — Pareto Principle and the 80/20 RuleLesson 2116 — Diminishing Returns and the 80/20 Rule
- 80% power
- , meaning an 80% chance of detecting a true effect if it exists.
- Lesson 446 — Power and Sample Size for ANOVALesson 1495 — Power Analysis Fundamentals
- 95% confidence
- The procedure captures the true parameter 95% of the time
- Lesson 267 — Interpreting Confidence LevelsLesson 278 — Confidence Interval Formula for One Proportion
- α (alpha)
- Controls the shape on the left side
- Lesson 184 — Beta Distribution: Bounded Between 0 and 1Lesson 761 — Double Exponential Smoothing (Holt's Method)
- α close to 0
- (e.
- Lesson 758 — Simple Exponential Smoothing (SES)Lesson 759 — Choosing the Smoothing Parameter α
- α close to 1
- (e.
- Lesson 758 — Simple Exponential Smoothing (SES)Lesson 759 — Choosing the Smoothing Parameter α
- β (beta)
- Controls the shape on the right side
- Lesson 184 — Beta Distribution: Bounded Between 0 and 1Lesson 331 — Understanding Type II Error (False Negative)Lesson 761 — Double Exponential Smoothing (Holt's Method)
- λ (lambda)
- , called the **rate parameter**.
- Lesson 165 — Exponential Distribution: PDF and CDFLesson 593 — Box-Cox Transformation
- μ (mu)
- The mean, which determines where the center of the bell curve sits
- Lesson 169 — The Normal Distribution: Definition and PropertiesLesson 170 — Parameters: Mean (μ) and Standard Deviation (σ)Lesson 172 — Probability Density Function for Normal Distribution
- σ (sigma)
- The standard deviation, which controls how spread out or "wide" the bell curve is
- Lesson 169 — The Normal Distribution: Definition and PropertiesLesson 170 — Parameters: Mean (μ) and Standard Deviation (σ)Lesson 172 — Probability Density Function for Normal Distribution
A
- above
- or **below** the hypothesized median.
- Lesson 391 — The Sign Test for MediansLesson 567 — Common Q-Q Plot Patterns: Heavy Tails and Light TailsLesson 568 — Skewness in Q-Q Plots: Left and Right Deviations
- Above 5:1
- Excellent margins, but potentially signals underinvestment in growth.
- Lesson 1667 — LTV:CAC Ratio and Profitability
- Above average
- `WHERE value > (SELECT AVG(value) FROM table)`
- Lesson 964 — Subqueries with Aggregate Functions
- absolute difference
- |p₁ - p₂|.
- Lesson 413 — Effect Size and Practical SignificanceLesson 1019 — Comparing Values to Window Aggregates
- Acceleration
- If the variance of your statistic changes across different data values (making the distribution asymmetric), BCa accounts for this skewness
- Lesson 304 — BCa Bootstrap Intervals: Bias Correction
- Accept limitations
- Acknowledge when clean measurement isn't feasible
- Lesson 1527 — Ignoring Network Effects
- Accept or reject
- if balance is good, proceed; if not, generate a new randomization
- Lesson 1492 — Rerandomization and Practical Implementation
- Acceptable error types
- Is a false positive worse than a false negative?
- Lesson 2117 — Defining 'Good Enough' with Stakeholders
- Access controls
- Limit who can use sensitive features (building on GDPR principles you learned)
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Access method
- How you retrieved it (SQL query, API call, manual download, automated script)
- Lesson 1161 — Documenting Data Sources
- Accessibility
- Design for colorblind viewers, provide alt text, and use clear labels.
- Lesson 1247 — The Ethics of Visualization DesignLesson 2086 — Stage 2: Data Acquisition and Assessment
- Accountability
- Lesson 1905 — Core Principles of GDPR
- Accounting for censored observations
- by including them in the "at-risk" count up until they're censored, then removing them
- Lesson 809 — Introduction to the Kaplan-Meier Estimator
- Accuracy
- What percentage of predictions were correct?
- Lesson 14 — Model Evaluation and ValidationLesson 243 — Choosing the Right Sampling MethodLesson 1863 — Data Quality DimensionsLesson 1869 — Data Quality Metrics and SLAsLesson 1878 — What is Bias in Data?Lesson 1905 — Core Principles of GDPRLesson 1973 — Report Review and Quality ChecklistLesson 2086 — Stage 2: Data Acquisition and Assessment
- Accuracy varies
- Ambiguous addresses ("Main Street") return less precise results
- Lesson 1315 — Geocoding and Reverse Geocoding
- ACF
- (Autocorrelation Function) at lag k, you measure the total correlation between a time series and its k-period-ago self.
- Lesson 728 — PACF vs ACF: Key DifferencesLesson 733 — Using ACF and PACF TogetherLesson 798 — SARIMA Model Selection
- ACF plot
- Should show rapid decay to zero, not slow tailing off
- Lesson 741 — Testing Stationarity After TransformationLesson 779 — The Box-Jenkins Methodology
- ACF plots
- decay very slowly (indicating non-stationarity)
- Lesson 734 — Why Differencing and Detrending Matter
- ACF/PACF clues
- When your first-differenced ACF still shows slow decay
- Lesson 736 — Higher-Order Differencing
- ACF/PACF of residuals
- Should show no significant spikes (all patterns captured)
- Lesson 799 — Fitting and Diagnosing SARIMA Models
- Acknowledge the constraint
- with stakeholders—don't promise ML magic without the ingredients.
- Lesson 2124 — Insufficient or Low-Quality Data
- Acknowledge the limitation
- State that the intercept has no practical interpretation in your context
- Lesson 526 — When the Intercept Has No Meaning
- Acknowledge uncertainty
- Use phrases like "on average" or "typically" rather than absolute claims
- Lesson 530 — Communicating Results to Non-Technical Audiences
- Acknowledging limitations
- Honest interpretation includes caveats—data gaps, model assumptions, confidence intervals.
- Lesson 2090 — Stage 6: Interpretation and Insight Generation
- Acquisition channels
- are the various pathways through which potential users discover and arrive at your product, website, or service.
- Lesson 1711 — What Are Acquisition Channels?
- Actionable
- What decisions will this answer inform?
- Lesson 1166 — Defining the Business QuestionLesson 1605 — Characteristics of Good North Star Metrics
- Actionable across teams
- – Different departments can influence it through their work
- Lesson 1604 — What is a North Star Metric?
- actions
- (triggers that produce results).
- Lesson 1774 — What is Apache Spark and Why Use It?Lesson 1780 — Transformations vs Actions in Spark
- Activation benchmarks
- Lesson 1697 — Time-to-Value and Activation Metrics
- Active Customers
- Engaged users who regularly use your product or make repeat purchases.
- Lesson 1704 — Customer Lifecycle Stages
- Acyclic
- means no variable can cause itself through any path (no loops)
- Lesson 1468 — Introduction to Directed Acyclic Graphs (DAGs)Lesson 1833 — Introduction to Apache Airflow
- Adapt immediately
- Change a threshold or add a condition in minutes
- Lesson 2128 — Data Distribution Shifts Frequently
- Add a legend
- Always include a size legend so viewers can interpret bubble magnitudes
- Lesson 1229 — Bubble Charts for Three Variables
- Add candidate predictors
- one at a time or in meaningful blocks (e.
- Lesson 703 — Sequential Model Building Strategy
- Add context
- Compare the effect size to something meaningful
- Lesson 530 — Communicating Results to Non-Technical Audiences
- Add Context and Clarity
- Lesson 1217 — The Transition from Explore to Explain
- Add redundant encoding
- Don't rely on color alone.
- Lesson 1248 — Color Blindness and Color Palette Design
- Adding a constant
- If you add a constant *c* to a random variable *X*, the expected value increases by exactly *c*:
- Lesson 149 — Properties of Expectation and Variance
- Adding intervals to dates
- Lesson 1040 — Date Arithmetic and INTERVAL Operations
- Additional detail
- on methodology (statistical tests used, data cleaning steps)
- Lesson 1949 — Anticipating Questions: Building in Appendices
- additive
- .
- Lesson 466 — Visualizing InteractionsLesson 710 — Additive vs Multiplicative ModelsLesson 744 — Classical Decomposition MethodsLesson 765 — Introduction to Holt-Winters MethodLesson 767 — Holt- Winters Additive Model
- Additive changes
- are safest: adding new columns doesn't break existing queries that don't reference them.
- Lesson 1876 — Schema Evolution and Backwards Compatibility
- Additive forecasting formula
- Lesson 771 — Forecasting with Holt-Winters
- Additive model
- `Observed = Trend + Seasonality + Irregular`
- Lesson 710 — Additive vs Multiplicative ModelsLesson 742 — Components of Seasonal DecompositionLesson 748 — Seasonally Adjusted DataLesson 749 — Using Decomposition for ForecastingLesson 770 — Initializing Holt-Winters Components
- Additive seasonality
- means seasonal fluctuations stay roughly constant in size regardless of the data's level.
- Lesson 766 — Additive vs Multiplicative Seasonality
- Adds a penalty
- Includes a penalty term for each additional change-point to avoid over-segmenting (similar to regularization in regression)
- Lesson 1416 — PELT Algorithm: Pruned Exact Linear Time
- Adequate range
- Data should span a reasonable range of values
- Lesson 480 — Scatterplots and Visual Assessment
- Adequate sample size
- Expected frequency ≥ 5 in every category
- Lesson 419 — Assumptions and Minimum Expected Frequencies
- Adjust next sprint
- Based on feedback, decide whether to refine features, try new models, or pivot
- Lesson 2113 — Timeboxing and Sprint Planning for Data Projects
- Adjust the scale
- to emphasize meaningful differences—sometimes a gradient from 0-100% works, other times 20- 80% highlights actionable variations
- Lesson 1649 — Visualizing Cohort Data with Heatmaps
- Adjusted fences
- use a skewness coefficient to shift boundaries asymmetrically
- Lesson 1388 — Limitations and Alternatives to IQR Detection
- Adjusted p-values
- corrected for multiple testing (Tukey, Bonferroni, etc.
- Lesson 462 — Interpreting and Reporting Post-Hoc Results
- Adjusted R-squared
- Quick model comparisons, reporting to non-technical audiences, when interpretability matters most.
- Lesson 616 — Adjusted R-Squared vs Other CriteriaLesson 626 — Nested vs Non-Nested ModelsLesson 632 — Parsimony and Occam's Razor
- Adjustment
- means including confounders directly in a regression model as additional predictors.
- Lesson 1431 — Controlling for Confounders: Adjustment
- Administrative records
- Lists that don't capture informal workers or undocumented individuals
- Lesson 249 — Coverage Error and Undercoverage
- Administrative selection
- gatekeepers assign treatment based on need or eligibility
- Lesson 1444 — Selection Bias and Treatment Assignment
- Adoption rate
- = (Users who've used feature at least once) / (Total active users)
- Lesson 1696 — Feature Adoption and Usage Frequency
- Adstock
- (also called "advertising stock") is a transformation that captures two key phenomena:
- Lesson 1739 — Adstock and Carryover Effects
- Advanced composition
- Mathematical techniques can provide tighter bounds, so the total might be less than simple addition
- Lesson 1900 — Privacy Budget and Composition
- Adverse Impact Ratio
- extends the 80% rule with confidence intervals
- Lesson 1890 — Measuring Disparate Impact
- Advocacy in analyst's clothing
- Using your technical authority to push personal or organizational agendas
- Lesson 1926 — The Honest Broker Role
- aesthetic mappings
- define *how* your data becomes *visible*.
- Lesson 1341 — Data and Aesthetic MappingsLesson 1348 — The Base Layer: ggplot() and Data Mapping
- Aesthetics (aes)
- How variables map to visual properties like x-position, y-position, color, size, or shape
- Lesson 1339 — What is the Grammar of Graphics?Lesson 1340 — The Seven Layers of Grammar
- Affected by extremes
- One very high or very low value (an outlier) can pull the mean in that direction
- Lesson 39 — The Mean (Arithmetic Average)
- After denormalizing
- Lesson 1077 — Measuring Performance Impact of Denormalization
- After testing
- If your first-differenced series still fails stationarity tests (ADF, KPSS)
- Lesson 736 — Higher-Order Differencing
- Age groups
- A person in the "18-25" bracket cannot also be in the "26-35" bracket
- Lesson 81 — Mutually Exclusive Events
- Age in customer data
- Even if someone's age of 150 falls within 3 standard deviations (passing the Z-score test), you know it's invalid—humans don't live that long.
- Lesson 75 — Domain-Specific Outlier Rules
- Aggregate
- your data
- Lesson 973 — Nested Subqueries in FROMLesson 994 — CTEs for Simplifying Complex JoinsLesson 1827 — Transformation Patterns: Map, Filter, Aggregate
- Aggregate functions calculate
- Totals, averages, counts are computed for each group
- Lesson 915 — Combining WHERE and HAVING
- Aggregated summaries
- Storing `total_order_value` on a customer record instead of calculating it from order lines each time.
- Lesson 1074 — Duplicating Data Across Tables
- Aggregation
- Switch to hexbin maps or heatmaps for dense datasets
- Lesson 1310 — Point Maps and Scatter Plots on Maps
- Aggregation problems
- Lesson 1245 — Misleading Aggregations and Binning
- Aggregations
- Sum, count, mean calculations that don't need all data simultaneously
- Lesson 1800 — Chunked Reading with read_csv
- Agreed upon
- Stakeholders buy in *before* analysis begins
- Lesson 2094 — Defining Success Metrics Upfront
- Agreement is confidence
- When both tests agree (ADF rejects + KPSS doesn't reject), you can confidently call the series stationary.
- Lesson 718 — Interpreting Stationarity Test Results
- Agricultural data
- might follow growing seasons that vary by region
- Lesson 746 — Choosing Seasonal Period
- AIC
- , or **BIC** — but you cannot use the Partial F-Test
- Lesson 626 — Nested vs Non-Nested ModelsLesson 660 — Choosing the Polynomial DegreeLesson 700 — AIC and BIC for Model SelectionLesson 781 — Information Criteria: AIC and BICLesson 785 — Information Criteria: AIC and BIC
- AIC (Akaike Information Criterion)
- and **BIC (Bayesian Information Criterion)** are scores that penalize models for using too many parameters while rewarding good fit to the data.
- Lesson 781 — Information Criteria: AIC and BIC
- AIC and BIC
- explicitly trade off fit quality against model size
- Lesson 632 — Parsimony and Occam's RazorLesson 791 — Comparing Nested and Non-Nested Models
- AIC/BIC
- Formal model selection procedures, comparing non-nested models, automated selection algorithms.
- Lesson 616 — Adjusted R-Squared vs Other Criteria
- Airbnb
- Nights booked (value = accommodations secured)
- Lesson 1604 — What is a North Star Metric?Lesson 1606 — Examples of North Star Metrics by Industry
- Airflow
- offers multiple ways to declare dependencies:
- Lesson 1843 — Declaring Dependencies in Orchestration Tools
- Alation
- , and **Apache Atlas** maintain centralized inventories of your data assets.
- Lesson 1164 — Tools for Lineage Tracking
- Alert and Continue
- Lesson 1866 — Handling Failed Quality Checks
- Algorithm initialization
- Neural networks, k-means clustering, random forests all start with random states
- Lesson 2055 — Why Randomness Matters in Data Science
- Algorithmic amplification of harm
- occurs when automated systems take existing problems—bias, misinformation, manipulation, or discrimination—and multiply their impact exponentially.
- Lesson 1923 — Algorithmic Amplification of Harm
- Align with business reality
- Reflect how your sales team actually closes deals
- Lesson 1731 — Custom Rule-Based Attribution
- Aligned
- with long-term business value
- Lesson 1478 — Defining Success MetricsLesson 2094 — Defining Success Metrics Upfront
- all
- the uncertainty to one tail instead of splitting it between two tails.
- Lesson 275 — One-Sided Confidence BoundsLesson 729 — Calculating Partial AutocorrelationsLesson 866 — The AND OperatorLesson 928 — LEFT JOIN vs INNER JOIN: When to Use EachLesson 963 — ANY and ALL OperatorsLesson 1407 — The ESD ComponentLesson 1513 — Always-Valid Inference and Confidence SequencesLesson 1753 — Customer Acquisition Cost (CAC): Components and Calculation (+1 more)
- All assumptions met
- → Proceed with standard parametric t-test
- Lesson 383 — Diagnostic Workflow: When to Proceed or Switch Tests
- Allocate budget wisely
- Identify which touchpoints assist vs.
- Lesson 1719 — The Customer Journey and Touchpoints
- Allowed Values
- Valid ranges for numeric data or enumerated categories
- Lesson 2064 — Creating Data Dictionaries
- Alpha
- controls how much weight recent observations get when updating the **baseline level** of your series.
- Lesson 769 — Smoothing Parameters: Alpha, Beta, Gamma
- alphabetical order
- to select the reference category.
- Lesson 646 — Reference Categories in Statistical SoftwareLesson 1178 — Bar Charts for Categorical Data
- Alt text
- (alternative text) is a brief written description of a visualization that screen readers can announce.
- Lesson 1250 — Text Alternatives and Screen Reader Compatibility
- Alternative
- The interaction coefficient differs from zero (it matters)
- Lesson 654 — Testing Interaction Significance
- Alternative (H₁)
- At least one group has different variance
- Lesson 450 — Homogeneity of Variance (Homoscedasticity)Lesson 683 — Hypothesis Tests for Individual CoefficientsLesson 787 — Ljung-Box Test for Residual Autocorrelation
- Alternative analyses
- you considered but didn't choose (and why)
- Lesson 1949 — Anticipating Questions: Building in Appendices
- Alternative hypothesis (H₁)
- The data does *not* come from a normal distribution
- Lesson 205 — Shapiro-Wilk TestLesson 311 — One-Sided vs Two-Sided AlternativesLesson 354 — Setting Up Hypotheses for One-Sample t-TestLesson 378 — Testing Normality: Statistical TestsLesson 401 — Setting Up Hypotheses for ProportionsLesson 406 — Two-Sample Proportion Test SetupLesson 500 — Hypothesis Testing Framework for CorrelationLesson 501 — T-Test for Pearson Correlation Significance (+5 more)
- Always include confidence intervals
- , not just point estimates.
- Lesson 1928 — Communicating Uncertainty Honestly
- Always increasing
- As x increases, F(x) never decreases
- Lesson 157 — Cumulative Distribution Functions (CDFs) for Continuous Variables
- Always positive
- Log-normal variables are strictly greater than zero
- Lesson 178 — Log-Normal Distribution: Definition and Properties
- Always qualify columns
- in multi-table queries, even when names don't conflict—it makes your intent crystal clear
- Lesson 922 — Selecting Columns from Joined Tables
- Always specify join conditions
- that relate the tables using foreign key relationships
- Lesson 955 — Avoiding Cartesian Products
- Always state units
- when reporting slopes ("$150 per square foot," not just "150")
- Lesson 525 — Units and Scale in Interpretation
- Always try this first
- Use pandas' built-in operations that work on entire columns at once.
- Lesson 1806 — Parallel Processing with apply() Alternatives
- Always unique
- Unlike `RANK()` or `DENSE_RANK()`, ties receive different numbers based on arbitrary order
- Lesson 1007 — ROW_NUMBER(): Assigning Unique Row Numbers
- Always use parentheses
- when mixing `AND` and `OR`, even if precedence would give the correct result.
- Lesson 870 — Operator Precedence and Parentheses
- Always-valid inference
- provides p-values and confidence intervals that remain statistically valid *no matter when you stop* — whether you check once, continuously, or at random times you didn't plan ahead.
- Lesson 1513 — Always-Valid Inference and Confidence Sequences
- Amazon
- Number of purchases per month — reflects both customer satisfaction and business sustainability.
- Lesson 1606 — Examples of North Star Metrics by Industry
- Ambiguity kills analysis
- If you're studying "time to employee turnover," does the clock start at date of hire, end of training, or first promotion?
- Lesson 803 — Defining the Event and Time Origin
- Amplify historical inequities
- baked into training data
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Analogy
- If your investment grows 10% one year and shrinks 10% the next, the arithmetic mean says 0% change—but you actually lost money!
- Lesson 44 — Geometric and Harmonic MeansLesson 50 — Population vs Sample VarianceLesson 57 — Quantiles: Quartiles, Deciles, and BeyondLesson 70 — Visual Methods: Box Plots and Scatter PlotsLesson 74 — Multivariate Outlier DetectionLesson 106 — Common Misconceptions About IndependenceLesson 126 — From Bernoulli to Binomial: Multiple TrialsLesson 133 — Expectation and Variance of the Geometric Distribution (+57 more)
- Analysis cells
- Alternate between explaining your approach (markdown) and executing it (code)
- Lesson 1982 — Literate Programming with Notebooks
- Analysis plan
- Statistical test you'll use, significance level (usually α = 0.
- Lesson 1485 — Documentation and Pre-Registration
- Analytical
- "Which customer segments have the highest lifetime value, and what acquisition channels bring us those segments?
- Lesson 2093 — Translating Business Questions into Analytical Questions
- Analytical goal
- Are you comparing values, showing distribution, revealing relationships, tracking change over time, or displaying composition?
- Lesson 1230 — Choosing the Right Chart Type
- Analytics
- You need to understand trends and make informed decisions
- Lesson 4 — Data Science vs Data Analytics vs Business Intelligence
- Analyze and Test
- Lesson 25 — The Scientific Method in Data Science
- Anchor Member
- The starting point—your initial row(s) with no dependencies.
- Lesson 996 — Recursive CTEs: Introduction
- Anderson-Darling test
- is another statistical test that checks whether your data follows a normal distribution, but with a special feature: it gives **more weight to the tails** (the extreme values at both ends) than the K- S test does.
- Lesson 207 — Anderson-Darling TestLesson 449 — Normality of Residuals
- Animation
- Show changes over time or across a third variable sequentially
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- Annotations
- draw attention to specific data points or regions.
- Lesson 1271 — Adding Legends, Annotations, and TextLesson 1355 — Layer Order and Plot Composition
- Anomalies
- Flag anything unusual.
- Lesson 1180 — Documenting Univariate FindingsLesson 2087 — Stage 3: Exploratory Data Analysis
- Anonymize rather than delete
- where possible for retained data
- Lesson 1909 — Right to Erasure and Data Retention Policies
- Anonymous participation options
- when power dynamics exist
- Lesson 1918 — Special Populations and Vulnerable Groups
- ANOVA framework
- (Analysis of Variance), which decomposes total variation into parts explained by the model versus leftover residuals.
- Lesson 618 — Global F-Test for Overall Model Significance
- Anscombe's quartet
- the famous cautionary tale where four datasets have identical summary statistics but wildly different relationships that only visualization reveals.
- Lesson 1222 — Scatter Plots for Relationships
- Answers to likely questions
- based on past presentations or stakeholder concerns
- Lesson 1949 — Anticipating Questions: Building in Appendices
- Anticipation
- occurs when units change behavior *before* treatment actually occurs.
- Lesson 1458 — Common DiD Pitfalls
- ANY
- Returns `TRUE` if the comparison is true for *at least one* value returned by the subquery
- Lesson 963 — ANY and ALL OperatorsLesson 1506 — Benjamini-Hochberg Procedure
- Any shape
- The original population can be uniform, exponential, Poisson, or anything else.
- Lesson 218 — What the Central Limit Theorem States
- Apache Airflow
- , **Prefect**, and **Dagster** log every execution step.
- Lesson 1164 — Tools for Lineage Tracking
- Apache Atlas
- maintain centralized inventories of your data assets.
- Lesson 1164 — Tools for Lineage Tracking
- Apache Spark
- emerged as a faster alternative, keeping data in memory when possible and supporting iterative algorithms (essential for machine learning).
- Lesson 1764 — The Big Data Technology Landscape
- Aperiodicity
- The chain doesn't get stuck in cycles
- Lesson 1589 — Markov Chains: The Foundation of MCMC
- API (Application Programming Interface)
- is like a restaurant menu for data.
- Lesson 21 — APIs and Web Scraping
- Appendices
- Technical details, additional charts, validation metrics
- Lesson 1966 — Report Structure and Executive Summary
- Appendix or Technical Supplement
- Lesson 1947 — Handling Methodology and Technical Details
- Application Logic Burden
- Unlike foreign key constraints that enforce referential integrity automatically, you must manually keep denormalized data consistent through careful application code or database triggers.
- Lesson 1075 — Handling Data Consistency in Denormalized Schemas
- Apply a color scale
- where higher retention rates get warmer colors (red, orange) and lower rates get cooler colors (blue, green)
- Lesson 1649 — Visualizing Cohort Data with Heatmaps
- Apply conditional logic
- "If the first touch was organic search AND a demo was booked, give search 40%"
- Lesson 1731 — Custom Rule-Based Attribution
- Apply domain knowledge
- could this happen in reality?
- Lesson 1209 — Outlier Detection and Investigation
- Apply information criteria
- Calculate AIC and BIC to balance fit and complexity
- Lesson 633 — Practical Model Selection Strategy
- Apply insights
- Set warranty periods just beyond the steep part of the failure curve; flag high-risk product lines
- Lesson 837 — Product Warranty and Failure Analysis
- Apply intervention
- Only the treatment group sees the new feature, pricing, or campaign
- Lesson 1641 — Isolating Effects with Control Groups
- Apply removal effect
- Remove one channel completely, recalculate conversion probability
- Lesson 1733 — Markov Chain Attribution Models
- Apply the correction factor
- The `n/((n-1)(n-2))` part adjusts for sample size, making the estimate more accurate for smaller datasets.
- Lesson 65 — Calculating Skewness
- AR (AutoRegressive) - p
- Lesson 773 — Introduction to ARIMA: Components and Notation
- AR (autoregressive) processes
- and determining their order.
- Lesson 731 — PACF for AR Process Identification
- AR process
- PACF cuts off sharply; ACF decays gradually
- Lesson 731 — PACF for AR Process Identification
- AR(1)
- Only the first lag is significant; all others fall within the confidence bounds
- Lesson 731 — PACF for AR Process IdentificationLesson 774 — Autoregressive (AR) Models
- AR(2)
- First two lags are significant; lag 3 onward drops off
- Lesson 731 — PACF for AR Process IdentificationLesson 776 — Identifying AR Order (p) Using PACF
- AR(p)
- First *p* lags are significant, then cutoff
- Lesson 731 — PACF for AR Process IdentificationLesson 732 — PACF Patterns for Common ModelsLesson 774 — Autoregressive (AR) Models
- Architectural discussion
- Sharing skeleton code to validate design decisions
- Lesson 2029 — Draft Pull Requests and WIP Workflows
- Area
- Crime counts in neighborhoods of different sizes
- Lesson 692 — Offset Terms for ExposureLesson 1232 — Perceptual Accuracy HierarchyLesson 1240 — Area and Volume Distortions
- Area or volume
- (acceptable since ratios are meaningful: "twice as much")
- Lesson 1238 — Matching Encoding to Data TypeLesson 1240 — Area and Volume Distortions
- ARMA
- models combine both components, so their PACF shows **gradual decay** (influenced by the MA part) rather than a clean cutoff.
- Lesson 732 — PACF Patterns for Common Models
- ARPU
- (Average Revenue Per User) = Monthly Recurring Revenue / Number of Customers
- Lesson 1666 — LTV for Subscription Businesses
- ARR
- is MRR × 12, representing the annualized value of subscriptions.
- Lesson 1628 — SaaS Metrics: MRR, ARR, and Logo Churn
- Artists
- Everything visible on the plot—lines, text, patches, images—are "Artist" objects.
- Lesson 1255 — The Anatomy of a Matplotlib Figure
- Ask
- What happens if I reject H₀ when it's actually true?
- Lesson 334 — Setting Alpha: Choosing Your Significance Level
- Ask "Why" repeatedly
- Use the "Five Whys" technique.
- Lesson 2102 — Understanding Stakeholder Goals and Constraints
- Ask a Question
- Lesson 25 — The Scientific Method in Data Science
- Ask clarifying questions
- When told to "make it more accurate," probe what accuracy means in their context—speed?
- Lesson 2105 — Translating Between Technical and Business Language
- Ask questions, don't demand
- "Have you considered handling NaN values here?
- Lesson 2024 — Code Review Best Practices
- Ask specific questions
- Lesson 1964 — Testing Visualizations with Audiences
- Assess completeness
- Are there known gaps, missing periods, or quality issues?
- Lesson 2098 — Identifying Data Availability Gaps Early
- Assess variance equality
- compare standard deviations or use Levene's test (not yet covered formally, but intuitive: do the spreads look similar?
- Lesson 290 — Assumptions and Diagnostics for Difference Intervals
- Assigns new customers
- to the right segment as soon as they arrive
- Lesson 1710 — Operationalizing Segments: Scoring and Deployment
- Assumes Normal Distribution
- Z-scores interpret best when data follows a normal distribution.
- Lesson 201 — Z-Score Applications and Limitations
- Assumption testing
- Early scoping involves assumptions about what matters.
- Lesson 2109 — Why Data Science is Inherently Iterative
- Assumption Validation
- means checking whether your model's prerequisites are met.
- Lesson 2089 — Stage 5: Model Development and Validation
- Assumptions
- "Assumed all temperature readings are in Fahrenheit based on metadata; values outside -50°F to 150°F flagged as suspicious"
- Lesson 1162 — Documenting TransformationsLesson 2100 — Documenting Assumptions and Open Questions
- Assumptions are severely violated
- extreme outliers dominate, variance explodes as X increases, or observations aren't independent
- Lesson 555 — When Regression Is and Isn't Appropriate
- Assumptions made
- Did you assume missing data was random?
- Lesson 1917 — Transparency in Analysis and Models
- Assumptions matter more
- Violations of homogeneity of variance become more problematic
- Lesson 468 — Balanced vs Unbalanced Designs
- Asymmetric
- Unlike the normal distribution, it's not symmetric around its mean
- Lesson 178 — Log-Normal Distribution: Definition and Properties
- Asymptotic
- (the tails approach but never touch zero—technically possible values extend infinitely in both directions)
- Lesson 169 — The Normal Distribution: Definition and Properties
- Asymptotic p-values
- rely on large-sample approximations (like the Central Limit Theorem).
- Lesson 322 — Exact vs Asymptotic P-Values
- at least one
- of the conditions.
- Lesson 867 — The OR OperatorLesson 1501 — The Multiple Testing Problem
- at most
- or **greater than** a certain value using cumulative distribution functions.
- Lesson 143 — Cumulative Poisson ProbabilitiesLesson 165 — Exponential Distribution: PDF and CDFLesson 275 — One-Sided Confidence Bounds
- At-Risk Customers
- Previously active users showing warning signs—declining usage, skipped payments, reduced session frequency, or negative support interactions.
- Lesson 1704 — Customer Lifecycle Stages
- Atomicity
- All operations in a transaction succeed or all fail—no partial completion
- Lesson 1110 — What Are Database Transactions?
- ATT
- the average effect of treatment *for those who actually received treatment*.
- Lesson 1451 — Estimating Treatment Effects from Matched Samples
- Attempted invalid insert
- Lesson 1056 — Foreign Key Constraints in Practice
- Attempted problematic delete
- Lesson 1056 — Foreign Key Constraints in Practice
- Attribute credit
- The difference in conversion probability represents that channel's contribution
- Lesson 1733 — Markov Chain Attribution Models
- Attribution
- You connect marketing spend to actual outcomes—which campaign drove that cohort with 60% Day-30 retention?
- Lesson 1711 — What Are Acquisition Channels?Lesson 1736 — MMM vs Attribution: Key DifferencesLesson 1744 — Incrementality vs Attribution
- Attribution decay
- models how influence weakens over time.
- Lesson 1639 — Time Windows and Attribution Decay
- Audience Engagement
- Lesson 1292 — Introduction to Styling: Why Aesthetics Matter
- Audience-specific reports
- Executive summary vs technical deep-dive
- Lesson 1984 — Parameterized Reports
- Audit backups
- erasure applies there too (eventually)
- Lesson 1909 — Right to Erasure and Data Retention Policies
- Audit trail
- See who changed what, when, and why through commit messages
- Lesson 1990 — What is Version Control and Why Git?
- Audit trails
- Comply with regulations by tracking what data was used when
- Lesson 1871 — Why Version Control for Data?Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Auditability
- Each run is logged and traceable
- Lesson 1986 — Automated Report GenerationLesson 2123 — Simple Rules Beat Complex Models
- Auditing
- When stakeholders question your findings, you need to demonstrate data provenance.
- Lesson 2062 — Why Data Source Documentation Matters
- Augmented Dickey-Fuller (ADF) test
- on your transformed series.
- Lesson 741 — Testing Stationarity After Transformation
- Augmented Dickey-Fuller test
- gives you a rigorous, statistical answer.
- Lesson 716 — Augmented Dickey-Fuller Test
- Authentication Failures
- occur when your credentials are wrong or insufficient.
- Lesson 1093 — Troubleshooting Connection Issues
- Auto-correct
- known issues with logging (caution required)
- Lesson 1826 — Data Validation and Schema Enforcement
- Autocommit mode
- Each SQL statement is automatically committed (saved) immediately after it runs.
- Lesson 1111 — Autocommit Mode vs Explicit Transactions
- Autocorrelation
- (also called serial correlation) is the most common violation.
- Lesson 548 — Independence of ObservationsLesson 562 — Index Plots and Time-Ordered ResidualsLesson 719 — What is Autocorrelation?Lesson 720 — The Autocorrelation Function (ACF)
- Autocorrelation Function (ACF)
- takes this idea further by systematically calculating these relationships at multiple different lags.
- Lesson 720 — The Autocorrelation Function (ACF)
- Automate the process
- write scripts that loop through randomizations and check balance
- Lesson 1492 — Rerandomization and Practical Implementation
- Automated collection
- Setting up systems to continuously gather data
- Lesson 11 — Data Collection and Acquisition
- Automated validation frameworks
- solve this by letting you define expectations once and apply them consistently across datasets, pipelines, and time.
- Lesson 1158 — Automated Validation Frameworks
- Automatic Deduplication
- Duplicate rows are removed automatically
- Lesson 999 — UNION: Combining Distinct Results
- Automatic derivatives
- Calculating gradients for optimization becomes straightforward
- Lesson 670 — Why Exponential Family Matters for GLMs
- Automating documentation
- means writing scripts that inspect your data and generate complete documentation automatically.
- Lesson 2067 — Automating Documentation with Code
- AutoRegressive Integrated Moving Average
- .
- Lesson 773 — Introduction to ARIMA: Components and Notation
- Availability
- Actual operating time ÷ planned production time (accounting for breakdowns, changeovers)
- Lesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- Average balance method
- Use `(Start + End) / 2` to account for growth
- Lesson 1671 — Churn Rate Calculation Methods
- Average Order Value (AOV)
- Revenue divided by number of orders
- Lesson 1516 — Business Metrics: Definition and ExamplesLesson 1625 — Cross-Functional Metric Dependencies
- Average performers
- 25th to 75th percentile
- Lesson 61 — Using Percentiles for Comparison and Benchmarking
- Average Purchase Value
- is the mean revenue per transaction.
- Lesson 1663 — Simple LTV: Average Revenue Per Customer
- Average Treatment Effect (ATE)
- , which answers: "On average, how much did the treatment change the outcome compared to no treatment?
- Lesson 1440 — Treatment Effect Estimation
- AVG
- , **MIN**, and **MAX**—together with **GROUP BY** to create rich summaries of grouped data.
- Lesson 892 — GROUP BY with Different Aggregate FunctionsLesson 894 — NULL Values in GROUP BY
- Avoid
- computing intermediate results you never use
- Lesson 1780 — Transformations vs Actions in SparkLesson 2073 — Naming Conventions for Files and Functions
- Avoid "security through obscurity"
- Don't assume hiding risks makes them disappear
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Avoid conditioning on colliders
- which would create spurious associations
- Lesson 1475 — Using DAGs to Guide Analysis
- Avoid conditioning on mediators
- on the causal path — which would block part of the effect you want to measure
- Lesson 1475 — Using DAGs to Guide Analysis
- Avoid extrapolation
- Don't use your model to predict Y values for X values far from your observed range
- Lesson 526 — When the Intercept Has No Meaning
- Avoid manipulation
- You've learned about truncated axes, area distortions, and cherry-picked ranges—these aren't just technical errors, they're ethical violations when done knowingly.
- Lesson 1247 — The Ethics of Visualization Design
- Avoid problematic pairs
- Red-green, blue-purple, and light green-yellow combinations are particularly troublesome.
- Lesson 1248 — Color Blindness and Color Palette Design
- Avoid redundant evaluations
- Don't call the same function multiple times within different WHEN clauses.
- Lesson 1037 — CASE Best Practices and Performance
- Avoid unnecessary CTEs
- If a simple subquery suffices and is clearer, use it
- Lesson 997 — CTE Best Practices and Performance
- Avoiding double-counting
- When your data has intentional duplicates but you need unique-value statistics
- Lesson 887 — Aggregates with DISTINCT
- Axis
- objects—the x-axis and y-axis with their tick marks, labels, and scales.
- Lesson 1255 — The Anatomy of a Matplotlib Figure
- Axis Limits
- Control what range of data appears using `set_xlim()` and `set_ylim()`.
- Lesson 1270 — Customizing Axes: Labels, Limits, and Scales
- Azimuth
- The horizontal rotation angle around your plot.
- Lesson 1326 — Viewing Angles and Projection Types
B
- b(θ)
- Lesson 665 — Canonical Form of Exponential Family DistributionsLesson 667 — Mean and Variance in the Exponential Family
- Backfilling corrupts data
- Re-processing historical data could add duplicate aggregations
- Lesson 1847 — What is Idempotency?
- Background geoms
- Large shapes, reference regions, or filled areas
- Lesson 1355 — Layer Order and Plot Composition
- Bad (chronological)
- "We collected transaction data from 2020-2024, cleaned 847 outliers, ran correlation analysis, built three models, and found churn is predicted by login frequency.
- Lesson 1942 — The Pyramid Principle: Starting with the Conclusion
- Bad (curved pattern)
- Suggests non-linear relationship; linear regression isn't appropriate
- Lesson 557 — The Residuals vs Fitted Values Plot
- Bad (funnel shape)
- Indicates heteroscedasticity; variance increases or decreases with fitted values
- Lesson 557 — The Residuals vs Fitted Values Plot
- Bad (outliers)
- Points far from the rest may be influential observations
- Lesson 557 — The Residuals vs Fitted Values Plot
- Balance
- means mixing high-volume/low-margin channels with low-volume/high-ROI ones
- Lesson 1716 — Channel Mix and Portfolio Thinking
- Balance Index Overhead
- Every index speeds reads but slows writes.
- Lesson 1086 — Index Maintenance and Monitoring
- Balance inference
- remember that rerandomization changes your p-values slightly (though often negligibly in practice)
- Lesson 1492 — Rerandomization and Practical Implementation
- Balance point
- The mean is the value where positive and negative distances from it cancel out perfectly
- Lesson 39 — The Mean (Arithmetic Average)
- Balance Tables
- Create side-by-side summaries showing mean (or proportion) of each covariate in treatment vs.
- Lesson 1491 — Covariate Balance and Diagnostics
- Balancing groups
- Good matches ensure treatment and control groups look similar *before* treatment
- Lesson 1445 — The Matching Framework
- bar charts
- to see frequency distributions.
- Lesson 1208 — Distribution Checks for All VariablesLesson 1219 — Bar Charts and Column ChartsLesson 1343 — Statistical TransformationsLesson 1959 — Choosing Familiar Chart Types
- Bars
- (`geom_bar` or `geom_col`) showing magnitudes as vertical rectangles
- Lesson 1342 — Geometric Objects (geoms)
- Bartlett's Test
- is more **powerful** when your data is truly normal, but it's very sensitive to non-normality—it might reject equal variances simply because your data isn't perfectly bell-shaped, not because variances actually differ.
- Lesson 380 — Testing Equal Variances: Levene's and Bartlett's Tests
- base layer
- created by the `ggplot()` function.
- Lesson 1348 — The Base Layer: ggplot() and Data MappingLesson 1355 — Layer Order and Plot Composition
- baseline
- to compare our data against.
- Lesson 307 — Defining the Null Hypothesis (H₀)Lesson 636 — The Reference CategoryLesson 642 — What is a Reference Category?
- Baseline variance
- Higher variability requires more data
- Lesson 1692 — Statistical Significance and Iteration
- Basemaps
- solve this by providing pre-rendered background images that give your audience familiar reference points—like roads, rivers, city names, and borders.
- Lesson 1314 — Basemaps and Map Tiles
- Basic execution example
- Lesson 2080 — Usage Examples and Running Your Code
- Batch
- (hours-to-days) permits scheduled ETL/ELT runs during off-peak hours.
- Lesson 1825 — Designing Pipeline Architecture
- Batch is ideal when
- Lesson 1824 — Batch vs Streaming Pipelines
- Batch pipelines
- work like a postal service—collect mail throughout the day, then deliver it all at scheduled times (hourly, daily, nightly).
- Lesson 1824 — Batch vs Streaming Pipelines
- Bayesian
- "There's a 95% probability the true conversion rate is between 12% and 18%.
- Lesson 1564 — Comparing Bayesian and Frequentist Proportion Inference
- Bayesian A/B testing
- treats the conversion rate as a random variable with a probability distribution.
- Lesson 1580 — Bayesian vs Frequentist A/B Testing
- Bayesian inference
- is the extension of this idea into a full statistical methodology.
- Lesson 116 — From Bayes' Theorem to Bayesian Inference
- Bayesian Information Criterion (BIC)
- is a model selection tool that helps you choose between competing regression models.
- Lesson 630 — Bayesian Information Criterion (BIC)
- Bayesian interpretation
- treats probability as a **degree of belief** or **quantification of uncertainty**.
- Lesson 1540 — Comparing Bayesian and Frequentist Interpretations
- Be honest about uncertainty
- "High confidence" or "preliminary estimate" builds trust without undermining your conclusion.
- Lesson 1944 — Executive Summary Best Practices
- Be selective
- Test only coefficients you care about based on theory, not all of them exploratorily.
- Lesson 624 — Multiple Testing Considerations
- Be specific
- Select only columns you need instead of `SELECT *`
- Lesson 880 — Performance Considerations and Best PracticesLesson 1679 — Defining Funnel Steps and Events
- Be specific and actionable
- Instead of "this is confusing," try "Consider renaming `df2` to `customer_features` to clarify what this dataframe contains.
- Lesson 2024 — Code Review Best Practices
- Be specific and consistent
- Lesson 2073 — Naming Conventions for Files and Functions
- Bed utilization rate
- = (occupied bed-days / available bed-days) measures capacity efficiency.
- Lesson 1633 — Healthcare Metrics: Patient Outcomes and Operational Efficiency
- Before 3NF (redundant)
- Lesson 1066 — Third Normal Form (3NF)
- Before denormalizing
- Lesson 1077 — Measuring Performance Impact of Denormalization
- Before-After Measurements
- Lesson 369 — When to Use a Paired t-Test
- Behavior
- feature usage, purchase frequency, engagement level
- Lesson 1701 — What is Customer Segmentation?
- below
- the hypothesized median.
- Lesson 391 — The Sign Test for MediansLesson 567 — Common Q-Q Plot Patterns: Heavy Tails and Light TailsLesson 568 — Skewness in Q-Q Plots: Left and Right Deviations
- Below 1:1
- You're losing money on every customer.
- Lesson 1667 — LTV:CAC Ratio and ProfitabilityLesson 1756 — LTV:CAC Ratio as a Health Metric
- Below maximum
- `WHERE value < (SELECT MAX(value) FROM table)`
- Lesson 964 — Subqueries with Aggregate Functions
- Benchmarking salaries
- across companies while maintaining confidentiality
- Lesson 1903 — Secure Multi-Party Computation
- Benchmarks
- Compare against industry standards, baseline models, or competitors.
- Lesson 1939 — Context and Comparison: Making Numbers MeaningfulLesson 1962 — Contextualizing Numbers
- Benjamini-Hochberg (BH) procedure
- takes a different approach.
- Lesson 1506 — Benjamini-Hochberg Procedure
- Benjamini-Hochberg (FDR)
- When you're exploring metrics and can tolerate some false positives
- Lesson 1507 — Multiple Testing in A/B Test Variations
- Bernoulli trial
- is a single experiment or observation that can result in exactly two outcomes: we call one outcome "success" and the other "failure.
- Lesson 123 — Bernoulli Trial Definition and PropertiesLesson 126 — From Bernoulli to Binomial: Multiple Trials
- Best practice
- Sort only when necessary for your analysis or presentation.
- Lesson 880 — Performance Considerations and Best Practices
- Best practices
- Group only by dimensions you truly need.
- Lesson 911 — Performance Considerations with Multiple GroupsLesson 1995 — Committing Changes with git commit
- Best use cases
- Lesson 1727 — Linear Attribution Model
- Beta
- controls how quickly the **trend component** (upward or downward direction) updates.
- Lesson 769 — Smoothing Parameters: Alpha, Beta, Gamma
- beta posterior
- no complex integrals required.
- Lesson 1557 — The Beta-Binomial ModelLesson 1579 — Practical Computation of Credible Intervals
- Beta-Binomial conjugate pair
- , your posterior is a Beta distribution: `Beta(α + successes, β + failures)`.
- Lesson 1562 — Credible Intervals for Proportions
- Beta-Binomial model
- (proportion problems), if your posterior is `Beta(α, β)`:
- Lesson 1561 — Posterior Mean and Mode
- Beta(2, 8) prior
- (you think a conversion rate is probably low).
- Lesson 1560 — Computing the Posterior Distribution
- Better analysis
- When controlling for smoking status, the correlation disappeared or even reversed, showing coffee might be protective.
- Lesson 1426 — Real-World Examples: Correlation vs Causation
- Better objective
- "Deliver a seamless first-time user experience by Q2"
- Lesson 1609 — Setting Effective Objectives
- between
- events (union → use addition)?
- Lesson 91 — Combining Rules in Multi-Step ProblemsLesson 441 — Sum of Squares: Total, Between, and Within
- Between Groups
- (or "Treatment"): Variation explained by differences among your group means
- Lesson 444 — The ANOVA Table
- Between-group variance (numerator)
- Measures how spread out the group means are from the overall mean
- Lesson 440 — The F-Statistic and Its Distribution
- Betweenness centrality
- How often a node lies on shortest paths between others (the "bridges")
- Lesson 1320 — Network Metrics and Visual Analysis
- Beware these traps
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- BI
- You need regular reports on key business metrics
- Lesson 4 — Data Science vs Data Analytics vs Business Intelligence
- Bias
- It ignores everything that happened after the first click—potentially undervaluing nurture campaigns and retargeting that actually closed the deal.
- Lesson 1723 — Comparing Single-Touch Models
- Bias and noise
- Sensor errors, bot traffic, or sampling issues
- Lesson 1762 — Extended Dimensions: Veracity and Value
- Bias Correction
- If your bootstrap distribution is systematically shifted from the sample statistic, BCa corrects for this
- Lesson 304 — BCa Bootstrap Intervals: Bias Correction
- biased
- .
- Lesson 552 — Zero Conditional Mean of ErrorsLesson 553 — Exogeneity: X Must Be Independent of ErrorsLesson 554 — Consequences of Violating Assumptions
- Biased assignment
- Certain user types might be systematically excluded or included
- Lesson 1524 — Sample Ratio Mismatch (SRM)
- BIC
- but you cannot use the Partial F-Test
- Lesson 626 — Nested vs Non-Nested ModelsLesson 660 — Choosing the Polynomial DegreeLesson 700 — AIC and BIC for Model SelectionLesson 781 — Information Criteria: AIC and BICLesson 785 — Information Criteria: AIC and BIC
- BIC (Bayesian Information Criterion)
- are scores that penalize models for using too many parameters while rewarding good fit to the data.
- Lesson 781 — Information Criteria: AIC and BIC
- Big Compute
- problems occur when the calculations themselves are expensive, even with modest data sizes.
- Lesson 1765 — Big Data vs Big Compute
- Big Data
- problems arise when you have so much data that it won't fit in memory or takes too long to read/write.
- Lesson 1765 — Big Data vs Big Compute
- Biggest impact
- Which step affects the most users?
- Lesson 1685 — Actionable Insights from Funnel Analysis
- BigQuery
- Serverless model; Google manages all infrastructure automatically
- Lesson 1813 — Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
- Bimodal
- Two distinct peaks, suggesting two subgroups (e.
- Lesson 1175 — Histograms for Distribution Shape
- Bin data
- `stat_bin()` aggregates continuous data into intervals
- Lesson 1352 — Statistical Transformations with stat_* Layers
- Binary assets
- that must be versioned with code
- Lesson 2033 — Git Large File Storage (LFS) for Data Assets
- Binary or semi-structured
- Git can't show meaningful diffs, so every change duplicates the entire file
- Lesson 2070 — Separating Data from Code
- Binary outcomes
- Success or failure
- Lesson 131 — Real-World Applications of Binomial DistributionsLesson 435 — McNemar's Test: Paired Categorical DataLesson 678 — Choosing the Right Link Function
- Binning matters
- Too few bins and you miss important details; too many bins and you see noise instead of pattern.
- Lesson 1220 — Histograms for Continuous Distributions
- Binning problems
- Lesson 1245 — Misleading Aggregations and Binning
- Binomial
- tracks "k successes in n trials with probability p each"
- Lesson 142 — Poisson as Limit of BinomialLesson 154 — Real-World Use Cases: Customer Behavior and EventsLesson 664 — What is the Exponential Family of Distributions?
- binomial distribution
- enters the picture.
- Lesson 126 — From Bernoulli to Binomial: Multiple TrialsLesson 127 — Binomial Distribution PMFLesson 153 — Real-World Use Cases: Quality Control and DefectsLesson 154 — Real-World Use Cases: Customer Behavior and EventsLesson 669 — The Dispersion Parameter φ
- Biological Gradient
- Is there a dose-response relationship?
- Lesson 498 — Bradford Hill Criteria for Causation
- Blended CAC
- weighted average across all channels
- Lesson 1716 — Channel Mix and Portfolio ThinkingLesson 1754 — Blended CAC vs Paid CAC
- Block randomization
- divides the assignment process into small "blocks" of fixed size, ensuring balance within each block.
- Lesson 1488 — Block Randomization
- Bonferroni
- Divide your α by the number of tests (conservative, appropriate for critical decisions)
- Lesson 1507 — Multiple Testing in A/B Test VariationsLesson 1508 — Pre-Registration and Correction Strategy
- Bonferroni Correction
- is a conservative, straightforward method to control this risk.
- Lesson 458 — Bonferroni CorrectionLesson 512 — Testing Significance in Correlation MatricesLesson 624 — Multiple Testing ConsiderationsLesson 824 — Multiple Group Comparisons
- Boolean logic
- every condition produces either TRUE or FALSE.
- Lesson 865 — Introduction to Logical Operators in SQL
- bootstrap distribution
- shows you the range and variability of what your estimate might be, forming the foundation for confidence intervals.
- Lesson 299 — How Bootstrap Resampling WorksLesson 300 — Bootstrap Distribution of a Statistic
- Bootstrap methods
- work by resampling *with replacement* from your observed data thousands of times.
- Lesson 291 — Non-Parametric Alternatives for Difference Intervals
- Bootstrapping
- Resampling methods generate different samples
- Lesson 2055 — Why Randomness Matters in Data Science
- Boston's coefficient
- doesn't appear (it's built into the intercept)
- Lesson 643 — Interpreting Coefficients Relative to Reference
- both
- domains.
- Lesson 7 — The Data Science Skill StackLesson 87 — Multiplication Rule for Independent EventsLesson 276 — Sampling Distribution of a ProportionLesson 570 — Q-Q Plots vs Formal Normality Tests: When Visual Checks MatterLesson 688 — Effect Size and Practical SignificanceLesson 939 — FULL OUTER JOIN with Multiple ConditionsLesson 1001 — INTERSECT: Finding Common Rows
- Both say non-stationary
- Apply differencing or detrending.
- Lesson 718 — Interpreting Stationarity Test Results
- Both say stationary
- Proceed with modeling—no transformation needed.
- Lesson 718 — Interpreting Stationarity Test Results
- Both transformed
- With `log(Y) = β₀ + β₁log(X)`, β₁ becomes an *elasticity*—the percent change in Y per 1% change in X.
- Lesson 594 — Interpreting Models After Transformation
- Bottom layers
- Technical details, methodology, data sources—available as appendices if questioned.
- Lesson 1952 — The Pyramid Principle: Leading with Conclusions
- box plot
- (or box-and-whisker plot) turns the five-number summary into a visual.
- Lesson 59 — The Five-Number Summary and Box PlotsLesson 70 — Visual Methods: Box Plots and Scatter PlotsLesson 1176 — Box Plots for Spread and OutliersLesson 1268 — Box Plots and Violin Plots
- Box plots
- show the distribution through **five key numbers**: the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
- Lesson 1223 — Box Plots and Violin PlotsLesson 1268 — Box Plots and Violin PlotsLesson 1343 — Statistical Transformations
- Box-Cox transformation
- solves this by testing a *family* of power transformations, controlled by a single parameter called **lambda (λ)**.
- Lesson 214 — Box-Cox TransformationLesson 593 — Box-Cox Transformation
- boxplot
- draws a box from the first quartile (Q1) to the third quartile (Q3), with a line at the median.
- Lesson 55 — Visualizing SpreadLesson 1285 — Categorical Plots: stripplot, swarmplot, boxplot
- Boy Scout Rule
- Leave code slightly cleaner than you found it whenever you touch it
- Lesson 2137 — Refactoring Strategies and Debt Paydown
- Branching logic
- After analyzing a dataset, you might trigger different validation pipelines depending on data quality scores or record counts.
- Lesson 1844 — Dynamic Dependencies
- Brand awareness campaigns
- where every impression matters similarly
- Lesson 1727 — Linear Attribution Model
- Brand awareness efforts
- Which channels are best at introducing new prospects?
- Lesson 1720 — First-Touch Attribution Model
- Breadth
- means splitting into many parallel branches at fewer levels.
- Lesson 1623 — Depth vs Breadth in Metric Trees
- break-even ROAS
- (the minimum ROAS needed to cover all costs) is critical.
- Lesson 1751 — Return on Ad Spend (ROAS): Definition and CalculationLesson 1752 — Target ROAS and Break-Even Analysis
- Breaking changes
- (renaming, deleting columns, changing data types) require careful handling:
- Lesson 1876 — Schema Evolution and Backwards Compatibility
- Breaking it down
- Lesson 2017 — Understanding Merge Conflicts
- Breaking point
- Above 10-20 GB (or ~50% of available RAM), Pandas becomes unreliable or crashes
- Lesson 1783 — Data Size Thresholds: When Pandas Isn't Enough
- Brief Mention with Signpost
- Lesson 1947 — Handling Methodology and Technical Details
- Bright spots
- Anomalously high retention cohorts teach you what worked
- Lesson 1649 — Visualizing Cohort Data with Heatmaps
- Broken LTV:CAC ratio
- If churn is too high, you may never recover acquisition costs
- Lesson 1670 — What is Churn and Why It Matters
- Bubble charts
- extend this by encoding a third numeric variable through the **size of each point (bubble)**.
- Lesson 1229 — Bubble Charts for Three Variables
- Budget optimization
- Shift resources to channels with real impact
- Lesson 1718 — Introduction to Marketing AttributionLesson 1742 — Budget Optimization Using MMM
- Bug fixes
- Create a `hotfix` branch to quickly patch issues
- Lesson 2005 — What are Branches and Why Use Them?
- Build a bootstrap distribution
- of the test statistic under H₀
- Lesson 396 — Bootstrap Hypothesis Testing
- Build backup slides
- with technical details for deep dives
- Lesson 1956 — Anticipating and Addressing Audience Questions
- Build comprehensive models
- Capture the full story your data tells
- Lesson 1190 — Introduction to Multivariate Analysis
- Build the transition graph
- Map all observed customer journeys as state transitions (e.
- Lesson 1733 — Markov Chain Attribution Models
- Build trust incrementally
- Regular check-ins demonstrate progress and keep stakeholders engaged.
- Lesson 2111 — Fast Feedback Loops with Stakeholders
- Building confidence intervals
- using standard formulas
- Lesson 202 — Why Test for Normality?Lesson 265 — Using Standard Error in Practice
- Built-in transformations
- (ggplot2, some Seaborn):
- Lesson 1373 — Statistical Transformations: Built-in vs Manual
- Burden of proof
- The prosecution must prove guilt beyond reasonable doubt
- Lesson 312 — Hypothesis Testing as a Legal Analogy
- Burn-in
- refers to discarding the first portion of your MCMC samples—typically the first 10-50% of iterations.
- Lesson 1592 — Burn-in, Thinning, and Convergence Diagnostics
- Business → Technical
- When a stakeholder says "We need to reduce customer churn," you translate this into: "Build a classification model predicting 30-day cancellation probability, optimized for recall since false negatives cost more than false positives, using histori...
- Lesson 2105 — Translating Between Technical and Business Language
- Business decisions can't wait
- Supply chain adjustments based on current demand patterns
- Lesson 1788 — Streaming Data and Real-Time Requirements
- Business documentation
- Process flows, compliance rules, product specs
- Lesson 1201 — Domain Knowledge as a Hypothesis Source
- Business impact
- Would a 0.
- Lesson 1480 — Minimum Detectable Effect (MDE)Lesson 1858 — Alerting StrategiesLesson 2141 — Building a Portfolio and Personal Brand
- Business implication
- If you're a pool safety company, don't target ice cream shops for partnerships based on this correlation—focus on seasonal weather patterns instead.
- Lesson 1426 — Real-World Examples: Correlation vs Causation
- Business Intelligence (BI) professional
- creates a dashboard showing last quarter's sales by region
- Lesson 4 — Data Science vs Data Analytics vs Business Intelligence
- Business KPIs
- Sales, transactions, or user activity with weekly or monthly seasonality
- Lesson 1411 — Applications and Limitations
- Business logic violations
- Withdrawing more money than available
- Lesson 1109 — Input Validation and Defense in Depth
- Business metrics
- A handful of products generate most revenue
- Lesson 191 — Pareto Principle and the 80/20 RuleLesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Business needs evolve
- What mattered last quarter might not matter now
- Lesson 15 — Deployment, Monitoring, and Iteration
- Business processes
- How does data flow through the organization?
- Lesson 1168 — Understanding Domain Context
- Business relevance
- It forces you to ask: "What size of change would actually move the needle for our business?
- Lesson 1494 — Effect Size: The Minimum Detectable Effect
- Business requirements
- Must results be explainable to non-technical stakeholders?
- Lesson 1169 — Clarifying Assumptions and Constraints
- Business rule checks
- Lesson 1211 — Domain Validation and Sanity Checks
- Business strategy shifts
- If your company pivots from growth-at-all-costs to sustainable profitability, your North Star metric and its supporting branches must change.
- Lesson 1626 — Maintaining and Evolving Metric Trees
- Business Understanding
- Knowing how organizations actually work helps you focus on problems that matter, not just technically interesting puzzles.
- Lesson 7 — The Data Science Skill Stack
- Business Value Side
- Lesson 2118 — Cost-Benefit Analysis for Continued Work
- Business-Friendly Labels
- Instead of "Cluster 3," assign meaningful names like "High-Value Loyalists," "At-Risk Champions," or "New Bargain Hunters.
- Lesson 1709 — Segment Profiling and Interpretation
- Busy executives
- get the answer immediately
- Lesson 1942 — The Pyramid Principle: Starting with the Conclusion
- By callable function
- Lesson 1801 — Column Selection and Usecols
- By cohort/channel
- Some channels may have better LTV:CAC but slower payback, affecting budget allocation
- Lesson 1757 — Payback Period: Definition and Importance
- by how much
- A confidence interval for the difference gives you a range of plausible values for the true difference between two population proportions.
- Lesson 412 — Confidence Interval for DifferenceLesson 1955 — Framing Insights in Business Language
C
- C(n-1, r-1)
- counts arrangements of those *r-1* successes
- Lesson 135 — The Negative Binomial Distribution: Waiting for r Successes
- C(n, k)
- The number of ways to choose k successes from n trials (called "n choose k" or the binomial coefficient)
- Lesson 127 — Binomial Distribution PMF
- Caching
- means storing the results of expensive computations so you can reuse them instead of recalculating.
- Lesson 1337 — Dashboard Performance and CachingLesson 1782 — Spark Performance Basics: Partitions and Caching
- Calculate
- standardized residuals for each cell after your significant Chi-Squared test
- Lesson 428 — Post-Hoc Analysis and ResidualsLesson 994 — CTEs for Simplifying Complex Joins
- Calculate cumulative contribution
- to the total metric
- Lesson 1698 — Power User Curves and Engagement Distribution
- Calculate difference
- Treatment effect = (Treatment metric) - (Control metric)
- Lesson 1641 — Isolating Effects with Control Groups
- Calculate differences
- Subtract the hypothesized median from each observation
- Lesson 391 — The Sign Test for Medians
- Calculate error metrics
- Compare predictions to actual values
- Lesson 790 — Out-of-Sample Forecast Evaluation
- Calculate incrementality
- Lift = (Test performance - Expected baseline) ÷ Expected baseline
- Lesson 1746 — Geo-Lift ExperimentsLesson 1747 — Ghost Ads and PSA Tests
- Calculate LTV per cohort
- by summing or projecting total revenue per customer in that group
- Lesson 1664 — Cohort-Based LTV Calculation
- Calculate paired differences
- (just like the paired t-test or Sign Test)
- Lesson 392 — Wilcoxon Signed-Rank Test
- Calculate probabilities
- Convert to the standard normal distribution (from your previous lesson) to find exact probabilities
- Lesson 195 — Z-Score Definition and Interpretation
- Calculate slope β₁
- Use the formula involving sums of products and squared deviations from the mean
- Lesson 522 — Implementing Least Squares from Scratch
- Calculate statistics
- like mean, median, min, max, std, or count for each group
- Lesson 1185 — Grouped Summary Statistics
- Calculate the business impact
- in dollars, time, or customers
- Lesson 1956 — Anticipating and Addressing Audience Questions
- Calculate the expected value
- E(X) = Σ [outcome × probability]
- Lesson 152 — Decision Making Under Uncertainty
- Calculate the F-Statistic
- Lesson 447 — Conducting One-Way ANOVA in Practice
- Calculate the p-value
- as the proportion of permuted statistics as extreme or more extreme than your observed value
- Lesson 395 — Permutation Tests for Means and Beyond
- Calculate the tail probabilities
- For 95%, that's (1 - 0.
- Lesson 1575 — Computing Equal-Tailed Credible Intervals
- Calculate the treatment effect
- within each stratum
- Lesson 1430 — Controlling for Confounders: Stratification
- Calculate the U statistic
- based on these rank sums
- Lesson 393 — Mann-Whitney U Test (Wilcoxon Rank-Sum)
- Calculate transition probabilities
- For each state, determine the likelihood of moving to the next state
- Lesson 1733 — Markov Chain Attribution Models
- Calculate your statistic
- (median, correlation, ratio, etc.
- Lesson 306 — Bootstrap for Non-Standard Problems
- Calculate your test statistic
- (you learned this in lesson 316)
- Lesson 319 — Calculating P-Values from Test Statistics
- Calculate your Z-score
- from raw data (you've already learned this!
- Lesson 198 — Using Z-Tables for Probability
- Calculated fields
- Storing computed values (like `order_total`) instead of recalculating from line items every time.
- Lesson 1071 — When to Denormalize: Performance Trade-offs
- Calculates the average rank
- for each group
- Lesson 471 — Kruskal-Wallis H Test: The Non-Parametric One-Way ANOVA
- Calculating date differences
- Lesson 1040 — Date Arithmetic and INTERVAL Operations
- Calculating differences
- Compare current vs previous values (sales growth, price changes)
- Lesson 1023 — Introduction to Window Functions: LAG and LEAD
- Calculating the posterior distribution
- means applying Bayes' theorem to compute exactly how probable each parameter value is, given both your starting assumptions and the observed data.
- Lesson 1545 — Calculating the Posterior Distribution
- Calculations
- Computing percentages or ratios using aggregates
- Lesson 967 — Subqueries in the SELECT Clause
- Calibration (Predictive Parity)
- Lesson 1887 — Defining Fairness in Data Science
- Caliper Matching
- adds a safety rule: only match if the propensity scores are within a maximum distance (the "caliper").
- Lesson 1448 — Propensity Score Matching Methods
- Call Centers
- A help desk receives 30 calls per day on average.
- Lesson 144 — Poisson Applications: Arrivals and Events
- Call-in polls
- Only passionate viewers with free time participate
- Lesson 246 — Volunteer and Self-Selection Bias
- Campaign A
- 70% chance of $50,000 profit, 30% chance of $0
- Lesson 152 — Decision Making Under Uncertainty
- Campaign B
- 40% chance of $100,000 profit, 60% chance of -$10,000 loss
- Lesson 152 — Decision Making Under Uncertainty
- cannot
- compare unstandardized coefficients across predictors with different units:
- Lesson 605 — Units and Scaling of CoefficientsLesson 899 — HAVING vs WHERE: Key DifferencesLesson 1011 — Filtering on Window Function ResultsLesson 1574 — Credible Intervals vs Confidence IntervalsLesson 1906 — Legal Bases for Processing Personal DataLesson 1932 — Building Trust Through Transparency
- canonical link
- for binomial (binary) outcomes, meaning it naturally pairs with the exponential family representation of the binomial distribution.
- Lesson 673 — The Logit LinkLesson 676 — Canonical vs Non-Canonical LinksLesson 678 — Choosing the Right Link FunctionLesson 690 — The Poisson Distribution as a GLM
- Capture non-linear monotonic patterns
- (a curved upward trend still gets positive correlation)
- Lesson 486 — Spearman's Rank Correlation Coefficient
- Cardinality
- = number of unique values.
- Lesson 1080 — When to Create an IndexLesson 1083 — Index Selectivity and CardinalityLesson 1867 — Data Profiling and Monitoring
- Career advancement
- Publishing sensational findings (even if overstated) could boost your reputation.
- Lesson 1930 — Managing Conflicts of Interest
- Carryover effect
- Advertising impact persists and decays over time, like a drug slowly leaving your bloodstream
- Lesson 1739 — Adstock and Carryover Effects
- Cartesian product
- first—every row from `orders` paired with every row from `customers`—then filters it.
- Lesson 925 — INNER JOIN vs WHERE: Join Order MattersLesson 942 — Understanding CROSS JOIN Syntax and MechanicsLesson 943 — CROSS JOIN Results: Size and StructureLesson 955 — Avoiding Cartesian Products
- CASCADE
- automatically propagates the change to child records:
- Lesson 1054 — Cascading Actions: DELETE and UPDATELesson 1057 — ON DELETE and ON UPDATE Actions
- Case Studies
- simulate real problems: "How would you measure the success of a new feature?
- Lesson 2142 — Interviewing: Technical and Behavioral Prep
- Cash-constrained companies
- prioritize rapid payback above all, even if it means higher CAC or lower ROAS, because they literally can't afford to wait.
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- categorical
- and **numerical variables**, making your **data cleaning and preparation** work much easier than dealing with messy external sources.
- Lesson 20 — Primary Data Sources: Databases and Data WarehousesLesson 634 — Categorical Variables in Regression
- categorical × categorical
- interaction captures whether the combined effect of two categories differs from their individual additive effects—like whether a specific treatment works differently depending on disease severity level.
- Lesson 687 — Categorical Predictors and Interactions in Logistic ModelsLesson 1182 — Choosing Analysis Methods by Variable Types
- categorical data
- or when testing **variance**.
- Lesson 315 — Common Test Statistics: Z, t, Chi-Square, and FLesson 426 — Assumptions and Sample Size RequirementsLesson 430 — Common Applications and Pitfalls
- Categorical plots
- Compare groups (box plots, violin plots, bar plots)
- Lesson 1281 — Introduction to Seaborn's Statistical Plots
- Categorical-to-Categorical
- Build contingency tables and apply association measures like Cramér's V or chi-square tests.
- Lesson 1210 — Relationship Exploration: Correlation and Association
- Category or product line
- if your queries consistently filter these
- Lesson 1812 — Partitioning and Clustering Strategies
- Causal Chain Mapping
- Lesson 1602 — Identifying Leading Indicators for Your Metrics
- Causal clarity
- Can you tie drop-off to specific friction (forms too long, unclear CTAs)?
- Lesson 1685 — Actionable Insights from Funnel Analysis
- Causal question
- Does traffic *cause* revenue, or do successful companies simply attract both?
- Lesson 1426 — Real-World Examples: Correlation vs Causation
- Causal reasoning
- Ask "what causes this feature to have predictive power?
- Lesson 1883 — Protected Classes and Proxy Variables
- Causation
- means one variable *directly causes* changes in another.
- Lesson 1420 — Defining Correlation and Causation
- Cause must precede effect
- If A causes B, then A must happen before B.
- Lesson 1425 — Identifying Potential Causal Relationships
- CC-BY
- Requires attribution when data is used
- Lesson 2082 — Choosing a License for Data Science Projects
- CC-BY-SA
- Requires derivatives to use the same license (like GPL for data)
- Lesson 2082 — Choosing a License for Data Science Projects
- CC0 (Public Domain)
- Maximum openness, no restrictions
- Lesson 2082 — Choosing a License for Data Science Projects
- CDF
- For x in [a, b], F(x) = (x - a)/(b - a) — a straight line from 0 to 1
- Lesson 161 — The Continuous Uniform Distribution
- Cell proportions
- divide each cell by the grand total, giving you joint probabilities like P(A and B).
- Lesson 98 — Conditional Probability with Tables
- Cells
- Metrics like user count, retention rate, cumulative revenue, or conversion rate
- Lesson 1647 — Building a Cohort Table
- Censored observations
- subjects still "at risk" but whose outcome is unknown (they left the study, were lost to follow-up, or the study ended)
- Lesson 812 — Handling Event Times and CensoringLesson 839 — Time-to-Conversion in Marketing Funnels
- Censored observations contribute
- to the "at risk" count up until their censoring time, then they're removed from the calculation.
- Lesson 812 — Handling Event Times and Censoring
- censoring
- .
- Lesson 802 — What is Survival Analysis?Lesson 835 — Customer Churn Prediction with Survival Analysis
- Census data
- Does your sample reflect regional population proportions?
- Lesson 421 — Applications: Uniform, Genetic Ratios, and Distributions
- Center Line (CL)
- The process mean or target value
- Lesson 1396 — Introduction to Control ChartsLesson 1397 — Shewhart Control Chart BasicsLesson 1398 — Control Charts for Means (X-bar Charts)
- Centered around zero
- positive and negative deviations should balance out
- Lesson 709 — Irregular Component: Random Noise
- Centered moving averages
- use data points from *both* before and after the target time.
- Lesson 753 — Centered vs Trailing Moving Averages
- Centering
- solves this by transforming each predictor to have a mean of zero.
- Lesson 656 — Centering Variables in InteractionsLesson 661 — Centering Predictors for Polynomials
- Central Limit Theorem
- (which you'll learn later) shows that averages tend to be normally distributed
- Lesson 169 — The Normal Distribution: Definition and PropertiesLesson 223 — Standard Error and the CLT
- Central Limit Theorem (CLT)
- is one of the most important results in statistics.
- Lesson 218 — What the Central Limit Theorem States
- Central tendency
- is the statistical concept of finding a single representative value that describes the "center" or "typical" value of a dataset.
- Lesson 38 — What is Central Tendency?Lesson 1172 — What is Univariate Analysis?Lesson 1220 — Histograms for Continuous Distributions
- Centralized storage with structure
- ensures documentation lives where everyone can find it.
- Lesson 2068 — Data Provenance Best Practices
- ceteris paribus
- (Latin for "other things being equal").
- Lesson 604 — Marginal Effects and Ceteris ParibusLesson 637 — Interpreting Dummy Variable Coefficients
- Change-point detection
- identifies moments in time when the statistical properties of your data fundamentally shift.
- Lesson 1412 — What is Change-Point Detection?
- Changes to be committed
- (staged): Files you've added to the staging area with `git add` but haven't committed yet.
- Lesson 1997 — Viewing Repository State with git status
- Changing spread
- The fluctuations get wider or narrower over time (violates constant variance)
- Lesson 715 — Visual Tests for Stationarity
- Channel concentration
- percentage of volume from top channel (lower is safer)
- Lesson 1716 — Channel Mix and Portfolio Thinking
- Chartjunk
- refers to anything in a visualization that doesn't represent data or support comprehension:
- Lesson 1246 — Visual Clutter and Chartjunk
- Check access permissions
- Can you actually query these databases or files?
- Lesson 2098 — Identifying Data Availability Gaps Early
- Check associations with outcome
- Does it also correlate with your dependent variable?
- Lesson 1429 — Identifying Confounders in Practice
- Check associations with treatment
- Does the potential confounder correlate with your independent variable?
- Lesson 1429 — Identifying Confounders in Practice
- Check assumptions
- using visual tools (histograms, Q-Q plots) and tests (Shapiro-Wilk, Levene's)
- Lesson 398 — Choosing Between Parametric and Non-Parametric TestsLesson 447 — Conducting One-Way ANOVA in PracticeLesson 542 — Computing Fitted Values and ResidualsLesson 633 — Practical Model Selection Strategy
- Check assumptions first
- Lesson 368 — Common Pitfalls and Best Practices
- Check connection parameters
- Validate host, port, database name, and connection string format
- Lesson 1093 — Troubleshooting Connection Issues
- Check context
- does it appear in a cluster of suspicious records?
- Lesson 1209 — Outlier Detection and Investigation
- Check contrast ratios
- between text/elements and backgrounds
- Lesson 1254 — Testing Visualizations for Accessibility
- Check covariate balance
- against your threshold
- Lesson 1492 — Rerandomization and Practical Implementation
- Check it
- Plot log-odds against each continuous predictor; look for straight-line patterns, not curves.
- Lesson 686 — Assumptions and Diagnostics in Logistic Regression
- Check normality
- Q-Q plots or Shapiro-Wilk test per group
- Lesson 290 — Assumptions and Diagnostics for Difference Intervals
- Check response patterns
- Low response rates (under 50%) often signal nonresponse bias.
- Lesson 250 — Strategies for Bias Detection and Mitigation
- Check result counts
- if you expect hundreds of rows but get millions, investigate immediately
- Lesson 955 — Avoiding Cartesian Products
- Check retention schedules
- what *must* you keep by law?
- Lesson 1909 — Right to Erasure and Data Retention Policies
- Check source documentation
- or file metadata when available
- Lesson 1135 — Detecting and Fixing Encoding Issues
- Check statistical significance
- Use t-tests and F-tests to identify meaningful predictors
- Lesson 633 — Practical Model Selection Strategy
- Check the lineage
- Use your pipeline's metadata to identify which upstream tables, files, or APIs fed into the problematic dataset
- Lesson 1870 — Root Cause Analysis for Quality Issues
- Checkout Started
- Lesson 1679 — Defining Funnel Steps and Events
- Checks each row
- in the main query to see if its column value matches *any* value from the subquery results
- Lesson 961 — IN Operator with Subqueries
- Cherry-picking time ranges
- means deliberately selecting start and end dates that support a preferred narrative while hiding inconvenient context.
- Lesson 1241 — Cherry-Picking Time Ranges
- chi-squared distribution
- is another special case.
- Lesson 182 — Special Cases: Exponential and Chi-SquaredLesson 254 — Sampling Distribution of the Sample VarianceLesson 628 — Likelihood Ratio TestsLesson 684 — Likelihood Ratio Tests for Model ComparisonLesson 699 — The Likelihood Ratio Test
- Chi-Squared test
- uses an approximation based on a mathematical distribution, while **Fisher's Exact Test** calculates the exact probability by considering all possible table arrangements.
- Lesson 434 — Fisher's Exact vs Chi-Squared: When to Use Each
- Chi-Squared Test of Independence
- helps you answer questions like: "Is there a relationship between gender and product preference?
- Lesson 422 — Introduction to Chi-Squared Test of Independence
- Children and minors
- They lack legal capacity and cognitive maturity to understand data implications
- Lesson 1918 — Special Populations and Vulnerable Groups
- Choose a meaningful baseline
- Lesson 645 — Changing the Reference Category
- Choose Dagster when
- You're managing complex data transformations, need strong guarantees about data quality, or want asset-centric workflows.
- Lesson 1839 — Alternative Orchestration Tools
- Choose exponential smoothing when
- Lesson 764 — Exponential Smoothing vs Moving Averages
- Choose intensity metrics
- Frequency (daily visits), depth (features used), or duration (session length)
- Lesson 1693 — Defining User Engagement
- Choose Kendall's Tau when
- Lesson 490 — Kendall's Tau vs Spearman's Rho
- Choose Luigi when
- You have simpler pipelines, want minimal infrastructure, or need quick prototyping without heavy tooling.
- Lesson 1839 — Alternative Orchestration Tools
- Choose moving averages when
- Lesson 764 — Exponential Smoothing vs Moving Averages
- Choose natural keys when
- Lesson 1050 — Choosing Effective Primary Keys
- Choose Prefect when
- You want rapid development, need dynamic pipelines, or prefer writing pure Python without Airflow's constraints.
- Lesson 1839 — Alternative Orchestration Tools
- Choose Spearman's Rho when
- Lesson 490 — Kendall's Tau vs Spearman's Rho
- Choose surrogate keys when
- Lesson 1050 — Choosing Effective Primary Keys
- Choosing measures
- Remember comparing mean, median, and mode?
- Lesson 63 — Understanding Distribution Shape
- Choosing references wisely
- pick a meaningful baseline for comparison
- Lesson 643 — Interpreting Coefficients Relative to Reference
- Choosing Weak Leading Indicators
- Lesson 1603 — Common Pitfalls in Indicator Selection
- Choosing α before analysis
- means deciding your threshold for rejecting the null hypothesis—typically 0.
- Lesson 329 — Choosing α Before Analysis
- Churn analysis
- measures the percentage who *stop using* your product in a given period (Week 1: 10% churned, Week 2: 8% churned).
- Lesson 1660 — Retention Curves vs Churn AnalysisLesson 1678 — What is Funnel Analysis?
- Churn prediction
- becomes more accurate when built separately for high-value versus low-value segments
- Lesson 1701 — What is Customer Segmentation?
- Churn Rate
- Percentage of customers who leave
- Lesson 1516 — Business Metrics: Definition and ExamplesLesson 1613 — Raw Counts vs. Rates and Ratios
- Churn reason
- (from attribution analysis): If they left due to missing features, notify them when those ship
- Lesson 1676 — Win-Back and Retention Strategies
- Churned Customers
- Those who've stopped paying, canceled subscriptions, or haven't engaged in your defined inactivity window.
- Lesson 1704 — Customer Lifecycle Stages
- City populations
- A few megacities dwarf most towns
- Lesson 190 — The Pareto Distribution: Heavy Tails and Power Laws
- City sizes
- A few megacities contain most urban population
- Lesson 191 — Pareto Principle and the 80/20 Rule
- Claim
- "Mobile users convert at lower rates than desktop users"
- Lesson 1946 — Supporting Your Claims with Evidence
- clarity
- (easy to understand), and **narrative** (answers "so what?
- Lesson 1215 — Characteristics of Explanatory VisualizationsLesson 1973 — Report Review and Quality Checklist
- Classic retention
- User was active in that *exact* period
- Lesson 1648 — Cohort Retention RatesLesson 1654 — Classic vs Unbounded Retention
- Clean experimentation
- Test new packages without risking your system-wide installation
- Lesson 2039 — Virtual Environments: Concept and Benefits
- Cleaner pipelines
- Data arrives pre-formatted
- Lesson 1802 — Filtering During Read with dtype and Converters
- Clear
- "Achieve 95% recall on fraud cases while maintaining false positive rate below 2%"
- Lesson 2094 — Defining Success Metrics Upfront
- Clear dependencies
- on packages, data sources, and environments
- Lesson 1981 — What Makes a Report Reproducible?
- Clear metrics
- (not "satisfaction," but "NPS score")
- Lesson 2093 — Translating Business Questions into Analytical Questions
- Clear outputs before committing
- Use "Restart & Clear Output" before staging your notebook.
- Lesson 2030 — Version Control for Notebooks: Challenges and Solutions
- Clear problem statement
- What question did you answer?
- Lesson 2141 — Building a Portfolio and Personal Brand
- Click-through rates
- in digital marketing (proportion of clicks)
- Lesson 184 — Beta Distribution: Bounded Between 0 and 1
- Closeness centrality
- How quickly a node can reach all others (the "efficient communicators")
- Lesson 1320 — Network Metrics and Visual Analysis
- Cloud data warehouses
- (Snowflake, BigQuery, Redshift) providing scalable compute
- Lesson 1821 — Hybrid Approaches and Modern Data Stacks
- Cluster
- or **multistage sampling** concentrates your effort geographically.
- Lesson 243 — Choosing the Right Sampling MethodLesson 1481 — Unit of Randomization
- Cluster sampling
- is a technique where you divide your population into groups (called **clusters**), randomly select some of those clusters, and then survey all or some members within the chosen clusters.
- Lesson 237 — Cluster SamplingLesson 243 — Choosing the Right Sampling Method
- Clustered Data
- Students within the same classroom, patients from the same hospital, or measurements from the same family are often more similar to each other than to observations from different clusters.
- Lesson 381 — Independence Assumption and Its ViolationsLesson 548 — Independence of Observations
- Clustering
- Different groups have noticeably different spreads
- Lesson 559 — Detecting Heteroscedasticity (Non-Constant Variance)Lesson 1812 — Partitioning and Clustering Strategies
- Clustering coefficient
- measures how tightly a node's neighbors are connected to each other—like whether your friends also know each other.
- Lesson 1320 — Network Metrics and Visual Analysis
- clusters
- ), randomly select some of those clusters, and then survey all or some members within the chosen clusters.
- Lesson 237 — Cluster SamplingLesson 584 — Correlation Matrices for PredictorsLesson 1179 — Identifying Missing Values PatternsLesson 1189 — Detecting Nonlinear RelationshipsLesson 1222 — Scatter Plots for Relationships
- Clusters of high correlations
- reveal groups of variables that measure similar underlying concepts.
- Lesson 511 — Reading and Interpreting Correlation Matrices
- Clusters or trends
- Independence assumption might be violated
- Lesson 556 — What Are Residuals and Why Plot Them?
- Coarsen
- Temporarily bin continuous variables into meaningful categories (e.
- Lesson 1449 — Coarsened Exact Matching (CEM)
- Coarsened Exact Matching
- solves this through a clever three-step process:
- Lesson 1449 — Coarsened Exact Matching (CEM)
- Code and reproducibility
- Lesson 1971 — Appendices and Technical Details
- Code chunks
- Sections of R code enclosed in special delimiters that execute and display results
- Lesson 1983 — R Markdown for Dynamic Reports
- Code clarity
- `src/` contains reusable functions and scripts.
- Lesson 2032 — Organizing Repository Structure for Data Science
- Code contribution process
- Should contributors fork your repo?
- Lesson 2083 — Contributing Guidelines and Contact Information
- Code debt
- Copy-pasting notebook cells instead of writing reusable functions
- Lesson 2131 — What is Technical Debt in Data Science?
- Code Licenses
- (your scripts and algorithms):
- Lesson 2082 — Choosing a License for Data Science Projects
- Code management
- means tracking changes to your scripts and notebooks, usually with tools like version control systems.
- Lesson 29 — Code and Environment Management
- Code references
- Scripts or notebook cells that performed each transformation
- Lesson 2065 — Tracking Data Lineage
- Code review
- Share branches for review before merging into `main`
- Lesson 2005 — What are Branches and Why Use Them?
- Code review happens
- Team members examine your changes, spot bugs, suggest improvements, and ensure standards are met
- Lesson 2022 — Understanding Pull Requests
- Code standards
- What style guide do you follow (PEP 8)?
- Lesson 2083 — Contributing Guidelines and Contact Information
- Code versions
- Git commit hashes, script versions, or package versions
- Lesson 1988 — Embedding Data Lineage and Metadata
- Coefficient of Variation
- is your tool when comparing datasets with different units or scales (e.
- Lesson 54 — When to Use Each Measure
- Coefficient of Variation (CV)
- solves this by expressing variability as a *percentage of the mean*.
- Lesson 53 — Coefficient of Variation
- Coefficient p-values
- Statistical significance of specific dummies shifts because you're testing different comparisons
- Lesson 647 — Impact on Model Results and Reporting
- Coefficient values
- Each dummy variable coefficient represents the difference from the reference, so new reference = new differences
- Lesson 647 — Impact on Model Results and Reporting
- Coffee
- → **Alertness** (coffee directly increases alertness)
- Lesson 1469 — Building a Simple Causal DAG
- Cohen's d
- for t-tests (difference between means in standard deviation units)
- Lesson 384 — What is Effect Size?
- Coherence
- Does the causal interpretation align with existing theory and evidence?
- Lesson 498 — Bradford Hill Criteria for CausationLesson 1563 — Sequential Updating with New Data
- Cohort analysis
- is a technique that divides users or customers into groups—called cohorts—based on a shared characteristic or experience within a defined time window.
- Lesson 1644 — What is Cohort Analysis?Lesson 1661 — What is Customer Lifetime Value (LTV)?Lesson 1678 — What is Funnel Analysis?Lesson 1701 — What is Customer Segmentation?Lesson 1715 — Comparing Channel Performance
- Cohort comparison
- Use log-rank tests to compare retention across pricing tiers or customer segments
- Lesson 838 — Subscription and Membership Duration Modeling
- Cohort-based payback analysis
- breaks down payback periods by customer segment (acquisition channel, geography, plan type, etc.
- Lesson 1758 — Cohort-Based Payback Analysis
- Collaboration
- Multiple team members can work with consistent datasets
- Lesson 1871 — Why Version Control for Data?Lesson 1990 — What is Version Control and Why Git?Lesson 2005 — What are Branches and Why Use Them?Lesson 2047 — What is Dependency Management?Lesson 2062 — Why Data Source Documentation MattersLesson 2074 — Notebooks vs Scripts: When to Use EachLesson 2142 — Interviewing: Technical and Behavioral Prep
- Collaboration-friendly
- Everyone knows where to find things.
- Lesson 2032 — Organizing Repository Structure for Data Science
- Collaborative fraud detection
- across banks without sharing customer data
- Lesson 1903 — Secure Multi-Party Computation
- Collectively exhaustive
- Lesson 82 — Collectively Exhaustive EventsLesson 83 — Partitions of the Sample SpaceLesson 89 — The Complement Rule
- Collectively exhaustive events
- are a group of events whose union contains *every possible outcome* in the sample space— nothing is left out.
- Lesson 82 — Collectively Exhaustive Events
- College attended
- may proxy for race, class, and family wealth
- Lesson 1883 — Protected Classes and Proxy Variables
- Collibra
- , **Alation**, and **Apache Atlas** maintain centralized inventories of your data assets.
- Lesson 1164 — Tools for Lineage Tracking
- collider
- is a variable that sits at the convergence of two causal arrows.
- Lesson 1432 — Colliders and Bad ControlsLesson 1468 — Introduction to Directed Acyclic Graphs (DAGs)Lesson 1471 — Mediators and CollidersLesson 1473 — Conditioning on Colliders: Selection BiasLesson 1476 — Common DAG Patterns and Pitfalls
- Collinearity
- makes models unstable and coefficients hard to interpret
- Lesson 1197 — Identifying Variable Importance and Redundancy
- Color
- determines what color your line appears.
- Lesson 1258 — Customizing Lines: Colors, Styles, and MarkersLesson 1297 — Font Properties and Text StylingLesson 1341 — Data and Aesthetic MappingsLesson 1364 — Customizing Text Elements
- Color (hue)
- Different colors stand out immediately (red among blues)
- Lesson 1235 — Pre-Attentive AttributesLesson 1310 — Point Maps and Scatter Plots on Maps
- Color (intensity)
- Sequential scales for continuous variables (temperature, risk level)
- Lesson 1310 — Point Maps and Scatter Plots on Maps
- Color blindness simulators
- (like Coblis or Chrome DevTools) show your chart through the lens of deuteranopia, protanopia, or other color vision deficiencies
- Lesson 1254 — Testing Visualizations for Accessibility
- Color choices
- Ensure colorblind-friendly palettes and grayscale compatibility.
- Lesson 1369 — Publication-Ready Plot Styling
- Color encoding
- Use color to represent the third dimension on a 2D plot (like heatmaps)
- Lesson 1329 — Effective Use and Pitfalls of 3D VisualizationsLesson 1362 — When to Use Facets vs. Other Approaches
- Color mapping
- adds another dimension, using different hues or intensity to show groupings or continuous scales.
- Lesson 1265 — Scatter Plots: Relationships Between Variables
- Color Scales Matter Immensely
- Lesson 1309 — Choropleth Maps: Basics and Best Practices
- Color-coding
- in scatter plots can reveal when different groups show different trends
- Lesson 1195 — Interaction Effects Between Variables
- ColorBrewer palettes
- offer scientifically-designed color schemes for categorical, sequential, or diverging data:
- Lesson 1368 — Color Scales and Palettes
- column
- The value you want to retrieve from a future row
- Lesson 1025 — LEAD Function: Accessing Next Row ValuesLesson 1358 — facet_grid() for Two Variables
- Column charts
- arrange categories along the horizontal axis with vertical bars extending upward.
- Lesson 1219 — Bar Charts and Column Charts
- Column Count Must Match
- Both queries must return the same number of columns
- Lesson 999 — UNION: Combining Distinct Results
- Column Names
- The result uses column names from the first SELECT
- Lesson 999 — UNION: Combining Distinct ResultsLesson 1151 — Schema Validation
- Column proportions
- divide each cell by its column total.
- Lesson 98 — Conditional Probability with Tables
- Column Types
- map Python objects to SQL data types.
- Lesson 1121 — Column Types, Constraints, and Relationships
- column_name
- Which column's value to retrieve
- Lesson 1023 — Introduction to Window Functions: LAG and LEADLesson 1024 — LAG Function: Accessing Previous Row Values
- Columns (Fields/Attributes)
- Each column represents a specific property or feature.
- Lesson 843 — Relational Database Concepts
- Combine adjacent categories
- to increase expected counts
- Lesson 419 — Assumptions and Minimum Expected Frequencies
- Comfortable zone
- Datasets under 1-2 GB work smoothly in Pandas on typical machines
- Lesson 1783 — Data Size Thresholds: When Pandas Isn't Enough
- Command-line tools
- that accept parameters and integrate with schedulers
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- commit
- it to save your work.
- Lesson 1112 — Starting and Committing TransactionsLesson 1995 — Committing Changes with git commit
- Commit the merge
- with `git commit` (Git will provide a default merge commit message)
- Lesson 2011 — Resolving Merge Conflicts
- Commit thoughtfully
- Make atomic commits after completing logical units of work, not after every cell execution.
- Lesson 2030 — Version Control for Notebooks: Challenges and Solutions
- Common data sources
- and their quirks in that sector
- Lesson 2145 — Transitioning Between Industries and Domains
- Common Pattern
- Lesson 977 — Correlated Subqueries in WHERE ClausesLesson 978 — Correlated Subqueries in SELECT Clauses
- Common patterns
- Lesson 1017 — Moving Averages with Window FramesLesson 1033 — CASE with Aggregation Functions
- Common Table Expression (CTE)
- is a named temporary result set that you define at the beginning of a query using the `WITH` clause.
- Lesson 989 — What are Common Table Expressions (CTEs)?
- Common Time-to-Value metrics
- Lesson 1697 — Time-to-Value and Activation Metrics
- Common use cases
- Lesson 1838 — XComs and Passing Data Between Tasks
- Common violations
- Lesson 553 — Exogeneity: X Must Be Independent of Errors
- Communicate timeline risks early
- If you discover the analysis will take longer than expected, flag it immediately.
- Lesson 2099 — Aligning with Business Timelines and Decision Points
- Communicating results
- with stakeholders who benefit from narrative + code + visuals in one document
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- Communication
- You must explain complex findings to people who don't speak "data.
- Lesson 7 — The Data Science Skill Stack
- Communication bridge
- Owner translates technical nuances for business stakeholders
- Lesson 1619 — What is Metric Ownership?
- Communication Protocols
- Lesson 1643 — Building Attribution Frameworks
- Community channels
- Link to Slack, Discord, or discussion forums
- Lesson 2083 — Contributing Guidelines and Contact Information
- Community detection
- algorithms group nodes into clusters based on connection patterns, revealing natural subdivisions in the network.
- Lesson 1320 — Network Metrics and Visual Analysis
- Company Level
- Your North Star Metric becomes the top-level objective.
- Lesson 1608 — Connecting North Star Metrics to OKRs
- Compare
- Does P(A and B) equal P(A) × P(B)?
- Lesson 102 — Testing for IndependenceLesson 395 — Permutation Tests for Means and BeyondLesson 1185 — Grouped Summary StatisticsLesson 1353 — Position Adjustments: Dodge, Stack, and JitterLesson 1590 — The Metropolis-Hastings Algorithm
- Compare across cohorts
- to identify trends, improvements, or degradation
- Lesson 1664 — Cohort-Based LTV Calculation
- Compare across tables
- Filter rows in one table based on criteria from another
- Lesson 959 — Introduction to Subqueries in WHERE
- Compare apples to apples
- Compare January sales to July sales fairly
- Lesson 748 — Seasonally Adjusted Data
- Compare apples to oranges
- Compare test scores from different exams with different scales
- Lesson 195 — Z-Score Definition and Interpretation
- Compare cohorts instantly
- Did the January cohort retain better than February's?
- Lesson 1656 — Visualizing Retention Curves
- Compare costs
- Which error would cause more harm?
- Lesson 334 — Setting Alpha: Choosing Your Significance Level
- Compare effects
- across strata—if the relationship disappears or reverses, the confounder was key
- Lesson 1430 — Controlling for Confounders: Stratification
- Compare nested models
- Use partial F-tests when adding/removing specific variables
- Lesson 633 — Practical Model Selection Strategy
- Compare posteriors
- to see which hypothesis is most supported by the evidence.
- Lesson 113 — Multiple Hypotheses and Total ProbabilityLesson 1572 — Sensitivity Analysis and Prior Robustness
- Compare stratified analyses
- Calculate effects within each confounder level—are they consistent or wildly different?
- Lesson 1429 — Identifying Confounders in Practice
- Compare the smaller sum
- to critical values or compute a p-value
- Lesson 392 — Wilcoxon Signed-Rank Test
- Compare visually and numerically
- Do summary statistics (mean, variance, extreme values) of simulated data match your observed data?
- Lesson 1596 — Posterior Predictive Checks and Model Comparison
- Compare your observed statistic
- to this distribution to get a p-value
- Lesson 396 — Bootstrap Hypothesis Testing
- Comparing datasets
- Detect records in a source system missing from a target
- Lesson 1002 — EXCEPT: Finding Differences
- Comparing groups
- Use SE to gauge if observed differences are substantial
- Lesson 265 — Using Standard Error in Practice
- Comparing means across categories
- (like average sales by quarter)
- Lesson 1288 — Point Plots for Trend Visualization
- Comparing metrics
- Find records where one value exceeds another
- Lesson 947 — Self-Joins for Comparisons Within a Table
- Comparing models
- requires matching units (you can't directly compare slopes from different scales)
- Lesson 525 — Units and Scale in Interpretation
- Comparing multiple curves
- (different cohorts or product versions) reveals which changes improved stickiness
- Lesson 1653 — What are Retention Curves?
- Comparing Values Within Rows
- Lesson 948 — Self-Joins with Inequality Conditions
- compatible data types
- .
- Lesson 998 — Introduction to Set OperationsLesson 999 — UNION: Combining Distinct ResultsLesson 1001 — INTERSECT: Finding Common RowsLesson 1003 — Set Operation Requirements and Rules
- Competence
- Lesson 1913 — Elements of Valid Consent
- Complementary events
- save work when one tail is shorter.
- Lesson 130 — Calculating Binomial Probabilities
- Complementary probabilities
- Using P(A') = 1 - P(A) for efficiency
- Lesson 130 — Calculating Binomial Probabilities
- Complete rows
- If entire rows are missing, perhaps certain groups weren't measured
- Lesson 1179 — Identifying Missing Values Patterns
- completeness
- , **consistency**, **timeliness**, **validity**, and **uniqueness**.
- Lesson 1863 — Data Quality DimensionsLesson 1865 — Data Quality Checks in PipelinesLesson 1867 — Data Profiling and MonitoringLesson 1869 — Data Quality Metrics and SLAsLesson 1973 — Report Review and Quality ChecklistLesson 2086 — Stage 2: Data Acquisition and Assessment
- Completeness checks
- are your detective work for finding exactly where data is missing, how much is missing, and whether the missingness follows patterns.
- Lesson 1153 — Completeness Checks
- Completion Rate
- Percentage of content finished by viewers.
- Lesson 1635 — Media and Content Metrics: Watch Time and Content Performance
- Complex aggregations
- Multiple groupBy operations with window functions over large groups
- Lesson 1784 — Computation Complexity: Beyond Data Size
- Complex constraints
- Stan's type system handles parameter boundaries and transformations elegantly
- Lesson 1595 — Stan: High-Performance Bayesian Inference
- Complex models
- Multi-parameter models where conjugacy breaks down anyway
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- Complex queries
- When you need multiple derived tables or nested subqueries
- Lesson 974 — When to Use FROM Subqueries vs CTEs
- Complexity costs
- Adding that tenth feature interaction makes your model unmaintainable
- Lesson 2116 — Diminishing Returns and the 80/20 Rule
- Complexity penalty
- A term that increases with the number of parameters (k)
- Lesson 629 — Akaike Information Criterion (AIC)
- Compliance
- Meet regulations like GDPR while still enabling data-driven work
- Lesson 1901 — Synthetic Data Generation
- Compliance and Legal Teams
- care about:
- Lesson 1951 — Understanding Stakeholder Priorities and Constraints
- Composite keys
- Multiple columns together, like `(order_id, product_id)`
- Lesson 1048 — What Are Primary Keys?
- Compositional changes
- occur when the *makeup* of your treatment or control groups changes over time.
- Lesson 1458 — Common DiD Pitfalls
- Compounding growth drag
- Even with strong acquisition, high churn prevents the compounding effects of a growing base
- Lesson 1670 — What is Churn and Why It Matters
- Comprehension
- Lesson 1913 — Elements of Valid Consent
- Compression
- Parquet (best) > Feather > CSV (gzip) > JSON > Excel
- Lesson 1133 — Performance Considerations Across FormatsLesson 1811 — Columnar Storage and Query Optimization
- Computational complexity
- the number and cost of operations you perform—can make processing even modest-sized datasets painfully slow on a single machine.
- Lesson 1784 — Computation Complexity: Beyond Data Size
- Computational complexity increases
- Different methods (Type I, II, III sums of squares) can give different results
- Lesson 468 — Balanced vs Unbalanced Designs
- Computational efficiency
- Process data in manageable chunks
- Lesson 1538 — Updating Beliefs with Sequential Data
- Computational resources
- Can you process millions of rows or just thousands?
- Lesson 1169 — Clarifying Assumptions and Constraints
- Computational simplicity
- No need for sampling algorithms or numerical integration
- Lesson 1555 — Advantages and Limitations of Conjugate Priors
- Computationally efficient
- You only need the last forecast and the new observation
- Lesson 757 — Introduction to Exponential Smoothing
- Compute baseline conversion probability
- The chance a random user converts given the current channel mix
- Lesson 1733 — Markov Chain Attribution Models
- Compute means
- Find x̄ (mean of x values) and ȳ (mean of y values)
- Lesson 522 — Implementing Least Squares from Scratch
- Compute on encrypted values
- using special arithmetic that preserves secrecy
- Lesson 1903 — Secure Multi-Party Computation
- Compute summaries
- `stat_summary()` calculates means, medians, or custom functions
- Lesson 1352 — Statistical Transformations with stat_* Layers
- Computer Science & Programming
- Lesson 1 — Defining Data Science
- Concentration of values
- (wider sections = more data points)
- Lesson 1286 — Violin Plots and Distribution Shape
- Conclusion
- The die doesn't appear to follow a uniform distribution; it's likely biased
- Lesson 420 — Interpreting Chi-Squared Test ResultsLesson 733 — Using ACF and PACF Together
- Conclusion cells
- Summarize findings and recommendations
- Lesson 1982 — Literate Programming with Notebooks
- Conditional dependencies
- Some tools support dynamic dependency creation
- Lesson 1843 — Declaring Dependencies in Orchestration Tools
- Conditional distributions
- (e.
- Lesson 1187 — Contingency Tables and Cross-TabulationsLesson 1197 — Identifying Variable Importance and Redundancy
- Conditional probability
- captures exactly this: the probability of event A happening when we *already know* event B has occurred.
- Lesson 92 — Definition and Notation of Conditional ProbabilityLesson 96 — Conditional Probability in Tree Diagrams
- Conditional values
- Different logic per row based on related data
- Lesson 967 — Subqueries in the SELECT Clause
- Confidence
- Higher confidence (e.
- Lesson 295 — Trade-offs: Precision, Confidence, and CostLesson 1158 — Automated Validation Frameworks
- Confidence bands
- Usually shown as blue shaded regions or dashed lines (typically at ±2/√n).
- Lesson 722 — ACF Plots and Interpretation
- Confidence interval
- for the effect size (shows uncertainty)
- Lesson 389 — Reporting Effect Sizes in PracticeLesson 412 — Confidence Interval for DifferenceLesson 607 — Confidence Intervals for CoefficientsLesson 621 — Interpreting t-Statistics and Confidence Intervals
- Confidence intervals
- may be too narrow or too wide
- Lesson 202 — Why Test for Normality?Lesson 227 — Practical Applications of the CLTLesson 300 — Bootstrap Distribution of a StatisticLesson 462 — Interpreting and Reporting Post-Hoc ResultsLesson 625 — Practical Workflow: Testing and Interpreting PredictorsLesson 730 — Interpreting PACF PlotsLesson 800 — Generating Forecasts with SARIMALesson 815 — Survival Curve Plots and Interpretation (+7 more)
- Confidence level
- Higher confidence (e.
- Lesson 271 — Margin of ErrorLesson 289 — Sample Size Requirements for Difference IntervalsLesson 292 — Sample Size for Estimating a MeanLesson 294 — Margin of Error and Its Components
- Confirm Long-Term Trends
- Lesson 1598 — Characteristics of Lagging Indicators
- Confirmation bias
- Analyzing data only until it supports a desired conclusion
- Lesson 1926 — The Honest Broker Role
- Conflict (insight)
- What surprising or important pattern did you discover?
- Lesson 1933 — The Power of Narrative in Data Communication
- confounded
- .
- Lesson 1526 — Selection Bias in Opt-In TestsLesson 1531 — Interference from Concurrent Tests
- confounder
- appears as a node with arrows pointing to both treatment and outcome
- Lesson 1468 — Introduction to Directed Acyclic Graphs (DAGs)Lesson 1470 — Confounders in DAGsLesson 1476 — Common DAG Patterns and Pitfalls
- confounding variable
- (or confounder) is a third variable that influences both your variables of interest, creating a spurious (fake) correlation between them.
- Lesson 509 — Confounding Variables and ControlLesson 1194 — Simpson's Paradox and ConfoundingLesson 1423 — The Third Variable ProblemLesson 1426 — Real-World Examples: Correlation vs CausationLesson 1427 — What is a Confounding Variable?
- Confounding variables
- A hidden third factor causes both (like temperature above)
- Lesson 493 — The Fundamental Difference: Association vs Cause-and-EffectLesson 495 — Confounding VariablesLesson 510 — Correlation Matrices: Construction and DisplayLesson 1201 — Domain Knowledge as a Hypothesis SourceLesson 1487 — Simple Random Assignment
- Confusing logic
- Code must constantly check `type` to interpret what's valid
- Lesson 1148 — Handling Multiple Types in One Table
- Confusion
- New team members (or your future self) waste time trying to understand if old experiments are still relevant
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- conjugate prior
- is a prior distribution that, when combined with a specific likelihood function, produces a posterior distribution from the same probability family as the prior.
- Lesson 1550 — What Are Conjugate Priors?Lesson 1551 — Beta-Binomial Conjugacy
- Connecting to objectives
- Every insight should tie back to the problem you scoped at the start.
- Lesson 2090 — Stage 6: Interpretation and Insight Generation
- Connection pooling
- is like a parking lot for database connections.
- Lesson 1092 — Connection Pooling Basics
- Connection to normal
- If *ln(X)* ~ Normal(μ, σ²), then *X* ~ Log-Normal
- Lesson 178 — Log-Normal Distribution: Definition and Properties
- Cons
- Stale data between refreshes, storage overhead, refresh time on large datasets
- Lesson 1076 — Materialized Views and Summary Tables
- Consecutive rankings
- for categorization (like price tiers: budget, mid-range, premium)
- Lesson 1009 — DENSE_RANK(): Ranking Without Gaps
- Conservative Estimates
- Lesson 297 — Handling Unknown Population Parameters
- Consider `nbdime` or `jupytext`
- Tools like `nbdime` provide notebook-aware diffs.
- Lesson 2030 — Version Control for Notebooks: Challenges and Solutions
- Consider adversarial users
- Who benefits from gaming your system?
- Lesson 1924 — Red Team Thinking for Data Scientists
- Consider d=2 cautiously
- If d=1 didn't work, try second-order differencing (differencing the already-differenced series).
- Lesson 778 — Determining Differencing Order (d)
- Consider JOINs instead
- Correlated subqueries can often be rewritten as LEFT JOINs with GROUP BY, executing more efficiently
- Lesson 969 — Performance Considerations for SELECT Subqueries
- Consider JOINs instead when
- You have many conditions (10+) or conditions change frequently.
- Lesson 1037 — CASE Best Practices and Performance
- Consider ramp-up periods
- Exclude the first few days from analysis
- Lesson 1525 — Novelty and Primacy Effects
- Consider robustness
- With n > 30-40, t-tests handle mild violations well (Central Limit Theorem)
- Lesson 398 — Choosing Between Parametric and Non-Parametric Tests
- Consider simpler alternatives
- regression with strong domain priors, expert-designed scoring systems, or rule-based logic.
- Lesson 2124 — Insufficient or Low-Quality Data
- Consider Transformation
- Lesson 579 — What to Do with Influential Points
- Consider UUID/GUID
- for distributed systems where different databases generate records independently
- Lesson 1050 — Choosing Effective Primary Keys
- Consider WHERE filters when
- You only need to include/exclude rows, not transform values.
- Lesson 1037 — CASE Best Practices and Performance
- Consistency
- Has the relationship been found repeatedly, across different studies, populations, and settings?
- Lesson 498 — Bradford Hill Criteria for CausationLesson 1110 — What Are Database Transactions?Lesson 1158 — Automated Validation FrameworksLesson 1822 — What is a Data Pipeline?Lesson 1863 — Data Quality DimensionsLesson 1865 — Data Quality Checks in PipelinesLesson 1986 — Automated Report GenerationLesson 2059 — Seeds in Train-Test Splits
- Consistency with benchmarks
- Does your entire interval fall in the "large effect" range, or does it span from "small" to "large"?
- Lesson 387 — Confidence Intervals for Effect Sizes
- Consistent analysis syntax
- Functions like `groupby()`, `pivot_table()`, and aggregation operations work the same way across different datasets.
- Lesson 1149 — Benefits of Tidy Data for Downstream Work
- Consistent spread
- The scatter shouldn't fan out or compress at one end
- Lesson 480 — Scatterplots and Visual Assessment
- constant
- across all trials
- Lesson 126 — From Bernoulli to Binomial: Multiple TrialsLesson 648 — What are Interaction Terms?
- Constant autocorrelation structure
- the relationship between observations at different time lags remains stable
- Lesson 712 — What is Stationarity?
- constant over time
- (proportional hazards)
- Lesson 823 — Log-Rank Test vs Other TestsLesson 825 — What is the Cox Proportional Hazards Model?
- Constant variance (homoscedasticity)
- Do residuals spread evenly across all predicted values, or do they fan out or compress?
- Lesson 544 — The Role of Residuals in Diagnostics
- Constraints
- Time limits, budget, available data, ethical considerations
- Lesson 10 — Problem Definition and ScopingLesson 1121 — Column Types, Constraints, and RelationshipsLesson 1151 — Schema Validation
- Consultation
- involving your Data Protection Officer and potentially data subjects
- Lesson 1910 — Data Protection Impact Assessments (DPIAs)
- Consume massive memory
- your database must store or stream millions of rows
- Lesson 943 — CROSS JOIN Results: Size and Structure
- Consume memory
- holding all unique combinations
- Lesson 911 — Performance Considerations with Multiple Groups
- Consumer Mobile Apps
- Lesson 1657 — Day-1, Day-7, Day-30 Benchmarks
- Contact information
- Who to reach with questions
- Lesson 1989 — Best Practices for Sharing Reproducible ReportsLesson 2063 — Essential Metadata to CaptureLesson 2091 — Stage 7: Communication and Handoff
- Contact/Contribution
- Who maintains this and how to get involved
- Lesson 2077 — The Purpose and Anatomy of a Good README
- Container tools
- that package code *and* environment together
- Lesson 29 — Code and Environment Management
- Content Acquisition Cost (CAC)
- Total spend (licensing or production) divided by content hours.
- Lesson 1635 — Media and Content Metrics: Watch Time and Content Performance
- Content Library Depth
- Number of titles and hours of available content.
- Lesson 1635 — Media and Content Metrics: Watch Time and Content Performance
- Content Platform
- Discover Content → Click → Watch/Read → Like/Share
- Lesson 1678 — What is Funnel Analysis?
- Content platforms
- Account creation or premium upgrade
- Lesson 1686 — Defining Conversions and Conversion Rate
- Context
- "Is this sample mean different from a known population mean?
- Lesson 315 — Common Test Statistics: Z, t, Chi-Square, and FLesson 342 — Alpha Level Trade-offsLesson 1247 — The Ethics of Visualization Design
- Context expertise
- Owner knows when the metric is actionable vs.
- Lesson 1619 — What is Metric Ownership?
- Context matters
- Remember why you're testing.
- Lesson 210 — Combining Visual and Statistical MethodsLesson 1659 — Comparing Retention Across Cohorts
- Context-aware metrics
- Lesson 1691 — Mobile vs Desktop Conversion Analysis
- Contextual understanding
- Some intersections carry unique historical disadvantages that single-attribute analysis misses entirely
- Lesson 1893 — Intersectionality in Fairness
- contingency table
- (rows = one variable, columns = another)
- Lesson 422 — Introduction to Chi-Squared Test of IndependenceLesson 423 — Contingency Tables and Expected FrequenciesLesson 1187 — Contingency Tables and Cross-Tabulations
- Continue collecting data
- Lesson 1511 — Sequential Probability Ratio Test (SPRT)
- Continue the rebase
- Run `git rebase --continue` to move to the next commit
- Lesson 2018 — Resolving Conflicts During Rebase
- Continue with Warnings
- Lesson 1866 — Handling Failed Quality Checks
- Continuity
- The eye naturally follows smooth, continuous paths.
- Lesson 1236 — Gestalt Principles in Visualization
- Continuity correction
- For small counts (b + c < 25), use the corrected formula: χ² = (|b - c| - 1)² / (b + c)
- Lesson 436 — Conducting McNemar's Test
- Continuous (water)
- You might have 250ml, or 250.
- Lesson 18 — Numerical Variables: Discrete and Continuous
- Continuous data
- mapping numeric ranges to positions or gradients
- Lesson 1344 — Scales and Coordinate Systems
- Continuous monitoring required
- IoT sensors tracking equipment failures need instant alerts
- Lesson 1788 — Streaming Data and Real-Time Requirements
- Continuous numerical data
- represents *measurements* that can take any value within a range, including decimals.
- Lesson 18 — Numerical Variables: Discrete and Continuous
- continuous positive values
- that tend to be skewed rather than symmetric.
- Lesson 183 — Applications of the Gamma DistributionLesson 678 — Choosing the Right Link Function
- Continuous predictors
- (like age, blood pressure, or income) take numerical values along a scale, while **categorical predictors** (like treatment group, gender, or risk category) represent distinct groups.
- Lesson 829 — Continuous and Categorical Predictors
- Continuous unbounded data
- The **identity link** (standard linear regression) is appropriate.
- Lesson 678 — Choosing the Right Link Function
- Contour plots
- Display 3D surfaces as 2D contour lines, like topographic maps
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- Contracting funnel
- The opposite—wide on the left, narrow on the right.
- Lesson 559 — Detecting Heteroscedasticity (Non-Constant Variance)
- Contrast checkers
- verify that your text and visual elements meet minimum visibility standards
- Lesson 1254 — Testing Visualizations for Accessibility
- control
- (version A—the current state or baseline).
- Lesson 1477 — Core Principles of A/B TestingLesson 1482 — Control and Treatment Design
- Control backfills
- Re-running a task may require re-running its entire downstream chain
- Lesson 1841 — Upstream and Downstream Dependencies
- Control charts
- for process stability without strong seasonality
- Lesson 1411 — Applications and Limitations
- Control for confounders
- Isolate true relationships from spurious ones
- Lesson 1190 — Introduction to Multivariate Analysis
- Control for confounding variables
- you learned about in partial correlation
- Lesson 595 — From Simple to Multiple Linear Regression
- control group
- or **standard treatment** as reference.
- Lesson 644 — Choosing a Reference CategoryLesson 1435 — What is a Randomized Controlled Trial?Lesson 1641 — Isolating Effects with Control GroupsLesson 1677 — Measuring Churn Reduction ImpactLesson 1688 — A/B Testing for Conversion Optimization
- Control group, after intervention
- Lesson 1452 — The Difference-in-Differences Setup
- Control or Baseline Group
- Lesson 644 — Choosing a Reference Category
- controlled experiment
- solves this problem through **randomization**.
- Lesson 499 — Why Controlled Experiments Are NeededLesson 1477 — Core Principles of A/B Testing
- Controlling deletes
- Cascading actions (DELETE and UPDATE) you learned about help maintain integrity when parent records change
- Lesson 1055 — What is Referential Integrity?
- Controls
- Account for confounding variables (income, membership duration)
- Lesson 1204 — From Hypothesis to Analysis Plan
- Controls for attention effects
- Users still see *something* in the ad slot
- Lesson 1747 — Ghost Ads and PSA Tests
- Convenience
- or **quota sampling** may be pragmatic (but acknowledge the bias risk).
- Lesson 243 — Choosing the Right Sampling Method
- Convenience sampling gone wrong
- Surveying only people easy to reach (like students in your class) when you want to understand all adults.
- Lesson 244 — Selection Bias and Its Causes
- Convenience wins
- Metrics like clicks, page views, or session duration give fast feedback.
- Lesson 1530 — Mismatched Metrics and Goals
- Convention
- Most SQL developers write keywords in UPPERCASE to distinguish them from table/column names, but lowercase works equally well.
- Lesson 847 — Basic SQL Syntax Rules
- conversion
- is any desired action a user completes that moves them closer to delivering value to your business.
- Lesson 1686 — Defining Conversions and Conversion RateLesson 1690 — Landing Page and CTA Optimization
- Conversion probability curves
- What percentage converts by day 7, 30, or 90?
- Lesson 839 — Time-to-Conversion in Marketing Funnels
- Conversion Rate
- Percentage of visitors who complete a desired action
- Lesson 1516 — Business Metrics: Definition and ExamplesLesson 1613 — Raw Counts vs. Rates and RatiosLesson 1625 — Cross-Functional Metric DependenciesLesson 1680 — Measuring Drop-off and Conversion RatesLesson 1714 — Channel-Level Metrics
- conversion rates
- (Did they stay?
- Lesson 1676 — Win-Back and Retention StrategiesLesson 1723 — Comparing Single-Touch Models
- Convert between zones
- Transform a timestamp from one timezone to another (e.
- Lesson 1042 — Working with Timestamps and Time Zones
- Cook's Distance
- .
- Lesson 578 — Visualizing Leverage and InfluenceLesson 587 — Identifying Outliers in Regression ContextLesson 589 — Deciding Whether to Remove Outliers
- Cookie banners
- are designed to extract "consent" through friction and confusion.
- Lesson 1914 — Consent in Digital Contexts
- Cookiecutter
- is a command-line tool that creates projects from templates.
- Lesson 2076 — Code Organization Templates and Cookiecutter
- Cookiecutter Data Science
- , which implements best practices you've already learned: separating raw from processed data, organizing notebooks, keeping configuration separate, and more.
- Lesson 2076 — Code Organization Templates and Cookiecutter
- Coordinated disclosure
- If external disclosure is needed, work with security/ethics experts to time and frame it appropriately
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Coordinates (coord)
- The space where data is plotted (Cartesian, polar, map projections)
- Lesson 1340 — The Seven Layers of Grammar
- Coordinating dependencies
- some tasks must wait for others (e.
- Lesson 1769 — Task Parallelism and Work Distribution
- Copy Elements
- Lesson 1690 — Landing Page and CTA Optimization
- Copyleft licenses
- (like GPL, AGPL) require that derivative works also be open source under the same license.
- Lesson 2081 — Understanding Open Source Licenses
- Core
- is the foundation level, and the **ORM** (Object-Relational Mapper) is built on top of it.
- Lesson 1118 — SQLAlchemy Core vs ORM
- Correct
- "We are 95% confident the population proportion lies between 0.
- Lesson 281 — Interpreting Proportion Confidence Intervals
- Correct interpretation
- "Being sick causes people to go to the hospital.
- Lesson 496 — Reverse Causality
- Correct period
- Algorithm properly separates normal seasonal peaks from true anomalies
- Lesson 1409 — Setting Detection Parameters
- Correctly explaining results
- "NYC homes cost $15k more than Boston homes" (not "NYC homes cost $15k")
- Lesson 643 — Interpreting Coefficients Relative to Reference
- Correlated approach
- Lesson 980 — Converting Correlated to Non-Correlated Subqueries
- Correlated SELECT subqueries
- run repeatedly:
- Lesson 969 — Performance Considerations for SELECT Subqueries
- Correlated subqueries
- reference columns from the outer query.
- Lesson 968 — Correlated vs Non-Correlated Subqueries in SELECT
- correlated subquery
- references the outer query and must run for *every row* being evaluated.
- Lesson 966 — Performance Considerations for WHERE SubqueriesLesson 975 — What is a Correlated Subquery?
- Correlated vs. Uncorrelated Subqueries
- Lesson 966 — Performance Considerations for WHERE Subqueries
- Correlation
- means two variables move together in a statistically observable pattern.
- Lesson 1420 — Defining Correlation and Causation
- Correlation Analysis
- Lesson 1602 — Identifying Leading Indicators for Your MetricsLesson 1883 — Protected Classes and Proxy Variables
- correlation coefficient
- ?
- Lesson 306 — Bootstrap for Non-Standard ProblemsLesson 476 — What is Pearson Correlation?Lesson 719 — What is Autocorrelation?Lesson 721 — Computing ACF Values
- Correlation matrices and heatmaps
- (Lesson 1192) reveal pairs of highly correlated variables—strong candidates for redundancy
- Lesson 1197 — Identifying Variable Importance and Redundancy
- correlation matrix
- solves this by computing correlations between *every pair* of variables and organizing them into a grid.
- Lesson 510 — Correlation Matrices: Construction and DisplayLesson 513 — Applications: Feature Selection and MulticollinearityLesson 1192 — Correlation Matrices and Heatmaps
- Correlation matrix examination
- Drop one variable from pairs with correlation > 0.
- Lesson 585 — Remedies: Variable Selection
- Correlations
- between two variables
- Lesson 306 — Bootstrap for Non-Standard ProblemsLesson 1191 — Scatter Plot Matrices and Pairplots
- cost
- , and **feasibility**.
- Lesson 243 — Choosing the Right Sampling MethodLesson 295 — Trade-offs: Precision, Confidence, and CostLesson 1677 — Measuring Churn Reduction Impact
- Cost forecasting
- Knowing the hazard function helps finance teams predict warranty claim volumes and budget accordingly.
- Lesson 837 — Product Warranty and Failure Analysis
- Cost less to maintain
- No retraining pipelines, drift monitoring, or GPU compute
- Lesson 2128 — Data Distribution Shifts Frequently
- Cost of false positives
- vs false negatives
- Lesson 324 — Common Significance Levels: 0.05, 0.01, and 0.10
- Cost Per Acquisition (CPA)
- by targeting cheaper traffic sources, they might inadvertently decrease **Average Order Value (AOV)** that the sales team tracks.
- Lesson 1625 — Cross-Functional Metric DependenciesLesson 1714 — Channel-Level MetricsLesson 1715 — Comparing Channel Performance
- Cost per patient encounter
- aggregates all expenses divided by patient visits or admissions—a critical profitability metric.
- Lesson 1633 — Healthcare Metrics: Patient Outcomes and Operational Efficiency
- Cost-benefit analysis
- Is the effect large enough to justify intervention costs?
- Lesson 386 — Effect Size Interpretation GuidelinesLesson 2126 — Cost and Complexity Exceed Benefit
- Cost-effective
- Reduces travel and administrative costs by concentrating data collection in selected clusters
- Lesson 238 — Multistage Sampling
- COUNT
- , **SUM**, **AVG**, **MIN**, and **MAX**—together with **GROUP BY** to create rich summaries of grouped data.
- Lesson 892 — GROUP BY with Different Aggregate Functions
- Count data
- (number of events, purchases, visits)
- Lesson 213 — Square Root and Cube Root TransformationsLesson 678 — Choosing the Right Link FunctionLesson 689 — When to Use Poisson RegressionLesson 690 — The Poisson Distribution as a GLMLesson 1552 — Gamma-Poisson Conjugacy
- Count the signs
- Ignore zeros; count how many differences are positive (+) and how many are negative (−)
- Lesson 391 — The Sign Test for Medians
- COUNT(column_name)
- counts only the rows where that specific column has a **non-null value**.
- Lesson 882 — COUNT: Counting Rows and Non-Null ValuesLesson 894 — NULL Values in GROUP BY
- COUNT(right_table_column)
- ignores NULLs → correct for "how many matches"
- Lesson 933 — Aggregating with LEFT JOINs
- Counter-example
- Drawing two cards from a deck *without replacement* creates dependence.
- Lesson 101 — Defining Statistical Independence
- Counter-metrics
- and **guardrails** are defensive metrics designed to catch these problems before they damage your business.
- Lesson 1624 — Counter-Metrics and GuardrailsLesson 1635 — Media and Content Metrics: Watch Time and Content Performance
- Counting distinct performance levels
- rather than absolute positions
- Lesson 1009 — DENSE_RANK(): Ranking Without Gaps
- Counting unique entities
- How many different customers placed orders?
- Lesson 873 — Understanding DISTINCT: Removing Duplicate RowsLesson 887 — Aggregates with DISTINCT
- Course-correct quickly
- Discover that your chosen metric doesn't align with business goals *before* building a complete pipeline.
- Lesson 2111 — Fast Feedback Loops with Stakeholders
- covariance
- as the "raw" measure of how two variables move together.
- Lesson 478 — The Formula for Pearson's rLesson 519 — Computing β₁: The Slope Estimate
- Covariate balance
- means the distribution of baseline characteristics—age, prior purchase behavior, device type, etc.
- Lesson 1491 — Covariate Balance and Diagnostics
- Covariates
- (the variables you believe affect survival)
- Lesson 828 — Fitting the Cox ModelLesson 835 — Customer Churn Prediction with Survival AnalysisLesson 840 — Loan Default Timing and Credit Risk
- Coverage error
- occurs when your **sampling frame**—the actual list or method you use to select your sample— doesn't include everyone in the target population.
- Lesson 249 — Coverage Error and Undercoverage
- Cox models
- to identify which covariates (customer age, past purchases, email open time) predict faster responses
- Lesson 841 — Campaign Response Time Analysis
- Cox Proportional Hazards Model
- (or Cox regression) is a semi-parametric method that lets you predict how covariates (like age, treatment, or risk factors) affect survival time **without** assuming what the baseline hazard distribution looks like.
- Lesson 825 — What is the Cox Proportional Hazards Model?Lesson 835 — Customer Churn Prediction with Survival AnalysisLesson 836 — Employee Turnover and Retention AnalysisLesson 839 — Time-to- Conversion in Marketing FunnelsLesson 840 — Loan Default Timing and Credit Risk
- Cramér's V
- is the standard effect size measure for chi-squared tests of independence.
- Lesson 429 — Effect Size: Cramér's V and Phi
- Crash queries
- some databases have row limits or timeouts
- Lesson 943 — CROSS JOIN Results: Size and Structure
- Create a narrative flow
- Guide readers from question → exploration → findings → conclusions in a linear, readable format
- Lesson 1982 — Literate Programming with Notebooks
- Create disparate impact
- even without intent (your model systematically disadvantages a protected group)
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Create predictable rhythms
- Schedule recurring meetings at the project's start.
- Lesson 2104 — Communication Cadence and Updates
- Create strata
- by grouping units with identical covariate values
- Lesson 1489 — Stratified Randomization Fundamentals
- Create unexpected results
- accidental CROSS JOINs are a common SQL mistake
- Lesson 943 — CROSS JOIN Results: Size and Structure
- Creating new features
- from existing ones: combining columns, extracting date components (day of week, month), or calculating ratios that encode domain knowledge.
- Lesson 2088 — Stage 4: Feature Engineering and Preparation
- Creative costs
- design, copywriting, video production
- Lesson 1753 — Customer Acquisition Cost (CAC): Components and Calculation
- Credentials and secrets
- Lesson 2031 — Using .gitignore for Data Science Projects
- credible interval
- is the Bayesian alternative to a confidence interval.
- Lesson 1562 — Credible Intervals for ProportionsLesson 1573 — What is a Credible Interval?
- credible intervals
- (e.
- Lesson 1417 — Bayesian Change-Point DetectionLesson 1539 — Interpreting Posterior ProbabilitiesLesson 1574 — Credible Intervals vs Confidence Intervals
- Credit history location
- → historical discrimination effects
- Lesson 1889 — Proxy Variables and Redlining
- Credit scoring
- built on historical lending bias against minorities
- Lesson 1881 — Historical and Societal Bias
- Criminal justice data
- reflecting decades of discriminatory policing practices
- Lesson 1881 — Historical and Societal Bias
- critical
- .
- Lesson 376 — The Assumption of Normality in t-TestsLesson 970 — Subqueries in the FROM Clause (Derived Tables)Lesson 1857 — Logging Best PracticesLesson 1910 — Data Protection Impact Assessments (DPIAs)
- Critical caveat
- Due to Jensen's inequality, `exp(E[log(Y)])` ≠ `E[Y]`.
- Lesson 594 — Interpreting Models After Transformation
- Critical insight
- Controlling for a collider *opens* a spurious path between X and Y, creating bias where none existed.
- Lesson 1471 — Mediators and Colliders
- Critical pipeline failures
- Page on-call engineer via PagerDuty
- Lesson 1851 — Error Logging and Notifications
- Critical requirement
- MVT demands significantly more traffic than A/B testing because you're splitting visitors across many more variants.
- Lesson 1689 — Multivariate Testing and Personalization
- Critical rule
- Each step must map to at least one clear event.
- Lesson 1679 — Defining Funnel Steps and Events
- critical value
- is the multiplier that determines how wide your confidence interval is.
- Lesson 268 — Critical Values and the t-DistributionLesson 271 — Margin of ErrorLesson 326 — Critical ValuesLesson 607 — Confidence Intervals for Coefficients
- Critical Value Approach
- Lesson 327 — Decision Rules: Reject or Fail to Reject
- critical values
- .
- Lesson 325 — The Rejection RegionLesson 345 — Directionality in Hypothesis TestingLesson 355 — Finding Critical Values and P-Values
- Critical/Page
- Data corruption, complete pipeline failure, SLA breach—requires immediate human intervention
- Lesson 1858 — Alerting Strategies
- Critically
- You *cannot* say "there's a 95% probability the true proportion lies in this interval" under the frequentist interpretation—the parameter either is or isn't in that specific interval.
- Lesson 1564 — Comparing Bayesian and Frequentist Proportion Inference
- Cross-filtering
- Filtering data in one view updates all views
- Lesson 1304 — Subplots and Linked Interactions
- Cross-tabulations
- Visualize frequency patterns across two categorical variables
- Lesson 1224 — Heatmaps and Correlation Matrices
- Cross-validation
- When prediction is paramount, sufficient data exists, and computational resources allow it.
- Lesson 616 — Adjusted R-Squared vs Other CriteriaLesson 632 — Parsimony and Occam's RazorLesson 1463 — RDD Bandwidth Selection and Local EstimationLesson 2055 — Why Randomness Matters in Data Science
- Crossing or converging lines
- → Interaction present; one factor's effect changes depending on the other
- Lesson 466 — Visualizing Interactions
- CRS
- defines exactly how coordinates relate to positions on Earth.
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- CSV
- is human-readable and universal but slow to parse and memory-intensive.
- Lesson 1133 — Performance Considerations Across Formats
- CTA Performance
- Lesson 1690 — Landing Page and CTA Optimization
- CTEs
- are named and defined upfront (we'll cover these in detail soon):
- Lesson 974 — When to Use FROM Subqueries vs CTEsLesson 991 — CTEs vs Subqueries: When to Use Each
- Cube Root Transformation
- (`x^(1/3)`) is useful for:
- Lesson 213 — Square Root and Cube Root Transformations
- Cube the z-scores
- For each value, subtract the mean, divide by standard deviation, then cube it.
- Lesson 65 — Calculating Skewness
- Cubing
- keeps the sign, so values below the mean contribute negatively and values above contribute positively.
- Lesson 65 — Calculating Skewness
- Cultural buy-in
- Stakeholders may question "peeking" at results, requiring education
- Lesson 1515 — Trade-offs: Sample Size, Speed, and Complexity
- Cumulative Distribution Function (CDF)
- tells you the probability of getting *that value or anything smaller*.
- Lesson 120 — Cumulative Distribution Functions (CDF) for Discrete VariablesLesson 157 — Cumulative Distribution Functions (CDFs) for Continuous VariablesLesson 162 — Uniform Distribution: PDF and CDF
- Cumulative metrics
- $12,500 total revenue from Jan cohort by Week 3
- Lesson 1647 — Building a Cohort Table
- Cumulative probabilities
- P(X ≤ k) — at most k successes, or P(X ≥ k) — at least k successes
- Lesson 130 — Calculating Binomial Probabilities
- Cumulative probability
- requires summing multiple exact probabilities.
- Lesson 130 — Calculating Binomial Probabilities
- CURRENT ROW
- Start or end at the current row being processed
- Lesson 1020 — UNBOUNDED and CURRENT ROW Keywords
- Curved patterns
- Your relationship isn't actually linear (violates linearity assumption)
- Lesson 556 — What Are Residuals and Why Plot Them?Lesson 1189 — Detecting Nonlinear Relationships
- Customer Arrivals
- A coffee shop averages 15 customers per hour.
- Lesson 144 — Poisson Applications: Arrivals and Events
- Customer behavior
- Whether users stay, leave, or convert
- Lesson 1516 — Business Metrics: Definition and Examples
- Customer demographics
- Do purchases align with market segment sizes?
- Lesson 421 — Applications: Uniform, Genetic Ratios, and Distributions
- Customer Lifespan
- measures how long customers stay active (in the same time unit).
- Lesson 1663 — Simple LTV: Average Revenue Per Customer
- Customer Lifetime Value (CLV)
- Predicted total revenue from a customer
- Lesson 1516 — Business Metrics: Definition and Examples
- Customer Lifetime Value (LTV)
- is the total revenue a customer generates over their entire relationship with a business—from their first purchase to their last interaction before churning.
- Lesson 1661 — What is Customer Lifetime Value (LTV)?
- Customer preferences
- Your marketing theory predicts 40% will choose red, 35% blue, 25% green.
- Lesson 414 — Introduction to Chi-Squared Goodness of Fit Test
- Customer Retention Rate
- Percentage of customers who remain active
- Lesson 1516 — Business Metrics: Definition and Examples
- Customer segments
- A customer classified as "new" cannot simultaneously be "returning"
- Lesson 81 — Mutually Exclusive Events
- Customer service calls arriving
- If no one has called in the last 5 minutes, that doesn't make a call in the next minute more or less likely
- Lesson 167 — Memoryless Property of Exponential
- Customer success failures
- (poor onboarding, lack of support)
- Lesson 1675 — Churn Attribution and Root Cause Analysis
- CUSUM
- tracks the *cumulative sum* of deviations from a target value.
- Lesson 1403 — CUSUM and EWMA ChartsLesson 1415 — CUSUM: Cumulative Sum Control Chart
- cut off sharply
- to near-zero beyond that lag.
- Lesson 731 — PACF for AR Process IdentificationLesson 776 — Identifying AR Order (p) Using PACF
- Cut technical debt
- Features with low adoption *and* low frequency are candidates for removal
- Lesson 1696 — Feature Adoption and Usage Frequency
- Cycle through each parameter
- , sampling from its conditional distribution:
- Lesson 1591 — Gibbs Sampling for Multivariate Posteriors
- Cycle Time
- tracks how long it takes to complete one unit from start to finish, while **throughput** measures actual units produced per time period.
- Lesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- Cycles
- Does website traffic peak on weekends?
- Lesson 19 — Temporal Data and Time SeriesLesson 1846 — Testing and Validating Dependency Graphs
- Cyclical
- Variable period (could be 3 years, then 5 years, then 4 years)
- Lesson 708 — Cyclical Patterns: Non-Fixed Fluctuations
- Cyclical patterns
- repeating waves suggesting seasonal or periodic effects
- Lesson 562 — Index Plots and Time-Ordered ResidualsLesson 708 — Cyclical Patterns: Non-Fixed Fluctuations
D
- Daily Active Users (DAU)
- counts how many unique users performed a meaningful action in your product on a given day.
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- Damage vulnerable populations
- through automated decisions
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Damped trend methods
- add a *damping parameter* (usually denoted φ, pronounced "phi") that gradually flattens the trend over time.
- Lesson 762 — Damped Trend Methods
- Dampens irregular fluctuations
- (the noise component)
- Lesson 755 — Moving Averages for Trend Estimation
- Dark patterns
- are interface designs that deliberately trick users into giving up data or privacy rights.
- Lesson 1914 — Consent in Digital Contexts
- Dash
- (by Plotly) offers **more control and flexibility**.
- Lesson 1330 — Introduction to Interactive Dashboards
- dashboard
- is an interactive, real-time (or near-real-time) monitoring tool that updates automatically as underlying data changes.
- Lesson 1974 — Defining Dashboards and ReportsLesson 1997 — Viewing Repository State with git status
- Dashboards excel at monitoring
- , letting stakeholders track KPIs, spot anomalies, and maintain situational awareness.
- Lesson 1980 — Hybrid Approaches and When to Use Both
- data
- , never as executable SQL commands.
- Lesson 1105 — Parameter Placeholders: Question MarksLesson 1339 — What is the Grammar of Graphics?Lesson 1340 — The Seven Layers of GrammarLesson 1348 — The Base Layer: ggplot() and Data MappingLesson 1552 — Gamma-Poisson ConjugacyLesson 1557 — The Beta-Binomial ModelLesson 2082 — Choosing a License for Data Science Projects
- Data access
- Are there privacy restrictions or missing historical data?
- Lesson 1169 — Clarifying Assumptions and Constraints
- Data Analyst
- investigates *why* the Northeast region underperformed and identifies the key factors
- Lesson 4 — Data Science vs Data Analytics vs Business Intelligence
- Data Analysts
- focus on *understanding the past and present*.
- Lesson 2138 — Data Analyst vs Data Scientist vs ML Engineer
- Data auditing
- Find customers who placed orders in 2023 but not in 2024
- Lesson 1002 — EXCEPT: Finding Differences
- Data augmentation
- Add more representative examples from underrepresented groups to balance your dataset.
- Lesson 1894 — Auditing and Remediation Strategies
- Data availability
- You might have a task that checks which data sources updated today, then spawns processing tasks only for those sources.
- Lesson 1844 — Dynamic Dependencies
- Data Cleaning & Preparation
- Lesson 9 — The Data Science Lifecycle Overview
- Data Cleaning and Preparation
- to extract meaningful insights—for example, analyzing customer reviews (text) to understand sentiment, or processing images to detect patterns.
- Lesson 16 — Structured vs Unstructured DataLesson 20 — Primary Data Sources: Databases and Data WarehousesLesson 38 — What is Central Tendency?
- Data Collection
- Lesson 9 — The Data Science Lifecycle OverviewLesson 1169 — Clarifying Assumptions and ConstraintsLesson 1878 — What is Bias in Data?
- Data Collection and Acquisition
- (which you learned earlier), you'll encounter both types.
- Lesson 16 — Structured vs Unstructured DataLesson 20 — Primary Data Sources: Databases and Data Warehouses
- Data Completeness
- The percentage of expected records that successfully arrived.
- Lesson 1856 — Key Metrics to Monitor
- Data consistency risk
- Aggregates can become stale or incorrect if updates fail
- Lesson 1073 — Storing Computed Values and Aggregates
- Data corruption during export
- Special characters lost when saving to formats that don't support Unicode.
- Lesson 1139 — Dealing with Special Characters and Unicode
- Data debt
- Undocumented preprocessing steps that become "tribal knowledge"
- Lesson 2131 — What is Technical Debt in Data Science?
- Data decays quickly
- Real-time bidding for ads loses value after milliseconds
- Lesson 1788 — Streaming Data and Real-Time Requirements
- Data documentation
- Lesson 1971 — Appendices and Technical Details
- Data drift
- New patterns emerge that your model hasn't seen
- Lesson 15 — Deployment, Monitoring, and Iteration
- Data exploration
- Understanding the range of values in a column
- Lesson 873 — Understanding DISTINCT: Removing Duplicate Rows
- Data Freshness
- The time lag between when data is generated and when it's available for use.
- Lesson 1856 — Key Metrics to Monitor
- Data Freshness SLO
- "Dashboard data will be no more than 4 hours old during business hours"
- Lesson 1860 — SLA and SLO Definitions
- Data Infrastructure
- Lesson 1643 — Building Attribution Frameworks
- data integrity
- you can't have an order pointing to a non-existent customer.
- Lesson 921 — Primary and Foreign Key RelationshipsLesson 1810 — Snowflake Schema and Normalization Trade-offs
- Data isolation
- Keep `data/` and `outputs/` in `.
- Lesson 2032 — Organizing Repository Structure for Data Science
- Data lineage
- is the documented history of data from its original source through every transformation, merge, filter, and calculation until it reaches its final form in a report, model, or dashboard.
- Lesson 1159 — What is Data Lineage?Lesson 1875 — Data Lineage and ProvenanceLesson 1908 — Data Subject Access Requests (DSARs)
- Data locality
- means running your computation where the data already lives.
- Lesson 1772 — Data Locality and Network Bottlenecks
- Data minimization
- Collect only what you actually need (goodbye "vacuum up everything" strategies)
- Lesson 1904 — What is GDPR and Why It MattersLesson 1905 — Core Principles of GDPR
- Data parallelism
- is like having five chefs all chopping vegetables using the same technique.
- Lesson 1769 — Task Parallelism and Work Distribution
- Data pipeline issues
- Some users' data might not be logged correctly
- Lesson 1524 — Sample Ratio Mismatch (SRM)
- Data pipeline maintenance
- Sources change schemas, APIs deprecate, databases get restructured
- Lesson 1979 — Maintenance and Sustainability Considerations
- Data points overlap
- Points in front hide those behind, potentially concealing important patterns
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- Data Poisoning
- Adversaries might deliberately feed corrupted data into your pipeline to manipulate model outputs (e.
- Lesson 1920 — Anticipating Misuse of Data Products
- Data provenance
- is the documented history of your data: where it came from, who collected it, when it was gathered, and every transformation it underwent.
- Lesson 23 — Data Provenance and MetadataLesson 26 — Reproducibility vs. ReplicabilityLesson 1875 — Data Lineage and Provenance
- Data quality checks
- Comparing `COUNT(*)` vs `COUNT(DISTINCT column)` reveals how many duplicates exist
- Lesson 887 — Aggregates with DISTINCT
- Data quality drift
- Your model expects feature X between 0-100, but upstream changes cause values of 0-1000.
- Lesson 2136 — Monitoring Gaps and Silent Failures
- Data Quality Issues
- Record missing value patterns, unexpected outliers, or encoding problems.
- Lesson 1180 — Documenting Univariate FindingsLesson 1201 — Domain Knowledge as a Hypothesis SourceLesson 1840 — What is Dependency Management in Pipelines?Lesson 1851 — Error Logging and Notifications
- data reconciliation
- finding discrepancies between two datasets, like customers in your CRM but not in your billing system, or vice versa.
- Lesson 938 — Symmetric Difference PatternLesson 941 — Use Cases: Data Reconciliation
- Data requirements
- Ensure sufficient sample size in each age group
- Lesson 1204 — From Hypothesis to Analysis Plan
- Data rows
- Each subsequent line represents one record
- Lesson 1125 — CSV Files: Structure and Common Issues
- Data Science Lifecycle
- , lessons 9-15) are so important.
- Lesson 26 — Reproducibility vs. Replicability
- Data science problem
- "Build a binary classification model to predict 30-day churn probability, achieving minimum 80% recall to catch potential churners, using historical customer behavior data from the past 2 years.
- Lesson 2085 — Stage 1: Problem Definition and Scoping
- Data science specifics
- Check for proper handling of missing data, appropriate train-test splits, reproducibility (random seeds), and whether assumptions of statistical methods are met.
- Lesson 2024 — Code Review Best Practices
- Data Scientist
- builds a predictive model to forecast next quarter's sales and recommends which products to promote
- Lesson 4 — Data Science vs Data Analytics vs Business Intelligence
- Data Scientists
- focus on *predicting and prescribing*.
- Lesson 2138 — Data Analyst vs Data Scientist vs ML Engineer
- Data sources
- Database names, API endpoints, file paths, table versions
- Lesson 1988 — Embedding Data Lineage and MetadataLesson 2077 — The Purpose and Anatomy of a Good README
- Data splitting
- `train_test_split` shuffles differently each run
- Lesson 2055 — Why Randomness Matters in Data Science
- Data type
- Ordinal or continuous (but non-normal)
- Lesson 474 — Friedman Test: Non-Parametric Repeated Measures ANOVALesson 846 — Tables, Schemas, and Data TypesLesson 1163 — Metadata and Data DictionariesLesson 1230 — Choosing the Right Chart TypeLesson 2064 — Creating Data Dictionaries
- Data type matching
- `int64`, `float64`, `object`, `datetime64` match expectations
- Lesson 1151 — Schema Validation
- Data updates frequently
- and users want current information
- Lesson 1330 — Introduction to Interactive Dashboards
- Data Version Control (DVC)
- extends Git to handle large data files.
- Lesson 2066 — Version Control for Data Files
- Data versioning issues
- The dataset changes, but nobody tracks which version was used (Data Provenance)
- Lesson 30 — The Reproducibility Crisis and Solutions
- data warehouse
- , on the other hand, is like a massive library that collects copies of information from many different filing cabinets across an entire organization.
- Lesson 20 — Primary Data Sources: Databases and Data WarehousesLesson 1807 — Data Warehouse vs Database: Architecture and Purpose
- Data-driven/algorithmic
- Uses statistical models to weight contributions based on observed patterns and incrementality testing
- Lesson 1637 — What is Metric Attribution?
- data-ink ratio
- the proportion of ink (or pixels) in your chart that actually represents data versus non-data elements.
- Lesson 1237 — Chart Junk and Data-Ink RatioLesson 1246 — Visual Clutter and Chartjunk
- database
- as a digital filing cabinet where organizations store their day-to-day information in an organized way.
- Lesson 20 — Primary Data Sources: Databases and Data WarehousesLesson 842 — What is a Database?
- Database Management System (DBMS)
- is software that sits between you and your database files, handling all the complex operations of storing, organizing, retrieving, and managing data.
- Lesson 845 — Database Management Systems (DBMS)
- Database portability
- Switch from SQLite to PostgreSQL with minimal code changes
- Lesson 1117 — What is an ORM and Why Use It?
- Databases
- (JDBC):
- Lesson 1779 — Reading and Writing Data in SparkLesson 1877 — Versioning Strategies for Different Data Types
- Datadog
- automate this process, offering dashboards that show pipeline status at a glance and trigger alerts when thresholds are breached.
- Lesson 1861 — Monitoring Tools and Dashboards
- DataFrames
- organize your data into named columns—like a spreadsheet or SQL table—but distributed across a cluster.
- Lesson 1778 — DataFrames and Spark SQL Basics
- Dataset
- A collection of typed objects (numbers, strings, custom objects).
- Lesson 1777 — RDDs: Resilient Distributed Datasets Fundamentals
- Datasets
- over ~10 MB that change occasionally
- Lesson 2033 — Git Large File Storage (LFS) for Data Assets
- DATE
- , **DATETIME**, **TIMESTAMP**: Date and time values
- Lesson 846 — Tables, Schemas, and Data TypesLesson 1999 — Viewing Commit History
- Date fields
- (`order_date`, `created_at`) — most queries filter by time ranges
- Lesson 1812 — Partitioning and Clustering Strategies
- Date truncation
- cuts off the precision beyond a certain level, effectively "rounding down" a timestamp to the beginning of that time period.
- Lesson 1043 — Date Truncation and Rounding
- Dates
- capture days without precise times: "March 15, 2024"
- Lesson 19 — Temporal Data and Time SeriesLesson 857 — Comparison Operators: Greater and Less Than
- Dating app data
- Attractiveness and personality both lead to getting matches.
- Lesson 1473 — Conditioning on Colliders: Selection Bias
- DAU
- counts unique users engaging with your platform each day, while **MAU** tracks monthly uniques.
- Lesson 1631 — Social Media Metrics: DAU/MAU and Content Engagement
- Days or weeks saved
- in fast-moving business environments
- Lesson 1515 — Trade-offs: Sample Size, Speed, and Complexity
- dbt
- (transforms + documentation), and **OpenLineage** (open standard) embed lineage capture directly into your code.
- Lesson 1164 — Tools for Lineage TrackingLesson 1821 — Hybrid Approaches and Modern Data Stacks
- Dead Letter Queue
- is a separate storage location where permanently failed tasks or messages are routed after exhausting all retry attempts.
- Lesson 1852 — Dead Letter Queues
- Dead Letter Queues
- Verify that permanently failed records actually land in your dead letter queue for later investigation.
- Lesson 1854 — Testing Error Handling
- Debugging
- Identify when data quality issues were introduced
- Lesson 1871 — Why Version Control for Data?Lesson 2047 — What is Dependency Management?
- Decision
- We either reject H₀ (evidence is convincing) or fail to reject H₀ (insufficient evidence)
- Lesson 312 — Hypothesis Testing as a Legal Analogy
- Decision hesitation
- (need to consult others, compare options)
- Lesson 1681 — Time-Based Funnel Analysis
- Decision points
- "If we can explain 60% of the variance, we can make the call"
- Lesson 2117 — Defining 'Good Enough' with Stakeholders
- Decision rule
- If p-value < α (usually 0.
- Lesson 378 — Testing Normality: Statistical TestsLesson 716 — Augmented Dickey-Fuller Test
- Decision-focused
- Compute expected gains or losses for business decisions
- Lesson 1570 — Comparing Two Means: Bayesian Approach
- Decision-making
- Planning inventory based on predicted demand ranges
- Lesson 1571 — Posterior Predictive Distribution for New DataLesson 1580 — Bayesian vs Frequentist A/B Testing
- Decisions Made
- Document any filtering or transformation decisions.
- Lesson 1180 — Documenting Univariate Findings
- Declining Session Frequency
- A user who visited daily but now appears weekly is sending a signal.
- Lesson 1700 — Leading Indicators of Disengagement
- Decomposing Seasonality
- concept you've learned, the technique involves:
- Lesson 1408 — Handling Multiple Seasonal Periods
- Decorative backgrounds
- Solid fills, gradients, or textures add no value
- Lesson 1237 — Chart Junk and Data-Ink RatioLesson 1246 — Visual Clutter and ChartjunkLesson 1963 — Removing Chartjunk
- Decreased Depth of Activity
- Fewer actions per session, less content consumed, or shallow navigation compared to their historical baseline.
- Lesson 1700 — Leading Indicators of Disengagement
- Decreasing adjusted R-squared
- when you add a variable means that variable adds more noise than signal—it's not worth including
- Lesson 614 — Interpreting Adjusted R-Squared Values
- Dedicate regular time blocks
- for learning—even 30 minutes daily beats sporadic weekend marathons.
- Lesson 2143 — Continuous Learning and Skill Development
- Default Alphabetical
- Lesson 644 — Choosing a Reference Category
- default_value
- What to return when there's no previous/next row (defaults to NULL)
- Lesson 1023 — Introduction to Window Functions: LAG and LEADLesson 1024 — LAG Function: Accessing Previous Row Values
- Defect Rate
- quantifies quality problems, often measured in defects per million opportunities (DPMO) in Six Sigma environments.
- Lesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- Defense in depth
- means building multiple independent security layers so that if one fails, others still protect you.
- Lesson 1109 — Input Validation and Defense in Depth
- Defensive Deletes
- If your pipeline needs to "delete and reload," make deletions specific.
- Lesson 1848 — Designing Idempotent Operations
- Define an error metric
- (like Mean Squared Error or Mean Absolute Error)
- Lesson 772 — Holt-Winters Parameter Optimization
- Define constraints
- minimum spend per channel (contractual obligations), maximum spend (capacity limits), total budget
- Lesson 1742 — Budget Optimization Using MMM
- Define flexible step completion
- – count a step as completed the first time it occurs, regardless of order
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Define targets
- for LTV:CAC ratio (typically 3:1 minimum)
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Define terms once
- When you must use jargon, explain it immediately
- Lesson 1967 — Writing Clear and Concise Analysis Sections
- Defined scope
- (timeframe, population, geography)
- Lesson 2093 — Translating Business Questions into Analytical Questions
- Defining success criteria
- What does "good enough" look like?
- Lesson 2085 — Stage 1: Problem Definition and Scoping
- Definition drift
- Changing what "active" means breaks trend comparisons
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- Deflating r
- Conversely, an outlier that doesn't follow the general pattern (like an extremely tall person who weighs very little due to illness) can weaken an otherwise strong correlation by pulling the line away from the main cluster.
- Lesson 481 — Outliers and Their Impact on r
- Degree centrality
- How many connections a node has (the "popular" nodes)
- Lesson 1320 — Network Metrics and Visual Analysis
- Degrees of freedom
- represent the amount of independent information you have.
- Lesson 352 — The t-Distribution and Degrees of FreedomLesson 355 — Finding Critical Values and P- ValuesLesson 362 — Welch's t-Test for Unequal VariancesLesson 501 — T-Test for Pearson Correlation Significance
- degrees of freedom (df)
- , typically *n - 1* for a mean (where *n* is sample size).
- Lesson 268 — Critical Values and the t-DistributionLesson 270 — Degrees of Freedom in t-Intervals
- Delayed feedback
- Subscription renewals happen after 12 months
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is Impractical
- Delayed response
- You discover issues after substantial damage occurs
- Lesson 1617 — The Danger of Lagging-Only MetricsLesson 1739 — Adstock and Carryover Effects
- DELETE
- With cascading rules, the database must find and handle all child records
- Lesson 1060 — Trade-offs: Performance vs Integrity
- DELETE protection
- You cannot delete a parent record if children still reference it (unless you specify cascading behavior)
- Lesson 1052 — Foreign Key Constraints
- Delete ruthlessly
- If code isn't in production and hasn't been touched in months, remove it.
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Deletion Anomalies
- Lesson 1062 — Data Anomalies: Insert, Update, Delete
- Delimiter
- The character separating values (usually a comma, but sometimes tabs, semicolons, or pipes)
- Lesson 1125 — CSV Files: Structure and Common Issues
- Deliver faster
- You can answer the question in hours instead of weeks
- Lesson 2110 — The Minimum Viable Analysis (MVA)
- Deliver incrementally
- Instead of one massive final report, provide preliminary findings early.
- Lesson 2099 — Aligning with Business Timelines and Decision Points
- Demand mechanism
- Can you explain *how* improving this metric causes the outcome?
- Lesson 1615 — Correlation Without Causation
- Demographic parity
- Do groups receive positive outcomes at similar rates?
- Lesson 1884 — Detecting Bias in Your Data
- Demographic Parity (Statistical Parity)
- Lesson 1887 — Defining Fairness in Data Science
- Demographic statistics
- Regions with different population counts
- Lesson 43 — Weighted Mean and Its Applications
- Demographics
- Location, device type, referral source
- Lesson 1689 — Multivariate Testing and PersonalizationLesson 1701 — What is Customer Segmentation?
- denominator
- (SE) scales that difference by how much variability you'd expect due to random sampling.
- Lesson 353 — Calculating the t-StatisticLesson 478 — The Formula for Pearson's r
- density
- at each point—how likely that region is.
- Lesson 172 — Probability Density Function for Normal DistributionLesson 1267 — Histograms and Distribution Plots
- Department A (competitive)
- Lesson 1428 — The Simpson's Paradox Example
- Department B (less competitive)
- Lesson 1428 — The Simpson's Paradox Example
- Dependencies
- Input files, datasets, or code it needs
- Lesson 1874 — DVC Pipelines and StagesLesson 1988 — Embedding Data Lineage and MetadataLesson 2100 — Documenting Assumptions and Open Questions
- Dependency hell
- occurs when your project requires specific package versions, but those requirements conflict with each other or with what's installed in different environments.
- Lesson 2048 — The Dependency Hell Problem
- Dependency isolation
- is critical in data science because:
- Lesson 2039 — Virtual Environments: Concept and Benefits
- dependent
- (not independent), we use:
- Lesson 88 — General Multiplication RuleLesson 104 — Dependent Events and Joint ProbabilityLesson 427 — Interpreting Chi-Squared Test ResultsLesson 704 — What Makes Time Series Data Different?
- Dependent (paired) samples
- have a natural one-to-one correspondence between observations.
- Lesson 360 — Independent vs. Dependent Samples
- Dependent samples
- require a **paired t-test**, which analyzes the *differences* within each pair, effectively reducing the problem to a one-sample test on those differences
- Lesson 360 — Independent vs. Dependent Samples
- depends on
- how much water is present.
- Lesson 465 — Interaction EffectsLesson 648 — What are Interaction Terms?
- Deployment
- Lesson 9 — The Data Science Lifecycle OverviewLesson 15 — Deployment, Monitoring, and Iteration
- Deployment constraints
- (pure Python libraries install easier in restricted environments)
- Lesson 1087 — Database Drivers and Connection Libraries
- Deployment instructions
- specifying environment requirements
- Lesson 2091 — Stage 7: Communication and Handoff
- Deployment lag
- By the time you retrain and deploy, the distribution may have shifted again
- Lesson 2128 — Data Distribution Shifts Frequently
- description
- .
- Lesson 817 — Comparing Multiple Survival CurvesLesson 1163 — Metadata and Data DictionariesLesson 1910 — Data Protection Impact Assessments (DPIAs)Lesson 2064 — Creating Data Dictionaries
- Descriptive Axis Labels
- Lesson 1960 — Annotation and Labeling Best Practices
- Descriptive problems
- answer "What happened?
- Lesson 2096 — Distinguishing Descriptive, Diagnostic, and Prescriptive Problems
- Descriptive Statistics
- Calculate summary metrics for each segment—average purchase frequency, mean customer lifetime value, median recency, typical basket size.
- Lesson 1709 — Segment Profiling and Interpretation
- Design an Experiment
- Lesson 25 — The Scientific Method in Data Science
- Destination loaders
- Write processed data to warehouses, lakes, or operational systems
- Lesson 1822 — What is a Data Pipeline?
- Detect interactions
- How one variable's effect depends on another
- Lesson 1190 — Introduction to Multivariate Analysis
- Detecting multiple anomalies simultaneously
- without the masking effect where one outlier hides another
- Lesson 1405 — What is Seasonal Hybrid ESD?
- Detection delay
- measures the time lag between when the change actually occurs and when your algorithm flags it.
- Lesson 1418 — Evaluating Change-Point Detection Methods
- Detection lag
- grows exponentially—the longer it takes to notice, the more data and decisions are affected
- Lesson 2136 — Monitoring Gaps and Silent Failures
- Determine proportions
- Calculate what percentage of the total population each stratum represents
- Lesson 236 — Stratified Sampling
- Deterministic with ORDER BY
- The same query produces the same numbering
- Lesson 1007 — ROW_NUMBER(): Assigning Unique Row Numbers
- Detrend the Series
- Subtract the trend from the original data: `Y - T`.
- Lesson 744 — Classical Decomposition Methods
- detrending
- when:
- Lesson 734 — Why Differencing and Detrending MatterLesson 740 — Choosing Between Differencing and Detrending
- Deuteranopia/Deuteranomaly
- (green-weak): difficulty distinguishing red from green
- Lesson 1248 — Color Blindness and Color Palette Design
- Development branch (`develop`)
- The integration point for completed experiments that passed initial validation.
- Lesson 2035 — Branching Strategies for Experiments
- Deviation from mean
- `value - AVG(value) OVER (PARTITION BY category)`
- Lesson 1019 — Comparing Values to Window Aggregates
- DFFITS
- (pronounced "dee-fits") zooms in on a more specific question: *"How much does the predicted value for observation i change when we remove observation i from the dataset?
- Lesson 576 — DFFITS: Influence on Fitted ValuesLesson 589 — Deciding Whether to Remove Outliers
- Diagnostic outcomes
- A test result cannot be both "positive" and "negative" at the same time
- Lesson 81 — Mutually Exclusive Events
- Diagnostic problems
- answer "Why did it happen?
- Lesson 2096 — Distinguishing Descriptive, Diagnostic, and Prescriptive Problems
- Diagonal patterns
- Gradual retention decline is normal; sudden cliff-drops warrant investigation
- Lesson 1649 — Visualizing Cohort Data with Heatmaps
- DiD estimate
- Treatment effect = Treatment change - Control change
- Lesson 1452 — The Difference-in-Differences Setup
- difference
- for each pair: subtract one measurement from the other.
- Lesson 371 — Calculating Paired DifferencesLesson 636 — The Reference CategoryLesson 698 — Null and Residual Deviance
- Difference from average
- `sale_amount - AVG(sale_amount) OVER (PARTITION BY region)`
- Lesson 1019 — Comparing Values to Window Aggregates
- Differences
- between two sample means (comparing groups)
- Lesson 225 — CLT for Sums and Other Statistics
- Differences are subtle
- small variations must be detectable
- Lesson 1233 — Position as the Most Effective Channel
- Differences in distribution shape
- between groups, not just center/spread
- Lesson 1286 — Violin Plots and Distribution Shape
- differencing
- or **detrending** when:
- Lesson 734 — Why Differencing and Detrending MatterLesson 740 — Choosing Between Differencing and Detrending
- differential privacy
- (which adds noise to protect individuals), MPC provides exact computation with zero information leakage about inputs—assuming parties don't collude.
- Lesson 1903 — Secure Multi-Party ComputationLesson 1911 — GDPR Compliance for Data Scientists
- Difficult to change
- Update the same value in multiple places
- Lesson 2072 — Configuration Files vs Hard-Coded Values
- Difficulty interpreting effects
- You can't confidently say "holding all else constant, X increases Y by.
- Lesson 580 — What is Multicollinearity?
- dimension tables
- (containing descriptive attributes like customer names, product details, or dates).
- Lesson 956 — Star Schema JoinsLesson 1808 — Star Schema and Fact TablesLesson 1809 — Dimension Tables and Slowly Changing Dimensions
- Dimensionality reduction
- solves this by mathematically projecting your high-dimensional data into a lower-dimensional space while preserving as much important structure as possible.
- Lesson 1196 — Dimensionality Reduction for Visualization
- Dimensions matter
- Journals often require specific figure sizes (e.
- Lesson 1369 — Publication-Ready Plot Styling
- diminishing returns
- , and we capture it mathematically with **saturation curves**.
- Lesson 1740 — Saturation Curves and Diminishing ReturnsLesson 2116 — Diminishing Returns and the 80/20 Rule
- direct
- relationship between two variables, you must "control for" or "hold constant" potential confounders.
- Lesson 509 — Confounding Variables and ControlLesson 1712 — Common Channel Categories
- Direct attention
- Add a single annotation or callout box pointing to what matters: "Sales dropped 30% here.
- Lesson 1958 — Simplifying Visual Complexity
- Direct identifier removal
- means stripping obvious PII like names, social security numbers, email addresses, and phone numbers.
- Lesson 1895 — Data Anonymization Basics
- Direct interpretation
- Roughly tells you the "typical distance" values fall from the mean
- Lesson 49 — Standard Deviation: Interpretable Spread
- Direct probability statements
- "There's a 92% chance treatment A has a higher mean than control"
- Lesson 1570 — Comparing Two Means: Bayesian Approach
- Directed
- Tasks flow in specific directions (task A → task B)
- Lesson 1833 — Introduction to Apache Airflow
- Directed Acyclic Graph
- is a mathematical structure where nodes (tasks) are connected by directed edges (dependencies) with one ironclad rule: **no cycles allowed**.
- Lesson 1842 — Directed Acyclic Graphs (DAGs)
- Directed Acyclic Graph (DAG)
- is a visual diagram where:
- Lesson 1468 — Introduction to Directed Acyclic Graphs (DAGs)
- Directed edges
- (arrows) represent causal relationships pointing from cause to effect
- Lesson 1468 — Introduction to Directed Acyclic Graphs (DAGs)
- Directed graphs
- show asymmetrical relationships.
- Lesson 1316 — Introduction to Network Graphs and Graph Theory Basics
- Direction
- Do points trend upward (positive) or downward (negative)?
- Lesson 480 — Scatterplots and Visual AssessmentLesson 2122 — When Uncertainty Is Acceptable
- Directional (one-tailed)
- "The new landing page will *increase* sign-ups by 5%"
- Lesson 1479 — Formulating Hypotheses
- Directional alignment
- When the surrogate goes up, the business metric should too (not down!
- Lesson 1518 — The Relationship Between Surrogate and Business Metrics
- Dirty reads
- Reading uncommitted data that gets rolled back
- Lesson 1116 — Transaction Isolation and Concurrency
- Disability status
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Disagreement requires judgment
- If tests conflict, lean on your visual evidence and domain knowledge.
- Lesson 718 — Interpreting Stationarity Test Results
- Disclose failed model iterations
- and why they didn't work
- Lesson 1929 — Avoiding Cherry-Picking Results
- Disclosure of Purpose
- Lesson 1913 — Elements of Valid Consent
- Discovery-driven iteration
- Your analysis reveals unexpected patterns, missing data, or invalid assumptions.
- Lesson 2092 — Iteration and Feedback Loops in Practice
- Discrete (apples)
- You have exactly 5 apples or 6 apples.
- Lesson 18 — Numerical Variables: Discrete and Continuous
- Discrete data
- assigning categories to distinct colors or positions
- Lesson 1344 — Scales and Coordinate Systems
- Discrete numerical data
- consists of whole numbers that represent *counts* of distinct items.
- Lesson 18 — Numerical Variables: Discrete and Continuous
- Discriminatory Application
- Your fair hiring model might be selectively applied only to certain demographics while others bypass it entirely.
- Lesson 1920 — Anticipating Misuse of Data Products
- Discussion is centralized
- Questions, explanations, and decisions live alongside the code
- Lesson 2022 — Understanding Pull Requests
- dispersion parameter
- controls the spread or variability.
- Lesson 665 — Canonical Form of Exponential Family DistributionsLesson 669 — The Dispersion Parameter φ
- Display issues
- Characters appearing as , boxes, or garbled text—usually an encoding mismatch.
- Lesson 1139 — Dealing with Special Characters and Unicode
- Display outputs inline
- Charts, tables, and statistical results appear right below the code that generated them
- Lesson 1982 — Literate Programming with Notebooks
- Distance from max
- `MAX(value) OVER (PARTITION BY category) - value`
- Lesson 1019 — Comparing Values to Window Aggregates
- Distinguish stakeholder types
- The person requesting may not be the end user.
- Lesson 2102 — Understanding Stakeholder Goals and Constraints
- distribute
- work across many machines.
- Lesson 1764 — The Big Data Technology LandscapeLesson 1768 — Data Parallelism Fundamentals
- Distributed
- Your data is partitioned across multiple machines.
- Lesson 1777 — RDDs: Resilient Distributed Datasets Fundamentals
- Distributed Scheduler
- Lesson 1795 — Distributed Schedulers and Client Setup
- Distributing heterogeneous jobs
- to available workers
- Lesson 1769 — Task Parallelism and Work Distribution
- distribution
- centered around 0.
- Lesson 253 — Sampling Distribution of the Sample ProportionLesson 1172 — What is Univariate Analysis?Lesson 1284 — Pair Plots for Multivariate Exploration
- Distribution Characteristics
- Note shape (skewed, bimodal), spread, and central tendency.
- Lesson 1180 — Documenting Univariate Findings
- Distribution Checks
- Compare your data's distribution against historical baselines.
- Lesson 1157 — Statistical Anomaly Detection in QA
- Distribution comparisons
- ensuring all histograms use the same bin ranges
- Lesson 1276 — Sharing Axes Between Subplots
- Distribution plots
- Show how data is spread (histograms, KDE plots)
- Lesson 1281 — Introduction to Seaborn's Statistical Plots
- Distribution shape
- describes the overall form or silhouette of your data when visualized—whether values cluster symmetrically in the middle, bunch up on one side, or spread out evenly.
- Lesson 63 — Understanding Distribution ShapeLesson 1267 — Histograms and Distribution Plots
- Distributions
- Use `geom_histogram` or `geom_boxplot`
- Lesson 1342 — Geometric Objects (geoms)Lesson 1867 — Data Profiling and MonitoringLesson 2087 — Stage 3: Exploratory Data Analysis
- Diversification
- Relying on a single channel is risky; tracking reveals over-dependence
- Lesson 1711 — What Are Acquisition Channels?Lesson 1716 — Channel Mix and Portfolio Thinking
- Divide
- by the total number of observations
- Lesson 45 — Central Tendency for Grouped DataLesson 237 — Cluster SamplingLesson 744 — Classical Decomposition Methods
- Do missingness patterns correlate
- with other variables?
- Lesson 1207 — Missing Data Assessment and Strategy
- Docker container
- runs a lightweight, isolated instance of a complete operating system environment.
- Lesson 2045 — Docker for Complete Environment Reproducibility
- Document active experiments
- Maintain a simple tracking file listing which experiments are ongoing, which succeeded, and which are archived.
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Document and mitigate
- Create threat models; build monitoring, rate limits, access controls, or kill switches
- Lesson 1924 — Red Team Thinking for Data Scientists
- Document assumptions
- that came from domain experts or literature
- Lesson 1972 — Citations and References in Data Science ReportsLesson 2124 — Insufficient or Low-Quality Data
- Document changes
- so teams understand why the structure evolved.
- Lesson 1626 — Maintaining and Evolving Metric Trees
- Document data sources
- In your README, specify where data lives and how to access it
- Lesson 2070 — Separating Data from Code
- Document everything
- Record every decision, assumption, and step
- Lesson 30 — The Reproducibility Crisis and SolutionsLesson 250 — Strategies for Bias Detection and MitigationLesson 1679 — Defining Funnel Steps and EventsLesson 2046 — Best Practices for Environment Management in Teams
- Document original purpose
- explicitly in your consent forms and data governance policies
- Lesson 1915 — Secondary Use and Scope Creep
- Document sensitivity
- Report when conclusions are stable or when they depend on prior choice
- Lesson 1572 — Sensitivity Analysis and Prior Robustness
- Document trade-offs
- Where do optimizations create tension?
- Lesson 1625 — Cross-Functional Metric Dependencies
- Document your decision
- keep or remove?
- Lesson 1209 — Outlier Detection and InvestigationLesson 1909 — Right to Erasure and Data Retention Policies
- Document your methods
- before seeing results (prevents post-hoc justification)
- Lesson 35 — Conflicts of Interest and Independence
- Documentation
- Your validation rules become living documentation
- Lesson 1158 — Automated Validation FrameworksLesson 1925 — Mitigation Strategies and Responsible DisclosureLesson 2082 — Choosing a License for Data Science Projects
- Documentation debt
- Skipping README updates or data dictionaries
- Lesson 2131 — What is Technical Debt in Data Science?
- Documentation Licenses
- (README, tutorials, papers):
- Lesson 2082 — Choosing a License for Data Science Projects
- Documented
- automatically in human-readable format
- Lesson 1868 — Great Expectations FrameworkLesson 1912 — What is Informed Consent in Data Science?
- Domain context
- is the background knowledge about the field you're analyzing: its terminology, business processes, constraints, typical patterns, and unwritten rules.
- Lesson 1168 — Understanding Domain Context
- Domain Expertise
- Lesson 1 — Defining Data ScienceLesson 386 — Effect Size Interpretation GuidelinesLesson 1429 — Identifying Confounders in PracticeLesson 1534 — The Prior DistributionLesson 1883 — Protected Classes and Proxy Variables
- Domain knowledge
- (understanding the industry, business context, or field you're working in) is crucial.
- Lesson 8 — Misconceptions About Data ScienceLesson 75 — Domain-Specific Outlier RulesLesson 193 — Choosing Between Distributions in PracticeLesson 537 — When R-Squared is Not EnoughLesson 585 — Remedies: Variable SelectionLesson 1201 — Domain Knowledge as a Hypothesis Source
- Domain knowledge suggests
- "Our email campaign goes out Monday evenings—let's check if opens predict Tuesday purchases"
- Lesson 1201 — Domain Knowledge as a Hypothesis Source
- Domain rules
- Values outside physically possible ranges (negative age, 500% growth rate)
- Lesson 1209 — Outlier Detection and Investigation
- Domain validity
- Your model might fit your training data beautifully (high R-squared) but make nonsensical predictions outside the observed range.
- Lesson 537 — When R-Squared is Not Enough
- Domain-specific rules
- when you have expert knowledge about what constitutes "normal"
- Lesson 1411 — Applications and Limitations
- Don't
- This is still vulnerable to SQL injection if the list contains user input.
- Lesson 1108 — Handling IN Clauses Safely
- Don't skip the diagnostics
- Check histograms, Q-Q plots, and variance equality tests *before* running your test.
- Lesson 368 — Common Pitfalls and Best Practices
- Don't use SUM for
- Counting rows (use `COUNT`), averaging (use `AVG`), or non-numeric data (it only works with numbers).
- Lesson 883 — SUM: Calculating Totals
- Double funnel
- Variance is small in the middle but large at both extremes
- Lesson 559 — Detecting Heteroscedasticity (Non-Constant Variance)
- Double-counting in partitions
- When using the law of total probability, make sure your conditioning events are mutually exclusive and collectively exhaustive—no overlap, no gaps.
- Lesson 100 — Common Conditional Probability Mistakes
- Doubling your sample size
- doesn't cut the standard error in half—it reduces it by a factor of √2 ≈ 1.
- Lesson 223 — Standard Error and the CLT
- Download buttons
- for saving charts as static images
- Lesson 1300 — Creating Basic Interactive Charts with Plotly Express
- Downstream
- `train_model` and `generate_report` (direct and transitive)
- Lesson 1841 — Upstream and Downstream Dependencies
- Downstream dependencies
- are the tasks that rely on *your* task's output.
- Lesson 1841 — Upstream and Downstream Dependencies
- Draft pull requests
- are a special PR state that signals "this is work-in-progress—feedback welcome, but don't merge yet.
- Lesson 2029 — Draft Pull Requests and WIP Workflows
- Draw Conclusions
- Lesson 25 — The Scientific Method in Data Science
- Drop
- If <5% missing and MCAR
- Lesson 1207 — Missing Data Assessment and StrategyLesson 2015 — Interactive Rebase for History Cleanup
- Drop non-significant predictors
- if they don't contribute beyond noise.
- Lesson 703 — Sequential Model Building Strategy
- Dry runs
- Execute the DAG structure logic (declaring dependencies, checking conditions) without running the actual data processing.
- Lesson 1846 — Testing and Validating Dependency Graphs
- Dtype and converter fallbacks
- Lesson 1141 — Recovering from Corrupted or Partially Broken Data
- Dual use
- refers to technology, methods, or data that can be applied for both beneficial and harmful purposes.
- Lesson 1919 — Defining Dual Use in Data ScienceLesson 1920 — Anticipating Misuse of Data ProductsLesson 1931 — When to Push Back on Requests
- Dummy variable encoding
- creates separate binary (0/1) columns for each category.
- Lesson 635 — Dummy Variable Encoding Basics
- Dunn's test
- follows Kruskal-Wallis to identify which specific group pairs are significantly different.
- Lesson 473 — Post-Hoc Tests After Kruskal-Wallis: Dunn's Test
- Dunnett's test
- is specialized for situations where you have one control or reference group and several experimental treatments.
- Lesson 460 — Dunnett's Test for Control Comparisons
- Durability
- Once committed, changes persist even if the system crashes
- Lesson 1110 — What Are Database Transactions?
- During deep dives
- (testing relationships, checking distributions): exploration
- Lesson 1216 — Choosing the Right Purpose
- During Training
- Models need an objective function—a mathematical definition of "better.
- Lesson 2130 — No Clear Success Metric or Feedback Loop
- During Validation
- Even if you train something, how do you know it works?
- Lesson 2130 — No Clear Success Metric or Feedback Loop
- Dynamic dependencies
- let your pipeline decide its own structure while running.
- Lesson 1844 — Dynamic Dependencies
- Dynamic filtering
- Filter based on calculated values (like averages, maximums) rather than hardcoded numbers
- Lesson 959 — Introduction to Subqueries in WHERE
E
- E-commerce
- Homepage → Product Page → Add to Cart → Checkout → Purchase
- Lesson 1678 — What is Funnel Analysis?Lesson 1686 — Defining Conversions and Conversion RateLesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- E(X) = λ
- (expected value)
- Lesson 141 — Mean and Variance of Poisson DistributionLesson 151 — Expected Value and Variance for Common Distributions
- E(Y) = b'(θ)
- the first derivative of the cumulant function gives you the expected value
- Lesson 667 — Mean and Variance in the Exponential Family
- Early detection
- Catch bad data before it corrupts downstream analysis
- Lesson 1158 — Automated Validation Frameworks
- Early feedback on approach
- "Before I process these 50 datasets, does this transformation logic look right?
- Lesson 2029 — Draft Pull Requests and WIP Workflows
- Early in your workflow
- (profiling, hypothesis generation, outlier detection): exploration
- Lesson 1216 — Choosing the Right Purpose
- Early quality detection
- Comparing survival curves (using the log-rank test) between production batches, suppliers, or manufacturing plants reveals if one group has significantly higher failure rates.
- Lesson 837 — Product Warranty and Failure Analysis
- Early research
- Studies showed coffee drinkers had higher rates of heart disease.
- Lesson 1426 — Real-World Examples: Correlation vs Causation
- Early-stage customer discovery
- What sparks initial interest?
- Lesson 1720 — First-Touch Attribution Model
- Easier collaboration
- When your team knows data will always arrive in tidy format, everyone can use the same templates, functions, and workflows without custom adaptations.
- Lesson 1149 — Benefits of Tidy Data for Downstream Work
- Easier dimension updates
- Changing a category name happens in one place
- Lesson 1810 — Snowflake Schema and Normalization Trade-offs
- Easier Maintenance
- Smaller, focused tables are simpler to understand, query, and modify than massive, repetitive tables.
- Lesson 1061 — Introduction to Normalization
- Easiest wins
- Where do top-performing segments reveal best practices?
- Lesson 1685 — Actionable Insights from Funnel Analysis
- Easy to Measure
- Lesson 1598 — Characteristics of Lagging Indicators
- Economic business cycles
- are the classic example.
- Lesson 708 — Cyclical Patterns: Non-Fixed Fluctuations
- Economic growth
- might correlate with **increased coffee consumption**, not because coffee drives the economy, but because both rise with population and urbanization.
- Lesson 1422 — Spurious Correlations
- Edge cases
- Handling tests that never stop, or stopping rules that trigger unexpectedly
- Lesson 1515 — Trade-offs: Sample Size, Speed, and ComplexityLesson 1949 — Anticipating Questions: Building in AppendicesLesson 2024 — Code Review Best PracticesLesson 2129 — Edge Cases Dominate the Problem
- Edge Color/Style
- Differentiate relationship types with color or dashed/solid lines (friend vs.
- Lesson 1319 — Styling Network Visualizations
- Edge Weight
- Make thicker lines represent stronger relationships (more messages, higher correlation).
- Lesson 1319 — Styling Network Visualizations
- Education
- Does the average test score in your classroom differ from the district standard of 75?
- Lesson 351 — When to Use a One-Sample t-Test
- Education and income
- Does higher education lead to higher income, or do wealthier families afford better education?
- Lesson 1424 — Reverse Causality
- Educational records
- reflecting unequal access to opportunities
- Lesson 1881 — Historical and Societal Bias
- Effect
- Compresses large values more than small ones, pulling in the right tail of skewed distributions.
- Lesson 592 — Common Transformations: Log, Square Root, Reciprocal
- Effect size
- How different the true parameter is from the null hypothesis value
- Lesson 335 — Calculating Type II Error Probability (Beta)Lesson 341 — Effect Size and PowerLesson 343 — Calculating Power for Common TestsLesson 344 — Power Analysis in Study DesignLesson 384 — What is Effect Size?Lesson 405 — Sample Size and Power for Proportion TestsLesson 413 — Effect Size and Practical SignificanceLesson 446 — Power and Sample Size for ANOVA (+3 more)
- Effect Size (δ)
- The minimum detectable difference you care about—determined by your MDE (Minimum Detectable Effect).
- Lesson 1496 — The Four Parameters of Sample Size Calculation
- Effective sample size (ESS)
- Estimates how many independent samples you truly have after accounting for autocorrelation
- Lesson 1592 — Burn-in, Thinning, and Convergence Diagnostics
- Ego-network analysis
- Model and measure the spillover explicitly
- Lesson 1527 — Ignoring Network Effects
- elbow method
- on within-cluster variance or **silhouette scores** to quantify segment quality at each cut point.
- Lesson 1706 — Hierarchical Clustering for SegmentationLesson 1708 — Choosing the Number of Segments
- Electricity demand
- might show daily (24-hour) *and* weekly (168-hour) patterns
- Lesson 746 — Choosing Seasonal Period
- Elevation
- How high above (or below) the horizontal plane your camera sits, measured in degrees.
- Lesson 1326 — Viewing Angles and Projection Types
- Eliminates bottlenecks
- Shared resources (like shared memory or a central database) become traffic jams as you scale.
- Lesson 1771 — Shared-Nothing Architecture
- Eliminates sign problems
- Squaring makes all errors positive, so they can't cancel each other out.
- Lesson 517 — The Least Squares Criterion
- ELT flips this order
- Extract, Load, *then* Transform.
- Lesson 1816 — What is ELT? Extract, Load, Transform Explained
- embarrassingly parallel
- (easy to split across machines), while others require extensive data shuffling or coordination.
- Lesson 1786 — Data Processing Patterns Best Suited for SparkLesson 1790 — What is Dask and When to Use It
- Emotional connection
- means helping your audience *feel* the human stakes behind the numbers.
- Lesson 1941 — Emotional Connection Without Manipulation
- Emphasizes larger errors
- A point that's 4 units away contributes 16 to the sum, while one that's 2 units away contributes only 4.
- Lesson 517 — The Least Squares Criterion
- Emphasizes smaller values
- – the square root compresses large values and stretches small ones, making patterns clearer
- Lesson 560 — Scale-Location Plot (Spread-Location Plot)
- Empirical Rule
- is your quick mental map for normal distributions.
- Lesson 171 — The 68-95-99.7 Rule (Empirical Rule)
- Employee-manager relationships
- An `employees` table where each employee has a `manager_id` pointing to another employee in the same table
- Lesson 945 — Introduction to Self-Joins
- Enable comparison
- You can compare typical values across groups ("Team A averages 15 sales per week vs.
- Lesson 38 — What is Central Tendency?
- Enable step-by-step execution
- Others can run each cell independently to verify your work or experiment with modifications
- Lesson 1982 — Literate Programming with Notebooks
- Enables decision-making
- You can calculate probabilities, credible intervals, and expected values directly from it
- Lesson 1537 — The Posterior Distribution
- Enclosure
- Elements surrounded by a boundary are perceived as a group.
- Lesson 1236 — Gestalt Principles in Visualization
- End-to-end integration tests
- Run your pipeline on sample data in a test environment to verify the execution order produces expected results.
- Lesson 1846 — Testing and Validating Dependency Graphs
- Ends at 1
- F(∞) = 1 (all probability is accumulated)
- Lesson 157 — Cumulative Distribution Functions (CDFs) for Continuous Variables
- Enforces Status Checks
- Automated tests, linters, or CI/CD pipelines must pass before merging.
- Lesson 2027 — Protecting Branches and Required Reviews
- Engagement Rate
- = (Likes + Comments + Shares) / Impressions × 100
- Lesson 1631 — Social Media Metrics: DAU/MAU and Content Engagement
- Engagement scoring
- might show that 20% of users generate 80% of value (power users)
- Lesson 1701 — What is Customer Segmentation?
- Engaging storytelling
- Making trends memorable and intuitive
- Lesson 1306 — Animation and Time-Based Transitions
- Engineers and Technical Implementers
- need:
- Lesson 1951 — Understanding Stakeholder Priorities and Constraints
- Enhancements (additional layers)
- Smoothing lines, confidence bands, annotations
- Lesson 1347 — Understanding Layers in ggplot2
- Enrollments
- (StudentID, CourseID, StudentName, CourseName, Grade)
- Lesson 1065 — Second Normal Form (2NF)
- Ensure immutability
- Changing a primary key causes cascading headaches
- Lesson 1050 — Choosing Effective Primary Keys
- Enter
- or **Space** (to activate), and **arrow keys** (for fine control).
- Lesson 1253 — Interactive Accessibility: Keyboard Navigation
- Environment details
- Which worker, timestamp, resource usage
- Lesson 1851 — Error Logging and Notifications
- Environment differences
- Code runs differently on different machines (Code and Environment Management)
- Lesson 30 — The Reproducibility Crisis and Solutions
- Environment files
- that list all software dependencies and versions
- Lesson 29 — Code and Environment Management
- Environment management
- means recording *exactly* which software versions you used.
- Lesson 29 — Code and Environment Management
- Environment variables
- and configurations
- Lesson 2038 — What is Environment Management and Why It Matters
- Environment-driven iteration
- External changes (new regulations, market shifts, updated systems) force you to revisit earlier decisions.
- Lesson 2092 — Iteration and Feedback Loops in Practice
- Environment-specific
- Your local paths won't work on a colleague's machine or cloud server
- Lesson 2072 — Configuration Files vs Hard-Coded Values
- environment.yml
- Lesson 2043 — Creating and Exporting Environment SpecificationsLesson 2044 — Recreating Environments from Specifications
- Epidemiological data
- that helps track disease spread can reveal individuals' health status or movements, enabling discrimination or persecution.
- Lesson 1919 — Defining Dual Use in Data Science
- Equal information
- Posterior sits roughly halfway between
- Lesson 1567 — Posterior Mean as Weighted Average
- Equal opportunity
- Among qualified individuals, do groups succeed at similar rates?
- Lesson 1884 — Detecting Bias in Your Data
- Equal to minimum
- `WHERE value = (SELECT MIN(value) FROM table)`
- Lesson 964 — Subqueries with Aggregate Functions
- equal variances
- Lesson 361 — Pooled Variance t-TestLesson 398 — Choosing Between Parametric and Non-Parametric TestsLesson 447 — Conducting One-Way ANOVA in Practice
- Equal-area
- Preserves area ratios but distorts shapes
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- Equal-width vs equal-frequency
- Different bin strategies tell different stories
- Lesson 1245 — Misleading Aggregations and Binning
- Equality checks
- Lesson 960 — Subqueries Returning Single Values
- Equality searches
- (`WHERE id = 100`) jump straight to the target
- Lesson 1079 — B-Tree Indexes: Structure and Mechanics
- Erasure isn't absolute
- GDPR includes exemptions when you must keep data:
- Lesson 1909 — Right to Erasure and Data Retention Policies
- Ergodicity
- Long-run averages converge to expectations under the stationary distribution
- Lesson 1589 — Markov Chains: The Foundation of MCMC
- Error bars
- attach vertical or horizontal lines to a point (often a mean) showing ±1 standard deviation, ±2 SE, or confidence intervals.
- Lesson 55 — Visualizing SpreadLesson 1244 — Omitting Uncertainty and Variability
- Error classification
- Transient network issue or data quality problem?
- Lesson 1851 — Error Logging and Notifications
- Error handling
- Logs failures, sends alerts, and implements retry logic
- Lesson 1822 — What is a Data Pipeline?
- Error measures
- For numerical predictions, how far off were your guesses on average?
- Lesson 14 — Model Evaluation and Validation
- Escalation patterns
- Moving from chat to phone to "speak to a manager"
- Lesson 1673 — Leading Indicators of Churn
- Establish coordination protocols
- When do teams need to align before taking action?
- Lesson 1625 — Cross-Functional Metric Dependencies
- estimate
- population parameters.
- Lesson 229 — Defining Samples and StatisticsLesson 607 — Confidence Intervals for CoefficientsLesson 1449 — Coarsened Exact Matching (CEM)
- Estimate densities
- `stat_density()` creates smooth distribution curves
- Lesson 1352 — Statistical Transformations with stat_* Layers
- Eta-squared
- is the most straightforward effect size for ANOVA.
- Lesson 445 — Effect Size: Eta-Squared and Omega-Squared
- Ethical collection
- – Was this data gathered with people's informed consent?
- Lesson 36 — Responsible Data Sourcing and Use
- Ethical violations
- if you mishandle sensitive data or misrepresent uncertainty
- Lesson 34 — Recognizing Boundaries of Competence
- ETL
- stands for **Extract, Transform, Load**—a traditional data integration pattern that moves data from source systems into a data warehouse or analytics platform.
- Lesson 1815 — What is ETL? Extract, Transform, Load Explained
- Etsy
- Gross Merchandise Sales (GMS) — captures value for both buyers (finding unique items) and sellers (making sales).
- Lesson 1606 — Examples of North Star Metrics by Industry
- Evaluate trade-offs
- Parametric tests have higher power when assumptions hold; non-parametric tests are safer when assumptions are questionable
- Lesson 398 — Choosing Between Parametric and Non-Parametric Tests
- Evaluates segments
- For each possible segmentation (sets of change-points), calculates a cost based on how well each segment fits the data
- Lesson 1416 — PELT Algorithm: Pruned Exact Linear Time
- event
- is simply a collection of one or more outcomes from your sample space.
- Lesson 78 — Events as Subsets of the Sample SpaceLesson 803 — Defining the Event and Time OriginLesson 835 — Customer Churn Prediction with Survival AnalysisLesson 840 — Loan Default Timing and Credit Risk
- Event logs
- specific actions like "clicked_ad", "opened_email", "viewed_product"
- Lesson 1719 — The Customer Journey and Touchpoints
- Events
- subjects who experienced the event (death, churn, failure) at that exact time
- Lesson 812 — Handling Event Times and CensoringLesson 1679 — Defining Funnel Steps and Events
- Every selection is independent
- picking one individual doesn't affect who else gets picked
- Lesson 234 — Simple Random Sampling
- Everything-to-Target
- Always examine relationships *with* your target variable first.
- Lesson 1210 — Relationship Exploration: Correlation and Association
- evidence
- (your observations), and produce **posterior beliefs** (your updated understanding).
- Lesson 116 — From Bayes' Theorem to Bayesian InferenceLesson 1536 — The Evidence (Marginal Likelihood)Lesson 1546 — The Role of the Normalizing Constant
- Evidence is strong
- Overwhelming data swamps your initial belief
- Lesson 115 — Prior Sensitivity Analysis
- Evidence is weak
- A mildly positive test result won't overcome a very skeptical or very confident prior
- Lesson 115 — Prior Sensitivity Analysis
- Evidence package
- Lesson 1946 — Supporting Your Claims with Evidence
- EWMA
- applies weighted averaging where recent observations matter more than older ones.
- Lesson 1403 — CUSUM and EWMA Charts
- Exact duplicate detection
- Find rows where *all* columns match exactly—these are often accidental copies from data loading errors.
- Lesson 1154 — Uniqueness and Duplication Checks
- Exact p-values
- use the true, theoretical probability distribution without approximation.
- Lesson 322 — Exact vs Asymptotic P-Values
- Exact pinning
- guarantees that everyone running your code uses identical package versions, maximizing reproducibility.
- Lesson 2050 — Pinning Versions vs Flexible Ranges
- Exact probability
- uses the binomial PMF directly.
- Lesson 130 — Calculating Binomial ProbabilitiesLesson 431 — When Chi-Squared Assumptions Fail
- exactly
- the same measured characteristics as treated units.
- Lesson 1446 — Exact MatchingLesson 1993 — The Three States: Working Directory, Staging, Repository
- Exactly 4 accept
- Calculate P(X=4) directly with the PMF
- Lesson 130 — Calculating Binomial Probabilities
- exactly the same
- as taking the Pearson correlation coefficient between X and Y and squaring it.
- Lesson 534 — R-Squared vs Correlation SquaredLesson 647 — Impact on Model Results and Reporting
- Exactly two outcomes
- Success or failure, no middle ground
- Lesson 123 — Bernoulli Trial Definition and Properties
- Examine edge cases
- Look for missing groups entirely—this reveals coverage error.
- Lesson 250 — Strategies for Bias Detection and Mitigation
- Examine the coefficient magnitude
- in its real-world units
- Lesson 609 — Practical vs Statistical Significance
- Examine transformations
- Review each transformation step—was a join dropping records?
- Lesson 1870 — Root Cause Analysis for Quality Issues
- Example 1
- In educational research, improving test scores by d = 0.
- Lesson 386 — Effect Size Interpretation Guidelines
- Example 2
- In pharmaceutical trials, a pain medication with d = 0.
- Lesson 386 — Effect Size Interpretation Guidelines
- Example 2: Video Autoplay
- Lesson 1521 — Risks of Optimizing for Surrogates
- Example 3
- In physics experiments measuring fundamental constants, even d = 0.
- Lesson 386 — Effect Size Interpretation Guidelines
- Example analogy
- A company's sales look higher in stores with fewer employees.
- Lesson 1194 — Simpson's Paradox and ConfoundingLesson 1937 — The Hero's Journey: Making Your Audience the Hero
- Example context
- Lesson 265 — Using Standard Error in Practice
- Example handling
- Lesson 1100 — Handling NULL Values and Data Types
- Example interpretation
- Lesson 730 — Interpreting PACF Plots
- Example intuition
- Imagine you're testing if a coin is fair (H₀: p = 0.
- Lesson 335 — Calculating Type II Error Probability (Beta)Lesson 1513 — Always-Valid Inference and Confidence Sequences
- Example output
- Lesson 1008 — RANK(): Handling Ties with Gaps
- Example pattern
- Calculating daily revenue across millions of transactions.
- Lesson 1786 — Data Processing Patterns Best Suited for SparkLesson 1865 — Data Quality Checks in Pipelines
- Example scenario
- You want to find average order values by region, but only for orders placed in 2023.
- Lesson 895 — Combining WHERE and GROUP BYLesson 898 — HAVING Clause FundamentalsLesson 1600 — Business Examples: Revenue vs Pipeline
- Example scenarios
- Lesson 36 — Responsible Data Sourcing and Use
- Example structure
- Lesson 891 — Single Column GroupingLesson 1977 — Design Principles for DashboardsLesson 2071 — Modular Code: Functions and Scripts
- Example use case
- Instead of joining `orders` and `products` and summing totals every time someone checks monthly sales, create a materialized view that stores those monthly totals.
- Lesson 1076 — Materialized Views and Summary Tables
- Example violation
- In a customer churn study, if high-risk customers are more likely to stop using your product *and* more likely to unsubscribe from your tracking emails (causing censoring), your results will be biased.
- Lesson 821 — Assumptions of the Log-Rank Test
- Example with numbers
- Lesson 859 — IN Operator for Multiple Values
- Example with text values
- Lesson 859 — IN Operator for Multiple Values
- Excel
- adds formatting overhead and reads even slower than CSV, especially with multiple sheets.
- Lesson 1133 — Performance Considerations Across Formats
- Excess Kurtosis (Pearson's)
- Sometimes the calculation stops before the final "-3" adjustment.
- Lesson 67 — Calculating Kurtosis
- Excessive grid lines
- Too many or overly prominent gridlines
- Lesson 1246 — Visual Clutter and Chartjunk
- exchangeability
- if the groups had swapped assignments, we'd expect the same average outcome.
- Lesson 1438 — Ensuring Balance Between GroupsLesson 1443 — Observational Studies vs Randomized Experiments
- Excluding multiple values
- Lesson 868 — The NOT Operator
- Excluding pattern matches
- Lesson 868 — The NOT Operator
- Executes the subquery first
- and gets back multiple rows (each with one value)
- Lesson 961 — IN Operator with Subqueries
- Execution order chaos
- Tasks might run simultaneously when they should be sequential, causing some to fail because required data isn't ready yet.
- Lesson 1840 — What is Dependency Management in Pipelines?
- Execution stage
- Lesson 1857 — Logging Best Practices
- Executive Summary
- (1 page): Key findings, recommendations, business impact
- Lesson 1966 — Report Structure and Executive Summary
- Executive/Business stakeholders
- Lead with directional findings and practical significance.
- Lesson 1953 — Adjusting Statistical Depth by Audience
- Executives
- making strategic decisions need clean, simple charts that communicate the main point at a glance: think bar charts showing three key metrics or a single trend line.
- Lesson 1954 — Tailoring Visualizations to Audience Needs
- Executives and Business Leaders
- focus on:
- Lesson 1951 — Understanding Stakeholder Priorities and Constraints
- Executor
- Determines *how* tasks run (locally, distributed, etc.
- Lesson 1833 — Introduction to Apache Airflow
- Exercise and health
- Do healthier people exercise more, or does exercise make people healthier?
- Lesson 1424 — Reverse Causality
- Existing knowledge
- What do subject-matter experts already know?
- Lesson 1168 — Understanding Domain Context
- EXISTS
- stops searching as soon as it finds *any* matching row.
- Lesson 985 — EXISTS vs IN: Performance Considerations
- Exogeneity
- means that your predictor variable X is determined *outside* the model and is completely independent of the error term ε.
- Lesson 553 — Exogeneity: X Must Be Independent of Errors
- Expanding funnel
- Residuals start tight on the left and fan out wider on the right.
- Lesson 559 — Detecting Heteroscedasticity (Non-Constant Variance)
- Expectations and Success Criteria
- Document what each stakeholder considers "success.
- Lesson 2101 — Identifying and Mapping Stakeholders
- Expected
- The count you would expect if the null hypothesis were true (which you calculated in the previous lesson)
- Lesson 417 — The Chi-Squared Test Statistic Formula
- expected frequencies
- across all categories.
- Lesson 414 — Introduction to Chi-Squared Goodness of Fit TestLesson 416 — Calculating Expected FrequenciesLesson 423 — Contingency Tables and Expected Frequencies
- Expected Frequency Requirement
- Lesson 426 — Assumptions and Sample Size Requirements
- Expected Impact
- Prevent 20-30 account cancellations/month ($50K-75K MRR saved)
- Lesson 1948 — The Recommendation Slide: Making It Actionable
- Expected loss
- answers: "If I pick this variant and it's wrong, how much will I lose on average?
- Lesson 1584 — Expected Loss and Decision MakingLesson 1586 — Multi-Armed Bandit ConnectionsLesson 1587 — Bayesian A/B Testing in Practice
- Expected loss threshold
- Stop when the expected loss of choosing variant B is below $X
- Lesson 1585 — Early Stopping in Bayesian Tests
- Expected outputs
- What files or results should appear
- Lesson 1989 — Best Practices for Sharing Reproducible Reports
- Expected uniqueness violated
- An ID column contains repeats
- Lesson 1154 — Uniqueness and Duplication Checks
- expected value
- (often written as E(X) or μ) is the long-run average you'd expect if you could repeat a random process infinitely many times.
- Lesson 121 — Expected Value of Discrete Random VariablesLesson 122 — Variance and Standard Deviation of Discrete Random VariablesLesson 125 — Bernoulli Mean and VarianceLesson 147 — Expected Value of Discrete Random VariablesLesson 152 — Decision Making Under UncertaintyLesson 255 — Expected Value of Sample Statistics
- Experiment
- with different sample sizes, population shapes, or statistics
- Lesson 259 — Simulating Sampling DistributionsLesson 498 — Bradford Hill Criteria for Causation
- Experiment snapshots
- `exp-baseline-xgboost`, `exp-feature-engineering-v3`
- Lesson 2037 — Tagging Releases and Experiment Snapshots
- Experimental tracking nightmare
- Hard to remember which parameter combinations you've tested
- Lesson 2072 — Configuration Files vs Hard-Coded Values
- Experimentation
- Create branches to test new approaches without breaking working code
- Lesson 1990 — What is Version Control and Why Git?Lesson 2005 — What are Branches and Why Use Them?
- Explanatory visualization
- is your public communication tool.
- Lesson 1213 — Exploratory vs Explanatory Visualization
- Explicit transactions
- You manually control when a group of statements is committed or rolled back.
- Lesson 1111 — Autocommit Mode vs Explicit Transactions
- Exploitation
- playing the machine you currently believe is best
- Lesson 1586 — Multi-Armed Bandit Connections
- Exploration
- trying different machines to learn which pays best
- Lesson 1586 — Multi-Armed Bandit Connections
- Exploration & Analysis
- Lesson 9 — The Data Science Lifecycle Overview
- Exploratory Analysis
- means investigating your data to discover patterns, spot anomalies, and understand relationships between different pieces of information.
- Lesson 13 — Exploratory Analysis and ModelingLesson 38 — What is Central Tendency?Lesson 1395 — When to Use Grubbs' TestLesson 1727 — Linear Attribution Model
- Exploratory data analysis
- where you need to see data distributions, plot trends, and test hypotheses interactively
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- Exploratory visualization
- is your private investigation tool.
- Lesson 1213 — Exploratory vs Explanatory Visualization
- Explore constraints explicitly
- Ask about:
- Lesson 2102 — Understanding Stakeholder Goals and Constraints
- Exploring data
- You're getting familiar with a new table and want to see what's in it
- Lesson 851 — Selecting All Columns with Asterisk
- Exponential
- works when failure or arrival rates are constant over time (memoryless property).
- Lesson 193 — Choosing Between Distributions in PracticeLesson 664 — What is the Exponential Family of Distributions?
- Exponential complexity
- With multiple attributes, the number of subgroups grows quickly
- Lesson 1893 — Intersectionality in Fairness
- Exponential decay
- Sharp initial drop, then gradual decline (common in digital marketing)
- Lesson 1639 — Time Windows and Attribution Decay
- exponential distribution
- flips this around—it models *how long you wait* until the next event occurs.
- Lesson 164 — The Exponential DistributionLesson 182 — Special Cases: Exponential and Chi-Squared
- exponential family
- is a special class of probability distributions that can all be written in the same mathematical form.
- Lesson 664 — What is the Exponential Family of Distributions?Lesson 666 — Natural Parameter and Sufficient StatisticsLesson 690 — The Poisson Distribution as a GLM
- Exponential smoothing
- uses a declining weight scheme controlled by parameter `α`.
- Lesson 764 — Exponential Smoothing vs Moving Averages
- Exponentiate the bounds
- Transform to odds ratio scale:
- Lesson 685 — Confidence Intervals for Odds Ratios
- Expose data issues early
- Basic analysis quickly reveals data quality problems
- Lesson 2110 — The Minimum Viable Analysis (MVA)
- Extended
- Multiple months if seasonal patterns matter to your metric
- Lesson 1484 — Duration and Timing Considerations
- Extended evidence
- Lesson 1971 — Appendices and Technical Details
- External conditions
- A task queries an API to see which regions have new data, then dynamically creates one downstream task per region.
- Lesson 1844 — Dynamic Dependencies
- External factors
- (business closure, budget cuts)
- Lesson 1675 — Churn Attribution and Root Cause AnalysisLesson 1741 — Controlling for Seasonality and External Factors
- External task dependencies
- explicitly declare that a task in Pipeline A depends on a task in Pipeline B.
- Lesson 1845 — Cross-Pipeline Dependencies
- External validity
- asks: *Do these results apply beyond your study's specific conditions?
- Lesson 1441 — Internal vs External Validity
- Extra transparency
- about consequences of declining
- Lesson 1918 — Special Populations and Vulnerable Groups
- Extract and lightly transform
- sensitive data (masking PII) before loading to comply with regulations
- Lesson 1821 — Hybrid Approaches and Modern Data Stacks
- Extract the Irregular (I)
- Subtract both trend and seasonal components from the original: `I = Y - T - S`.
- Lesson 744 — Classical Decomposition Methods
- Extract the timezone
- Determine what zone a timestamp uses
- Lesson 1042 — Working with Timestamps and Time Zones
- Extract the Trend (T)
- Apply a moving average to smooth out the data.
- Lesson 744 — Classical Decomposition Methods
- Extract, Transform, Load
- a traditional data integration pattern that moves data from source systems into a data warehouse or analytics platform.
- Lesson 1815 — What is ETL? Extract, Transform, Load Explained
- Extracting
- data from operational databases, APIs, or files
- Lesson 1816 — What is ELT? Extract, Load, Transform Explained
- Extraction tools
- (Fivetran, Airbyte) that load raw data with minimal transformation
- Lesson 1821 — Hybrid Approaches and Modern Data Stacks
- Extreme outliers
- Even with larger samples, severe outliers can distort the mean and inflate the standard error, making your t-statistic misleading.
- Lesson 390 — When Parametric Tests Fail: Violations of Assumptions
- Extreme predictions
- When estimating values far from your data's center
- Lesson 550 — Normality of Residuals
F
- F < 10
- Weak instrument—your second-stage estimates may be severely biased
- Lesson 1467 — Testing Instrument Strength and Validity
- F-statistic
- and **p-value** in the "Between Groups" row tell you whether your groups differ significantly.
- Lesson 444 — The ANOVA TableLesson 464 — Main Effects in Two-Way ANOVALesson 1467 — Testing Instrument Strength and Validity
- F-statistic is large
- and the **p-value is small** (typically < 0.
- Lesson 627 — The F-Test for Model Comparison
- F-test
- (covered in lesson 363), or simply inspect side-by-side boxplots.
- Lesson 379 — The Assumption of Equal Variances (Homoscedasticity)Lesson 654 — Testing Interaction Significance
- F-test for model comparison
- gives you a statistical answer.
- Lesson 627 — The F-Test for Model Comparison
- F(x)
- , answers: "What's P(X ≤ x)?
- Lesson 120 — Cumulative Distribution Functions (CDF) for Discrete Variables
- Facebook/Meta
- Monthly Active Users (MAU) or Daily Active Users (DAU) — engagement captures the platform's value through connection and content sharing.
- Lesson 1606 — Examples of North Star Metrics by Industry
- Faceted plots
- split your data by a third variable, showing if the pattern changes across groups
- Lesson 1195 — Interaction Effects Between Variables
- Faceting
- means creating multiple small charts—one per category—arranged in a grid.
- Lesson 1193 — Conditional Distributions and FacetingLesson 1289 — Regression Plots: regplot and lmplotLesson 1356 — What Are Facets and Small Multiples?
- Facets
- Small multiples—splitting data into separate subplots
- Lesson 1340 — The Seven Layers of GrammarLesson 1362 — When to Use Facets vs. Other Approaches
- Facets work best when
- Lesson 1362 — When to Use Facets vs. Other Approaches
- fact table
- (containing measurements like sales amounts, quantities, or counts) connects to multiple **dimension tables** (containing descriptive attributes like customer names, product details, or dates).
- Lesson 956 — Star Schema JoinsLesson 1808 — Star Schema and Fact Tables
- Factor impact
- Do email campaigns accelerate conversion compared to ads?
- Lesson 839 — Time-to-Conversion in Marketing Funnels
- Fail to Reject H₀
- | Correct | Type II Error (β) |
- Lesson 338 — What is Statistical Power?Lesson 358 — Worked Example: One-Sample t-Test in Practice
- Failed jobs create duplicates
- Rerunning after a crash might insert the same records twice
- Lesson 1847 — What is Idempotency?
- Failure
- Rolls back automatically if an exception occurs
- Lesson 1114 — Transaction Context Managers in Python
- Fair dice or spinners
- Are all six faces equally likely?
- Lesson 421 — Applications: Uniform, Genetic Ratios, and Distributions
- fairness
- in how credit flows, enabling better resource allocation and learning.
- Lesson 1643 — Building Attribution FrameworksLesson 1878 — What is Bias in Data?
- Fairness audits
- Test model outcomes across demographic groups
- Lesson 1883 — Protected Classes and Proxy Variables
- Fairness through awareness
- takes the opposite approach: explicitly include sensitive attributes so you can measure and correct for disparate impact.
- Lesson 1892 — Fairness Through Unawareness vs Awareness
- Fairness through unawareness
- sounds intuitive—if the model can't see protected attributes, it can't discriminate, right?
- Lesson 1892 — Fairness Through Unawareness vs Awareness
- False Discovery Rate (FDR)
- Controls the expected proportion of false positives among all significant results
- Lesson 512 — Testing Significance in Correlation MatricesLesson 1505 — False Discovery Rate (FDR)Lesson 1506 — Benjamini-Hochberg Procedure
- false negative
- or Type II error.
- Lesson 1495 — Power Analysis FundamentalsLesson 1529 — Running Underpowered Tests
- False Negatives (FN)
- Missed actual change-points
- Lesson 1418 — Evaluating Change-Point Detection Methods
- False Positive Rate
- You should see "significant" results only at your chosen alpha level (e.
- Lesson 1483 — Pre-Experiment Validation
- false positives
- for stationarity.
- Lesson 717 — KPSS TestLesson 1518 — The Relationship Between Surrogate and Business Metrics
- False Positives (FP)
- Flagged changes where none exist (false alarms)
- Lesson 1418 — Evaluating Change-Point Detection Methods
- False precision
- The mathematical convenience might tempt you to use an inappropriate prior
- Lesson 1555 — Advantages and Limitations of Conjugate Priors
- Familiar API
- If you know SQL or pandas, DataFrames feel natural
- Lesson 1778 — DataFrames and Spark SQL Basics
- family-wise error rate
- (the probability of making *any* Type I error across all tests) balloons.
- Lesson 337 — Error Rates in Practice: Multiple TestingLesson 1501 — The Multiple Testing Problem
- Family-Wise Error Rate (FWER)
- is the probability of making **at least one false discovery** (Type I error) across a "family" of hypothesis tests conducted simultaneously.
- Lesson 1502 — Family-Wise Error Rate (FWER)Lesson 1505 — False Discovery Rate (FDR)
- Fan-in
- Multiple tasks must complete before one starts
- Lesson 1843 — Declaring Dependencies in Orchestration Tools
- Fan-out
- One task triggers multiple parallel tasks
- Lesson 1843 — Declaring Dependencies in Orchestration Tools
- Fast-moving funnel
- Users zip through in minutes (smooth experience)
- Lesson 1681 — Time-Based Funnel Analysis
- Faster payback
- = more cash to reinvest in growth
- Lesson 1757 — Payback Period: Definition and Importance
- Favor parsimony
- When models perform similarly, choose the simpler one (Occam's Razor)
- Lesson 633 — Practical Model Selection Strategy
- Feather
- is a lightweight columnar format optimized for speed.
- Lesson 1129 — Parquet and Feather: Columnar Formats
- Feature adoption
- without connecting to retention or expansion revenue
- Lesson 1616 — Metrics Divorced from RevenueLesson 1646 — Defining Cohort Start EventsLesson 1696 — Feature Adoption and Usage Frequency
- Feature bloat
- Models train slower and become harder to explain when filled with irrelevant features
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Feature development
- Build new model features without disrupting production code
- Lesson 2005 — What are Branches and Why Use Them?
- Feature engineering needs
- "Create interaction term between X and Y"
- Lesson 1212 — EDA Summary Documentation and Next Steps
- Feature engineering repeats
- New behaviors may require new features entirely
- Lesson 2128 — Data Distribution Shifts Frequently
- Feature requests
- Stakeholders inevitably want new views or filters
- Lesson 1979 — Maintenance and Sustainability Considerations
- Feature selection discipline
- After adding features, measure their importance and remove low-contributors before the next iteration.
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Feature sprawl
- happens when you keep accumulating features for models without ever pruning the ones that don't contribute value.
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Features
- Early behavior metrics like days to first purchase, first-order value, login frequency in week one, number of products viewed, engagement with onboarding emails
- Lesson 1668 — Predictive LTV Models
- Feedback loops
- A biased model's decisions become tomorrow's training data (e.
- Lesson 1882 — Algorithmic Amplification of BiasLesson 1923 — Algorithmic Amplification of Harm
- fewer samples
- you need.
- Lesson 405 — Sample Size and Power for Proportion TestsLesson 1515 — Trade-offs: Sample Size, Speed, and Complexity
- Fewer Type I errors
- – You're less likely to reject H₀ when it's actually true
- Lesson 342 — Alpha Level Trade-offs
- Field conventions
- (what does your discipline expect?
- Lesson 324 — Common Significance Levels: 0.05, 0.01, and 0.10
- Field standards
- Psychology and social sciences commonly use α = 0.
- Lesson 342 — Alpha Level Trade-offs
- Figure
- is the entire building—the blank canvas or container that holds everything.
- Lesson 1255 — The Anatomy of a Matplotlib Figure
- Fill rate
- % of buyer requests successfully matched
- Lesson 1630 — Marketplace Metrics: GMV, Take Rate, and Liquidity
- Filling gaps
- Use LEAD to preview the next non-null value
- Lesson 1023 — Introduction to Window Functions: LAG and LEAD
- Filter
- raw tables to relevant rows
- Lesson 994 — CTEs for Simplifying Complex JoinsLesson 1827 — Transformation Patterns: Map, Filter, Aggregate
- Filter conditions
- WHERE clauses applied at different stages.
- Lesson 1084 — Reading and Interpreting Query Execution Plans
- Filter early
- Use `WHERE` before `DISTINCT` or `ORDER BY` to reduce the data volume
- Lesson 880 — Performance Considerations and Best PracticesLesson 997 — CTE Best Practices and Performance
- Filtering conditions
- Applying `WHERE` clauses early reduces row counts before joining
- Lesson 951 — Join Order and Performance
- Filtering early
- Place restrictive `WHERE` conditions before or with early joins
- Lesson 951 — Join Order and Performance
- Finance
- R² = 0.
- Lesson 533 — Interpreting R-Squared ValuesLesson 1412 — What is Change-Point Detection?
- Financial analysis
- (comparing stocks with vastly different price ranges)
- Lesson 200 — Comparing Values Across Different Distributions
- Financial conflicts
- You're analyzing sales data for a product your company desperately needs to succeed.
- Lesson 35 — Conflicts of Interest and IndependenceLesson 1930 — Managing Conflicts of Interest
- Financial portfolios
- Assets with different investment amounts
- Lesson 43 — Weighted Mean and Its Applications
- Financing decisions
- Investors scrutinize this metric to assess capital efficiency
- Lesson 1757 — Payback Period: Definition and Importance
- Find minimal adjustment sets
- the smallest set of variables that, when conditioned on, blocks all backdoor paths
- Lesson 1475 — Using DAGs to Guide Analysis
- Find shared drivers
- Do two metrics both depend on the same underlying factor?
- Lesson 1625 — Cross-Functional Metric Dependencies
- Find the p-value
- from the chi-squared distribution (df = 1)
- Lesson 436 — Conducting McNemar's TestLesson 447 — Conducting One-Way ANOVA in Practice
- Find the quantiles
- of your posterior distribution that capture that probability mass
- Lesson 1562 — Credible Intervals for ProportionsLesson 1575 — Computing Equal-Tailed Credible Intervals
- Find the row
- corresponding to the first two digits of your Z-score (e.
- Lesson 198 — Using Z-Tables for Probability
- Finding duplicates
- Match rows where key fields are identical
- Lesson 947 — Self-Joins for Comparisons Within a Table
- Finding Overlapping Date Ranges
- Lesson 948 — Self-Joins with Inequality Conditions
- Finding probabilities
- P(a < X ≤ b) = F(b) - F(a)
- Lesson 157 — Cumulative Distribution Functions (CDFs) for Continuous Variables
- Finding Sequential Records
- Lesson 948 — Self-Joins with Inequality Conditions
- Finding unique categories
- What products do we sell?
- Lesson 873 — Understanding DISTINCT: Removing Duplicate Rows
- Findings
- Your main insights with supporting visualizations
- Lesson 1966 — Report Structure and Executive Summary
- Finite Population Correction (FPC)
- factor adjusts for this by *shrinking* the standard error to reflect the extra precision:
- Lesson 264 — Finite Population Correction
- Firewall Issues
- block traffic between your application and database.
- Lesson 1093 — Troubleshooting Connection Issues
- First batch of data
- Start with Beta(2, 2) prior → observe 10 successes, 15 failures → get Beta(12, 17) posterior
- Lesson 1563 — Sequential Updating with New Data
- First difference
- Treatment group's change = (After - Before)
- Lesson 1452 — The Difference-in-Differences Setup
- First difference (Control group)
- Calculate the change in the control group over the same period: `(Y_control_after - Y_control_before)`
- Lesson 1454 — Calculating the DiD Estimator
- First difference (Treatment group)
- Calculate the change in the treatment group from before to after the intervention: `(Y_treatment_after - Y_treatment_before)`
- Lesson 1454 — Calculating the DiD Estimator
- First evidence (fingerprint found)
- Apply Bayes' Theorem → posterior becomes 60%.
- Lesson 114 — Sequential Updating
- First join
- (LEFT): All customers appear, even those without orders (orders columns show NULL)
- Lesson 952 — Mixing Join Types
- First meaningful action
- When a user performs a core action that indicates true engagement
- Lesson 1646 — Defining Cohort Start Events
- First Quartile (Q1)
- the 25th percentile
- Lesson 59 — The Five-Number Summary and Box PlotsLesson 1383 — Understanding the Interquartile Range (IQR)
- First touch
- (30%) – The initial interaction that brings awareness
- Lesson 1730 — W-Shaped Attribution Model
- First touch matters
- Someone discovered you somehow—that channel deserves significant credit
- Lesson 1729 — Position-Based (U-Shaped) Attribution
- First-Pass Yield
- measures the percentage of units that pass quality checks without rework on the first attempt.
- Lesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- First-touch
- Credit goes to the initial interaction
- Lesson 1637 — What is Metric Attribution?Lesson 1724 — Limitations of Single-Touch Attribution
- First-touch attribution
- credits the initial discovery channel.
- Lesson 1723 — Comparing Single-Touch ModelsLesson 1725 — Implementing Single-Touch Attribution
- Fisher's exact test
- (for small samples)
- Lesson 419 — Assumptions and Minimum Expected FrequenciesLesson 434 — Fisher's Exact vs Chi- Squared: When to Use EachLesson 437 — Applications: Clinical Trials and Market Research
- Fisher's z-transformation
- , which converts *r* into a value *z'* that *is* approximately normally distributed:
- Lesson 503 — Confidence Intervals for Correlation Coefficients
- Fit models
- Kaplan-Meier for overall survival curves; Cox models to test effects of predictors like manufacturing date, component supplier, or usage intensity
- Lesson 837 — Product Warranty and Failure Analysis
- Fitness trackers
- revealing military base locations through jogging patterns
- Lesson 1922 — Surveillance and Secondary Data Uses
- Fitted value (ŷ)
- "Here's what the model *predicts* based on the linear relationship"
- Lesson 543 — Residuals as Unexplained Variation
- fitted values
- (group means) on the x-axis and **residuals** (observed minus predicted) on the y-axis.
- Lesson 451 — Diagnostic Plots for ANOVALesson 538 — What Are Fitted Values?
- Fixed aspect ratios
- ensure equal spacing (crucial for maps)
- Lesson 1344 — Scales and Coordinate Systems
- Fixed n
- Known number of trials (products, patients, voters, visitors)
- Lesson 131 — Real-World Applications of Binomial Distributions
- fixed number of trials
- (n)
- Lesson 146 — When to Use Poisson vs Other DistributionsLesson 154 — Real-World Use Cases: Customer Behavior and Events
- Fixed probability
- The probability p stays the same each time
- Lesson 123 — Bernoulli Trial Definition and Properties
- Fixing inconsistencies
- Standardizing formats (like dates, phone numbers, or categories) so everything follows the same pattern.
- Lesson 12 — Data Cleaning and Preparation
- flexibility
- to capture these non-linear patterns without abandoning regression entirely.
- Lesson 657 — What Are Polynomial Features?Lesson 662 — Polynomial Features vs SplinesLesson 1816 — What is ELT? Extract, Load, Transform Explained
- Flexible
- Combines the strengths of multiple sampling techniques you've already learned
- Lesson 238 — Multistage SamplingLesson 1557 — The Beta-Binomial Model
- Flexible ranges
- allow newer patch or minor versions, enabling automatic security fixes and bug patches without manual intervention.
- Lesson 2050 — Pinning Versions vs Flexible Ranges
- Flipped coordinates
- swap x and y axes for horizontal layouts
- Lesson 1344 — Scales and Coordinate Systems
- Flipping two coins
- Lesson 78 — Events as Subsets of the Sample Space
- Focus
- your conclusions on these specific relationships
- Lesson 428 — Post-Hoc Analysis and ResidualsLesson 1215 — Characteristics of Explanatory Visualizations
- Focus indicators
- are critical: users must see *where* they are in the interface at all times.
- Lesson 1253 — Interactive Accessibility: Keyboard Navigation
- Focus on large-data scenarios
- With abundant data, the likelihood dominates and priors matter less (robustness naturally increases)
- Lesson 1572 — Sensitivity Analysis and Prior Robustness
- Focus on the slope
- The slope still tells you about the relationship *within your data range*
- Lesson 526 — When the Intercept Has No Meaning
- Folium
- and **Plotly** transform your geographic data into engaging web visualizations.
- Lesson 1313 — Interactive Maps with Folium and Plotly
- Follow ethical guidelines
- established by your organization or profession
- Lesson 35 — Conflicts of Interest and Independence
- Follow multiple channels
- academic papers for cutting-edge research, industry blogs for practical applications, documentation for tool updates, and community forums for real-world problem-solving patterns.
- Lesson 2143 — Continuous Learning and Skill Development
- Follow up with nonrespondents
- Send reminders, offer incentives, or use different contact methods to reduce nonresponse bias.
- Lesson 250 — Strategies for Bias Detection and Mitigation
- Follow-ups
- If confirmed, design experiment to optimize marketing for that segment
- Lesson 1204 — From Hypothesis to Analysis Plan
- Font face
- Make titles bold with `face = "bold"` or italicize annotations
- Lesson 1364 — Customizing Text Elements
- Font family
- The typeface itself (e.
- Lesson 1297 — Font Properties and Text StylingLesson 1364 — Customizing Text Elements
- Font size
- Measured in points; larger for titles, smaller for tick labels
- Lesson 1297 — Font Properties and Text StylingLesson 1364 — Customizing Text Elements
- Font sizing
- At publication dimensions, default fonts become tiny.
- Lesson 1369 — Publication-Ready Plot Styling
- Font weight
- 'normal', 'bold', 'light', or numeric values (100-900)
- Lesson 1297 — Font Properties and Text Styling
- Foot traffic
- customers entering stores—acts as a leading indicator.
- Lesson 1634 — Retail Metrics: Same-Store Sales and Inventory Turnover
- For `random` module
- Lesson 2057 — Setting Seeds in Python and R
- For comparisons
- between categories → use bar charts or column charts
- Lesson 1230 — Choosing the Right Chart Type
- For distributions
- of continuous data → use histograms or box plots
- Lesson 1230 — Choosing the Right Chart Type
- For each p-value
- , calculate its threshold: (i/m) × α, where i is its rank, m is the total number of tests, and α is your target FDR level (e.
- Lesson 1506 — Benjamini-Hochberg Procedure
- For executives
- Lesson 1954 — Tailoring Visualizations to Audience Needs
- For expensive aggregations
- approximate when possible, or use `.
- Lesson 1796 — Limitations and Differences from Pandas
- For floats
- If precision beyond ~7 significant digits isn't critical for your analysis, `float32` cuts memory in half with minimal impact on most calculations.
- Lesson 1799 — Optimal Data Types and Downcasting
- For lag 1
- Lesson 721 — Computing ACF Values
- For lag 2
- Lesson 721 — Computing ACF Values
- For nested models
- Use the **Partial F-Test** (which you learned in lesson 623) to formally test whether the extra predictors significantly improve the model
- Lesson 626 — Nested vs Non-Nested Models
- For non-nested models
- Compare using **Adjusted R-Squared**, **AIC**, or **BIC** — but you cannot use the Partial F- Test
- Lesson 626 — Nested vs Non-Nested Models
- For part-to-whole composition
- → use stacked bar charts (or reluctantly, pie charts)
- Lesson 1230 — Choosing the Right Chart Type
- For performance
- `UNION ALL` skips the expensive duplicate-checking step, making it significantly faster on large datasets
- Lesson 1000 — UNION ALL: Preserving Duplicates
- For positive values
- It applies a transformation similar to Box-Cox
- Lesson 215 — Yeo-Johnson Transformation
- For relationships
- between two numeric variables → use scatter plots
- Lesson 1230 — Choosing the Right Chart Type
- For sorting
- minimize sorts or do them after filtering to smaller datasets.
- Lesson 1796 — Limitations and Differences from Pandas
- For strings
- The `categorical` dtype stores each unique value once plus integer codes—massive savings when cardinality is low relative to row count.
- Lesson 1799 — Optimal Data Types and Downcasting
- For technical stakeholders
- Lesson 1954 — Tailoring Visualizations to Audience Needs
- For the intercept (b₀)
- Lesson 518 — Deriving the Least Squares Estimators
- For the slope (b₁)
- Lesson 518 — Deriving the Least Squares Estimators
- Forecast ahead
- Generate predictions for the held-out period
- Lesson 790 — Out-of-Sample Forecast Evaluation
- Forecast future churn
- more accurately using cohort-specific curves
- Lesson 1672 — Cohort-Based Churn Analysis
- forecasting
- because they only use information available *at that moment*.
- Lesson 753 — Centered vs Trailing Moving AveragesLesson 1571 — Posterior Predictive Distribution for New Data
- Foreign Key
- A column that references the primary key in another table (like `customer_id` in an orders table)
- Lesson 843 — Relational Database ConceptsLesson 921 — Primary and Foreign Key RelationshipsLesson 1051 — Introduction to Foreign Keys
- foreign keys
- (concepts you've already learned).
- Lesson 1061 — Introduction to NormalizationLesson 1148 — Handling Multiple Types in One TableLesson 1808 — Star Schema and Fact Tables
- Form
- Is the relationship linear (straight-line pattern) or curved?
- Lesson 480 — Scatterplots and Visual Assessment
- Form a Hypothesis
- Lesson 25 — The Scientific Method in Data Science
- Formal definition
- A **sampling distribution** is the probability distribution of a statistic (like the mean, median, or proportion) computed from *all possible samples* of a fixed size drawn from the same population.
- Lesson 251 — What is a Sampling Distribution?
- formal hypothesis test
- that helps you determine whether your data is normally distributed.
- Lesson 205 — Shapiro-Wilk TestLesson 1389 — What is Grubbs' Test?
- Formal reviews
- Monthly presentations for key milestones and decision points
- Lesson 2104 — Communication Cadence and Updates
- Formal test second
- Does it confirm major concerns, or is it just picking up minor noise?
- Lesson 570 — Q-Q Plots vs Formal Normality Tests: When Visual Checks Matter
- Format errors
- Malformed email addresses or phone numbers
- Lesson 1109 — Input Validation and Defense in Depth
- Format expectations
- You assume dates come in one format, numeric codes have certain meanings, or null values are handled consistently—until they're not.
- Lesson 2133 — Undocumented Data Dependencies
- Format inconsistencies
- Dates in different formats, mixed capitalizations
- Lesson 1150 — What is Data Validation?
- Format selection
- Vector formats (PDF, SVG) scale perfectly for print; PNG works for web at 300+ DPI.
- Lesson 1369 — Publication-Ready Plot Styling
- Formula
- Lesson 1383 — Understanding the Interquartile Range (IQR)Lesson 1451 — Estimating Treatment Effects from Matched SamplesLesson 1627 — E-commerce Metrics: AOV, Cart Abandonment, and RPVLesson 1890 — Measuring Disparate Impact
- Formula connection
- Lesson 182 — Special Cases: Exponential and Chi-Squared
- Foundation (base layer)
- Your data and aesthetic mappings using `ggplot()`
- Lesson 1347 — Understanding Layers in ggplot2
- Foundation for inference
- Understanding the sampling distribution lets us say things like "we're 95% confident the true population mean is between X and Y"—which we'll explore in future lessons.
- Lesson 251 — What is a Sampling Distribution?
- Fragmentation
- occurs when data pages become scattered physically on disk due to inserts, updates, and deletes.
- Lesson 1086 — Index Maintenance and Monitoring
- frame
- the subset of rows within your partition that the function operates on.
- Lesson 1015 — ROWS vs RANGE Frame SpecificationsLesson 1016 — Cumulative Sums and Running Totals
- Framing the technical problem
- Is this supervised learning?
- Lesson 2085 — Stage 1: Problem Definition and Scoping
- Frequency order (descending)
- Best for spotting the most/least common categories at a glance—this is usually recommended for EDA
- Lesson 1178 — Bar Charts for Categorical Data
- Frequentist
- "If we ran this test repeatedly, 95% of intervals constructed this way would capture the true rate.
- Lesson 1564 — Comparing Bayesian and Frequentist Proportion Inference
- Frequentist A/B testing
- treats the true conversion rate as a fixed (but unknown) parameter.
- Lesson 1580 — Bayesian vs Frequentist A/B Testing
- Frequentist interpretation
- treats probability as a **long-run frequency**.
- Lesson 1540 — Comparing Bayesian and Frequentist Interpretations
- Frequently accessed lookup values
- Product categories, customer names, or status labels that rarely change but are queried constantly.
- Lesson 1074 — Duplicating Data Across Tables
- Friction zone
- 2-10 GB datasets may work but cause slowdowns and memory pressure
- Lesson 1783 — Data Size Thresholds: When Pandas Isn't Enough
- FROM
- SQL identifies which table(s) to use
- Lesson 896 — GROUP BY Execution OrderLesson 912 — Fundamental Difference: Filter Timing
- From domain expertise
- If historical data suggests 100 conversions from 500 trials, you could use α = 100, β = 400 as an informative prior that reflects actual experience.
- Lesson 1558 — Choosing Informative Priors for Proportions
- From z-score to percentile
- Look up your z-score in a z-table.
- Lesson 199 — Finding Percentiles with Z-Scores
- Full model
- Adds more predictors (e.
- Lesson 623 — Partial F-Tests for Nested ModelsLesson 654 — Testing Interaction SignificanceLesson 684 — Likelihood Ratio Tests for Model Comparison
- FULL OUTER JOIN
- (also called FULL JOIN) returns all rows from both tables, regardless of whether there's a matching row in the other table.
- Lesson 935 — What is a FULL OUTER JOIN?Lesson 936 — FULL OUTER JOIN Syntax
- Funnel analysis
- is a method that visualizes and measures how users move through a defined sequence of steps (a "funnel") toward completing a desired action—like making a purchase, signing up, or subscribing.
- Lesson 1678 — What is Funnel Analysis?
- Funnel or cone shape
- Variance increases or decreases with predictions
- Lesson 560 — Scale-Location Plot (Spread-Location Plot)
- Funnel shapes
- Variance changes as predictions increase (violates homoscedasticity)
- Lesson 556 — What Are Residuals and Why Plot Them?
- Future-proofing
- You can resurrect old projects years later
- Lesson 2047 — What is Dependency Management?
- Fuzzy
- A job training program *offered* to all unemployed workers over age 55.
- Lesson 1461 — Sharp vs Fuzzy RDD
G
- Gaming the System
- A credit scoring model could be reverse-engineered by fraudsters who game input features to appear creditworthy.
- Lesson 1920 — Anticipating Misuse of Data Products
- Gamma
- (for positive continuous values)
- Lesson 664 — What is the Exponential Family of Distributions?Lesson 669 — The Dispersion Parameter φLesson 676 — Canonical vs Non-Canonical LinksLesson 769 — Smoothing Parameters: Alpha, Beta, Gamma
- Gamma distribution
- is a continuous probability distribution that describes positive real numbers (values greater than zero).
- Lesson 181 — Gamma Distribution: Shape and Rate ParametersLesson 1552 — Gamma-Poisson Conjugacy
- Gap analysis
- Find when values changed from the prior period
- Lesson 1024 — LAG Function: Accessing Previous Row Values
- gaps
- (empty bins) that might signal data collection issues or natural separations, and **outliers** (isolated bars far from the main cluster).
- Lesson 1175 — Histograms for Distribution ShapeLesson 1220 — Histograms for Continuous Distributions
- Gaussian mechanism
- adds noise from a normal (Gaussian) distribution.
- Lesson 1899 — Adding Noise for Privacy
- GDPR principles
- , your organization's **conflicts of interest** policy, or industry standards rather than personal objections.
- Lesson 1931 — When to Push Back on Requests
- Gender
- (Male vs Female) on recovery time.
- Lesson 653 — Interpreting Categorical × Categorical Interactions
- Gender and sex
- Lesson 1888 — Protected Classes and Sensitive Attributes
- General Multiplication Rule
- , and it works for *any* two events—dependent or independent.
- Lesson 88 — General Multiplication Rule
- Generalization
- replaces specific values with broader categories: exact ages become age ranges (25-30), precise locations become regions, exact salaries become brackets.
- Lesson 1895 — Data Anonymization BasicsLesson 1896 — K-Anonymity
- Generalized Linear Models
- that handle non-normal outcomes.
- Lesson 664 — What is the Exponential Family of Distributions?
- Generalized Linear Models (GLMs)
- , which extend regression to non-normal outcomes.
- Lesson 668 — Common Distributions as Exponential Family Members
- Generate a randomization
- using your chosen method
- Lesson 1492 — Rerandomization and Practical Implementation
- Generate replicated data
- For each posterior sample of parameters, simulate a new dataset
- Lesson 1596 — Posterior Predictive Checks and Model Comparison
- Generate the file automatically
- from your current environment
- Lesson 1987 — Environment and Dependency Management
- Genetics
- You expect a 9:3:3:1 ratio of phenotypes.
- Lesson 414 — Introduction to Chi-Squared Goodness of Fit Test
- GeoDataFrame
- like a pandas DataFrame, but with a special `geometry` column containing the actual shapes.
- Lesson 1311 — Working with Shapefiles and GeoJSON
- Geographic limitations
- Missing homeless populations or remote areas
- Lesson 249 — Coverage Error and Undercoverage
- Geographic regions
- (`country`, `region`) — when analysis is region-specific
- Lesson 1812 — Partitioning and Clustering Strategies
- GeoJSON
- is a newer, web-friendly format built on JSON.
- Lesson 1311 — Working with Shapefiles and GeoJSON
- Geometric
- "How many random calls until I reach someone who owns an electric vehicle?
- Lesson 138 — Real-World Applications: Quality Control and SurveysLesson 154 — Real-World Use Cases: Customer Behavior and Events
- geometric distribution
- tells you the probability of waiting exactly *k* tries before your first success happens.
- Lesson 132 — The Geometric Distribution: Waiting for the First SuccessLesson 137 — Geometric vs Negative Binomial: Key DifferencesLesson 138 — Real-World Applications: Quality Control and SurveysLesson 154 — Real-World Use Cases: Customer Behavior and Events
- Geometric objects (geoms)
- The actual visual marks—points, lines, bars, polygons—that represent your data
- Lesson 1339 — What is the Grammar of Graphics?Lesson 1342 — Geometric Objects (geoms)
- Geometries (geom)
- The visual marks representing data (points, lines, bars, boxes)
- Lesson 1340 — The Seven Layers of Grammar
- ggplot2
- ships with a distinctive gray background with white gridlines—a deliberate choice to reduce visual clutter while maintaining reference lines.
- Lesson 1371 — Default Aesthetics and Design ChoicesLesson 1373 — Statistical Transformations: Built-in vs Manual
- Ghost ads
- and **PSA (Public Service Announcement) tests** are incrementality testing techniques where you show *neutral content* instead of your real ads to a control group.
- Lesson 1747 — Ghost Ads and PSA Tests
- Git
- tracks changes to code, including transformation scripts.
- Lesson 1164 — Tools for Lineage TrackingLesson 1990 — What is Version Control and Why Git?
- Global F-test
- Asks "Does *at least one* predictor help explain the outcome?
- Lesson 622 — Relationship Between F-Test and t-Tests
- Go deeper
- when you need to diagnose problems in a specific area (e.
- Lesson 1623 — Depth vs Breadth in Metric Trees
- Goal
- fit the line `y = a + b*time`, then subtract it
- Lesson 738 — Linear DetrendingLesson 1400 — Control Limits vs Specification Limits
- gold standard
- for creating representative samples.
- Lesson 234 — Simple Random SamplingLesson 1442 — Limitations and Practical Constraints
- Good
- Random cloud of points scattered evenly around the horizontal line at y=0, with consistent spread
- Lesson 557 — The Residuals vs Fitted Values PlotLesson 562 — Index Plots and Time-Ordered ResidualsLesson 1857 — Logging Best Practices
- Good (pyramid)
- "Reducing churn requires increasing early-user engagement.
- Lesson 1942 — The Pyramid Principle: Starting with the Conclusion
- Good example
- "Bar chart comparing Q4 sales across five regions.
- Lesson 1250 — Text Alternatives and Screen Reader Compatibility
- Good hypothesis
- "Changing the checkout button from blue to green will increase conversion rate by at least 2 percentage points.
- Lesson 1479 — Formulating Hypotheses
- Good independence
- Lesson 448 — Independence of Observations
- Goodhart's Law
- *"When a measure becomes a target, it ceases to be a good measure.
- Lesson 1521 — Risks of Optimizing for Surrogates
- Goodness-of-fit
- How well the model explains the data (measured via likelihood)
- Lesson 629 — Akaike Information Criterion (AIC)Lesson 700 — AIC and BIC for Model Selection
- Goodness-of-fit tests
- Formal tests compare observed versus expected frequencies.
- Lesson 693 — Overdispersion in Count Data
- Governance and Approval
- Lesson 1643 — Building Attribution Frameworks
- GPL
- Requires derivative works to also be open source (more restrictive)
- Lesson 2082 — Choosing a License for Data Science Projects
- graceful degradation
- the pipeline remains healthy and productive while you handle edge cases systematically.
- Lesson 1852 — Dead Letter QueuesLesson 1854 — Testing Error Handling
- Grafana
- , and **Datadog** automate this process, offering dashboards that show pipeline status at a glance and trigger alerts when thresholds are breached.
- Lesson 1861 — Monitoring Tools and Dashboards
- Grammar of Graphics
- is a systematic approach to creating visualizations by combining independent building blocks, rather than selecting from a fixed menu of chart types.
- Lesson 1339 — What is the Grammar of Graphics?
- Graph algorithms
- Computing connected components or PageRank involves recursive traversals
- Lesson 1784 — Computation Complexity: Beyond Data Size
- Great Expectations
- is the leading Python library for this purpose.
- Lesson 1158 — Automated Validation FrameworksLesson 1164 — Tools for Lineage Tracking
- Greater sensitivity to effects
- With less noise in your estimate, even small differences between your null hypothesis and reality become detectable.
- Lesson 340 — Power and Sample Size Relationship
- greater than
- a certain value using cumulative distribution functions.
- Lesson 143 — Cumulative Poisson ProbabilitiesLesson 857 — Comparison Operators: Greater and Less Than
- Greenwood's formula
- gives us the standard error (SE) of the Kaplan-Meier estimator at any time t.
- Lesson 814 — Standard Errors and Confidence Intervals
- GridSpec
- lets you treat your figure like a flexible grid where subplots can span multiple cells, much like merging cells in a spreadsheet.
- Lesson 1278 — GridSpec for Complex Layouts
- Gross Profit Margin
- Revenue minus cost of goods sold
- Lesson 1516 — Business Metrics: Definition and Examples
- Group A
- Values tightly clustered around 50 (median = 50)
- Lesson 394 — Interpreting Rank-Based Tests: Medians vs Distributions
- Group B
- Values spread widely from 20 to 80 (median = 50)
- Lesson 394 — Interpreting Rank-Based Tests: Medians vs Distributions
- GROUP BY
- to create rich summaries of grouped data.
- Lesson 892 — GROUP BY with Different Aggregate FunctionsLesson 896 — GROUP BY Execution OrderLesson 903 — Combining WHERE and HAVINGLesson 912 — Fundamental Difference: Filter Timing
- Group customers into cohorts
- by their acquisition date (e.
- Lesson 1664 — Cohort-Based LTV CalculationLesson 1758 — Cohort-Based Payback Analysis
- Grouped (side-by-side) bars
- excel when you want to compare specific values across categories.
- Lesson 1188 — Stacked and Grouped Bar Charts
- Grouped analyses
- compare correlation coefficients or slopes between subgroups
- Lesson 1195 — Interaction Effects Between Variables
- Grouped bar charts
- place bars side-by-side for easy direct comparison between groups.
- Lesson 1188 — Stacked and Grouped Bar Charts
- grouped bars
- when precise, side-by-side comparison of subcategories is your priority.
- Lesson 1226 — Stacked and Grouped Bar ChartsLesson 1266 — Bar Plots: Categorical Comparisons
- Growth stage startups
- often tolerate higher CAC and longer payback because they're prioritizing market capture.
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Grubbs' test tables
- (organized by sample size and α level) or calculate them using formulas involving the t- distribution.
- Lesson 1392 — Critical Values and Significance Testing
- guardrails
- are defensive metrics designed to catch these problems before they damage your business.
- Lesson 1624 — Counter-Metrics and GuardrailsLesson 1925 — Mitigation Strategies and Responsible Disclosure
- Guide onboarding
- Focus new users on high-value features first
- Lesson 1696 — Feature Adoption and Usage Frequency
H
- H ₐ
- At least one method produces a different average score
- Lesson 439 — ANOVA Hypotheses and Research Questions
- H statistic
- that measures how much the rank sums vary between groups.
- Lesson 471 — Kruskal-Wallis H Test: The Non-Parametric One-Way ANOVA
- H₀
- μ = some specific value
- Lesson 310 — Writing Hypotheses for Different ParametersLesson 347 — One-Tailed Tests: Testing for a Specific DirectionLesson 354 — Setting Up Hypotheses for One-Sample t-TestLesson 358 — Worked Example: One-Sample t-Test in PracticeLesson 363 — Testing Equality of VariancesLesson 415 — Setting Up Hypotheses for Goodness of FitLesson 439 — ANOVA Hypotheses and Research QuestionsLesson 447 — Conducting One-Way ANOVA in Practice (+1 more)
- H₀ (Null Hypothesis)
- All groups have equal variances
- Lesson 380 — Testing Equal Variances: Levene's and Bartlett's Tests
- H₀: μ_d = 0
- Lesson 373 — Hypotheses for Paired t-Tests
- H₁
- μ ≠ some value (two-sided) *or* μ > value *or* μ < value (one-sided)
- Lesson 310 — Writing Hypotheses for Different ParametersLesson 347 — One-Tailed Tests: Testing for a Specific DirectionLesson 354 — Setting Up Hypotheses for One-Sample t-TestLesson 358 — Worked Example: One-Sample t-Test in PracticeLesson 363 — Testing Equality of VariancesLesson 415 — Setting Up Hypotheses for Goodness of FitLesson 447 — Conducting One-Way ANOVA in PracticeLesson 1511 — Sequential Probability Ratio Test (SPRT)
- H₁ (Alternative Hypothesis)
- At least one group has a different variance
- Lesson 380 — Testing Equal Variances: Levene's and Bartlett's Tests
- Hadoop MapReduce
- was the original distributed processing engine—breaking jobs into "map" (process chunks independently) and "reduce" (combine results) phases.
- Lesson 1764 — The Big Data Technology Landscape
- Halt (Fail Fast)
- Lesson 1866 — Handling Failed Quality Checks
- Hamiltonian Monte Carlo (HMC)
- borrows physics concepts to guide sampling intelligently.
- Lesson 1593 — Hamiltonian Monte Carlo and NUTS
- Handle staggered timing
- Each unit's treatment effect is estimated relative to its own adoption date
- Lesson 1457 — Multiple Time Periods and Staggered Adoption
- Handling missing values
- Deciding what to do when data points are absent—should you fill them in, remove those rows, or use another strategy?
- Lesson 12 — Data Cleaning and Preparation
- Hard to enforce rules
- You can't easily prevent invalid combinations (like a refund with no related sale)
- Lesson 1148 — Handling Multiple Types in One Table
- Harder interpretation
- when you're drowning in similar variables
- Lesson 1197 — Identifying Variable Importance and Redundancy
- Harder to judge accurately
- this is why pie charts are often criticized.
- Lesson 1231 — Channels of Visual Encoding
- HARKing
- (Hypothesizing After Results are Known), where you retrofit explanations to unexpected patterns.
- Lesson 1485 — Documentation and Pre-Registration
- Hash joins
- excel with large tables that fit in memory—they're fast but require space to build the hash structure.
- Lesson 957 — Join Strategies: Nested Loop, Hash, Merge
- HAVING
- Filters groups after aggregation
- Lesson 898 — HAVING Clause FundamentalsLesson 899 — HAVING vs WHERE: Key DifferencesLesson 903 — Combining WHERE and HAVINGLesson 912 — Fundamental Difference: Filter Timing
- HAVING filters last
- It removes entire groups based on their aggregated values
- Lesson 915 — Combining WHERE and HAVING
- HDI
- always includes the most probable values—the densest region.
- Lesson 1577 — When HDI and Equal-Tailed Intervals DifferLesson 1579 — Practical Computation of Credible Intervals
- Header row
- The first line often contains column names (`name`, `age`, `city`)
- Lesson 1125 — CSV Files: Structure and Common Issues
- Health monitoring
- Disease onset or treatment effectiveness
- Lesson 1412 — What is Change-Point Detection?
- Health standards
- Is the average blood pressure of patients in a clinic different from the national average of 120 mmHg?
- Lesson 351 — When to Use a One-Sample t-Test
- Health studies
- Volunteers may be more health-conscious than average
- Lesson 246 — Volunteer and Self-Selection Bias
- Healthcare
- Predicting disease outbreaks or personalizing treatment plans
- Lesson 6 — Common Data Science Applications
- Heatmaps
- solve this by color-coding correlation strengths:
- Lesson 510 — Correlation Matrices: Construction and DisplayLesson 1192 — Correlation Matrices and Heatmaps
- heavier tails
- meaning more probability in the extremes.
- Lesson 268 — Critical Values and the t-DistributionLesson 352 — The t-Distribution and Degrees of Freedom
- Heavy gridlines
- Use subtle, minimal guides only when necessary
- Lesson 1237 — Chart Junk and Data-Ink Ratio
- heavy tails
- , meaning a small number of extreme values dominate the total.
- Lesson 191 — Pareto Principle and the 80/20 RuleLesson 567 — Common Q-Q Plot Patterns: Heavy Tails and Light Tails
- Heavy-tailed distributions
- (with extreme outliers): even larger samples required
- Lesson 220 — Sample Size Requirements for the CLTLesson 1379 — Assumptions and Limitations
- Hedging
- protects against channel-specific risks (platform bans, seasonal dips)
- Lesson 1716 — Channel Mix and Portfolio Thinking
- Height
- Taller bars (positive or negative) indicate stronger correlation
- Lesson 722 — ACF Plots and Interpretation
- Height and Weight
- If you're predicting adult weight from height, the intercept represents the predicted weight when height = 0 inches.
- Lesson 526 — When the Intercept Has No Meaning
- Heroku
- is a general-purpose cloud platform that works with both Streamlit and Dash.
- Lesson 1338 — Deployment and Sharing Dashboards
- heteroscedasticity
- (non-constant variance) — a violation of this assumption.
- Lesson 549 — Homoscedasticity: Constant Variance of ResidualsLesson 559 — Detecting Heteroscedasticity (Non-Constant Variance)Lesson 591 — When and Why to Transform Variables
- Hidden randomness
- Random processes without fixed seeds produce varying results (Random Seeds)
- Lesson 30 — The Reproducibility Crisis and Solutions
- Hidden subgroups
- Averaging diverse populations together (Simpson's Paradox territory)
- Lesson 1245 — Misleading Aggregations and Binning
- Hide cyclicality
- Show only the upswing of a seasonal pattern while ignoring the inevitable downturn
- Lesson 1241 — Cherry-Picking Time Ranges
- Hide technical depth strategically
- Methodology, statistical tests, and data quality checks belong in appendices or backup slides (lesson 1949).
- Lesson 1965 — Progressive Disclosure Techniques
- Hiding data
- Adding a dense geom (like `geom_ribbon()`) last can hide points underneath
- Lesson 1355 — Layer Order and Plot Composition
- Hiding trade-offs
- Your values might mask important considerations others prioritize differently
- Lesson 1927 — Separating Analysis from Advocacy
- Hierarchical
- Clear parent-child or directed flow relationships
- Lesson 1318 — Network Layout Algorithms
- Hierarchical relationships
- If you have `city`, `state`, and `country` columns, does "Boston" really belong to "Texas" or "Canada"?
- Lesson 1155 — Consistency Checks Across Fields
- High correlations between predictors
- (e.
- Lesson 513 — Applications: Feature Selection and Multicollinearity
- High influence
- = actually changes the fitted model (unusual X *and* unusual Y given that X)
- Lesson 574 — Influence: Impact on Fitted Model
- high leverage
- not because their score is unusual, but because their study time is far from the typical range.
- Lesson 572 — Leverage: Distance in X-SpaceLesson 574 — Influence: Impact on Fitted Model
- High noise-to-signal ratio
- When errors dominate true patterns, models learn randomness instead of relationships.
- Lesson 2124 — Insufficient or Low-Quality Data
- High p-value (≥ α)
- The observed frequencies are reasonably close to expected frequencies.
- Lesson 420 — Interpreting Chi-Squared Test Results
- High power
- to detect non-normality makes it a go-to choice.
- Lesson 378 — Testing Normality: Statistical Tests
- High-risk, engaged recently
- In-product interventions (tooltips, feature prompts) based on usage gaps
- Lesson 1676 — Win-Back and Retention Strategies
- High-risk, high-value
- Personalized outreach, account manager check-ins, or special loyalty offers
- Lesson 1676 — Win-Back and Retention Strategies
- Higher adjusted R-squared
- suggests a better balance of fit and simplicity
- Lesson 615 — Comparing Models with Adjusted R-Squared
- Higher alpha (0.10)
- Like a more-sensitive detector.
- Lesson 334 — Setting Alpha: Choosing Your Significance Level
- Higher confidence (e.g., 99%)
- More reliable method, but wider intervals
- Lesson 267 — Interpreting Confidence Levels
- Higher confidence level
- → Wider margin (you cast a wider net to be more certain)
- Lesson 294 — Margin of Error and Its Components
- Higher evidence bar
- – Only stronger signals will be deemed "significant"
- Lesson 342 — Alpha Level Trade-offs
- Higher variance/standard deviation
- = outcomes are more spread out, more unpredictable
- Lesson 148 — Variance and Standard Deviation of Discrete Distributions
- Higher λ
- means events happen more frequently → shorter waiting times
- Lesson 164 — The Exponential Distribution
- Highest Density Interval (HDI)
- takes a smarter approach: it finds the *shortest possible* interval that still contains your desired probability mass (say, 95%).
- Lesson 1576 — Highest Density Intervals (HDI)
- Highlight, don't decorate
- Use bold or saturated colors for the 1-3 most important data points you want your audience to notice first.
- Lesson 1961 — Color as Communication Tool
- Highly Objective
- Lesson 1598 — Characteristics of Lagging Indicators
- Highly skewed distributions
- (like income data): you may need n = 50, 100, or more
- Lesson 220 — Sample Size Requirements for the CLT
- Hill function
- (also called logistic or S-curve):
- Lesson 1740 — Saturation Curves and Diminishing Returns
- histogram
- shows the frequency of values in bins.
- Lesson 377 — Testing Normality: Visual MethodsLesson 788 — Checking Residual NormalityLesson 1267 — Histograms and Distribution Plots
- Histograms
- divide your data into bins and show the frequency of observations in each bin as bars.
- Lesson 203 — Visual Assessment: Histograms and Density PlotsLesson 290 — Assumptions and Diagnostics for Difference IntervalsLesson 1208 — Distribution Checks for All VariablesLesson 1343 — Statistical Transformations
- Historical Data
- Lesson 297 — Handling Unknown Population ParametersLesson 1534 — The Prior Distribution
- Historical patterns
- If a job normally takes 10-15 minutes, alert at 30+ minutes, not 16
- Lesson 1858 — Alerting StrategiesLesson 1878 — What is Bias in Data?
- Historical performance
- "This is our best quarter in three years"
- Lesson 1962 — Contextualizing Numbers
- Historical snapshots
- Copying current product prices into order records preserves what the customer actually paid, even if prices change later.
- Lesson 1074 — Duplicating Data Across Tables
- Historical Trends
- Show how values change over time.
- Lesson 1939 — Context and Comparison: Making Numbers Meaningful
- Holm-Bonferroni
- (also called "step-down Bonferroni") is a sequential method:
- Lesson 459 — Holm-Bonferroni and Šidák MethodsLesson 512 — Testing Significance in Correlation MatricesLesson 1507 — Multiple Testing in A/B Test Variations
- Holt-Winters Multiplicative Model
- is designed for time series where seasonal fluctuations change in size as the overall level of the series changes.
- Lesson 768 — Holt-Winters Multiplicative Model
- Holt's Method
- adds a second equation to track the trend separately.
- Lesson 761 — Double Exponential Smoothing (Holt's Method)
- homoscedasticity
- (homo = same, scedasticity = scatter).
- Lesson 379 — The Assumption of Equal Variances (Homoscedasticity)Lesson 450 — Homogeneity of Variance (Homoscedasticity)Lesson 546 — The Five Core Assumptions of Linear RegressionLesson 557 — The Residuals vs Fitted Values PlotLesson 601 — Assumptions for Multiple Linear RegressionLesson 782 — Residual Diagnostics for ARIMA
- Horizontal patterns
- One cohort performing differently across all periods indicates something unique about that acquisition group
- Lesson 1649 — Visualizing Cohort Data with Heatmaps
- Horizontal scaling (scale-out)
- means distributing work across multiple machines working in parallel.
- Lesson 1767 — Scale-Up vs Scale-Out Architectures
- Horizontal trend line
- Good news—variance is roughly constant (homoscedasticity)
- Lesson 560 — Scale-Location Plot (Spread-Location Plot)
- Hospital studies
- Disease severity and access to care both lead to hospitalization.
- Lesson 1473 — Conditioning on Colliders: Selection Bias
- Hover tooltips
- displaying data values when you move your mouse over points
- Lesson 1300 — Creating Basic Interactive Charts with Plotly Express
- how
- to calculate it, the critical question becomes: *what does the number actually mean?
- Lesson 533 — Interpreting R-Squared ValuesLesson 920 — Understanding Join Conditions with ONLesson 1346 — The Grammar vs Traditional PlottingLesson 1830 — Documentation and Metadata ManagementLesson 1850 — Retry StrategiesLesson 2023 — Creating a Pull Request
- How it works
- Lesson 1457 — Multiple Time Periods and Staggered AdoptionLesson 1828 — Incremental vs Full Load Strategies
- How many extra parameters
- you added (degrees of freedom cost)
- Lesson 627 — The F-Test for Model Comparison
- How much
- is missing per column?
- Lesson 1207 — Missing Data Assessment and StrategyLesson 2137 — Refactoring Strategies and Debt Paydown
- How much better
- the full model fits the data (lower RSS—residual sum of squares)
- Lesson 627 — The F-Test for Model Comparison
- How strongly
- it would need to relate to the treatment (exposure)
- Lesson 1434 — Sensitivity Analysis for Confounding
- How to check
- Lesson 374 — Assumptions of the Paired t-TestLesson 552 — Zero Conditional Mean of Errors
- How to report bugs
- Where should users file issues?
- Lesson 2083 — Contributing Guidelines and Contact Information
- How to suggest features
- Is there a template or discussion forum?
- Lesson 2083 — Contributing Guidelines and Contact Information
- HubSpot
- Weekly active teams using the platform — activation and ongoing engagement signal product- market fit.
- Lesson 1606 — Examples of North Star Metrics by Industry
- Hue
- is what we typically call "color": red, green, blue, purple, etc.
- Lesson 1234 — Color: Hue, Saturation, and LuminanceLesson 1238 — Matching Encoding to Data Type
- Human-readability
- CSV > JSON > Excel > Parquet/Feather
- Lesson 1133 — Performance Considerations Across Formats
- Human-readable units
- Same units as your original data
- Lesson 49 — Standard Deviation: Interpretable Spread
- Hybrid approach
- – handling both global and local anomalies in one framework
- Lesson 1405 — What is Seasonal Hybrid ESD?
- Hypotheses
- Lesson 471 — Kruskal-Wallis H Test: The Non-Parametric One-Way ANOVALesson 787 — Ljung-Box Test for Residual AutocorrelationLesson 1508 — Pre-Registration and Correction Strategy
- Hypothesis
- The specific change and expected directional effect
- Lesson 1485 — Documentation and Pre-Registration
- Hypothesis Testing
- Z-scores help us ask "Is this result surprising?
- Lesson 201 — Z-Score Applications and Limitations
- Hypothesize
- Based on funnel analysis, identify a bottleneck
- Lesson 1692 — Statistical Significance and Iteration
I
- I (Integrated) - d
- Lesson 773 — Introduction to ARIMA: Components and Notation
- I Chart (Individuals Chart)
- Plots each single measurement and tracks whether the process mean is stable.
- Lesson 1404 — Control Charts for Individual Observations
- I-MR charts
- (Individual and Moving Range charts) come in.
- Lesson 1404 — Control Charts for Individual Observations
- idempotency
- so rerunning doesn't corrupt data, **checkpointing** to resume mid-pipeline, and **monitoring/alerts** for quick detection.
- Lesson 1825 — Designing Pipeline ArchitectureLesson 1847 — What is Idempotency?Lesson 1850 — Retry StrategiesLesson 1853 — Partial Failure Recovery
- Identify
- cells with residuals beyond ±2 (moderate) or ±3 (strong)
- Lesson 428 — Post-Hoc Analysis and Residuals
- Identify "flattening"
- When curves level off, you've found your core retained users—the ones likely to stick around long- term.
- Lesson 1656 — Visualizing Retention Curves
- Identify all relevant periods
- hourly (24), daily (7), weekly, etc.
- Lesson 1408 — Handling Multiple Seasonal Periods
- Identify all systems
- holding that person's data (data lineage helps here!
- Lesson 1909 — Right to Erasure and Data Retention Policies
- Identify conflicts
- Run `git status` to see which files have conflicts (marked as "both modified")
- Lesson 2018 — Resolving Conflicts During Rebase
- Identify core value actions
- What behaviors indicate someone is getting value?
- Lesson 1693 — Defining User Engagement
- Identify data sources
- Where should each variable come from?
- Lesson 2098 — Identifying Data Availability Gaps Early
- Identify direct causal relationships
- (does X directly cause Y?
- Lesson 1469 — Building a Simple Causal DAG
- Identify direct links
- Does your metric directly influence another team's metric?
- Lesson 1625 — Cross-Functional Metric Dependencies
- Identify meaningful strata
- Divide your population into non-overlapping groups based on important characteristics (age, income, region, education level, etc.
- Lesson 236 — Stratified Sampling
- Identify outliers
- Values with |z| > 3 are typically considered unusual
- Lesson 195 — Z-Score Definition and InterpretationLesson 542 — Computing Fitted Values and Residuals
- Identify patterns
- A single representative value helps you spot trends over time or differences between categories
- Lesson 38 — What is Central Tendency?
- Identify power features
- High adoption + high frequency = core value drivers
- Lesson 1696 — Feature Adoption and Usage Frequency
- Identify stratification variables
- (usually 1-3 key covariates)
- Lesson 1489 — Stratified Randomization Fundamentals
- Identify the confounder
- (from your previous analysis)
- Lesson 1430 — Controlling for Confounders: Stratification
- Identify the pre-rebase state
- Look for the entry just before you started the problematic rebase
- Lesson 2021 — Recovering from Rebase Mistakes
- Identify the reference distribution
- (standard normal for Z, t-distribution for t, etc.
- Lesson 319 — Calculating P-Values from Test Statistics
- Identify Unused Indexes
- Query your database's system catalogs to find indexes that are never or rarely used.
- Lesson 1086 — Index Maintenance and Monitoring
- Identifying actionable next steps
- What should the business *do* differently?
- Lesson 2090 — Stage 6: Interpretation and Insight Generation
- Identifying trends
- Detect consecutive increases or decreases
- Lesson 1023 — Introduction to Window Functions: LAG and LEAD
- Identity
- Coefficients are direct additive effects (simplest interpretation).
- Lesson 678 — Choosing the Right Link Function
- identity link
- is the simplest possible link function: it does absolutely nothing!
- Lesson 672 — The Identity LinkLesson 677 — Interpreting Coefficients Under Different LinksLesson 678 — Choosing the Right Link Function
- If it changes direction
- The control variable was suppressing the true relationship (a suppressor effect)
- Lesson 508 — Interpreting Partial Correlations
- If it remains strong
- The relationship between your two variables is genuine, independent of the control variable(s)
- Lesson 508 — Interpreting Partial Correlations
- If p-value < α
- Reject H₀ (the result is "statistically significant")
- Lesson 323 — What is a Significance Level (α)?
- If p-value > α
- Fail to reject H₀ (insufficient evidence against the null)
- Lesson 327 — Decision Rules: Reject or Fail to RejectLesson 356 — Making Decisions and Stating ConclusionsLesson 404 — Making Decisions and Drawing Conclusions
- If p-value ≤ α
- Reject H₀ (the data are unlikely under the null hypothesis)
- Lesson 327 — Decision Rules: Reject or Fail to RejectLesson 356 — Making Decisions and Stating Conclusions
- If p-value ≥ α
- Fail to reject H₀ (insufficient evidence)
- Lesson 323 — What is a Significance Level (α)?
- If you reject H₀
- You have sufficient evidence to support the alternative hypothesis.
- Lesson 404 — Making Decisions and Drawing Conclusions
- Ignore baseline context
- Start your chart at an unusual low point to make normal recovery look exceptional
- Lesson 1241 — Cherry-Picking Time Ranges
- Ignoring geographic size bias
- Large empty regions dominate visually even with low values.
- Lesson 1309 — Choropleth Maps: Basics and Best Practices
- Ignoring the base rates
- When calculating P(A|B), people forget that the prior probability P(A) matters enormously.
- Lesson 100 — Common Conditional Probability Mistakes
- Ignoring the clock
- You're two weeks past the deadline chasing marginal improvements while stakeholders have moved on or made decisions without you.
- Lesson 2119 — Signs You're Over-Engineering
- Immutable Data Patterns
- Rather than updating records in place, append new versions with timestamps or version numbers.
- Lesson 1848 — Designing Idempotent Operations
- Impact
- Number of rows affected, distribution changes, or new/dropped columns
- Lesson 1162 — Documenting TransformationsLesson 1883 — Protected Classes and Proxy VariablesLesson 1966 — Report Structure and Executive Summary
- Imperfect measurement instruments
- A broken thermometer that reads 2°C high introduces systematic error.
- Lesson 1880 — Measurement and Label Bias
- Implement access controls
- that enforce purpose-based restrictions
- Lesson 1915 — Secondary Use and Scope Creep
- Implement pagination
- (showing results in batches, like "page 1 of 100")
- Lesson 877 — LIMIT: Restricting the Number of Rows Returned
- Implementation bugs
- Maybe your randomization code has an off-by-one error or timestamp issues
- Lesson 1524 — Sample Ratio Mismatch (SRM)
- Implementing Safeguards
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Implicit transformations
- You depend on data that's already been filtered or aggregated upstream, but that logic changes without notice.
- Lesson 2133 — Undocumented Data Dependencies
- Important nuances
- Lesson 1451 — Estimating Treatment Effects from Matched Samples
- Important requirements
- Lesson 1001 — INTERSECT: Finding Common Rows
- Impractical test duration
- – If your required sample size is large but your daily traffic is small, you'll need to run the test for weeks or months.
- Lesson 1493 — Why Sample Size Matters in A/B Tests
- Improve Data Integrity
- When that customer moves, you update one row in one table, not dozens of scattered records.
- Lesson 1061 — Introduction to Normalization
- Improve interpretability
- Clearer story about what drives your outcome
- Lesson 585 — Remedies: Variable Selection
- Improve your measurement precision
- Lesson 332 — The Trade-off Between Type I and Type II Errors
- Improving trends
- Later cohorts retain better than earlier ones.
- Lesson 1650 — Comparing Cohorts Over Time
- Impute
- Replace with mean, median, mode, or modeled values
- Lesson 1207 — Missing Data Assessment and Strategy
- IN
- typically builds a complete list of values first, then checks membership.
- Lesson 985 — EXISTS vs IN: Performance Considerations
- In business
- Lesson 802 — What is Survival Analysis?
- In final deliverables
- (reports, presentations, dashboards): explanation
- Lesson 1216 — Choosing the Right Purpose
- In Production
- ML systems degrade over time as data distributions shift.
- Lesson 2130 — No Clear Success Metric or Feedback Loop
- In science
- Lesson 802 — What is Survival Analysis?
- In-place modifications
- aren't supported—methods like `df.
- Lesson 1796 — Limitations and Differences from Pandas
- Incapacitated individuals
- People with cognitive impairments, dementia, or mental health conditions may not fully comprehend what they're consenting to
- Lesson 1918 — Special Populations and Vulnerable Groups
- Include a quick-start section
- that gets someone from zero to a working result in under five minutes—this builds confidence and engagement.
- Lesson 2080 — Usage Examples and Running Your Code
- Include notebook workflows
- Lesson 2080 — Usage Examples and Running Your Code
- Include null results
- that show no effect or relationship
- Lesson 1929 — Avoiding Cherry-Picking Results
- including
- items at exactly $10 and exactly $50.
- Lesson 860 — BETWEEN Operator for RangesLesson 1892 — Fairness Through Unawareness vs Awareness
- inclusive
- , meaning they include the boundary value itself.
- Lesson 857 — Comparison Operators: Greater and Less ThanLesson 860 — BETWEEN Operator for Ranges
- Inconsistent definitions
- across teams (is "active user" last 7 or 30 days?
- Lesson 1619 — What is Metric Ownership?
- Inconsistent formats
- User-entered data with typos, duplicates, or conflicting values
- Lesson 1762 — Extended Dimensions: Veracity and Value
- Inconsistent standards
- If your data collection team changes definitions midway (e.
- Lesson 1880 — Measurement and Label Bias
- Incorporate offline touchpoints
- Sales calls, conferences, or direct mail that standard models ignore
- Lesson 1731 — Custom Rule-Based Attribution
- Incorporates uncertainty
- It's a full distribution, not just a point estimate
- Lesson 1537 — The Posterior Distribution
- Incorporates uncertainty naturally
- The width of your posterior reflects how confident you are
- Lesson 1570 — Comparing Two Means: Bayesian Approach
- Increase I/O
- transferring massive result sets
- Lesson 911 — Performance Considerations with Multiple Groups
- Increase your sample size
- (collect more data)
- Lesson 332 — The Trade-off Between Type I and Type II ErrorsLesson 340 — Power and Sample Size Relationship
- Increased CAC pressure
- You need constant acquisition just to maintain size, let alone grow
- Lesson 1670 — What is Churn and Why It Matters
- Increases statistical power
- by focusing only on within-pair changes
- Lesson 370 — Differences as the Unit of Analysis
- Incremental collaboration
- Breaking large features into reviewable chunks while still working
- Lesson 2029 — Draft Pull Requests and WIP Workflows
- Incremental efficiency
- does adding channel X improve overall LTV:CAC?
- Lesson 1716 — Channel Mix and Portfolio Thinking
- Incremental testing
- runs initiatives sequentially or uses holdout groups to isolate each team's effect.
- Lesson 1640 — Attribution in Multi-Team Environments
- Incrementality
- asks: "What would have happened *without* this channel?
- Lesson 1717 — Incrementality and True Channel ImpactLesson 1718 — Introduction to Marketing AttributionLesson 1743 — What is Incrementality?Lesson 1744 — Incrementality vs Attribution
- Incrementality correlation
- Do the model's channel credits align with incrementality tests (like those control group experiments you learned)?
- Lesson 1734 — Comparing and Validating Attribution Models
- Independence
- means making decisions based solely on data and sound methodology—not on what others want to hear.
- Lesson 35 — Conflicts of Interest and IndependenceLesson 131 — Real-World Applications of Binomial DistributionsLesson 218 — What the Central Limit Theorem StatesLesson 382 — Robustness of t-Tests to Assumption ViolationsLesson 398 — Choosing Between Parametric and Non-Parametric TestsLesson 400 — Assumptions and Conditions for Proportion TestsLesson 419 — Assumptions and Minimum Expected FrequenciesLesson 447 — Conducting One-Way ANOVA in Practice (+5 more)
- Independence of Observations
- Lesson 426 — Assumptions and Sample Size RequirementsLesson 448 — Independence of ObservationsLesson 470 — When Parametric ANOVA Assumptions Fail
- Independence of Paired Differences
- Lesson 374 — Assumptions of the Paired t-Test
- Independence violated
- → Reconsider your analysis approach entirely
- Lesson 383 — Diagnostic Workflow: When to Proceed or Switch Tests
- independent
- when the outcome of one doesn't affect the outcome of the other.
- Lesson 87 — Multiplication Rule for Independent EventsLesson 88 — General Multiplication RuleLesson 111 — Spam Filtering with Naive BayesLesson 126 — From Bernoulli to Binomial: Multiple TrialsLesson 144 — Poisson Applications: Arrivals and EventsLesson 176 — Sum of Independent Normal VariablesLesson 359 — Two-Sample t-Test OverviewLesson 361 — Pooled Variance t-Test (+4 more)
- Independent advocates
- for incapacitated individuals
- Lesson 1918 — Special Populations and Vulnerable Groups
- Independent groups
- different subjects in each group, not repeated measures
- Lesson 438 — When to Use One-Way ANOVA
- independent observations
- and a **sufficiently large sample size** (typically n ≥ 30, though this depends on the population distribution).
- Lesson 225 — CLT for Sums and Other StatisticsLesson 1389 — What is Grubbs' Test?
- Independent samples
- come from two different, unrelated groups.
- Lesson 360 — Independent vs. Dependent Samples
- index
- is a separate data structure that the database maintains to help find rows quickly without scanning the entire table.
- Lesson 1078 — What Are Indexes and Why They MatterLesson 1804 — Index Optimization and Reset Strategies
- Index bloat
- happens when deleted records leave empty space that isn't automatically reclaimed, making indexes larger than necessary.
- Lesson 1086 — Index Maintenance and Monitoring
- Index plots
- of residuals to spot specific observation numbers
- Lesson 587 — Identifying Outliers in Regression Context
- Index supporting columns
- Ensure columns referenced in correlated conditions are indexed
- Lesson 969 — Performance Considerations for SELECT Subqueries
- Index usage
- Are your indexes still effective?
- Lesson 1077 — Measuring Performance Impact of Denormalization
- Indexes
- Using indexed columns can make certain join orders faster
- Lesson 951 — Join Order and Performance
- Individual t-tests
- Ask "Does *this specific* predictor add value?
- Lesson 622 — Relationship Between F-Test and t-Tests
- Industry research
- Published studies, competitor analyses, domain blogs
- Lesson 1201 — Domain Knowledge as a Hypothesis Source
- Inference
- When you need trustworthy hypothesis tests and prediction intervals
- Lesson 550 — Normality of ResidualsLesson 1594 — PyMC: Probabilistic Programming in Python
- Inflated standard errors
- The uncertainty around coefficient estimates increases dramatically
- Lesson 580 — What is Multicollinearity?
- Influence
- is about *actual impact*—how much the regression line would change if you removed that observation.
- Lesson 571 — What Are Leverage and Influence?Lesson 574 — Influence: Impact on Fitted ModelLesson 2101 — Identifying and Mapping Stakeholders
- Influence vs. Interest Matrix
- Plot stakeholders on two axes:
- Lesson 2101 — Identifying and Mapping Stakeholders
- Influenced by time trends
- Both variables increase over time independently
- Lesson 494 — Spurious Correlations and Coincidence
- Info/Log only
- Minor retries succeeded, small delays—for forensic review later
- Lesson 1858 — Alerting Strategies
- informative prior
- .
- Lesson 1543 — Defining Prior DistributionsLesson 1581 — Setting Priors for A/B Tests
- Informative priors
- reflect strong beliefs.
- Lesson 1534 — The Prior DistributionLesson 1544 — Informative vs Uninformative Priors
- Infrastructure costs
- Hosting, computing resources, and database connections
- Lesson 1979 — Maintenance and Sustainability Considerations
- Infrastructure debt
- Manual processes that should be automated
- Lesson 2131 — What is Technical Debt in Data Science?
- Initial belief (prior)
- Maybe there's a 20% chance the suspect is guilty based on background.
- Lesson 114 — Sequential Updating
- INNER JOIN
- is SQL's way of bringing together information from two separate tables based on a relationship between them.
- Lesson 918 — What is an INNER JOIN?Lesson 928 — LEFT JOIN vs INNER JOIN: When to Use Each
- INNER JOIN table2
- The table you're joining to (the "right" table)
- Lesson 919 — Basic INNER JOIN Syntax
- Inner query alias
- (`inner`): identifies columns from the subquery
- Lesson 976 — Basic Correlated Subquery Syntax
- Input data context
- What data was being processed when it broke?
- Lesson 1851 — Error Logging and Notifications
- Input(s)
- The component property you're monitoring (e.
- Lesson 1335 — Dash Callbacks: Adding Interactivity
- INSERT protection
- You cannot add a child record unless the referenced parent exists
- Lesson 1052 — Foreign Key Constraints
- INSERT/UPDATE
- The database verifies the foreign key value exists in the parent table
- Lesson 1060 — Trade-offs: Performance vs Integrity
- Inserting
- new records
- Lesson 844 — What is SQL?Lesson 1124 — Insert, Update, Delete, and Bulk Operations
- Insertion Anomalies
- Lesson 1062 — Data Anomalies: Insert, Update, Delete
- Inside the bounds
- (between the dashed lines): The autocorrelation is **not statistically significant**—it could easily be random noise
- Lesson 723 — Significance Bounds in ACF Plots
- Inspect source data
- Go to the original data source.
- Lesson 1870 — Root Cause Analysis for Quality Issues
- Inspect your data first
- If an integer column only contains values between 0 and 100, you don't need `int64`—`int8` (range: -128 to 127) suffices.
- Lesson 1799 — Optimal Data Types and Downcasting
- Installation Instructions
- Step-by-step commands to set up the environment
- Lesson 2077 — The Purpose and Anatomy of a Good README
- Instead of
- "The slope coefficient β₁ = 2.
- Lesson 530 — Communicating Results to Non-Technical AudiencesLesson 1955 — Framing Insights in Business Language
- Institutional review
- (like ethics boards) before data collection
- Lesson 1918 — Special Populations and Vulnerable Groups
- Instrumentation Issues
- Logging errors, tracking bugs, or data pipeline problems often surface during A/A tests
- Lesson 1483 — Pre-Experiment Validation
- Insurance claims
- Total claim amounts in a period
- Lesson 181 — Gamma Distribution: Shape and Rate Parameters
- Integer division
- Some databases may truncate decimal places if the column is an integer type
- Lesson 884 — AVG: Computing Averages
- Integrated (Main Body)
- Lesson 1947 — Handling Methodology and Technical Details
- Integrates segments
- into downstream workflows like marketing automation, pricing engines, or customer support tools
- Lesson 1710 — Operationalizing Segments: Scoring and Deployment
- Integrity and Confidentiality
- Lesson 1905 — Core Principles of GDPR
- Intent matters
- Ask yourself: "Am I creating this visualization to inform or to persuade dishonestly?
- Lesson 1247 — The Ethics of Visualization Design
- Intent-to-Treat (ITT)
- means you analyze every participant in the group they were *originally randomized to*, regardless of what they actually did.
- Lesson 1439 — Intent-to-Treat AnalysisLesson 1748 — Intent-to-Treat Analysis
- interaction
- where the effect of one factor depends on the level of the other.
- Lesson 463 — Introduction to Two-Way ANOVALesson 466 — Visualizing InteractionsLesson 561 — Residuals vs Predictor Plots
- Interaction analysis
- (do color and size amplify each other's effects?
- Lesson 1482 — Control and Treatment Design
- interaction effect
- .
- Lesson 465 — Interaction EffectsLesson 1195 — Interaction Effects Between Variables
- interaction effects
- worth exploring (e.
- Lesson 1201 — Domain Knowledge as a Hypothesis SourceLesson 1531 — Interference from Concurrent TestsLesson 1689 — Multivariate Testing and Personalization
- Interaction effects analysis
- (Lesson 1195) shows whether variables work together or independently
- Lesson 1197 — Identifying Variable Importance and Redundancy
- Interaction is available
- (rotation helps overcome perspective distortion)
- Lesson 1323 — Introduction to 3D Plotting in Matplotlib
- Interaction plots
- make these non-additive effects visible at a glance.
- Lesson 466 — Visualizing Interactions
- interaction term
- represents a relationship where the effect of one predictor on your outcome variable *depends on* the level or value of another predictor.
- Lesson 648 — What are Interaction Terms?Lesson 653 — Interpreting Categorical × Categorical InteractionsLesson 1455 — DiD with Regression
- Interactive 2D plots
- Let users filter and explore without perspective distortion
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- Interactive dashboards are primary
- While R has Shiny, Python's Streamlit and Dash often integrate more naturally into broader Python ecosystems.
- Lesson 1375 — Choosing Tools: When to Use R vs Python for Visualization
- Interactive zoom
- Let users explore crowded areas at different scales
- Lesson 1310 — Point Maps and Scatter Plots on Maps
- Interleave explanation with code
- Write markdown cells that introduce your analysis approach, then show the actual code that implements it
- Lesson 1982 — Literate Programming with Notebooks
- Intermediate outputs
- Cleaned datasets, feature engineering results
- Lesson 2065 — Tracking Data Lineage
- Internal databases
- Your organization's own records (sales, customer info, logs)
- Lesson 11 — Data Collection and Acquisition
- Internal first
- Alert your organization's leadership and legal/ethics teams
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Internal validity
- asks: *Are the results truly caused by what you think caused them?
- Lesson 1441 — Internal vs External Validity
- Interpret
- Positive residuals mean more observations than expected; negative means fewer
- Lesson 428 — Post-Hoc Analysis and ResidualsLesson 436 — Conducting McNemar's TestLesson 685 — Confidence Intervals for Odds RatiosLesson 740 — Choosing Between Differencing and Detrending
- Interpret in Context
- Lesson 447 — Conducting One-Way ANOVA in Practice
- Interpretability
- Results speak directly about means—easier to communicate and understand in most contexts.
- Lesson 475 — Choosing Between Parametric and Non-Parametric TestsLesson 1555 — Advantages and Limitations of Conjugate PriorsLesson 2102 — Understanding Stakeholder Goals and ConstraintsLesson 2123 — Simple Rules Beat Complex Models
- Interpretation
- What does "revenue" mean—gross or net?
- Lesson 23 — Data Provenance and MetadataLesson 443 — Mean Squares and the F-RatioLesson 533 — Interpreting R-Squared ValuesLesson 647 — Impact on Model Results and ReportingLesson 691 — Interpreting Poisson CoefficientsLesson 827 — Hazard Ratios and InterpretationLesson 1580 — Bayesian vs Frequentist A/B TestingLesson 1629 — SaaS Growth Metrics: Quick Ratio and Net Revenue Retention
- Interpretation cells
- Discuss what results mean (markdown referencing outputs above)
- Lesson 1982 — Literate Programming with Notebooks
- Interpretation guideline
- context (small/medium/large, or domain-specific benchmarks)
- Lesson 389 — Reporting Effect Sizes in Practice
- Interpretation guidelines
- Lesson 445 — Effect Size: Eta-Squared and Omega-SquaredLesson 472 — Interpreting Kruskal-Wallis Results and Effect Size
- Interpreting the condition number
- Lesson 583 — Condition Number and Eigenvalues
- Interpreting variability
- Standard deviation and variance assume certain shapes.
- Lesson 63 — Understanding Distribution Shape
- Interquartile Range (IQR)
- is a measure of variability that tells you how spread out the middle half of your data is.
- Lesson 51 — Interquartile Range (IQR)Lesson 56 — Understanding Percentiles and Their InterpretationLesson 1176 — Box Plots for Spread and OutliersLesson 1383 — Understanding the Interquartile Range (IQR)Lesson 1384 — The IQR Outlier Detection Rule
- Intersectionality
- recognizes that a Black woman's experience isn't just "being Black" plus "being a woman"—it's a unique combined experience.
- Lesson 1893 — Intersectionality in Fairness
- Interval censoring
- means you know the event occurred within a specific time window, but not the precise moment.
- Lesson 805 — Left and Interval Censoring
- Interval data
- (numeric with no true zero: temperature in Celsius, dates) suits:
- Lesson 1238 — Matching Encoding to Data Type
- Interval/ratio data
- Meaningful numeric measurements
- Lesson 398 — Choosing Between Parametric and Non-Parametric Tests
- Introduces scope creep
- that derails core objectives
- Lesson 2107 — Saying No and Pushing Back Constructively
- Introduction
- Problem statement, objectives, context
- Lesson 1966 — Report Structure and Executive Summary
- Introduction cells
- State the question and context (markdown)
- Lesson 1982 — Literate Programming with Notebooks
- Intuition
- If you expect 3 heads in 10 fair coin flips (10 × 0.
- Lesson 129 — Binomial Mean and VarianceLesson 136 — Expectation and Variance of the Negative Binomial
- Intuitive interpretation
- The Beta parameters have natural meanings:
- Lesson 1551 — Beta-Binomial Conjugacy
- Invalid inference
- Hypothesis tests and confidence intervals are incorrect
- Lesson 734 — Why Differencing and Detrending Matter
- Invalid statistical inference
- Standard errors, confidence intervals, and hypothesis tests become meaningless because they assume stability that isn't there.
- Lesson 713 — Why Stationarity Matters
- Inventory turnover ratio
- measures how many times you sell and replace stock annually:
- Lesson 1634 — Retail Metrics: Same-Store Sales and Inventory Turnover
- Inverse-Gamma
- part models uncertainty about σ²
- Lesson 1568 — Unknown Variance: Normal-Inverse-Gamma Model
- Inverted S-shape
- Light-tailed distribution (fewer extreme values)
- Lesson 565 — What Q-Q Plots Show: Comparing Residual Distribution to NormalLesson 566 — Reading Q- Q Plots: Interpreting Points Along the Reference Line
- Invest in data collection
- if possible, but accept that sometimes you need to deliver value *now* with what you have.
- Lesson 2124 — Insufficient or Low-Quality Data
- Investigate First
- Lesson 579 — What to Do with Influential Points
- Investment advice
- based only on winning stocks ignores all the losers that went to zero
- Lesson 247 — Survivorship Bias
- Involuntary churn
- occurs without customer intent—usually from failed payments, expired credit cards, or technical issues.
- Lesson 1670 — What is Churn and Why It MattersLesson 1671 — Churn Rate Calculation Methods
- IoT sensor data
- Temperature, energy consumption, or manufacturing metrics with predictable rhythms
- Lesson 1411 — Applications and Limitations
- IQR method
- makes no such assumption—it relies on quartiles and is robust to skewed or non-normal distributions.
- Lesson 1386 — IQR Method vs Z-Score: When to Use Each
- IQR methods
- give you rules of thumb for flagging outliers, Grubbs' Test takes a more rigorous approach.
- Lesson 1389 — What is Grubbs' Test?
- Irreducibility
- The chain can eventually reach any state from any other state
- Lesson 1589 — Markov Chains: The Foundation of MCMC
- Irregular
- components, you face a fundamental choice: do these pieces combine by adding or by multiplying?
- Lesson 710 — Additive vs Multiplicative ModelsLesson 744 — Classical Decomposition Methods
- Irreversibility matters
- Decisions are costly to reverse
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Isolate Seasonality (S)
- Average the detrended values for each season (e.
- Lesson 744 — Classical Decomposition Methods
- Isolate the problem
- Test from different machines or networks to rule out local issues
- Lesson 1093 — Troubleshooting Connection Issues
- Isolates brand impact
- The PSA has no commercial intent, so any lift from your real ad is truly incremental
- Lesson 1747 — Ghost Ads and PSA Tests
- Isolating the treatment effect
- Differences in outcomes are more likely due to treatment, not pre-existing differences
- Lesson 1445 — The Matching Framework
- Isolation
- Concurrent transactions don't interfere with each other
- Lesson 1110 — What Are Database Transactions?
- Issue tracker
- Direct link to GitHub Issues or your bug tracking system
- Lesson 2083 — Contributing Guidelines and Contact Information
- It's poorly documented
- ("I'll remember what this does")
- Lesson 2132 — Pipeline Glue Code and Complexity Creep
- It's tightly coupled
- to specific data formats or versions
- Lesson 2132 — Pipeline Glue Code and Complexity Creep
- iterate
- .
- Lesson 15 — Deployment, Monitoring, and IterationLesson 25 — The Scientific Method in Data Science
- Iteration
- means deliberately refining your approach based on what you learned—testing a new feature, adjusting model complexity, or exploring a different angle after stakeholder feedback.
- Lesson 2112 — Iteration vs Rework: Learning from Each CycleLesson 2142 — Interviewing: Technical and Behavioral Prep
- Iteration is critical
- You need to test dozens of variants quickly
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Iterative algorithms
- Machine learning models that require hundreds of passes over the data
- Lesson 1784 — Computation Complexity: Beyond Data Size
J
- Jarque-Bera test
- takes a unique approach: it specifically looks at two shape characteristics—**skewness** and **kurtosis**—and combines them into a single test statistic.
- Lesson 208 — Jarque-Bera Test
- Jitter
- them (each person shifts slightly so all faces are visible)
- Lesson 1353 — Position Adjustments: Dodge, Stack, and Jitter
- Jittering
- Slightly randomize positions (when exact location isn't critical)
- Lesson 1310 — Point Maps and Scatter Plots on Maps
- Job performance studies
- Both competence and charisma can lead to promotion.
- Lesson 1473 — Conditioning on Colliders: Selection Bias
- Joining tables
- Multiple tables might have overlapping column names, causing confusion
- Lesson 851 — Selecting All Columns with Asterisk
- JSON
- handles nested data well but creates significant memory overhead with all its bracket and quote characters.
- Lesson 1133 — Performance Considerations Across FormatsLesson 1779 — Reading and Writing Data in SparkLesson 2072 — Configuration Files vs Hard-Coded Values
- Just right
- Reveals the true distribution shape clearly
- Lesson 1267 — Histograms and Distribution Plots
- Justified Removal (Last Resort)
- Lesson 579 — What to Do with Influential Points
K
- k < 1
- Decreasing failure rate (infant mortality—defects fail early)
- Lesson 187 — The Weibull Distribution: Shape, Scale, and SurvivalLesson 188 — Weibull Distribution: Hazard Function and ReliabilityLesson 189 — Fitting Weibull Models to Lifetime Data
- k = 1
- Constant failure rate (becomes the exponential distribution—random failures)
- Lesson 187 — The Weibull Distribution: Shape, Scale, and SurvivalLesson 188 — Weibull Distribution: Hazard Function and ReliabilityLesson 189 — Fitting Weibull Models to Lifetime Data
- k > 1
- Increasing failure rate (wear-out phase—things break down over time)
- Lesson 187 — The Weibull Distribution: Shape, Scale, and SurvivalLesson 188 — Weibull Distribution: Hazard Function and ReliabilityLesson 189 — Fitting Weibull Models to Lifetime Data
- K-anonymity
- ensures that each record is indistinguishable from at least *k-1* other records when considering quasi-identifiers (age, ZIP code, gender).
- Lesson 1895 — Data Anonymization BasicsLesson 1896 — K-AnonymityLesson 1897 — L-Diversity and T- ClosenessLesson 1911 — GDPR Compliance for Data Scientists
- Kaplan-Meier
- to estimate response probability curves by segment
- Lesson 841 — Campaign Response Time Analysis
- Kaplan-Meier curves
- Lesson 836 — Employee Turnover and Retention Analysis
- Kaplan-Meier estimator
- , you can plot conversion curves that account for censoring (prospects still "alive" but not yet converted).
- Lesson 839 — Time-to-Conversion in Marketing Funnels
- KDE (Kernel Density Estimate)
- adds a smooth curve that estimates the underlying probability distribution, helping you see trends the blocky bins might obscure.
- Lesson 1267 — Histograms and Distribution Plots
- Keep conditions simple
- Complex nested CASE statements are hard to maintain and slower to execute.
- Lesson 1037 — CASE Best Practices and Performance
- Keep CTEs focused
- Each CTE should represent one logical step
- Lesson 997 — CTE Best Practices and Performance
- Keep it simple
- Single-column integer keys perform best for joins and indexing
- Lesson 1050 — Choosing Effective Primary KeysLesson 1679 — Defining Funnel Steps and Events
- Keep separate
- Analyze complete vs incomplete groups
- Lesson 1207 — Missing Data Assessment and Strategy
- Kendall
- when you have outliers, skewed distributions, or ordinal (ranked) data.
- Lesson 1184 — Correlation Coefficients in Bivariate Analysis
- Kendall correlation
- also uses ranks but counts how often pairs of observations agree in their ordering.
- Lesson 1184 — Correlation Coefficients in Bivariate Analysis
- Kendall's Tau
- counts *concordant and discordant pairs*—comparing every possible pair of observations to see if they agree in direction.
- Lesson 490 — Kendall's Tau vs Spearman's Rho
- Kernel Density Estimation
- is the mathematical technique behind these visualizations.
- Lesson 1312 — Heatmaps and Density Maps for Spatial Data
- Kernel Density Estimation (KDE)
- is a technique that creates a smooth curve approximating your data's probability distribution.
- Lesson 1177 — Density Plots and KDE
- Kernel Density Plots
- (or density curves) smooth out the histogram into a continuous curve.
- Lesson 203 — Visual Assessment: Histograms and Density Plots
- Kernel Matching
- uses a weighted average of *all* control units, with weights based on distance from each treated unit's propensity score.
- Lesson 1448 — Propensity Score Matching Methods
- Key advantage
- Simple correction; coefficients remain interpretable as log-rate ratios.
- Lesson 694 — Quasi-Poisson and Negative Binomial Models
- Key characteristic
- Unlike the sampling distribution of the mean (which becomes normal thanks to the CLT), the sampling distribution of the variance follows a **chi-squared distribution** when the population is normal.
- Lesson 254 — Sampling Distribution of the Sample Variance
- Key conditions for convergence
- Lesson 1589 — Markov Chains: The Foundation of MCMC
- key difference
- is simply what you're waiting for:
- Lesson 137 — Geometric vs Negative Binomial: Key DifferencesLesson 229 — Defining Samples and Statistics
- Key factors affecting significance
- Lesson 1692 — Statistical Significance and Iteration
- Key lesson
- Always ask "what else might explain this pattern?
- Lesson 1426 — Real-World Examples: Correlation vs Causation
- Key partitioning principles
- Lesson 1782 — Spark Performance Basics: Partitions and Caching
- Key properties
- Lesson 572 — Leverage: Distance in X-Space
- Key Results
- are 2-5 specific, measurable outcomes that define *how* you'll know you've succeeded.
- Lesson 1607 — Introduction to OKRs (Objectives and Key Results)
- Kill the jargon
- Replace technical variable names like "churn_propensity_score_v2" with "Customer Risk Level.
- Lesson 1958 — Simplifying Visual Complexity
- Know your exit options
- Sometimes you must escalate to leadership or, in extreme cases, consider **responsible disclosure** or changing roles.
- Lesson 1931 — When to Push Back on Requests
- Knowledge spreads
- Reviewers learn about changes they didn't write; contributors get feedback that improves their skills
- Lesson 2022 — Understanding Pull Requests
- Known Unknowns
- "We don't yet know if historical patterns hold post-merger"
- Lesson 2100 — Documenting Assumptions and Open Questions
- Known variance structure
- The variance function follows directly from the exponential family form
- Lesson 670 — Why Exponential Family Matters for GLMs
- Kolmogorov-Smirnov
- to check if numeric distributions match theoretical ones (like normal distribution).
- Lesson 1208 — Distribution Checks for All Variables
- Kolmogorov-Smirnov (K-S) test
- takes a slightly different approach.
- Lesson 206 — Kolmogorov-Smirnov Test
- KPSS test
- High p-value (> 0.
- Lesson 718 — Interpreting Stationarity Test ResultsLesson 741 — Testing Stationarity After Transformation
- Kruskal-Wallis test
- The non-parametric cousin of one-way ANOVA, comparing medians across groups using ranks
- Lesson 470 — When Parametric ANOVA Assumptions Fail
- Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test
- is a statistical test for stationarity that works *opposite* to the Augmented Dickey-Fuller test you just learned.
- Lesson 717 — KPSS Test
L
- L-Diversity
- addresses this by requiring that each equivalence class (group of indistinguishable records) contains at least **L well-represented values** for sensitive attributes.
- Lesson 1897 — L-Diversity and T-Closeness
- LA's coefficient
- (say, -5): means LA is 5 units lower than Boston
- Lesson 643 — Interpreting Coefficients Relative to Reference
- Label bias
- happens when subjective human judgment creates inconsistent or skewed labels for supervised learning.
- Lesson 1880 — Measurement and Label Bias
- Labels + Color
- Add direct text labels or annotations to clarify what colors represent
- Lesson 1251 — Avoiding Reliance on Color Alone
- Labels and Titles
- Use `set_xlabel()`, `set_ylabel()`, and `set_title()` to give context.
- Lesson 1270 — Customizing Axes: Labels, Limits, and Scales
- LAG
- and **LEAD** window functions make this trivial by letting you "peek" at other rows directly.
- Lesson 1023 — Introduction to Window Functions: LAG and LEAD
- Lag 1
- Correlation between consecutive observations (today vs.
- Lesson 719 — What is Autocorrelation?Lesson 720 — The Autocorrelation Function (ACF)
- Lag 2
- Correlation between observations two steps apart (today vs.
- Lesson 719 — What is Autocorrelation?Lesson 720 — The Autocorrelation Function (ACF)
- Lagging
- Ultimate business metrics (revenue per user, customer lifetime value)
- Lesson 1601 — Balancing Leading and Lagging Metrics
- Lagging indicators
- are outcome-focused metrics that tell you *what has already happened*.
- Lesson 1597 — What Are Leading and Lagging Indicators?
- Lagging metrics provide accountability
- They're your scoreboard, measuring final outcomes.
- Lesson 1601 — Balancing Leading and Lagging Metrics
- lambda (λ)
- , called the **rate parameter**.
- Lesson 139 — The Poisson Process and Rate ParameterLesson 214 — Box-Cox Transformation
- Language-agnostic
- The same DataFrame operations work in Python, Scala, R, and Java
- Lesson 1778 — DataFrames and Spark SQL Basics
- Laplace mechanism
- adds noise drawn from a Laplace distribution.
- Lesson 1899 — Adding Noise for Privacy
- large
- , you fail to reject—the equal variance assumption seems reasonable.
- Lesson 380 — Testing Equal Variances: Levene's and Bartlett's TestsLesson 1871 — Why Version Control for Data?Lesson 2070 — Separating Data from Code
- Large Cook's Distance
- Point substantially changes the regression line
- Lesson 578 — Visualizing Leverage and Influence
- Large effect
- d ≈ 0.
- Lesson 385 — Cohen's d for Standardized Mean DifferencesLesson 386 — Effect Size Interpretation GuidelinesLesson 429 — Effect Size: Cramér's V and Phi
- Large magnitude values
- (typically |residual| > 2 or 3): potential outliers or poorly-fit observations
- Lesson 701 — Deviance Residuals
- large sample sizes
- , even tiny, meaningless slopes become statistically significant.
- Lesson 529 — Practical vs Statistical SignificanceLesson 1386 — IQR Method vs Z-Score: When to Use Each
- large samples
- (n > 5000): Tests often reject normality due to trivial deviations that won't affect your downstream analyses.
- Lesson 209 — Sample Size Considerations in Normality TestsLesson 550 — Normality of Residuals
- Large Standard Errors
- Lesson 581 — Symptoms of Multicollinearity
- Large tables
- Retrieving unnecessary columns wastes bandwidth and memory
- Lesson 851 — Selecting All Columns with Asterisk
- Larger sample size
- → Narrower margin (more data gives better estimates)
- Lesson 294 — Margin of Error and Its Components
- larger sample sizes
- and is particularly sensitive to differences in the middle of the distribution.
- Lesson 206 — Kolmogorov-Smirnov TestLesson 630 — Bayesian Information Criterion (BIC)Lesson 1482 — Control and Treatment Design
- Last Non-Direct Click
- would credit: LinkedIn Ad
- Lesson 1722 — Last Non-Direct Click AttributionLesson 1723 — Comparing Single-Touch Models
- Last Non-Direct Click Attribution
- credits the last touchpoint in the customer journey *before* conversion, **excluding any direct traffic**.
- Lesson 1722 — Last Non-Direct Click Attribution
- Last touch matters
- Something convinced them to finally convert
- Lesson 1729 — Position-Based (U-Shaped) Attribution
- Last-touch
- Credit goes entirely to the final action before conversion
- Lesson 1637 — What is Metric Attribution?Lesson 1722 — Last Non-Direct Click AttributionLesson 1724 — Limitations of Single-Touch Attribution
- last-touch attribution
- .
- Lesson 1721 — Last-Touch Attribution ModelLesson 1723 — Comparing Single-Touch ModelsLesson 1725 — Implementing Single-Touch Attribution
- Latency matters
- Fraud detection must happen in seconds, not overnight
- Lesson 1788 — Streaming Data and Real-Time Requirements
- Latitude
- measures north-south position from the equator (0°) to the poles (±90°).
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- Law of Total Probability
- lets you split the sample space into separate, non-overlapping scenarios (a partition), calculate the probability within each scenario, and then add them up to get your answer.
- Lesson 90 — The Law of Total ProbabilityLesson 97 — Law of Total Probability
- Lawful basis for processing
- You can't just collect data because it's convenient—you need explicit consent or a legitimate legal reason
- Lesson 1904 — What is GDPR and Why It Matters
- Lawfulness, Fairness, and Transparency
- Lesson 1905 — Core Principles of GDPR
- Layer in supporting evidence
- Once the headline lands, show the *why*—perhaps a single clear visualization with strong annotation (lessons 1960-1961).
- Lesson 1965 — Progressive Disclosure Techniques
- Lazy loading
- means only computing what's currently visible, deferring expensive operations until absolutely necessary.
- Lesson 1337 — Dashboard Performance and Caching
- LEAD
- window functions make this trivial by letting you "peek" at other rows directly.
- Lesson 1023 — Introduction to Window Functions: LAG and LEAD
- Lead conversion
- (30%) – The moment a prospect becomes a qualified lead (e.
- Lesson 1730 — W-Shaped Attribution Model
- Lead generation
- Form submission or demo request
- Lesson 1686 — Defining Conversions and Conversion Rate
- Lead with findings
- Put results before methodology when possible
- Lesson 1967 — Writing Clear and Concise Analysis Sections
- Leading
- Surrogate metrics (click-through rate, engagement time, sign-up rate)
- Lesson 1601 — Balancing Leading and Lagging Metrics
- leading indicator
- that correlates strongly with your actual business goal but can be measured sooner, more frequently, or with less noise.
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is ImpracticalLesson 1604 — What is a North Star Metric?Lesson 1605 — Characteristics of Good North Star MetricsLesson 1628 — SaaS Metrics: MRR, ARR, and Logo ChurnLesson 1632 — Financial Services Metrics: AUM, NIM, and Credit Metrics
- Leading indicators
- are predictive, forward-looking metrics that signal *what is likely to happen in the future*.
- Lesson 1597 — What Are Leading and Lagging Indicators?
- Leading indicators of disengagement
- are behavioral signals that precede actual churn—like smoke before fire.
- Lesson 1700 — Leading Indicators of Disengagement
- Learn & Repeat
- Use insights to generate next hypothesis
- Lesson 1692 — Statistical Significance and Iteration
- Learning the structure
- Analyze the original data's distributions, correlations, and statistical properties
- Lesson 1901 — Synthetic Data Generation
- least squares criterion
- says: choose the line that minimizes the **sum of squared residuals**.
- Lesson 517 — The Least Squares CriterionLesson 518 — Deriving the Least Squares Estimators
- Left censoring
- occurs when you know an event *has already occurred* before your observation period began, but you don't know exactly when.
- Lesson 805 — Left and Interval Censoring
- LEFT JOIN
- Returns **all** rows from the left table, plus matching rows from the right (or NULL if no match)
- Lesson 928 — LEFT JOIN vs INNER JOIN: When to Use EachLesson 936 — FULL OUTER JOIN SyntaxLesson 946 — Self-Joins for Hierarchical Data
- left to right
- (though the optimizer may reorder them internally).
- Lesson 950 — Chaining Multiple JoinsLesson 952 — Mixing Join Types
- Left-skewed (negative skew)
- A long tail to the left; most values cluster high (e.
- Lesson 1175 — Histograms for Distribution Shape
- Legacy systems
- Some older SQL environments don't support CTEs
- Lesson 974 — When to Use FROM Subqueries vs CTEs
- Legal compliance
- – Are you following laws like GDPR, CCPA, or HIPAA that govern data use in different regions and industries?
- Lesson 36 — Responsible Data Sourcing and UseLesson 2062 — Why Data Source Documentation Matters
- Legal Obligation
- Lesson 1906 — Legal Bases for Processing Personal Data
- Legend interactivity
- to show/hide data series by clicking
- Lesson 1300 — Creating Basic Interactive Charts with Plotly Express
- Legends
- identify what different visual elements represent—especially crucial when you have multiple lines, colors, or groups.
- Lesson 1271 — Adding Legends, Annotations, and Text
- Legitimate Interests
- Lesson 1906 — Legal Bases for Processing Personal Data
- LENGTH
- measures how many characters are in a string
- Lesson 1044 — String Manipulation: CONCAT, LENGTH, and SUBSTRINGLesson 1232 — Perceptual Accuracy HierarchyLesson 1238 — Matching Encoding to Data Type
- Length of stay (LOS)
- tracks average days hospitalized.
- Lesson 1633 — Healthcare Metrics: Patient Outcomes and Operational Efficiency
- Lengthening Time-to-Return
- The gap between visits grows longer.
- Lesson 1700 — Leading Indicators of Disengagement
- Leptokurtic
- (kurtosis > 3 or excess kurtosis > 0): Heavy tails and a sharp peak.
- Lesson 66 — Kurtosis: Definition and Interpretation
- Less effective when
- Lesson 1727 — Linear Attribution Model
- Less SQL boilerplate
- You write Python code, not SQL strings
- Lesson 1117 — What is an ORM and Why Use It?
- Less typing
- You save keystrokes, reducing errors and speeding up query writing.
- Lesson 924 — Using Table Aliases in Joins
- Lesson 804
- , you learned about right censoring (when someone drops out before the event happens).
- Lesson 805 — Left and Interval Censoring
- Let supporting details orbit
- around these three points, but never introduce a fourth major message
- Lesson 1940 — The Rule of Three in Data Storytelling
- level
- of the series
- Lesson 740 — Choosing Between Differencing and DetrendingLesson 765 — Introduction to Holt-Winters MethodLesson 767 — Holt-Winters Additive ModelLesson 770 — Initializing Holt-Winters ComponentsLesson 771 — Forecasting with Holt-Winters
- Level equation
- The current baseline value, adjusted for trend
- Lesson 761 — Double Exponential Smoothing (Holt's Method)Lesson 767 — Holt-Winters Additive ModelLesson 768 — Holt-Winters Multiplicative Model
- Levene's test
- or the **F-test** (covered in lesson 363), or simply inspect side-by-side boxplots.
- Lesson 379 — The Assumption of Equal Variances (Homoscedasticity)Lesson 380 — Testing Equal Variances: Levene's and Bartlett's Tests
- Leverage
- refers to an observation's *position* in the predictor space—specifically, how far its X-value is from the mean of all X-values.
- Lesson 571 — What Are Leverage and Influence?Lesson 573 — Calculating and Interpreting Hat ValuesLesson 574 — Influence: Impact on Fitted ModelLesson 575 — Cook's Distance
- Leverage associations
- Use culturally familiar color meanings: red for danger/stop/negative, green for go/positive, blue for neutral/calm.
- Lesson 1961 — Color as Communication Tool
- License
- The legal terms under which you can use and share the data.
- Lesson 2063 — Essential Metadata to Capture
- Lightweight artifacts
- Small CSVs of feature importance, confusion matrices, or performance metrics belong in version control
- Lesson 2034 — Committing Data Artifacts and Model Outputs
- Likelihood
- If someone has the disease, how likely is a positive test?
- Lesson 107 — Bayes' Theorem Formula and ComponentsLesson 682 — Maximum Likelihood Estimation in Logistic RegressionLesson 697 — Deviance: A Measure of Model FitLesson 1417 — Bayesian Change-Point DetectionLesson 1550 — What Are Conjugate Priors?Lesson 1566 — Conjugate Normal-Normal ModelLesson 1594 — PyMC: Probabilistic Programming in Python
- Likelihood P(Evidence | Guilty)
- probability of seeing this evidence if guilty
- Lesson 112 — Legal Evidence and Jury Reasoning
- Likelihood Ratio Test
- compares two **nested models**—where one model (the simpler one) is a special case of the other (the more complex one).
- Lesson 699 — The Likelihood Ratio TestLesson 791 — Comparing Nested and Non-Nested ModelsLesson 830 — Testing Coefficient Significance
- likelihood ratio test (LRT)
- compares two nested models by examining how well each explains the data.
- Lesson 628 — Likelihood Ratio TestsLesson 684 — Likelihood Ratio Tests for Model Comparison
- Likelihood: P(B|A)
- How probable the evidence B is *if* A is true
- Lesson 107 — Bayes' Theorem Formula and Components
- Limit CTE reuse
- If you reference a CTE many times, consider a temp table instead
- Lesson 997 — CTE Best Practices and Performance
- Limit result set size
- Filter rows in the outer query before applying expensive subqueries
- Lesson 969 — Performance Considerations for SELECT Subqueries
- Limit your palette
- Too many colors create cognitive overload—your audience spends mental energy decoding the legend instead of understanding your insight.
- Lesson 1961 — Color as Communication Tool
- Limitation
- It doesn't give you a true likelihood, so some model comparison tools (like AIC) won't work.
- Lesson 694 — Quasi-Poisson and Negative Binomial ModelsLesson 1226 — Stacked and Grouped Bar ChartsLesson 1744 — Incrementality vs AttributionLesson 1767 — Scale-Up vs Scale-Out Architectures
- Limitations and confidence levels
- honesty builds trust
- Lesson 2091 — Stage 7: Communication and Handoff
- Limitations and uncertainties
- Where might the model fail?
- Lesson 1917 — Transparency in Analysis and Models
- Limited control
- You can't influence outcomes that already crystallized
- Lesson 1617 — The Danger of Lagging-Only Metrics
- Limited flexibility
- Your prior beliefs must fit the conjugate family's shape, even if reality suggests otherwise
- Lesson 1555 — Advantages and Limitations of Conjugate Priors
- Limits
- the total number of concurrent connections to prevent overwhelming the database
- Lesson 1092 — Connection Pooling Basics
- Limits and breaks
- controlling what range displays and where tick marks appear
- Lesson 1344 — Scales and Coordinate Systems
- Line charts
- Showing trends over time (monthly revenue, daily user counts)
- Lesson 1959 — Choosing Familiar Chart Types
- Line style + Color
- Vary dashed, dotted, and solid lines in addition to color
- Lesson 1251 — Avoiding Reliance on Color Alone
- lineage
- information, so if a partition fails, it can rebuild just that piece—providing fault tolerance without constant replication overhead.
- Lesson 1774 — What is Apache Spark and Why Use It?Lesson 1871 — Why Version Control for Data?
- linear
- relationships.
- Lesson 476 — What is Pearson Correlation?Lesson 477 — Interpreting the Correlation CoefficientLesson 680 — The Logit Link Function and OddsLesson 1196 — Dimensionality Reduction for Visualization
- Linear decay
- Attribution drops steadily over time (e.
- Lesson 1639 — Time Windows and Attribution Decay
- Linear interpolation
- draws an imaginary line between the 7th and 8th values and picks the point halfway between them.
- Lesson 58 — Calculating Percentiles: Methods and Algorithms
- Linear pattern
- Points should roughly follow a straight line, not a curve
- Lesson 480 — Scatterplots and Visual Assessment
- Linear scalability
- Add more nodes, get proportionally more capacity.
- Lesson 1771 — Shared-Nothing Architecture
- Linearity
- Lesson 546 — The Five Core Assumptions of Linear RegressionLesson 557 — The Residuals vs Fitted Values PlotLesson 601 — Assumptions for Multiple Linear Regression
- Linearity assumption
- Are patterns randomly scattered, or do residuals show curves that suggest a non-linear relationship?
- Lesson 544 — The Role of Residuals in DiagnosticsLesson 547 — Linearity: The Relationship Must Be LinearLesson 558 — Identifying Non-Linearity in Residual Plots
- Linestyle
- controls whether your line is solid, dashed, dotted, or dash-dotted.
- Lesson 1258 — Customizing Lines: Colors, Styles, and Markers
- link function
- transforms the expected value of your response variable so it can be modeled with a linear predictor.
- Lesson 671 — What is a Link Function?Lesson 672 — The Identity LinkLesson 690 — The Poisson Distribution as a GLM
- Link to business costs
- "Type I vs Type II errors" becomes "cost of investigating false alarms vs cost of missing real problems"
- Lesson 2105 — Translating Between Technical and Business Language
- Linked selections
- Selecting points in one plot highlights them in others
- Lesson 1304 — Subplots and Linked Interactions
- List required features
- What specific variables does your analysis need?
- Lesson 2098 — Identifying Data Availability Gaps Early
- Ljung-Box test
- is a formal hypothesis test that checks whether residuals show significant autocorrelation at multiple lags at once.
- Lesson 783 — Ljung-Box Test for Residual IndependenceLesson 799 — Fitting and Diagnosing SARIMA Models
- Load
- only clean, aggregated, ready-to-query data into the warehouse
- Lesson 1817 — Historical Context: Why ETL Came First
- Load balancing
- assigning work so no worker sits idle while others are overloaded
- Lesson 1769 — Task Parallelism and Work Distribution
- Load raw data
- into your cloud warehouse for most sources
- Lesson 1821 — Hybrid Approaches and Modern Data Stacks
- Loading
- raw data directly into staging tables in the warehouse
- Lesson 1816 — What is ELT? Extract, Load, Transform Explained
- Local control
- Changes in one region don't affect distant regions
- Lesson 662 — Polynomial Features vs Splines
- Locks
- Constraints can hold locks longer, blocking concurrent operations
- Lesson 1060 — Trade-offs: Performance vs Integrity
- log link
- solves this by connecting the linear predictor to the expected outcome through a logarithm.
- Lesson 675 — The Log LinkLesson 677 — Interpreting Coefficients Under Different LinksLesson 678 — Choosing the Right Link FunctionLesson 690 — The Poisson Distribution as a GLM
- Log transformation
- If log-transformed data looks normal, consider log-normal.
- Lesson 193 — Choosing Between Distributions in PracticeLesson 212 — Log TransformationsLesson 591 — When and Why to Transform Variables
- Log transformation of X
- If you modeled `Y = β₀ + β₁log(X)`, then β₁ represents the change in Y when X is *multiplied* by some factor (like doubling).
- Lesson 594 — Interpreting Models After Transformation
- Log transformation of Y
- If you modeled `log(Y) = β₀ + β₁X`, the coefficient β₁ represents the *proportional* change in Y.
- Lesson 594 — Interpreting Models After Transformation
- Log-normal
- suits variables that are products of many small multiplicative factors—like incomes, stock prices, or city sizes.
- Lesson 193 — Choosing Between Distributions in Practice
- log-odds
- Lesson 673 — The Logit LinkLesson 677 — Interpreting Coefficients Under Different LinksLesson 680 — The Logit Link Function and OddsLesson 681 — Interpreting Logistic Regression CoefficientsLesson 686 — Assumptions and Diagnostics in Logistic Regression
- log-rank test
- is the most common statistical test for answering this question.
- Lesson 818 — What is the Log-Rank Test?Lesson 823 — Log-Rank Test vs Other TestsLesson 836 — Employee Turnover and Retention Analysis
- Log-rank tests
- to compare response timing across different campaign variants
- Lesson 841 — Campaign Response Time Analysis
- Logical consistency
- Lesson 1211 — Domain Validation and Sanity Checks
- Logically Connected
- Every recommendation must flow directly from your analysis.
- Lesson 1970 — Recommendations and Next Steps
- Logistic regression
- is designed specifically for binary outcomes.
- Lesson 679 — Logistic Regression Setup and the Binary ResponseLesson 1447 — Propensity Score: Concept and EstimationLesson 1674 — Churn Prediction Models
- logit
- link)
- Lesson 671 — What is a Link Function?Lesson 674 — The Probit LinkLesson 678 — Choosing the Right Link Function
- logit link
- bridges this gap.
- Lesson 673 — The Logit LinkLesson 674 — The Probit LinkLesson 677 — Interpreting Coefficients Under Different Links
- logit link function
- you just learned transforms these probabilities into a scale where linear modeling works, then transforms back to give valid probabilities.
- Lesson 679 — Logistic Regression Setup and the Binary ResponseLesson 680 — The Logit Link Function and Odds
- Logo churn
- counts *how many customers* cancel (e.
- Lesson 1628 — SaaS Metrics: MRR, ARR, and Logo Churn
- Long flat sections
- Periods with no events or only censored observations
- Lesson 815 — Survival Curve Plots and Interpretation
- Long format
- (the tidy version) stacks these observations vertically, using one column for the variable name (`month`) and another for its value (`sales`).
- Lesson 1144 — Common Violations: Wide vs Long Format
- Long-term business viability
- A 2% floor vs 20% changes unit economics dramatically
- Lesson 1658 — Flattening and Asymptotic Behavior
- Longer-term fluctuations
- tied to economic or business cycles, but *without* a fixed period.
- Lesson 705 — The Four Classical Components
- Longitude
- measures east-west position from the Prime Meridian (0°) through ±180°.
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- LOO
- (Leave-One-Out cross-validation) to compare them:
- Lesson 1596 — Posterior Predictive Checks and Model Comparison
- Look for
- Lesson 701 — Deviance Residuals
- Look for confounders
- What third factor might drive both metrics?
- Lesson 1615 — Correlation Without Causation
- Look for subgroup patterns
- Split your data by the suspected confounder—does the treatment-outcome relationship change or reverse?
- Lesson 1429 — Identifying Confounders in Practice
- Losing credibility
- Decision-makers can't tell where facts end and opinions begin
- Lesson 1927 — Separating Analysis from Advocacy
- Love plots
- (or balance plots) display SMDs before and after matching, making it easy to see which covariates improved and which remain problematic.
- Lesson 1450 — Assessing Balance After Matching
- Low baseline rate example
- If your current conversion is 2%, improving to 3% (a 50% relative lift!
- Lesson 1499 — Adjusting for Baseline Conversion Rates
- Low p-value (< α)
- Your observed frequencies differ significantly from expected frequencies.
- Lesson 420 — Interpreting Chi-Squared Test Results
- Low statistical power
- – You might miss real effects (false negatives).
- Lesson 1493 — Why Sample Size Matters in A/B Tests
- Lower alpha (0.01)
- Like a less-sensitive detector.
- Lesson 334 — Setting Alpha: Choosing Your Significance Level
- Lower confidence
- to 95% (slightly riskier, smaller sample needed)
- Lesson 295 — Trade-offs: Precision, Confidence, and Cost
- Lower confidence (e.g., 90%)
- Narrower intervals, but less reliable method
- Lesson 267 — Interpreting Confidence Levels
- Lower Control Limit (LCL)
- Typically 3 standard deviations below the mean
- Lesson 1396 — Introduction to Control ChartsLesson 1397 — Shewhart Control Chart BasicsLesson 1398 — Control Charts for Means (X-bar Charts)
- Lower fence
- = Q1 - (1.
- Lesson 72 — IQR Method and Tukey's FencesLesson 1385 — Calculating IQR Fences in Practice
- Lower is better
- The model with the smallest AIC or BIC is preferred
- Lesson 781 — Information Criteria: AIC and BIC
- Lower peak
- The center is slightly flatter than a normal curve
- Lesson 352 — The t-Distribution and Degrees of Freedom
- Lower threshold (A)
- Based on acceptable Type II error (β, false negative rate)
- Lesson 1511 — Sequential Probability Ratio Test (SPRT)
- Lower values are better
- they indicate a superior balance of fit and simplicity.
- Lesson 785 — Information Criteria: AIC and BIC
- Lower variance/standard deviation
- = outcomes cluster tightly around the expected value, more predictable
- Lesson 148 — Variance and Standard Deviation of Discrete Distributions
- Lower λ
- means events happen less frequently → longer waiting times
- Lesson 164 — The Exponential Distribution
- LTV:CAC ratio
- divides lifetime value by customer acquisition cost (CAC) to reveal whether you're spending wisely.
- Lesson 1667 — LTV:CAC Ratio and ProfitabilityLesson 1669 — LTV Segmentation and TargetingLesson 1756 — LTV:CAC Ratio as a Health Metric
- Luminance
- (or lightness/value) is how bright or dark the color appears, from near-black to near-white.
- Lesson 1234 — Color: Hue, Saturation, and Luminance
- lurking variable
- or **hidden confounder**.
- Lesson 497 — The Third Variable ProblemLesson 1423 — The Third Variable Problem
M
- M_X(t) = E[e^(tX)]
- Lesson 150 — Moment Generating Functions
- MA process
- ACF cuts off sharply; PACF decays gradually
- Lesson 731 — PACF for AR Process Identification
- MA(1)
- Uses only the most recent error
- Lesson 775 — Moving Average (MA) ModelsLesson 777 — Identifying MA Order (q) Using ACF
- MA(2)
- Uses the two most recent errors
- Lesson 775 — Moving Average (MA) ModelsLesson 777 — Identifying MA Order (q) Using ACF
- MA(q)
- process shows **gradual exponential decay** or a damped sinusoidal pattern in the PACF—no clean cutoff.
- Lesson 732 — PACF Patterns for Common ModelsLesson 775 — Moving Average (MA) ModelsLesson 777 — Identifying MA Order (q) Using ACF
- Machine failures
- For certain systems, past survival time doesn't reduce future failure risk
- Lesson 167 — Memoryless Property of Exponential
- Machine Learning (ML)
- These are techniques that let computers find patterns and make predictions automatically.
- Lesson 7 — The Data Science Skill Stack
- Machine Learning Feature Scaling
- Many algorithms (like k-nearest neighbors or neural networks) perform better when features are standardized to similar ranges.
- Lesson 201 — Z-Score Applications and Limitations
- Machine learning methods
- (Isolation Forest, autoencoders) for complex multivariate patterns
- Lesson 1411 — Applications and Limitations
- MAD
- (Mean Absolute Deviation) is useful when you want interpretability similar to standard deviation but with less sensitivity to outliers.
- Lesson 54 — When to Use Each Measure
- MAD (Median Absolute Deviation)
- instead of standard deviation (a robust measure of spread you learned earlier)
- Lesson 73 — Modified Z-Score Using MAD
- MAE (Mean Absolute Error)
- Average of absolute differences; easy to interpret in original units
- Lesson 790 — Out-of-Sample Forecast Evaluation
- magnitude
- of the difference between proportions.
- Lesson 413 — Effect Size and Practical SignificanceLesson 637 — Interpreting Dummy Variable Coefficients
- Mahalanobis distance
- measures how far a point is from the center of a multivariate distribution, accounting for correlations between variables.
- Lesson 74 — Multivariate Outlier DetectionLesson 1381 — Multivariate Z-Score Methods
- Main branch (`main`)
- Your production-ready, validated code.
- Lesson 2035 — Branching Strategies for Experiments
- Main effect of degree
- the intercept difference between groups
- Lesson 652 — Interpreting Categorical × Continuous Interactions
- Main effect of experience
- the baseline slope (for the reference group, no degree)
- Lesson 652 — Interpreting Categorical × Continuous Interactions
- Main effects
- test whether each factor matters *on its own*, averaging across all levels of the other factor.
- Lesson 464 — Main Effects in Two-Way ANOVALesson 465 — Interaction EffectsLesson 1689 — Multivariate Testing and Personalization
- Main ingredients (geom layers)
- Points, lines, bars added with `geom_*()` functions
- Lesson 1347 — Understanding Layers in ggplot2
- Maintain integrity
- by being transparent about what you can and cannot deliver
- Lesson 34 — Recognizing Boundaries of Competence
- Maintain Specification Files
- Lesson 2046 — Best Practices for Environment Management in Teams
- Maintainability
- Adding or reordering parameters won't break your code
- Lesson 1106 — Parameter Placeholders: Named Parameters
- Maintenance burden
- Outdated code may break when dependencies change, triggering false alarms
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Maintenance scheduling
- When k > 1, rising hazard rates signal when preventive maintenance is cost-effective
- Lesson 188 — Weibull Distribution: Hazard Function and Reliability
- Make a Decision
- Lesson 447 — Conducting One-Way ANOVA in Practice
- Make better decisions
- React to actual changes, not predictable cycles
- Lesson 748 — Seasonally Adjusted Data
- Make decisions
- Knowing the typical outcome helps guide business choices and predictions
- Lesson 38 — What is Central Tendency?Lesson 2121 — Timeboxing and Deadlines
- Make probabilistic statements
- Calculate P(μ₁ > μ₂ | data) or credible intervals for δ
- Lesson 1570 — Comparing Two Means: Bayesian Approach
- Makes everything positive
- – no more negative values, so you're only looking at spread magnitude
- Lesson 560 — Scale-Location Plot (Spread-Location Plot)
- Making findings accessible
- Translate technical metrics into business language.
- Lesson 2090 — Stage 6: Interpretation and Insight Generation
- Making multiplicative relationships additive
- – Easier to model and interpret
- Lesson 212 — Log Transformations
- Manager
- Lead a team (4-8 people), conduct 1-on-1s, handle performance reviews, remove blockers
- Lesson 2140 — Individual Contributor vs Management Tracks
- Manages memory
- by processing chunks sequentially when needed
- Lesson 1790 — What is Dask and When to Use It
- Managing execution order
- through task graphs or workflow engines
- Lesson 1769 — Task Parallelism and Work Distribution
- Mann-Whitney U test
- (also called the Wilcoxon Rank-Sum test) offers a robust alternative to the two-sample t-test.
- Lesson 393 — Mann-Whitney U Test (Wilcoxon Rank-Sum)
- Manual line-by-line parsing
- Lesson 1141 — Recovering from Corrupted or Partially Broken Data
- Manual transformations
- (typical Python workflow):
- Lesson 1373 — Statistical Transformations: Built-in vs Manual
- Manually edit
- Choose which changes to keep, or combine them, then remove the conflict markers
- Lesson 2018 — Resolving Conflicts During Rebase
- Manufacturing
- Predicting when machines need maintenance before they break
- Lesson 6 — Common Data Science ApplicationsLesson 1412 — What is Change-Point Detection?
- Manufacturing Defects
- A production line averages 2 defects per 1,000 units.
- Lesson 144 — Poisson Applications: Arrivals and Events
- Map
- and **Reduce**.
- Lesson 1770 — The MapReduce Programming ModelLesson 1827 — Transformation Patterns: Map, Filter, Aggregate
- Map common path variations
- – use path analysis or Sankey diagrams to visualize popular alternate routes
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Map projections
- are mathematical transformations that flatten the globe onto a plane—like peeling an orange and trying to lay the peel flat.
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- Map secondary applications
- What could someone build *on top of* your work that causes harm?
- Lesson 1924 — Red Team Thinking for Data Scientists
- MAR (Missing at Random)
- Missingness relates to *observed* data (e.
- Lesson 1207 — Missing Data Assessment and Strategy
- marginal effect
- the change in your outcome variable when that predictor increases by one unit, *while holding all other predictors constant*.
- Lesson 604 — Marginal Effects and Ceteris ParibusLesson 659 — Interpreting Polynomial Regression Coefficients
- marginal likelihood
- ) is the denominator that normalizes the posterior distribution.
- Lesson 1536 — The Evidence (Marginal Likelihood)Lesson 1546 — The Role of the Normalizing Constant
- Markdown text
- Plain text with simple formatting (headers, lists, bold, italics)
- Lesson 1983 — R Markdown for Dynamic Reports
- Marker size reduction
- Smaller points reduce overlap
- Lesson 1310 — Point Maps and Scatter Plots on Maps
- Marker styles
- change the shape of each point (circles, squares, triangles, etc.
- Lesson 1265 — Scatter Plots: Relationships Between Variables
- Markers
- are symbols that appear at each data point along your line.
- Lesson 1258 — Customizing Lines: Colors, Styles, and MarkersLesson 1272 — Colors, Markers, and Line Styles
- Market or product changes
- Launching in new markets, adding product lines, or facing competitive threats requires rethinking which metrics matter most and how they interconnect.
- Lesson 1626 — Maintaining and Evolving Metric Trees
- Marketing
- Identifying which customers are likely to cancel subscriptions
- Lesson 6 — Common Data Science Applications
- Marketing effectiveness
- Compare cohorts from different channels
- Lesson 1644 — What is Cohort Analysis?
- Marketing expenses
- ad spend across all channels, content creation, marketing tools and software
- Lesson 1753 — Customer Acquisition Cost (CAC): Components and Calculation
- Marketplace/Transactional
- Lesson 1657 — Day-1, Day-7, Day-30 Benchmarks
- Marketplaces
- More sellers (treatment) improve selection for all buyers
- Lesson 1527 — Ignoring Network Effects
- Markov chain
- is a sequence of states where the next state depends *only* on the current state, not on how you got there.
- Lesson 1589 — Markov Chains: The Foundation of MCMCLesson 1733 — Markov Chain Attribution Models
- Mask volatility
- Show a smooth period while hiding the chaos before and after
- Lesson 1241 — Cherry-Picking Time Ranges
- Massively scalable compute engines
- (BigQuery, Snowflake, Redshift): These could query petabytes of data in seconds, separating storage from compute power
- Lesson 1818 — The Rise of ELT: Cloud Storage and Compute
- Match
- Apply exact matching on these coarsened bins—much easier now!
- Lesson 1449 — Coarsened Exact Matching (CEM)
- Match regions
- Pair test and control geographies with similar historical sales, demographics, and seasonality
- Lesson 1746 — Geo-Lift Experiments
- Match the question
- If you care about *means*, try parametric first.
- Lesson 398 — Choosing Between Parametric and Non-Parametric Tests
- Match the Response Distribution
- Lesson 678 — Choosing the Right Link Function
- Match user intent
- Steps should reflect meaningful progress, not just technical page loads.
- Lesson 1679 — Defining Funnel Steps and Events
- Matched Pairs
- Lesson 369 — When to Use a Paired t-TestLesson 435 — McNemar's Test: Paired Categorical Data
- Matched rows
- Both sides have real values (no NULLs in key columns)
- Lesson 937 — Identifying Matched vs Unmatched Rows
- Mathematical convenience
- Some priors (called *conjugate priors*) make calculations simpler
- Lesson 1534 — The Prior Distribution
- Mathematical dependencies
- Subtotals should equal their parts.
- Lesson 1155 — Consistency Checks Across Fields
- Mathematical elegance
- Squared terms have smooth derivatives, making it possible to solve for the optimal slope and intercept using calculus.
- Lesson 517 — The Least Squares Criterion
- Mathematically elegant
- If your prior is Beta(α, β) and you observe `k` successes in `n` trials (Binomial likelihood), your posterior is simply Beta(α + k, β + n - k).
- Lesson 1551 — Beta-Binomial Conjugacy
- Mathematics & Statistics
- Lesson 1 — Defining Data Science
- Matplotlib
- historically had a more technical, MATLAB-inspired look with white backgrounds and primary colors (blue, orange, green).
- Lesson 1371 — Default Aesthetics and Design ChoicesLesson 1373 — Statistical Transformations: Built-in vs Manual
- Matplotlib subplots
- Best when you need completely different plot types, custom layouts, or fine-grained control over individual panels
- Lesson 1372 — Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
- Matplotlib's Object-Oriented Interface
- treats plotting like building with objects.
- Lesson 1370 — Syntax Philosophy: Grammar of Graphics vs Object-Oriented
- Matplotlib's subplots
- , however, are more imperative and manual.
- Lesson 1372 — Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
- Matrix plots
- Visualize data tables (heatmaps, cluster maps)
- Lesson 1281 — Introduction to Seaborn's Statistical Plots
- Mature businesses
- optimize for efficiency—targeting shorter payback (under 12 months), higher ROAS (3x+), and stable CAC.
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- MAX
- together with **GROUP BY** to create rich summaries of grouped data.
- Lesson 892 — GROUP BY with Different Aggregate FunctionsLesson 894 — NULL Values in GROUP BY
- Maximize ROAS
- You'll likely reduce spend, raise CAC (fewer efficient channels), and shorten payback (only safest bets)
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Maximizing external validity
- You recruit diverse students from multiple schools, allow natural variation in implementation, and study real-world conditions.
- Lesson 1441 — Internal vs External Validity
- Maximizing internal validity
- You recruit a highly homogeneous group of students, control every aspect of the environment, use strict protocols, and carefully monitor compliance.
- Lesson 1441 — Internal vs External Validity
- maximum likelihood estimation
- , which:
- Lesson 628 — Likelihood Ratio TestsLesson 780 — ARIMA Model Estimation
- Maximum test
- Only checks if the *largest* value is an outlier
- Lesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- McFadden's R²
- Lesson 702 — Pseudo R-Squared Measures
- McNemar's Test
- is specifically designed for situations where:
- Lesson 435 — McNemar's Test: Paired Categorical DataLesson 437 — Applications: Clinical Trials and Market Research
- MDE
- = minimum detectable effect (absolute difference in means)
- Lesson 1498 — Sample Size Formulas for Continuous Metrics
- mean
- (or arithmetic average) is a single number that represents the "center" of a dataset by distributing the total value equally across all observations.
- Lesson 39 — The Mean (Arithmetic Average)Lesson 40 — The Median: Middle ValueLesson 42 — Comparing Mean, Median, and ModeLesson 52 — Mean Absolute Deviation (MAD)Lesson 141 — Mean and Variance of Poisson DistributionLesson 147 — Expected Value of Discrete Random VariablesLesson 163 — Uniform Distribution: Mean and VarianceLesson 174 — Symmetry and the Mode, Median, Mean (+5 more)
- Mean (Expected Value)
- E(X) = *np*
- Lesson 129 — Binomial Mean and VarianceLesson 161 — The Continuous Uniform Distribution
- Mean = 1/λ
- The average waiting time is simply the inverse of the rate
- Lesson 166 — Exponential Distribution: Mean and Variance
- Mean Absolute Error (MAE)
- on historical data.
- Lesson 759 — Choosing the Smoothing Parameter αLesson 763 — Evaluating Exponential Smoothing Models
- Mean Square Within (MSW)
- Variance *within* groups (also called Mean Square Error, MSE)
- Lesson 443 — Mean Squares and the F-Ratio
- Mean Squared Error (MSE)
- or **Mean Absolute Error (MAE)** on historical data.
- Lesson 759 — Choosing the Smoothing Parameter αLesson 763 — Evaluating Exponential Smoothing Models
- mean squares
- (variance estimates) by dividing sum of squares by their respective df.
- Lesson 442 — Degrees of Freedom in ANOVALesson 443 — Mean Squares and the F-Ratio
- Mean-variance relationship
- For Poisson, `Var(Y) = μ` (variance equals the mean)
- Lesson 690 — The Poisson Distribution as a GLM
- Measurable
- What metrics define success?
- Lesson 1166 — Defining the Business QuestionLesson 1478 — Defining Success MetricsLesson 1605 — Characteristics of Good North Star MetricsLesson 2094 — Defining Success Metrics Upfront
- Measurable Success Criteria
- "We need 70% accuracy" or "Reduce customer churn by 15%"
- Lesson 10 — Problem Definition and Scoping
- Measure impact
- Did that feature launch in March improve Day-30 retention for the March cohort compared to February?
- Lesson 1659 — Comparing Retention Across Cohorts
- Measure lift
- Compare actual outcomes in test regions vs predicted outcomes (based on control region trends)
- Lesson 1746 — Geo-Lift Experiments
- Measure outcomes
- Track your KPI for both groups over the same time window
- Lesson 1641 — Isolating Effects with Control Groups
- Measure the outcome
- (purchases, sign-ups, visits) for both groups
- Lesson 1747 — Ghost Ads and PSA Tests
- Measurement
- Salary data that systematically underreports gig economy earnings
- Lesson 1878 — What is Bias in Data?
- Measurement bias
- occurs when your data collection instruments, procedures, or definitions consistently produce inaccurate values.
- Lesson 1880 — Measurement and Label Bias
- measurement error
- ?
- Lesson 589 — Deciding Whether to Remove OutliersLesson 1464 — Instrumental Variables: The Endogeneity Problem
- Measurement error in X
- Recording mistakes correlate with unpredictable noise
- Lesson 553 — Exogeneity: X Must Be Independent of Errors
- Measures customer value directly
- – It reflects how much value customers extract from your product
- Lesson 1604 — What is a North Star Metric?
- Medcouple-based detection
- measures skewness more robustly than traditional methods
- Lesson 1388 — Limitations and Alternatives to IQR Detection
- Media Mix Modeling (MMM)
- and **attribution modeling** help marketers understand marketing effectiveness, they examine the problem from fundamentally different angles—like comparing satellite imagery to street-level photography.
- Lesson 1736 — MMM vs Attribution: Key Differences
- median
- income is $35,000 (the middle value—much more representative of the typical person)
- Lesson 40 — The Median: Middle ValueLesson 42 — Comparing Mean, Median, and ModeLesson 56 — Understanding Percentiles and Their InterpretationLesson 73 — Modified Z-Score Using MADLesson 174 — Symmetry and the Mode, Median, MeanLesson 306 — Bootstrap for Non-Standard ProblemsLesson 1173 — Numerical Variable Summary StatisticsLesson 1380 — Modified Z-Score Using Median
- Median Absolute Deviation (MAD)
- (a robust measure of spread)
- Lesson 1380 — Modified Z-Score Using Median
- Median survival times
- for each group (from lesson 816)
- Lesson 817 — Comparing Multiple Survival Curves
- Median time-to-conversion
- How long until half your prospects convert?
- Lesson 839 — Time-to-Conversion in Marketing Funnels
- mediator
- sits *on* the causal path between treatment and outcome.
- Lesson 1471 — Mediators and CollidersLesson 1476 — Common DAG Patterns and Pitfalls
- Medical datasets
- excluding or underrepresenting certain populations
- Lesson 1881 — Historical and Societal Bias
- Medical measurements
- (comparing height, weight, and blood pressure on the same scale)
- Lesson 200 — Comparing Values Across Different Distributions
- Medical trials
- that only track patients who completed treatment miss those who got too sick to continue
- Lesson 247 — Survivorship Bias
- Medium effect
- d ≈ 0.
- Lesson 385 — Cohen's d for Standardized Mean DifferencesLesson 386 — Effect Size Interpretation GuidelinesLesson 429 — Effect Size: Cramér's V and Phi
- Medium-risk
- Automated email campaigns highlighting underused features that correlate with retention
- Lesson 1676 — Win-Back and Retention Strategies
- Meetups and conferences
- offer face-to-face learning and relationship building.
- Lesson 2144 — Networking and Community Engagement
- Memory efficiency
- Parquet/Feather > CSV > JSON > Excel
- Lesson 1133 — Performance Considerations Across FormatsLesson 1802 — Filtering During Read with dtype and Converters
- memoryless property
- when r = 1, but only the geometric distribution is truly memoryless in the classical sense.
- Lesson 137 — Geometric vs Negative Binomial: Key DifferencesLesson 167 — Memoryless Property of Exponential
- Mental Model
- Humans naturally think "group by region, then by product" differently than "group by product, then by region"
- Lesson 906 — Order Matters: Column Sequence in GROUP BY
- Mercator
- Preserves angles (useful for navigation) but distorts area dramatically near poles
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- Merge
- creates a new commit that combines two branches, preserving the complete history of both branches.
- Lesson 2014 — Understanding Git Rebase vs MergeLesson 2026 — Merge Strategies: Merge vs Squash vs Rebase
- merge conflict
- (covered in upcoming lessons).
- Lesson 2009 — Three-Way MergesLesson 2010 — Merge Conflicts: What They AreLesson 2017 — Understanding Merge Conflicts
- Merge joins
- are efficient for pre-sorted data or when the database can sort cheaply, advancing through both tables in lockstep.
- Lesson 957 — Join Strategies: Nested Loop, Hash, Merge
- Mesokurtic
- (kurtosis ≈ 3 or excess kurtosis ≈ 0): Matches the normal distribution.
- Lesson 66 — Kurtosis: Definition and Interpretation
- Message passing coordination
- Nodes use network protocols to exchange data.
- Lesson 1771 — Shared-Nothing Architecture
- Messaging apps
- Features that change how one person messages affect the recipient
- Lesson 1527 — Ignoring Network Effects
- Metadata
- is "data about data"—the descriptive information that explains what each piece of data means.
- Lesson 23 — Data Provenance and MetadataLesson 1163 — Metadata and Data DictionariesLesson 1871 — Why Version Control for Data?
- Methodological transparency
- Show your process, not just results
- Lesson 2141 — Building a Portfolio and Personal Brand
- Methodology
- (brief): High-level approach without technical minutiae
- Lesson 1966 — Report Structure and Executive Summary
- Metric attribution
- is the process of assigning credit to the specific drivers that caused a metric to move.
- Lesson 1637 — What is Metric Attribution?
- Metric Stability
- Your success metrics shouldn't show statistically significant differences between identical groups
- Lesson 1483 — Pre-Experiment Validation
- metric tree
- (also called a "metric hierarchy" or "decomposition tree") is a visual framework that breaks a single top-level metric—often your North Star Metric—into the sub-metrics that mathematically drive it.
- Lesson 1621 — Metric Trees: Structure and PurposeLesson 1632 — Financial Services Metrics: AUM, NIM, and Credit Metrics
- Metrics/Measures
- Numerical values you aggregate (sum, average, count)
- Lesson 1808 — Star Schema and Fact Tables
- Mid-period acquisitions
- Should new customers acquired *during* the period count in the denominator?
- Lesson 1671 — Churn Rate Calculation Methods
- Middle touches aren't irrelevant
- They nurtured the relationship and kept your brand top-of-mind
- Lesson 1729 — Position-Based (U-Shaped) Attribution
- MIN
- , and **MAX**—together with **GROUP BY** to create rich summaries of grouped data.
- Lesson 892 — GROUP BY with Different Aggregate FunctionsLesson 894 — NULL Values in GROUP BY
- Minimize CAC
- ROAS might drop (reaching less-qualified audiences), and payback could extend
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Minimize color
- Use color purposefully to highlight the key finding, not to decorate every category.
- Lesson 1958 — Simplifying Visual Complexity
- Minimize payback
- You may sacrifice ROAS (focusing on quick wins, not best returns) and accept higher CAC initially
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Minimum Detectable Effect (MDE)
- is the smallest effect size your A/B test is designed to reliably detect.
- Lesson 1480 — Minimum Detectable Effect (MDE)Lesson 1493 — Why Sample Size Matters in A/B TestsLesson 1494 — Effect Size: The Minimum Detectable Effect
- Minimum sizes matter
- For digital displays, axis labels should typically be at least 10–12 points; titles 14–18 points.
- Lesson 1252 — Font Size, Typeface, and Readability
- Minimum test
- Only checks if the *smallest* value is an outlier
- Lesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- Minimum Wage Changes
- One of the most famous DiD studies compared New Jersey (which raised minimum wage) to Pennsylvania (which didn't).
- Lesson 1459 — Real-World DiD Applications
- Minor departures matter more
- – slight skewness or a few outliers become "statistically significant"
- Lesson 209 — Sample Size Considerations in Normality Tests
- Misaligned incentives
- The team running the test is measured on engagement, not profitability, so they optimize for engagement.
- Lesson 1530 — Mismatched Metrics and Goals
- Misallocating budget
- based on attribution models that reward proximity, not causation
- Lesson 1717 — Incrementality and True Channel Impact
- Missed opportunities
- Early signals of success or failure go unnoticed
- Lesson 1617 — The Danger of Lagging-Only Metrics
- Missing context
- Always include a clear legend, data source, and what the metric represents.
- Lesson 1309 — Choropleth Maps: Basics and Best Practices
- Missing details
- Analysis steps aren't fully documented (remember Documentation Standards?
- Lesson 30 — The Reproducibility Crisis and Solutions
- Missing ON Clause
- Lesson 955 — Avoiding Cartesian Products
- Missing Value Codes
- How nulls or missing data are represented (NA, -999, blank)
- Lesson 2064 — Creating Data Dictionaries
- Missing values
- more gracefully than classical methods
- Lesson 745 — STL Decomposition (Seasonal-Trend Loess)Lesson 1762 — Extended Dimensions: Veracity and Value
- Mistaking correlation patterns
- Seeing A and Y correlate and assuming A → Y, when really both are caused by unmeasured C.
- Lesson 1476 — Common DAG Patterns and Pitfalls
- MIT or Apache 2.0
- Permissive licenses allowing commercial use with minimal restrictions
- Lesson 2082 — Choosing a License for Data Science Projects
- Mitigation measures
- technical safeguards (encryption, access controls) and organizational policies (training, audits)
- Lesson 1910 — Data Protection Impact Assessments (DPIAs)
- Mixed (Numeric × Categorical)
- When one variable is numeric and the other categorical (like "salary" across "department"), you're comparing distributions of the numeric variable across groups.
- Lesson 1182 — Choosing Analysis Methods by Variable Types
- Mixed ARMA Process
- Lesson 733 — Using ACF and PACF Together
- ML Engineers
- focus on *production systems and scale*.
- Lesson 2138 — Data Analyst vs Data Scientist vs ML Engineer
- MLlib
- provides scalable machine learning algorithms that work on distributed data:
- Lesson 1775 — Spark Components: Core, SQL, MLlib, Streaming
- MMM
- works at the aggregate level, analyzing total spend and performance across channels over time (typically weeks or months).
- Lesson 1736 — MMM vs Attribution: Key Differences
- mode
- is the value that appears most frequently in your data.
- Lesson 41 — The Mode: Most Frequent ValueLesson 42 — Comparing Mean, Median, and ModeLesson 174 — Symmetry and the Mode, Median, MeanLesson 1173 — Numerical Variable Summary Statistics
- Model 1 (Binary)
- Predicts whether an observation is a "certain zero" (structural) using logistic regression.
- Lesson 695 — Zero-Inflated Models
- Model 2 (Count)
- For those who *can* have the event, predicts the count using Poisson or negative binomial regression.
- Lesson 695 — Zero-Inflated Models
- Model A
- Predicts house price using square footage (R² = 0.
- Lesson 612 — Why R-Squared Alone Is Misleading
- model artifacts
- , version everything: the serialized model file, training code commit hash, hyperparameters, training data version, and evaluation metrics.
- Lesson 1877 — Versioning Strategies for Different Data TypesLesson 2091 — Stage 7: Communication and Handoff
- Model artifacts and cache
- Lesson 2031 — Using .gitignore for Data Science Projects
- Model B
- Predicts house price using square footage + owner's favorite color (R² = 0.
- Lesson 612 — Why R-Squared Alone Is MisleadingLesson 630 — Bayesian Information Criterion (BIC)
- Model checking
- Does your fitted model generate realistic fake data?
- Lesson 1571 — Posterior Predictive Distribution for New Data
- Model comparison
- Test models with different subsets and compare fit metrics
- Lesson 585 — Remedies: Variable Selection
- Model debt
- Quick fixes to model performance without understanding why they work
- Lesson 2131 — What is Technical Debt in Data Science?
- Model Development
- involves selecting appropriate statistical methods or machine learning algorithms, training them on your prepared data, and tuning parameters.
- Lesson 2089 — Stage 5: Model Development and Validation
- Model insights
- A baseline model might show your problem is too easy (100% accuracy suggests data leakage) or impossibly hard (random performance suggests the question can't be answered with available data).
- Lesson 2109 — Why Data Science is Inherently Iterative
- Model metadata
- Training parameters, hyperparameters, dataset versions, and evaluation scores (stored in JSON or YAML)
- Lesson 2034 — Committing Data Artifacts and Model Outputs
- Model mismatch
- Real problems often involve non-conjugate likelihoods or complex dependencies that conjugates can't handle
- Lesson 1555 — Advantages and Limitations of Conjugate Priors
- Model performance metrics
- "This model is 85% accurate on the test set"
- Lesson 2122 — When Uncertainty Is Acceptable
- Model performance plateaus
- Your validation accuracy improves from 0.
- Lesson 2116 — Diminishing Returns and the 80/20 Rule
- Model selection
- (certain algorithms handle discrete vs continuous differently)
- Lesson 18 — Numerical Variables: Discrete and Continuous
- Model staleness
- Your model was trained on 2022 data.
- Lesson 2136 — Monitoring Gaps and Silent Failures
- Model versioning gaps
- occur when you can't reliably reproduce a model's results because critical information wasn't tracked: which exact code version, which data snapshot, which hyperparameters, which library versions, or even which random seed was used.
- Lesson 2134 — Model Versioning and Reproducibility Gaps
- Model versions
- `model-churn-v2`, `fraud-detector-deployed`
- Lesson 2037 — Tagging Releases and Experiment Snapshots
- Model-ready format
- Machine learning libraries expect features in columns and samples in rows.
- Lesson 1149 — Benefits of Tidy Data for Downstream Work
- Modeling considerations
- "Imbalanced classes—use stratified sampling"
- Lesson 1212 — EDA Summary Documentation and Next Steps
- Modeling relationships
- Capture how variables relate to each other (e.
- Lesson 1901 — Synthetic Data Generation
- Moderate baseline rate example
- If your baseline is 50%, improving to 51% has much higher variance (0.
- Lesson 1499 — Adjusting for Baseline Conversion Rates
- Modern best practice
- Use Welch's t-test by default for two independent samples.
- Lesson 362 — Welch's t-Test for Unequal Variances
- Modern tools available
- MCMC samplers and probabilistic programming libraries handle non-conjugate cases well
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- Modern, Python-first orchestration
- focused on developer experience.
- Lesson 1839 — Alternative Orchestration Tools
- Modes
- `overwrite`, `append`, `ignore`, `errorIfExists`
- Lesson 1779 — Reading and Writing Data in Spark
- Modified box plots
- adjust the fence calculations to account for skewness:
- Lesson 1388 — Limitations and Alternatives to IQR Detection
- Modified files
- Files you've changed since your last commit but haven't staged yet
- Lesson 1998 — Checking Repository Status
- modified Z-score
- replaces the vulnerable mean and standard deviation with robust alternatives:
- Lesson 73 — Modified Z-Score Using MADLesson 1380 — Modified Z-Score Using Median
- Monitor cohort-level performance
- to detect when trade-offs shift
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Monitor Index Statistics
- Track metrics like index size, scan counts vs seek counts, and fragmentation percentage.
- Lesson 1086 — Index Maintenance and Monitoring
- Monitoring & Maintenance
- Lesson 9 — The Data Science Lifecycle Overview
- Monitoring infrastructure
- You need systems to check stopping conditions regularly (daily, hourly, or continuously)
- Lesson 1515 — Trade-offs: Sample Size, Speed, and Complexity
- Monitoring overhead
- You need infrastructure to detect drift before it damages performance
- Lesson 2128 — Data Distribution Shifts Frequently
- Monitoring recommendations
- what to watch when it's live
- Lesson 2091 — Stage 7: Communication and Handoff
- Monotonic
- S(t) never increases; it either stays flat or decreases
- Lesson 810 — The Survival Function S(t)
- monotonic relationships
- where one variable consistently increases as the other increases (positive monotonic) or consistently decreases (negative monotonic)—even if the relationship isn't a straight line.
- Lesson 486 — Spearman's Rank Correlation CoefficientLesson 490 — Kendall's Tau vs Spearman's Rho
- Monotonic vs Linear Relationships
- Lesson 487 — When to Use Spearman vs Pearson
- Monthly Active Users (MAU)
- does the same over a 30-day window.
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- Monthly cycles
- Credit card spending spikes at month-end
- Lesson 707 — Seasonality: Regular Periodic Patterns
- Monthly Recurring Revenue (MRR)
- , a metric tree might decompose it as:
- Lesson 1621 — Metric Trees: Structure and Purpose
- Monthly vs. annual churn
- A 5% monthly churn rate doesn't equal 60% annual churn.
- Lesson 1671 — Churn Rate Calculation Methods
- More complex write logic
- to keep redundant data synchronized
- Lesson 1071 — When to Denormalize: Performance Trade-offs
- More Type II errors
- – You're more likely to miss real effects (false negatives increase)
- Lesson 342 — Alpha Level Trade-offs
- More variability
- → Wider margin (unpredictable data means less precision)
- Lesson 294 — Margin of Error and Its Components
- Most accurate
- channel for quantitative data—humans excel at comparing positions along a common scale.
- Lesson 1231 — Channels of Visual Encoding
- Most Common Category
- Lesson 644 — Choosing a Reference Category
- most critical
- assumption for chi-squared tests.
- Lesson 426 — Assumptions and Sample Size RequirementsLesson 550 — Normality of Residuals
- Most Likely Values
- Lesson 1539 — Interpreting Posterior Probabilities
- moving average
- structure of order q.
- Lesson 726 — Using ACF for Model IdentificationLesson 750 — What is a Moving Average?Lesson 753 — Centered vs Trailing Moving AveragesLesson 1017 — Moving Averages with Window Frames
- Moving Average (MA) models
- use previous *forecast errors* (also called residuals or shocks).
- Lesson 775 — Moving Average (MA) Models
- Moving averages
- give equal weight to all observations in the window.
- Lesson 764 — Exponential Smoothing vs Moving Averages
- MPP (massively parallel processing)
- .
- Lesson 1813 — Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
- MRR
- is the normalized monthly value of all active subscriptions.
- Lesson 1628 — SaaS Metrics: MRR, ARR, and Logo Churn
- MS (Mean Square)
- SS divided by df—the average variation per degree of freedom
- Lesson 444 — The ANOVA Table
- Multi-panel dashboards
- where consistent scales prevent visual confusion
- Lesson 1276 — Sharing Axes Between Subplots
- Multi-path funnels
- recognize that users can reach the same endpoint through different sequences of events.
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Multi-step ahead forecasting
- projects multiple periods into the future (e.
- Lesson 794 — Forecasting Concepts and Horizons
- Multi-touch
- Credit is distributed across multiple touchpoints (linear, time-decay, position-based)
- Lesson 1637 — What is Metric Attribution?
- multicollinearity
- (highly correlated predictors) before modeling
- Lesson 510 — Correlation Matrices: Construction and DisplayLesson 511 — Reading and Interpreting Correlation MatricesLesson 513 — Applications: Feature Selection and MulticollinearityLesson 622 — Relationship Between F-Test and t-TestsLesson 661 — Centering Predictors for PolynomialsLesson 1192 — Correlation Matrices and Heatmaps
- multimodal
- distribution.
- Lesson 41 — The Mode: Most Frequent ValueLesson 1175 — Histograms for Distribution Shape
- Multimodal data
- Multiple clusters make mean/std deviation misleading
- Lesson 1379 — Assumptions and Limitations
- Multiple columns
- "What's the average salary for each job title *within* each department?
- Lesson 905 — Grouping by Multiple Columns: Basics
- Multiple linear regression
- extends the same least squares framework to include several predictor variables simultaneously.
- Lesson 595 — From Simple to Multiple Linear Regression
- Multiple lines
- Often overlay several cohorts or segments for comparison
- Lesson 1653 — What are Retention Curves?
- Multiple outliers expected
- Use robust methods like Modified Z-score or clustering techniques
- Lesson 1395 — When to Use Grubbs' Test
- Multiple subqueries
- execute independently instead of sharing work
- Lesson 966 — Performance Considerations for WHERE Subqueries
- Multiple Testing
- Lesson 430 — Common Applications and Pitfalls
- Multiple testing correction method
- (Bonferroni, Holm-Bonferroni, Benjamini-Hochberg, etc.
- Lesson 1508 — Pre-Registration and Correction Strategy
- Multiple views
- of the same data are needed (charts, tables, maps together)
- Lesson 1330 — Introduction to Interactive Dashboards
- multiplication rule
- you just learned works beautifully for two events: P(A ∩B) = P(A) × P(B|A).
- Lesson 95 — Chain Rule for Multiple EventsLesson 107 — Bayes' Theorem Formula and Components
- multiplicative
- world, each decoration stretches proportionally—higher rungs get proportionally bigger decorations.
- Lesson 710 — Additive vs Multiplicative ModelsLesson 744 — Classical Decomposition MethodsLesson 765 — Introduction to Holt-Winters MethodLesson 825 — What is the Cox Proportional Hazards Model?
- Multiplicative forecasting formula
- Lesson 771 — Forecasting with Holt-Winters
- Multiplicative model
- `Observed = Trend × Seasonality × Irregular`
- Lesson 710 — Additive vs Multiplicative ModelsLesson 742 — Components of Seasonal DecompositionLesson 748 — Seasonally Adjusted DataLesson 749 — Using Decomposition for ForecastingLesson 770 — Initializing Holt-Winters Components
- Multiplicative models
- assume components are multiplied:
- Lesson 743 — Additive vs Multiplicative Models
- Multiplicative seasonality
- means seasonal swings grow or shrink proportionally with the trend level.
- Lesson 766 — Additive vs Multiplicative Seasonality
- Multiply
- each midpoint by its frequency (how many values fall in that range)
- Lesson 45 — Central Tendency for Grouped DataLesson 96 — Conditional Probability in Tree DiagramsLesson 178 — Log-Normal Distribution: Definition and Properties
- Multiply by the likelihood
- `P(data|θ)` — how probable the observed data is for each possible θ value
- Lesson 1545 — Calculating the Posterior Distribution
- Multiplying by a constant
- If you multiply *X* by constant *a*, the expectation scales proportionally:
- Lesson 149 — Properties of Expectation and Variance
- Multiprocessing Scheduler
- Lesson 1795 — Distributed Schedulers and Client Setup
- must
- be a success (that's your *r*th success)
- Lesson 135 — The Negative Binomial Distribution: Waiting for r SuccessesLesson 274 — Confidence Intervals for Small SamplesLesson 799 — Fitting and Diagnosing SARIMA ModelsLesson 971 — Aliasing Derived TablesLesson 1056 — Foreign Key Constraints in PracticeLesson 1841 — Upstream and Downstream Dependencies
- Mutual independence
- (also called *joint independence*) is stronger.
- Lesson 103 — Mutual Independence vs Pairwise Independence
- mutually exclusive
- .
- Lesson 80 — Set Operations: Union, Intersection, and ComplementLesson 83 — Partitions of the Sample SpaceLesson 86 — General Addition Rule for Overlapping EventsLesson 89 — The Complement RuleLesson 106 — Common Misconceptions About IndependenceLesson 309 — Complementary Nature of Hypotheses
- MySQL
- is another open-source option, popular for web applications and simpler projects.
- Lesson 845 — Database Management Systems (DBMS)Lesson 862 — Case Sensitivity in Text FilteringLesson 940 — Database Support and AlternativesLesson 1041 — Formatting and Parsing Dates
N
- Named parameters
- give each placeholder a meaningful name using the `:name` syntax, making your queries self- documenting and easier to modify.
- Lesson 1106 — Parameter Placeholders: Named ParametersLesson 1108 — Handling IN Clauses Safely
- Naming conventions prevent chaos
- Establish patterns like `YYYY-MM-DD_project_dataset_version.
- Lesson 2068 — Data Provenance Best Practices
- Narrower
- than the original population distribution
- Lesson 252 — Sampling Distribution of the Sample Mean
- National origin
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Natural keys
- Existing unique values like `email` or `ssn`
- Lesson 1048 — What Are Primary Keys?Lesson 1050 — Choosing Effective Primary Keys
- Natural order
- For ordinal categories like "Small, Medium, Large" or months, preserve logical sequence
- Lesson 1178 — Bar Charts for Categorical Data
- Natural workflow
- Mirrors how we actually learn—bit by bit, not all at once
- Lesson 1538 — Updating Beliefs with Sequential Data
- Navigate with keyboard only
- can you reach all interactive features?
- Lesson 1254 — Testing Visualizations for Accessibility
- Near-duplicates
- Similar records that might represent the same entity (e.
- Lesson 1154 — Uniqueness and Duplication Checks
- Nearest-Neighbor Matching
- pairs each treated unit with the control unit(s) having the closest propensity score.
- Lesson 1448 — Propensity Score Matching Methods
- Necessity assessment
- why this processing is needed, what legal basis applies
- Lesson 1910 — Data Protection Impact Assessments (DPIAs)
- Negative
- Your model overpredicted (observed < fitted)
- Lesson 539 — What Are Residuals?Lesson 652 — Interpreting Categorical × Continuous Interactions
- Negative (Left) Skewness
- Lesson 64 — Skewness: Definition and Interpretation
- Negative binomial
- "How many calls until I get *10* EV owners for my focus group?
- Lesson 138 — Real-World Applications: Quality Control and SurveysLesson 694 — Quasi-Poisson and Negative Binomial Models
- negative binomial distribution
- answers this question: "How many trials will I need to get exactly *r* successes?
- Lesson 135 — The Negative Binomial Distribution: Waiting for r SuccessesLesson 136 — Expectation and Variance of the Negative BinomialLesson 137 — Geometric vs Negative Binomial: Key DifferencesLesson 138 — Real-World Applications: Quality Control and Surveys
- Negative coefficient
- → that category has a lower average outcome than the reference
- Lesson 637 — Interpreting Dummy Variable Coefficients
- Negative correlation
- Points trend downward (as one increases, the other decreases)
- Lesson 1222 — Scatter Plots for Relationships
- Negative r
- Variables move in opposite directions (car age and resale value, temperature and heating bills)
- Lesson 477 — Interpreting the Correlation Coefficient
- Negative residuals
- (`e_i < 0`) occur when the actual value is *below* the fitted line.
- Lesson 540 — The Residual Formula
- Negative values
- = lighter tails than normal (platykurtic)
- Lesson 67 — Calculating KurtosisLesson 720 — The Autocorrelation Function (ACF)
- Neglecting the complement
- When you know P(A|B), don't assume you automatically know P(A|not B).
- Lesson 100 — Common Conditional Probability Mistakes
- Neither
- recognizes the critical middle steps that move customers down the funnel
- Lesson 1724 — Limitations of Single-Touch Attribution
- nested
- when one model is a special case of the other.
- Lesson 626 — Nested vs Non-Nested ModelsLesson 1334 — Dash Basics: App Layout with Components
- nested models
- a smaller model (with fewer predictors) and a larger model (with additional predictors).
- Lesson 623 — Partial F-Tests for Nested ModelsLesson 627 — The F-Test for Model ComparisonLesson 699 — The Likelihood Ratio TestLesson 791 — Comparing Nested and Non-Nested Models
- Net Revenue Retention
- measures revenue retention *including* expansion from existing customers:
- Lesson 1629 — SaaS Growth Metrics: Quick Ratio and Net Revenue Retention
- Netflix
- Hours watched — reflects content value and reduces churn risk.
- Lesson 1606 — Examples of North Star Metrics by Industry
- Network effects
- In social features, randomizing by user may "leak" treatment effects to control users who interact with treated users.
- Lesson 1481 — Unit of RandomizationLesson 1923 — Algorithmic Amplification of Harm
- Network Errors
- happen when Python can't reach the database server.
- Lesson 1093 — Troubleshooting Connection Issues
- Neural networks
- Many deep learning frameworks expect one-hot encoded inputs
- Lesson 638 — One-Hot Encoding Overview
- Never
- choose your tail configuration after seeing your results.
- Lesson 350 — Choosing the Right Tail Configuration
- Never hardcode credentials
- Store them in environment variables or configuration files:
- Lesson 1090 — Establishing a Connection with psycopg2 (PostgreSQL)
- New Customers
- Recent converters who've made their first purchase or subscription.
- Lesson 1704 — Customer Lifecycle Stages
- New hypothesis
- Email engagement is the driver, not day-of-week effects
- Lesson 1201 — Domain Knowledge as a Hypothesis Source
- New insights from data
- Analysis might reveal that what you thought was a key driver (a branch metric) actually has minimal impact on your North Star.
- Lesson 1626 — Maintaining and Evolving Metric Trees
- No arbitrary cutoff
- All past data contributes (just with declining weight)
- Lesson 757 — Introduction to Exponential Smoothing
- No autocorrelation
- values shouldn't predict future values
- Lesson 709 — Irregular Component: Random Noise
- No change
- "Customer satisfaction hasn't changed after the redesign" (before = after)
- Lesson 307 — Defining the Null Hypothesis (H₀)
- no correlation
- between the two variables in the population.
- Lesson 500 — Hypothesis Testing Framework for CorrelationLesson 1222 — Scatter Plots for Relationships
- No decay (flat)
- Equal weight throughout the window—simpler but less realistic
- Lesson 1639 — Time Windows and Attribution Decay
- No difference
- "The mean weight of Group A equals the mean weight of Group B" (μ₁ = μ₂)
- Lesson 307 — Defining the Null Hypothesis (H₀)
- No effect
- "This new drug has no effect on blood pressure" (effect = 0)
- Lesson 307 — Defining the Null Hypothesis (H₀)
- No extreme outliers
- A single outlier can dramatically distort r
- Lesson 480 — Scatterplots and Visual Assessment
- No Multicollinearity
- Lesson 546 — The Five Core Assumptions of Linear Regression
- No one person understands
- the entire chain anymore
- Lesson 2132 — Pipeline Glue Code and Complexity Creep
- No outliers
- Mean provides more information because it uses all data points.
- Lesson 42 — Comparing Mean, Median, and Mode
- No patterns remaining
- if you see structure in the residuals, you've missed something
- Lesson 709 — Irregular Component: Random Noise
- No Perfect Multicollinearity
- Lesson 601 — Assumptions for Multiple Linear Regression
- No relationship
- "There's no correlation between study hours and test scores" (correlation = 0)
- Lesson 307 — Defining the Null Hypothesis (H₀)
- No repeating groups
- there are no columns like "Phone1", "Phone2", "Phone3" storing similar data
- Lesson 1064 — First Normal Form (1NF)
- No selection bias
- (nothing observable or unobservable influences assignment)
- Lesson 1487 — Simple Random Assignment
- no trend
- and **constant seasonality** can be stationary
- Lesson 712 — What is Stationarity?Lesson 758 — Simple Exponential Smoothing (SES)
- No trend (flat)
- Values fluctuate around a stable mean with no persistent direction
- Lesson 706 — Trend: Long-Term Direction
- Node Color
- Use color to encode categories (communities, types) or continuous values (temperature scales for metrics).
- Lesson 1319 — Styling Network Visualizations
- Node Size
- Scale nodes by importance metrics (degree centrality, betweenness) or attributes (population, budget).
- Lesson 1319 — Styling Network Visualizations
- Nodes
- represent variables (like treatment, outcome, confounders)
- Lesson 1468 — Introduction to Directed Acyclic Graphs (DAGs)
- Noise addition
- introduces random perturbation to numerical data, making exact values uncertain while preserving statistical properties for aggregate analysis.
- Lesson 1895 — Data Anonymization Basics
- Nominal
- variables are categories without any inherent ranking or order.
- Lesson 17 — Categorical Variables: Nominal and Ordinal
- Nominal data
- (categories with no order: fruit types, countries, product names) pairs best with:
- Lesson 1238 — Matching Encoding to Data Type
- Non-canonical links
- are any other valid link functions you might choose for that distribution.
- Lesson 676 — Canonical vs Non-Canonical Links
- Non-correlated SELECT subqueries
- run once:
- Lesson 969 — Performance Considerations for SELECT Subqueries
- Non-correlated subqueries
- are completely independent.
- Lesson 968 — Correlated vs Non-Correlated Subqueries in SELECT
- Non-correlated with JOIN
- Lesson 980 — Converting Correlated to Non-Correlated Subqueries
- non-directional
- .
- Lesson 345 — Directionality in Hypothesis TestingLesson 415 — Setting Up Hypotheses for Goodness of Fit
- Non-directional (two-tailed)
- "The new landing page will *change* sign-ups"
- Lesson 1479 — Formulating Hypotheses
- Non-independence
- If observations are related in ways you haven't accounted for (clustered data, time series correlations), the independence assumption fails completely, and your p-values become meaningless.
- Lesson 390 — When Parametric Tests Fail: Violations of Assumptions
- Non-independent pairs
- Results are unreliable; reconsider your study design
- Lesson 374 — Assumptions of the Paired t-Test
- Non-informative (flat) priors
- essentially say "I know nothing" — they let the data dominate the analysis completely.
- Lesson 1534 — The Prior Distribution
- Non-linear funnels
- acknowledge that users don't always move forward.
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Non-linearity
- The relationship between X and Y curves rather than forming a straight line
- Lesson 591 — When and Why to Transform Variables
- non-nested
- when neither is a special case of the other.
- Lesson 626 — Nested vs Non-Nested ModelsLesson 629 — Akaike Information Criterion (AIC)
- Non-nested models
- are competitors that can't be simplified into one another.
- Lesson 791 — Comparing Nested and Non-Nested Models
- Non-normal data
- Switch to IQR-based detection; it's distribution-agnostic
- Lesson 1395 — When to Use Grubbs' Test
- Non-normal differences (large n)
- Paired t-test is usually still robust
- Lesson 374 — Assumptions of the Paired t-Test
- Non-normal differences (small n)
- Consider the Wilcoxon signed-rank test (a non-parametric alternative)
- Lesson 374 — Assumptions of the Paired t-Test
- Non-Normal Distributions
- Lesson 487 — When to Use Spearman vs PearsonLesson 1379 — Assumptions and Limitations
- Non-normal residuals
- The Q-Q plot shows heavy tails, skewness, or other departures from normality
- Lesson 591 — When and Why to Transform Variables
- Non-normality with small samples
- If your sample size is small (typically n < 30) and your data show strong skewness, heavy outliers, or non-normal distributions (confirmed through visual checks or tests like Shapiro-Wilk), the t- test's results become unreliable.
- Lesson 390 — When Parametric Tests Fail: Violations of Assumptions
- Non-parametric part
- The baseline hazard function (risk over time for someone with all covariates = 0) is **not assumed** to follow any distribution—it's left flexible.
- Lesson 825 — What is the Cox Proportional Hazards Model?
- non-probability sampling
- method where you select individuals or items simply because they're easy to reach.
- Lesson 239 — Convenience SamplingLesson 242 — Probability vs Non-Probability SamplingLesson 247 — Survivorship Bias
- Non-probability sampling advantages
- Lesson 242 — Probability vs Non-Probability Sampling
- Non-probability sampling limitations
- Lesson 242 — Probability vs Non-Probability Sampling
- Non-regression contexts
- Classification tasks, clustering, or similarity calculations
- Lesson 638 — One-Hot Encoding Overview
- Non-repeatable reads
- Reading the same row twice and getting different values
- Lesson 1116 — Transaction Isolation and Concurrency
- Non-response bias
- When certain groups don't respond to your survey.
- Lesson 244 — Selection Bias and Its Causes
- Non-stationarity
- The statistical properties (mean, variance) often change over time—seasonal patterns, trends, and structural breaks are common
- Lesson 704 — What Makes Time Series Data Different?
- non-stationary
- .
- Lesson 716 — Augmented Dickey-Fuller TestLesson 725 — Decay Rates in ACFLesson 726 — Using ACF for Model Identification
- Non-technical audiences
- (executives, stakeholders, general public) typically:
- Lesson 1950 — Identifying Your Audience: Technical vs Non-Technical
- Non-trivial
- `StudentID → StudentName` (actually tells us something)
- Lesson 1063 — Functional Dependencies
- Nonlinear relationships
- Curves, U-shapes, or other patterns that aren't straight lines
- Lesson 1222 — Scatter Plots for Relationships
- Normal
- (for continuous outcomes)
- Lesson 664 — What is the Exponential Family of Distributions?Lesson 669 — The Dispersion Parameter φLesson 1568 — Unknown Variance: Normal-Inverse-Gamma Model
- normal distribution
- (also called the Gaussian distribution) is a continuous probability distribution that creates a distinctive **bell curve** shape when graphed.
- Lesson 169 — The Normal Distribution: Definition and PropertiesLesson 180 — Parameters and Moments of the Log-NormalLesson 676 — Canonical vs Non-Canonical LinksLesson 1395 — When to Use Grubbs' TestLesson 1566 — Conjugate Normal-Normal Model
- Normal posterior
- Use the mean ± (z-score × standard deviation) where z-score comes from the normal distribution.
- Lesson 1579 — Practical Computation of Credible Intervals
- normal prior
- for means because:
- Lesson 1565 — Prior Distributions for Normal MeansLesson 1566 — Conjugate Normal-Normal Model
- Normal-Inverse-Gamma (NIG)
- distribution is a conjugate prior for the normal likelihood when both mean (μ) and variance (σ²) are unknown.
- Lesson 1568 — Unknown Variance: Normal-Inverse-Gamma Model
- Normal-Normal conjugacy
- Normal prior + Normal likelihood = Normal posterior.
- Lesson 1553 — Normal-Normal Conjugacy
- Normality
- Each group's data is roughly normally distributed (or sample size is large)
- Lesson 447 — Conducting One-Way ANOVA in PracticeLesson 544 — The Role of Residuals in DiagnosticsLesson 546 — The Five Core Assumptions of Linear RegressionLesson 601 — Assumptions for Multiple Linear RegressionLesson 782 — Residual Diagnostics for ARIMA
- Normality checks
- Residuals should be roughly normally distributed (histogram, Q-Q plot)
- Lesson 799 — Fitting and Diagnosing SARIMA Models
- Normality holds
- Your data (or sampling distribution) is approximately normal, especially with small samples (n < 30)
- Lesson 398 — Choosing Between Parametric and Non-Parametric Tests
- Normality of Differences
- Lesson 374 — Assumptions of the Paired t-Test
- Normality required
- Your data must be approximately normally distributed (Grubbs' is parametric)
- Lesson 1389 — What is Grubbs' Test?
- Normality violated
- Severe skewness, outliers, or small samples from non-normal populations
- Lesson 398 — Choosing Between Parametric and Non-Parametric Tests
- Normality violated, small sample
- → Switch to non-parametric alternative (Mann-Whitney, Wilcoxon signed-rank)
- Lesson 383 — Diagnostic Workflow: When to Proceed or Switch Tests
- Normality violations
- The Central Limit Theorem saves us here.
- Lesson 382 — Robustness of t-Tests to Assumption Violations
- Normalization
- splitting tables so each stores one logical entity with proper relationships maintained through primary and foreign keys.
- Lesson 1062 — Data Anomalies: Insert, Update, Delete
- Normalize by the evidence
- `P(data)` — a scaling constant that ensures probabilities sum to 1
- Lesson 1545 — Calculating the Posterior Distribution
- Normalize to UTC
- Convert any timezone-aware timestamp to UTC for storage
- Lesson 1042 — Working with Timestamps and Time Zones
- Normalize Your Data
- Lesson 1309 — Choropleth Maps: Basics and Best Practices
- Normalizing values
- means replacing variations with a single standard form.
- Lesson 1138 — Cleaning and Standardizing Text Fields
- normally distributed
- (bell-shaped).
- Lesson 71 — Z-Score Method for Outlier DetectionLesson 224 — CLT for Proportions
- North Star Metric
- (NSM) is the one metric that best captures the core value your product or service delivers to customers.
- Lesson 1604 — What is a North Star Metric?
- not
- A"
- Lesson 80 — Set Operations: Union, Intersection, and ComplementLesson 81 — Mutually Exclusive EventsLesson 267 — Interpreting Confidence LevelsLesson 548 — Independence of ObservationsLesson 865 — Introduction to Logical Operators in SQLLesson 868 — The NOT OperatorLesson 870 — Operator Precedence and ParenthesesLesson 884 — AVG: Computing Averages (+4 more)
- Not a partition
- Lesson 83 — Partitions of the Sample Space
- Not quite
- Different libraries often maintain **separate random number generators** with independent states.
- Lesson 2058 — Seed Scope and Multiple Libraries
- Not reproducible
- Others can't easily adapt your paths and settings
- Lesson 2072 — Configuration Files vs Hard-Coded Values
- Not Robust
- For non-normal data, percentiles or IQR-based methods often work better than z-scores.
- Lesson 201 — Z-Score Applications and Limitations
- Not so fast
- You've ignored the thousands of failed startups that *also* took big risks but went bankrupt.
- Lesson 247 — Survivorship Bias
- Not sure
- Try multiple window sizes and compare how well they balance smoothness with responsiveness for your specific problem
- Lesson 752 — Choosing the Window Size
- Notebooks vs code
- Use notebooks (`notebooks/`) for exploration and communication.
- Lesson 2069 — Project Directory Structure
- Novelty bias
- Users often react differently to changes initially—either with excitement (novelty effect) or resistance (change aversion).
- Lesson 1484 — Duration and Timing Considerations
- Novelty Effect
- Users interact more with something *because it's new and different*, not because it's actually better.
- Lesson 1525 — Novelty and Primacy Effects
- NPS surveys
- that don't correlate with renewal rates in your specific business
- Lesson 1616 — Metrics Divorced from Revenue
- Null (H₀)
- The two variables are independent (no association)
- Lesson 433 — Conducting Fisher's Exact TestLesson 787 — Ljung-Box Test for Residual Autocorrelation
- Null deviance
- measures how poorly an intercept-only model (just predicting the overall mean/rate) fits your data.
- Lesson 698 — Null and Residual Deviance
- Null hypothesis
- The factor has no effect on the outcome (all group means for that factor are equal)
- Lesson 464 — Main Effects in Two-Way ANOVALesson 474 — Friedman Test: Non-Parametric Repeated Measures ANOVALesson 654 — Testing Interaction SignificanceLesson 819 — Null Hypothesis in the Log- Rank TestLesson 1467 — Testing Instrument Strength and Validity
- Null hypothesis (H₀)
- The data comes from a normal distribution
- Lesson 205 — Shapiro-Wilk TestLesson 207 — Anderson-Darling TestLesson 307 — Defining the Null Hypothesis (H₀)Lesson 354 — Setting Up Hypotheses for One-Sample t-TestLesson 378 — Testing Normality: Statistical TestsLesson 401 — Setting Up Hypotheses for ProportionsLesson 406 — Two- Sample Proportion Test SetupLesson 415 — Setting Up Hypotheses for Goodness of Fit (+9 more)
- NULL values
- from unmatched rows can affect your counts differently than expected
- Lesson 933 — Aggregating with LEFT JOINs
- Number at risk
- = all subjects who haven't yet had the event *and* haven't been censored before time *t*
- Lesson 812 — Handling Event Times and Censoring
- Number of categories
- How many distinct groups or bins your data falls into
- Lesson 418 — Degrees of Freedom in Goodness of Fit
- Number of events
- = only those who actually experienced the event at time *t*
- Lesson 812 — Handling Event Times and Censoring
- Number of Groups
- Lesson 446 — Power and Sample Size for ANOVA
- Number of variables
- Are you showing one variable, comparing two, or exploring relationships among three or more?
- Lesson 1230 — Choosing the Right Chart Type
- Numeric × Numeric
- When both variables are continuous or discrete numbers, you want to assess linear relationships, strength, and direction.
- Lesson 1182 — Choosing Analysis Methods by Variable Types
- Numeric-to-Categorical
- Compare distributions using grouped summary statistics and visualizations (box plots by group).
- Lesson 1210 — Relationship Exploration: Correlation and Association
- Numeric-to-Numeric
- Use correlation coefficients (Pearson, Spearman) and correlation matrices to spot linear and monotonic relationships.
- Lesson 1210 — Relationship Exploration: Correlation and Association
- Numerical data
- (continuous or discrete)
- Lesson 39 — The Mean (Arithmetic Average)Lesson 41 — The Mode: Most Frequent Value
- Numerical stability
- Stan's implementation of Hamiltonian Monte Carlo (NUTS) includes automatic differentiation and careful numerical engineering
- Lesson 1595 — Stan: High-Performance Bayesian Inference
- NUTS (No-U-Turn Sampler)
- is an advanced version of HMC that automatically tunes a critical parameter: how long to let the "ball" roll.
- Lesson 1593 — Hamiltonian Monte Carlo and NUTS
- NYC's coefficient
- (say, +15): means NYC is 15 units higher than Boston
- Lesson 643 — Interpreting Coefficients Relative to Reference
O
- O'Brien-Fleming
- Spends very little alpha early (conservative early looks), saving most for the final analysis
- Lesson 1512 — Group Sequential Testing
- Objective
- is a clear, inspiring goal that describes *what* you want to achieve.
- Lesson 1607 — Introduction to OKRs (Objectives and Key Results)
- Objective Example
- Lesson 1607 — Introduction to OKRs (Objectives and Key Results)
- Objectives
- are the qualitative, inspirational statements that describe *what* you want to achieve.
- Lesson 1609 — Setting Effective Objectives
- Observations are independent
- Each observation doesn't influence others
- Lesson 399 — When to Use the One-Sample Z-Test for Proportions
- Observe
- Collect data for a period (say, one day's worth of conversions)
- Lesson 1582 — Updating Beliefs with Test Data
- Observed
- The actual count you got in each category
- Lesson 417 — The Chi-Squared Test Statistic Formula
- observed frequencies
- are from your **expected frequencies** across all categories.
- Lesson 414 — Introduction to Chi-Squared Goodness of Fit TestLesson 416 — Calculating Expected FrequenciesLesson 423 — Contingency Tables and Expected Frequencies
- Odds ratio
- for proportions
- Lesson 384 — What is Effect Size?Lesson 677 — Interpreting Coefficients Under Different Links
- Offline
- Can use computationally intensive methods; you can look both forward *and* backward from any point
- Lesson 1414 — Offline vs Online Change-Point Detection
- Offline (batch) change-point detection
- works like a detective reviewing cold cases.
- Lesson 1414 — Offline vs Online Change-Point Detection
- offset
- is a predictor whose coefficient is fixed at 1.
- Lesson 692 — Offset Terms for ExposureLesson 1023 — Introduction to Window Functions: LAG and LEADLesson 1024 — LAG Function: Accessing Previous Row ValuesLesson 1025 — LEAD Function: Accessing Next Row Values
- Omega-squared
- provides a less biased, more conservative estimate by adjusting for sample size.
- Lesson 445 — Effect Size: Eta-Squared and Omega-Squared
- Omitted variable bias
- A third variable influences both X and Y, creating a spurious relationship
- Lesson 553 — Exogeneity: X Must Be Independent of Errors
- Omitted variables
- Important confounders are missing from your model and hide in the error term
- Lesson 1464 — Instrumental Variables: The Endogeneity Problem
- ON
- is required for complex conditions (inequalities, multiple different columns)
- Lesson 953 — Join Conditions: ON vs USING
- ON condition
- The matching rule, usually comparing a column from each table
- Lesson 919 — Basic INNER JOIN Syntax
- On macOS
- Open Terminal and type `git --version`.
- Lesson 1991 — Installing Git and Initial Configuration
- On Windows
- Download the installer from [git-scm.
- Lesson 1991 — Installing Git and Initial ConfigurationLesson 2040 — Creating and Activating Virtual Environments with venv
- once
- in the output, not twice.
- Lesson 953 — Join Conditions: ON vs USINGLesson 1003 — Set Operation Requirements and RulesLesson 2057 — Setting Seeds in Python and R
- once per row
- in the outer query
- Lesson 967 — Subqueries in the SELECT ClauseLesson 969 — Performance Considerations for SELECT SubqueriesLesson 978 — Correlated Subqueries in SELECT Clauses
- Once spent, it's gone
- You cannot query indefinitely—eventually you exhaust your budget and must stop
- Lesson 1900 — Privacy Budget and Composition
- One categorical independent variable
- (the "factor") with **three or more levels/groups**
- Lesson 438 — When to Use One-Way ANOVA
- One continuous dependent variable
- (the outcome you're measuring)
- Lesson 438 — When to Use One-Way ANOVA
- One idea per paragraph
- Don't mix method explanation with result interpretation
- Lesson 1967 — Writing Clear and Concise Analysis Sections
- One numerical, one categorical
- Do salaries differ by department?
- Lesson 1181 — What is Bivariate Analysis?
- One sample
- of data
- Lesson 351 — When to Use a One-Sample t-TestLesson 370 — Differences as the Unit of Analysis
- One-hot encoding
- takes a different approach: it creates k dummy variables for k categories—one for *every* level, with no reference category left out.
- Lesson 638 — One-Hot Encoding Overview
- One-sided
- When theory, cost, or practical concerns make only one direction meaningful
- Lesson 311 — One-Sided vs Two-Sided AlternativesLesson 345 — Directionality in Hypothesis TestingLesson 401 — Setting Up Hypotheses for Proportions
- One-sided (greater)
- "The parameter is *greater* than the null value" (>)
- Lesson 308 — Defining the Alternative Hypothesis (H₁ or H ₐ)Lesson 373 — Hypotheses for Paired t-Tests
- One-sided (less)
- "The parameter is *less* than the null value" (<)
- Lesson 308 — Defining the Alternative Hypothesis (H₁ or H ₐ)Lesson 373 — Hypotheses for Paired t-Tests
- One-sided (maximum)
- You're testing product dimensions where oversized items break downstream machinery, but undersized items are fine.
- Lesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- One-sided (minimum)
- You're checking server response times where slow responses matter, but faster-than-expected times are welcomed.
- Lesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- One-sided (one-tailed) tests
- These focus on a specific direction:
- Lesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- one-sided test
- , you only care about one tail direction.
- Lesson 319 — Calculating P-Values from Test StatisticsLesson 325 — The Rejection Region
- One-size-fits-all
- A news app *should* have high DAU/MAU; tax software shouldn't
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- One-step ahead forecasting
- predicts just the next immediate time period (e.
- Lesson 794 — Forecasting Concepts and Horizons
- One-Step-Ahead Forecasting Only
- Lesson 756 — Limitations of Moving Averages
- One-tailed
- H₁: p₁ > p₂ or H₁: p₁ < p₂ (testing for a specific direction)
- Lesson 406 — Two-Sample Proportion Test Setup
- one-tailed test
- (also called a one-sided test).
- Lesson 347 — One-Tailed Tests: Testing for a Specific DirectionLesson 348 — P-Value Calculation DifferencesLesson 349 — Power Advantages and Trade-offsLesson 354 — Setting Up Hypotheses for One- Sample t-TestLesson 433 — Conducting Fisher's Exact Test
- ongoing monitoring
- of processes over time.
- Lesson 1397 — Shewhart Control Chart BasicsLesson 1975 — When to Build a Dashboard
- Online
- Must be fast enough to keep pace with incoming data; can only look backward at history
- Lesson 1414 — Offline vs Online Change-Point Detection
- Online (real-time) change-point detection
- is like a security guard monitoring live camera feeds.
- Lesson 1414 — Offline vs Online Change-Point Detection
- Online communities
- (Reddit's r/datascience, Twitter/X, LinkedIn, Discord servers, Stack Overflow) provide daily touchpoints.
- Lesson 2144 — Networking and Community Engagement
- Online reviews
- Only people with strong opinions (very happy or very angry) typically write reviews
- Lesson 246 — Volunteer and Self-Selection Bias
- Online-only surveys
- Excluding people without internet access
- Lesson 249 — Coverage Error and Undercoverage
- only
- applies when events are independent.
- Lesson 87 — Multiplication Rule for Independent EventsLesson 866 — The AND OperatorLesson 928 — LEFT JOIN vs INNER JOIN: When to Use EachLesson 1821 — Hybrid Approaches and Modern Data Stacks
- Opacity
- De-emphasize less important points or show density
- Lesson 1310 — Point Maps and Scatter Plots on MapsLesson 1923 — Algorithmic Amplification of Harm
- Open conflicting files
- Look for conflict markers (`<<<<<<<`, `=======`, `>>>>>>>`) showing both versions
- Lesson 2018 — Resolving Conflicts During Rebase
- Open Questions
- "Does the marketing team track promo codes consistently?
- Lesson 2100 — Documenting Assumptions and Open Questions
- Open-source contributions
- demonstrate your skills publicly while improving tools others use.
- Lesson 2144 — Networking and Community Engagement
- Opening balance method
- Use only customers at period start (simpler, more conservative)
- Lesson 1671 — Churn Rate Calculation Methods
- OpenLineage
- (open standard) embed lineage capture directly into your code.
- Lesson 1164 — Tools for Lineage Tracking
- OpenStreetMap
- is the most popular open-source tile provider, offering street-level detail perfect for urban data visualization.
- Lesson 1314 — Basemaps and Map Tiles
- Operational alignment
- Does the model match what your marketing team observes qualitatively?
- Lesson 1734 — Comparing and Validating Attribution Models
- Operational databases
- Lesson 1807 — Data Warehouse vs Database: Architecture and Purpose
- operators
- (templates for tasks like PythonOperator, BashOperator, or SQLOperator).
- Lesson 1833 — Introduction to Apache AirflowLesson 1835 — Airflow Operators and Tasks
- Opportunity
- Can you convert core users to power users?
- Lesson 1698 — Power User Curves and Engagement Distribution
- Opportunity cost
- What are you giving up by choosing one option?
- Lesson 152 — Decision Making Under UncertaintyLesson 1586 — Multi-Armed Bandit ConnectionsLesson 2118 — Cost-Benefit Analysis for Continued Work
- Optimal bandwidth selectors
- Methods like Imbens-Kalyanaraman or Calonico-Cattaneo-Titiunik that balance bias and variance
- Lesson 1463 — RDD Bandwidth Selection and Local Estimation
- Optimal intervention timing
- Reach out *before* the high-risk window
- Lesson 835 — Customer Churn Prediction with Survival Analysis
- Optimization
- Each channel has unique conversion funnels and drop-off patterns you can improve
- Lesson 1711 — What Are Acquisition Channels?Lesson 1716 — Channel Mix and Portfolio Thinking
- Optimization pressure
- Algorithms optimize for accuracy on biased data, which means they get *better* at replicating and intensifying discriminatory patterns
- Lesson 1882 — Algorithmic Amplification of Bias
- Optimization traps
- You improve the surrogate at the expense of the business metric (e.
- Lesson 1518 — The Relationship Between Surrogate and Business Metrics
- Optimize
- your query by rearranging or combining operations
- Lesson 1780 — Transformations vs Actions in Spark
- Optimize execution
- Tasks with no mutual dependencies can run in parallel
- Lesson 1841 — Upstream and Downstream Dependencies
- Optimize timing
- See how much time passes between interactions
- Lesson 1719 — The Customer Journey and Touchpoints
- Optimize within bounds
- for maximum revenue or profit, not individual metrics
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Optimized execution
- Spark's Catalyst optimizer rewrites your queries for performance
- Lesson 1778 — DataFrames and Spark SQL Basics
- Optimizes computation
- by eliminating redundant operations
- Lesson 1790 — What is Dask and When to Use It
- Optimizing warranty periods
- By fitting a Cox model or Kaplan-Meier curve to historical failure data, you can estimate what percentage of products will fail within 1 year, 2 years, etc.
- Lesson 837 — Product Warranty and Failure Analysis
- Orchestration
- to schedule and monitor the entire pipeline
- Lesson 1821 — Hybrid Approaches and Modern Data StacksLesson 1832 — Orchestration vs Scheduling
- Orchestration layer
- Manages task scheduling, dependencies, retries, and monitoring
- Lesson 1822 — What is a Data Pipeline?
- order
- (q) tells you how many previous error terms to include.
- Lesson 777 — Identifying MA Order (q) Using ACFLesson 951 — Join Order and Performance
- ORDER BY
- Sorts the final result
- Lesson 896 — GROUP BY Execution OrderLesson 912 — Fundamental Difference: Filter Timing
- Order conditions by likelihood
- Place the most frequently matched conditions first to minimize unnecessary evaluations.
- Lesson 1037 — CASE Best Practices and Performance
- order matters
- .
- Lesson 789 — Overfitting and Cross-Validation for Time SeriesLesson 1355 — Layer Order and Plot Composition
- Order reversal
- The reciprocal transformation **reverses the order** of your values.
- Lesson 216 — Reciprocal and Inverse Transformations
- Order your p-values
- from smallest to largest: p₁ ≤ p₂ ≤ .
- Lesson 1504 — Holm-Bonferroni MethodLesson 1506 — Benjamini-Hochberg Procedure
- Ordering all event times
- from earliest to latest
- Lesson 809 — Introduction to the Kaplan-Meier Estimator
- Orders
- is a foreign key that must match a `customer_id` in **Customers**.
- Lesson 1051 — Introduction to Foreign Keys
- Ordinal
- variables have categories with a natural, meaningful order or ranking.
- Lesson 17 — Categorical Variables: Nominal and OrdinalLesson 392 — Wilcoxon Signed-Rank Test
- Ordinal data
- Ranks matter, but exact distances don't (e.
- Lesson 398 — Choosing Between Parametric and Non-Parametric TestsLesson 487 — When to Use Spearman vs PearsonLesson 1238 — Matching Encoding to Data Type
- Organizational conflicts
- Your employer wants data to support a predetermined decision.
- Lesson 35 — Conflicts of Interest and Independence
- Organizational pressure
- Your employer wants a particular conclusion to justify a strategy they've already committed to publicly.
- Lesson 1930 — Managing Conflicts of Interest
- Organize your 2×2 table
- and identify b and c (the off-diagonal counts)
- Lesson 436 — Conducting McNemar's Test
- Orientation matters
- Rotating the view can completely change the story your data tells—a sign the visualization isn't robust
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- Origin
- Database name, URL, file path, API endpoint, or vendor name
- Lesson 1161 — Documenting Data Sources
- Original data
- (the combined signal)
- Lesson 711 — Visualizing Components with Decomposition PlotsLesson 747 — Interpreting Decomposition Plots
- ORM (Object-Relational Mapper)
- is a tool that lets you interact with database tables using Python objects and classes instead of writing raw SQL queries.
- Lesson 1117 — What is an ORM and Why Use It?
- Ornamental illustrations
- (like pictures of coins on financial charts)
- Lesson 1963 — Removing Chartjunk
- Ornate borders and frames
- Decorative elements around the chart
- Lesson 1246 — Visual Clutter and Chartjunk
- Orthographic projection
- All objects maintain their size regardless of distance from the camera.
- Lesson 1326 — Viewing Angles and Projection Types
- Other cloud platforms
- like AWS, Google Cloud, or Azure offer the most flexibility and scalability but demand more technical knowledge around servers, containers, and networking.
- Lesson 1338 — Deployment and Sharing Dashboards
- Other Western Electric Rules
- Lesson 1401 — Detecting Out-of-Control Signals
- out of control
- (something changed).
- Lesson 1396 — Introduction to Control ChartsLesson 1400 — Control Limits vs Specification Limits
- Outcomes are mutually exclusive
- You can't have both success and failure simultaneously
- Lesson 123 — Bernoulli Trial Definition and Properties
- Outdated lists
- Phone directories missing new residents or unlisted numbers
- Lesson 249 — Coverage Error and Undercoverage
- Outer query alias
- (`outer`): identifies columns from the main query
- Lesson 976 — Basic Correlated Subquery Syntax
- outlier
- in regression is an observation with an unusual **Y value** given its X value—it doesn't follow the pattern of the other data points.
- Lesson 587 — Identifying Outliers in Regression ContextLesson 1389 — What is Grubbs' Test?
- Outlier Detection
- If a data point has a z-score beyond ±3, it's unusual enough to investigate.
- Lesson 201 — Z-Score Applications and LimitationsLesson 1157 — Statistical Anomaly Detection in QA
- Outliers
- Are there unusual points far from the main pattern?
- Lesson 480 — Scatterplots and Visual AssessmentLesson 487 — When to Use Spearman vs PearsonLesson 537 — When R-Squared is Not EnoughLesson 556 — What Are Residuals and Why Plot Them?Lesson 745 — STL Decomposition (Seasonal-Trend Loess)Lesson 1175 — Histograms for Distribution ShapeLesson 1176 — Box Plots for Spread and OutliersLesson 1183 — Scatter Plots for Two Numeric Variables (+5 more)
- Outliers and influential points
- Which observations have unusually large residuals that might distort your model?
- Lesson 544 — The Role of Residuals in Diagnostics
- Outliers present
- The median is *robust*—extreme values don't affect it.
- Lesson 42 — Comparing Mean, Median, and Mode
- Output
- A chi-squared-like test statistic with degrees of freedom = (k - 1), where k = number of conditions
- Lesson 474 — Friedman Test: Non-Parametric Repeated Measures ANOVALesson 1580 — Bayesian vs Frequentist A/B Testing
- Outputs are generated
- Everything in `reports/` and `models/` should be reproducible from code—don't edit these files manually.
- Lesson 2069 — Project Directory Structure
- Outside the bounds
- The autocorrelation is **statistically significant**—there's likely a real pattern at that lag
- Lesson 723 — Significance Bounds in ACF Plots
- Over-controlling
- Adding every available variable to your model without checking the DAG.
- Lesson 1476 — Common DAG Patterns and Pitfalls
- Over-crediting vanity metrics
- that didn't drive real outcomes
- Lesson 1637 — What is Metric Attribution?
- Over-differencing
- can introduce unnecessary complexity and make patterns harder to model.
- Lesson 736 — Higher-Order Differencing
- Over-investing
- in channels that capture existing demand rather than create it
- Lesson 1717 — Incrementality and True Channel Impact
- Over-Optimizing Proxies
- Lesson 1603 — Common Pitfalls in Indicator Selection
- Over-smoothing
- Mean income by state masks extreme inequality within states
- Lesson 1245 — Misleading Aggregations and Binning
- Overall Equipment Effectiveness (OEE)
- is the gold standard for measuring production efficiency.
- Lesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- Overall model F-test
- The global significance doesn't change
- Lesson 647 — Impact on Model Results and Reporting
- Overdispersion
- occurs when the actual variance in your data significantly exceeds the mean—violating this core Poisson assumption.
- Lesson 693 — Overdispersion in Count Data
- Overfit
- to your specific dataset's noise
- Lesson 632 — Parsimony and Occam's RazorLesson 2124 — Insufficient or Low-Quality Data
- overfitting
- .
- Lesson 14 — Model Evaluation and ValidationLesson 785 — Information Criteria: AIC and BICLesson 1938 — Using Metaphors and Analogies
- Overfitting risk
- increases with unnecessary predictors
- Lesson 1197 — Identifying Variable Importance and Redundancy
- Overhead allocation
- portion of office space, utilities for marketing/sales teams
- Lesson 1753 — Customer Acquisition Cost (CAC): Components and Calculation
- Overlap
- Too many bubbles or extreme size differences can create clutter—consider transparency or interactive tooltips
- Lesson 1229 — Bubble Charts for Three Variables
P
- p̂
- (p-hat) is your sample proportion
- Lesson 278 — Confidence Interval Formula for One ProportionLesson 402 — Calculating the Test Statistic for Proportions
- p < 0.05
- (common threshold): Strong evidence that survival curves differ significantly.
- Lesson 822 — Interpreting Log-Rank Test ResultsLesson 1692 — Statistical Significance and Iteration
- p = 0.5
- Perfectly symmetric (like a fair coin).
- Lesson 128 — Binomial Parameters n and pLesson 293 — Sample Size for Estimating a Proportion
- p-hacking
- manipulating the analysis until they get p < 0.
- Lesson 329 — Choosing α Before AnalysisLesson 1485 — Documentation and Pre-RegistrationLesson 1508 — Pre-Registration and Correction StrategyLesson 1926 — The Honest Broker Role
- p-value
- Lesson 205 — Shapiro-Wilk TestLesson 206 — Kolmogorov-Smirnov TestLesson 207 — Anderson-Darling TestLesson 208 — Jarque-Bera TestLesson 317 — Sampling Distribution of the Test StatisticLesson 318 — What is a P-Value?Lesson 319 — Calculating P-Values from Test StatisticsLesson 348 — P-Value Calculation Differences (+18 more)
- P-value < 0.05
- (or your chosen α): Reject the null → series is **stationary**
- Lesson 716 — Augmented Dickey-Fuller Test
- P-value ≥ 0.05
- Fail to reject → series is **non-stationary** (has unit root)
- Lesson 716 — Augmented Dickey-Fuller Test
- P-Value Approach
- Lesson 327 — Decision Rules: Reject or Fail to Reject
- p-values
- may be unreliable (too high or too low)
- Lesson 202 — Why Test for Normality?Lesson 345 — Directionality in Hypothesis TestingLesson 355 — Finding Critical Values and P-ValuesLesson 1938 — Using Metaphors and Analogies
- P(A ∩ B)
- is the probability that *both* A and B occur (the intersection)
- Lesson 92 — Definition and Notation of Conditional Probability
- P(A)
- = your **prior belief** (what you thought before seeing evidence)
- Lesson 108 — Updating Beliefs with New Evidence
- P(A) = Σ P(A|B ᵢ)×P(B ᵢ)
- Lesson 97 — Law of Total Probability
- P(A|B)
- , read as "the probability of A *given* B.
- Lesson 92 — Definition and Notation of Conditional ProbabilityLesson 108 — Updating Beliefs with New Evidence
- P(B)
- is the probability that B occurs (and must be greater than 0)
- Lesson 92 — Definition and Notation of Conditional ProbabilityLesson 108 — Updating Beliefs with New Evidence
- P(data | θ)
- Lesson 1535 — The Likelihood Function
- P(Evidence | Innocent)
- probability of seeing this evidence if innocent
- Lesson 112 — Legal Evidence and Jury Reasoning
- P(positive test | event)
- Lesson 110 — Base Rate Fallacy
- P(X < a)
- Probability that X is less than some value *a* — the area to the *left* of *a*
- Lesson 173 — Calculating Probabilities with the Normal Distribution
- P(X = k)
- The probability that random variable X equals exactly k successes
- Lesson 127 — Binomial Distribution PMF
- P(X > b)
- Probability that X is greater than *b* — the area to the *right* of *b*
- Lesson 173 — Calculating Probabilities with the Normal Distribution
- P(X > k)
- "More than k events" (complement of cumulative)
- Lesson 143 — Cumulative Poisson Probabilities
- p̂₁
- and **p̂₂** are your two sample proportions
- Lesson 287 — Confidence Intervals for the Difference Between Two ProportionsLesson 409 — Z-Test Statistic for Two Proportions
- p̂₂
- are your two sample proportions
- Lesson 287 — Confidence Intervals for the Difference Between Two ProportionsLesson 409 — Z-Test Statistic for Two Proportions
- PACF
- (Partial Autocorrelation Function), however, measures **only the direct relationship** at lag k, controlling for all intermediate lags.
- Lesson 728 — PACF vs ACF: Key DifferencesLesson 733 — Using ACF and PACF TogetherLesson 798 — SARIMA Model Selection
- Package versions
- (exact versions of every library you import)
- Lesson 2038 — What is Environment Management and Why It Matters
- Page views
- (without engagement depth or conversion)
- Lesson 1612 — What Are Vanity Metrics?Lesson 1616 — Metrics Divorced from Revenue
- Paid
- Any channel where you pay for placement (Google Ads, Facebook Ads, display networks, sponsored content)
- Lesson 1712 — Common Channel Categories
- Paid advertising
- (Google Ads, Facebook, display networks)
- Lesson 1711 — What Are Acquisition Channels?
- Paid CAC
- isolates only the costs and customers from *paid advertising channels*:
- Lesson 1754 — Blended CAC vs Paid CAC
- Paid Search
- has a 4-month payback, while **Referral** pays back in 6 months.
- Lesson 1758 — Cohort-Based Payback Analysis
- Paired or Repeated Measurements
- If you measure the same subjects twice (before/after treatment), those measurements aren't independent—they're linked to the same person.
- Lesson 381 — Independence Assumption and Its Violations
- paired t-test
- , which analyzes the *differences* within each pair, effectively reducing the problem to a one- sample test on those differences
- Lesson 360 — Independent vs. Dependent SamplesLesson 369 — When to Use a Paired t-TestLesson 375 — Paired t-Test vs Two-Sample t-Test
- Paired t-tests
- Remember, it's the *differences* that need to be normally distributed, not the original paired observations.
- Lesson 376 — The Assumption of Normality in t-Tests
- Pairing comparable units
- Each treated unit gets matched with one or more control units based on observed characteristics (covariates)
- Lesson 1445 — The Matching Framework
- Pairing related items
- Identify rows that share common attributes but differ in others
- Lesson 947 — Self-Joins for Comparisons Within a Table
- Pairwise comparisons
- which groups are being compared
- Lesson 462 — Interpreting and Reporting Post-Hoc ResultsLesson 469 — Follow-Up Tests for Two-Way ANOVA
- Pairwise independence
- means every *pair* of events is independent.
- Lesson 103 — Mutual Independence vs Pairwise Independence
- Pan and zoom
- capabilities for exploring dense datasets
- Lesson 1300 — Creating Basic Interactive Charts with Plotly Express
- Paper submissions
- `paper-submission-neurips2024`
- Lesson 2037 — Tagging Releases and Experiment Snapshots
- parallel trends
- without treatment, both groups would have changed similarly over time—a critical assumption you'll need to verify in practice.
- Lesson 1452 — The Difference-in-Differences SetupLesson 1453 — The Parallel Trends AssumptionLesson 1746 — Geo-Lift Experiments
- Parameter uncertainty
- The spread of the distribution shows how confident you are.
- Lesson 1547 — Interpreting Posterior Distributions
- Parameters
- are numerical characteristics that describe a population.
- Lesson 228 — Defining Populations and ParametersLesson 229 — Defining Samples and Statistics
- Parametric part
- The model assumes covariates affect hazard through a mathematical formula with parameters (coefficients) you estimate.
- Lesson 825 — What is the Cox Proportional Hazards Model?
- Parental/guardian consent
- for children, plus age-appropriate explanations
- Lesson 1918 — Special Populations and Vulnerable Groups
- Pareto
- describes heavy-tailed phenomena where extreme values are common—wealth distributions, file sizes on servers, or social network connections.
- Lesson 193 — Choosing Between Distributions in Practice
- Pareto distribution
- , which you learned about in the previous lesson.
- Lesson 191 — Pareto Principle and the 80/20 Rule
- Parquet
- is a compressed, column-oriented format designed for efficiency.
- Lesson 1129 — Parquet and Feather: Columnar FormatsLesson 1779 — Reading and Writing Data in Spark
- Parquet and Feather
- are columnar formats optimized for analytics.
- Lesson 1133 — Performance Considerations Across Formats
- Partial autocorrelation
- (PACF) solves this by measuring the *direct* correlation between observations separated by k time steps, *after removing* the influence of all the intermediate lags.
- Lesson 727 — What is Partial Autocorrelation?
- Partial correlation
- measures the relationship between two variables *after removing the influence of one or more other variables*.
- Lesson 506 — Introduction to Partial CorrelationLesson 508 — Interpreting Partial CorrelationsLesson 509 — Confounding Variables and ControlLesson 513 — Applications: Feature Selection and Multicollinearity
- Partial duplicate detection
- Identify rows that match on key fields (like name and birthdate) but differ elsewhere—these might represent the same entity entered multiple ways.
- Lesson 1154 — Uniqueness and Duplication Checks
- Partial F-Test
- (which you learned in lesson 623) to formally test whether the extra predictors significantly improve the model
- Lesson 626 — Nested vs Non-Nested Models
- Partial failure recovery
- uses **checkpoints** and **transaction boundaries** to save progress at strategic points.
- Lesson 1853 — Partial Failure Recovery
- Partial Failure Risk
- What if update #1 succeeds but update #2 fails?
- Lesson 1075 — Handling Data Consistency in Denormalized Schemas
- Partial reads and chunking
- Lesson 1141 — Recovering from Corrupted or Partially Broken Data
- partition
- of a sample space is a special collection of events that satisfies two critical properties simultaneously:
- Lesson 83 — Partitions of the Sample SpaceLesson 97 — Law of Total ProbabilityLesson 1782 — Spark Performance Basics: Partitions and Caching
- Partitioning
- and **clustering** tell the warehouse how to physically organize your data so queries can skip entire chunks of irrelevant data.
- Lesson 1812 — Partitioning and Clustering Strategies
- partitions
- and processes them in parallel.
- Lesson 1791 — Dask DataFrame BasicsLesson 1794 — Working with Partitions
- Past interactions
- Previous purchases, feature usage patterns
- Lesson 1689 — Multivariate Testing and Personalization
- Patient satisfaction scores
- capture experience quality through surveys—Net Promoter Score (NPS) or HCAHPS scores— serving as leading indicators for loyalty and reputation.
- Lesson 1633 — Healthcare Metrics: Patient Outcomes and Operational Efficiency
- Pattern
- Points curve **above** the line at the upper-right end and **below** the line at the lower-left end —like a gentle S-curve.
- Lesson 567 — Common Q-Q Plot Patterns: Heavy Tails and Light TailsLesson 722 — ACF Plots and InterpretationLesson 726 — Using ACF for Model Identification
- Pattern + Color
- In bar charts or area plots, add hatching, dots, or line patterns alongside color fills
- Lesson 1251 — Avoiding Reliance on Color Alone
- Pattern over-generalization
- Models find and exploit subtle correlations in biased data that humans might overlook (e.
- Lesson 1882 — Algorithmic Amplification of Bias
- patterns
- univariate analysis misses
- Lesson 1181 — What is Bivariate Analysis?Lesson 1191 — Scatter Plot Matrices and PairplotsLesson 1222 — Scatter Plots for RelationshipsLesson 1867 — Data Profiling and MonitoringLesson 2087 — Stage 3: Exploratory Data Analysis
- Patterns in plots
- non-random patterns suggest model misspecification (e.
- Lesson 701 — Deviance Residuals
- Payment Submitted
- Lesson 1679 — Defining Funnel Steps and Events
- PCA
- when you want speed, interpretability, and care about global structure.
- Lesson 1196 — Dimensionality Reduction for Visualization
- PDF acts like weights
- , telling you which regions contribute more to the average.
- Lesson 159 — Expected Value and Variance for Continuous Variables
- Pearson
- detects *linear* relationships: as X increases by a constant amount, Y changes by a constant amount
- Lesson 487 — When to Use Spearman vs PearsonLesson 1184 — Correlation Coefficients in Bivariate Analysis
- Pearson correlation
- is your go-to for linear relationships between normally distributed variables.
- Lesson 1184 — Correlation Coefficients in Bivariate Analysis
- Pearson's r
- measures the strength and direction of the linear relationship between two variables
- Lesson 534 — R-Squared vs Correlation Squared
- Peer groups
- "Among similar-sized companies, we rank in the top 10%"
- Lesson 1962 — Contextualizing Numbers
- Pennies-per-terabyte storage
- (Amazon S3, Google Cloud Storage): Storing raw data became so cheap that the cost of keeping everything in its original form was negligible
- Lesson 1818 — The Rise of ELT: Cloud Storage and Compute
- Percentage contribution
- `value / SUM(value) OVER (PARTITION BY category)`
- Lesson 1019 — Comparing Values to Window Aggregates
- Percentage of total
- `sale_amount / regional_total * 100`
- Lesson 1019 — Comparing Values to Window Aggregates
- percentile
- tells you what percentage of the data falls *below* a specific value.
- Lesson 56 — Understanding Percentiles and Their InterpretationLesson 199 — Finding Percentiles with Z- Scores
- percentiles
- (100 groups).
- Lesson 57 — Quantiles: Quartiles, Deciles, and BeyondLesson 62 — Percentiles vs Z-Scores: Complementary Position MeasuresLesson 1173 — Numerical Variable Summary Statistics
- Perfect
- multicollinearity means two or more predictors are perfectly linearly related—one can be expressed as an exact linear combination of the others.
- Lesson 551 — No Perfect Multicollinearity in Simple Regression
- Performance
- Each JOIN operation has a computational cost.
- Lesson 1070 — When to Stop NormalizingLesson 1092 — Connection Pooling BasicsLesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- Performance bottlenecks
- When specific queries consistently timeout or slow down user experience despite indexing and optimization.
- Lesson 1071 — When to Denormalize: Performance Trade-offs
- Performance monitoring
- Query optimization as data volume grows
- Lesson 1979 — Maintenance and Sustainability Considerations
- Performance needs
- (C-based drivers like psycopg2 are faster than pure-Python alternatives)
- Lesson 1087 — Database Drivers and Connection Libraries
- Performance thresholds
- "The model must achieve at least 85% accuracy" or "reduce processing time by 30%"
- Lesson 2117 — Defining 'Good Enough' with Stakeholders
- Performance-driven iteration
- Your model doesn't meet accuracy thresholds, prompting cycles through feature engineering, data collection, or even problem rescoping.
- Lesson 2092 — Iteration and Feedback Loops in Practice
- Period selection
- Choose periods that match your business cycle.
- Lesson 1671 — Churn Rate Calculation Methods
- Permanent
- Fail fast, log the issue, alert immediately, possibly route to a dead-letter queue for investigation
- Lesson 1849 — Transient vs Permanent Failures
- Permutation methods
- are useful when testing whether two groups differ.
- Lesson 291 — Non-Parametric Alternatives for Difference Intervals
- Permutation or bootstrap approaches
- Distribution-free methods that don't assume normality
- Lesson 470 — When Parametric ANOVA Assumptions Fail
- Permutation tests
- offer a clever alternative: they use resampling to build a reference distribution from your own data.
- Lesson 502 — Permutation Tests for Correlation
- Person-time
- Modeling disease incidence rates with different follow-up durations
- Lesson 692 — Offset Terms for Exposure
- Personal conflicts
- A friend asks you to help prove their startup idea will work.
- Lesson 35 — Conflicts of Interest and Independence
- Personal relationships
- You're analyzing data about a friend's project or a competitor of someone close to you.
- Lesson 1930 — Managing Conflicts of Interest
- Personalization
- High-value segments might receive premium support, exclusive offers, or early access to features.
- Lesson 1669 — LTV Segmentation and Targeting
- Personalize experiences
- Tailor messaging based on where users are in their journey
- Lesson 1719 — The Customer Journey and Touchpoints
- Perspective projection
- (default): Objects farther away appear smaller, mimicking how human eyes see the world.
- Lesson 1326 — Viewing Angles and Projection Types
- Peto test
- (Peto-Peto modification) uses a weighting scheme between log-rank and Wilcoxon.
- Lesson 823 — Log-Rank Test vs Other Tests
- Phantom reads
- A query returns different rows on repeat execution because another transaction inserted/deleted data
- Lesson 1116 — Transaction Isolation and Concurrency
- Phi
- is a special case used exclusively for 2×2 contingency tables.
- Lesson 429 — Effect Size: Cramér's V and Phi
- Physical constraints
- (negative ages, impossible dates)
- Lesson 75 — Domain-Specific Outlier RulesLesson 1211 — Domain Validation and Sanity Checks
- Pick a population distribution
- (any shape—uniform, exponential, skewed, bimodal, doesn't matter)
- Lesson 222 — Visualizing the CLT with Simulations
- Pie charts
- Displaying parts of a whole (market share, budget allocation) — use sparingly and only with 2-5 slices
- Lesson 1959 — Choosing Familiar Chart Types
- Pilot Studies
- Lesson 297 — Handling Unknown Population Parameters
- Pipeline delays
- Data usually arrives at 6 AM but starts arriving at 9 AM.
- Lesson 2136 — Monitoring Gaps and Silent Failures
- Pipeline validation
- Detect unexpected changes in upstream data sources
- Lesson 1871 — Why Version Control for Data?
- Pipeline/job identifier
- Lesson 1857 — Logging Best Practices
- Pipenv
- ) are modern dependency managers that treat your project like a publishable package from day one.
- Lesson 2051 — Poetry and Modern Python Tools
- Pitfall
- Conditioning on M blocks the path from A to Y, hiding the causal effect you want to measure.
- Lesson 1476 — Common DAG Patterns and Pitfalls
- Plan changes
- Modifying a task affects everything downstream
- Lesson 1841 — Upstream and Downstream Dependencies
- Planned vs. exploratory comparisons
- Pre-specified contrasts vs.
- Lesson 824 — Multiple Group Comparisons
- Platform differences
- compound the problem: A package compiled for Windows may behave differently than its macOS version, or the ARM architecture on newer Macs requires different binaries than Intel chips.
- Lesson 2048 — The Dependency Hell Problem
- Platform-friendly
- Ad networks often require the slot to be filled
- Lesson 1747 — Ghost Ads and PSA Tests
- Platform-specific optimizations
- Lesson 1691 — Mobile vs Desktop Conversion Analysis
- Platykurtic
- (kurtosis < 3 or excess kurtosis < 0): Light tails and a flatter peak.
- Lesson 66 — Kurtosis: Definition and Interpretation
- Plausibility
- Does a reasonable mechanism explain *how* X could cause Y, given current scientific knowledge?
- Lesson 498 — Bradford Hill Criteria for Causation
- Plot your data
- boxplots and histograms for each group
- Lesson 290 — Assumptions and Diagnostics for Difference Intervals
- Plotly
- transform your geographic data into engaging web visualizations.
- Lesson 1313 — Interactive Maps with Folium and PlotlyLesson 1321 — Interactive Network Graphs with Plotly and PyvisLesson 1371 — Default Aesthetics and Design Choices
- Plotly Express
- by specifying an `animation_frame` parameter pointing to your time or category column.
- Lesson 1306 — Animation and Time-Based Transitions
- Poetry
- (and similarly, **Pipenv**) are modern dependency managers that treat your project like a publishable package from day one.
- Lesson 2051 — Poetry and Modern Python Tools
- point estimate
- of the difference is simply p̂₁ - p̂₂.
- Lesson 280 — Confidence Intervals for Difference in ProportionsLesson 412 — Confidence Interval for DifferenceLesson 607 — Confidence Intervals for Coefficients
- Pointers
- to storage locations rather than storing full copies in version control
- Lesson 1871 — Why Version Control for Data?
- Pointers, not files
- For large models and datasets, commit references (like file hashes or DVC tracking files) rather than the actual binaries
- Lesson 2034 — Committing Data Artifacts and Model Outputs
- Points
- (`geom_point`) for scatter plots showing individual observations
- Lesson 1342 — Geometric Objects (geoms)
- Points along the line
- Perfect or near-perfect normality.
- Lesson 566 — Reading Q-Q Plots: Interpreting Points Along the Reference Line
- Points Beyond Control Limits
- Lesson 1401 — Detecting Out-of-Control Signals
- Points on the diagonal
- Your data matches the normal distribution well
- Lesson 204 — Q-Q Plots: Theory and Interpretation
- Poisson
- tracks "k events occurring at average rate λ"
- Lesson 142 — Poisson as Limit of BinomialLesson 154 — Real-World Use Cases: Customer Behavior and EventsLesson 664 — What is the Exponential Family of Distributions?Lesson 676 — Canonical vs Non- Canonical Links
- Poisson distribution
- when:
- Lesson 153 — Real-World Use Cases: Quality Control and DefectsLesson 154 — Real-World Use Cases: Customer Behavior and EventsLesson 669 — The Dispersion Parameter φ
- Poisson process
- .
- Lesson 139 — The Poisson Process and Rate ParameterLesson 140 — Poisson Probability Mass Function
- Poisson-distributed variables
- (events occurring at a constant rate)
- Lesson 213 — Square Root and Cube Root Transformations
- Polar coordinates
- transform bar charts into pie charts or create radial plots
- Lesson 1344 — Scales and Coordinate Systems
- Polish the Presentation
- Lesson 1217 — The Transition from Explore to Explain
- Polynomial features
- let you capture these curves *within* a linear regression framework by adding powers of your existing variables.
- Lesson 657 — What Are Polynomial Features?Lesson 662 — Polynomial Features vs Splines
- Pool all observations
- and randomly reassign them to groups
- Lesson 395 — Permutation Tests for Means and Beyond
- Pooled variance
- assumes both groups have the same underlying population variance.
- Lesson 285 — Pooled vs Unpooled Variance Approaches
- pooled variance t-test
- is specifically designed for situations where you can reasonably assume both populations have the **same variance** (even if their means differ).
- Lesson 361 — Pooled Variance t-TestLesson 362 — Welch's t-Test for Unequal VariancesLesson 379 — The Assumption of Equal Variances (Homoscedasticity)
- Poor decision-making
- based on correlation rather than causation
- Lesson 1637 — What is Metric Attribution?
- Poor interpretation
- (stakeholders misread what the metric actually measures)
- Lesson 1619 — What is Metric Ownership?
- Poor model fit
- Standard models assume stable variance and mean
- Lesson 734 — Why Differencing and Detrending Matter
- population
- ), you have complete information.
- Lesson 50 — Population vs Sample VarianceLesson 228 — Defining Populations and ParametersLesson 229 — Defining Samples and StatisticsLesson 232 — Notation ConventionsLesson 261 — Standard Error vs Standard DeviationLesson 692 — Offset Terms for Exposure
- Population Characteristics
- Lesson 243 — Choosing the Right Sampling Method
- Population distribution
- is the complete album of everyone's heights in a country — every single person.
- Lesson 258 — Comparing Population, Sample, and Sampling Distributions
- Population mean (μ)
- The average of all values in the population
- Lesson 228 — Defining Populations and Parameters
- Population proportion (p)
- The fraction of the population with a certain characteristic
- Lesson 228 — Defining Populations and Parameters
- Population standard deviation (σ)
- How spread out the population values are
- Lesson 228 — Defining Populations and ParametersLesson 292 — Sample Size for Estimating a Mean
- Population variability
- (σ): More spread in the population → larger SE
- Lesson 260 — Defining Standard Error
- Population variance
- Divide by **N** (total count of all values)
- Lesson 50 — Population vs Sample Variance
- Population variance (σ²)
- Expected variability in each group
- Lesson 289 — Sample Size Requirements for Difference Intervals
- Portfolio thinking
- means treating your channels like investments:
- Lesson 1716 — Channel Mix and Portfolio Thinking
- position
- , viewers can judge "this is twice that" with about 5% error.
- Lesson 1232 — Perceptual Accuracy HierarchyLesson 1238 — Matching Encoding to Data TypeLesson 1242 — Inappropriate Chart Types for DataLesson 1341 — Data and Aesthetic Mappings
- Position + Color
- Use spatial separation or faceting along with color coding
- Lesson 1251 — Avoiding Reliance on Color Alone
- positive
- (you can't divide by zero or use negatives without complications)
- Lesson 216 — Reciprocal and Inverse TransformationsLesson 539 — What Are Residuals?Lesson 652 — Interpreting Categorical × Continuous Interactions
- Positive (Right) Skewness
- Lesson 64 — Skewness: Definition and Interpretation
- Positive coefficient
- → that category has a higher average outcome than the reference
- Lesson 637 — Interpreting Dummy Variable Coefficients
- Positive correlation
- Points trend upward from left to right (as one variable increases, so does the other)
- Lesson 1222 — Scatter Plots for Relationships
- Positive r
- Variables move together (height and weight, study time and test scores)
- Lesson 477 — Interpreting the Correlation Coefficient
- Positive residuals
- (`e_i > 0`) occur when the actual value is *above* the fitted line.
- Lesson 540 — The Residual Formula
- Positive values
- = heavier tails than normal (leptokurtic)
- Lesson 67 — Calculating KurtosisLesson 212 — Log TransformationsLesson 720 — The Autocorrelation Function (ACF)
- Post
- is a binary indicator (1 if observation is from post-treatment period, 0 if pre-treatment)
- Lesson 1455 — DiD with Regression
- Post-hoc tests
- (meaning "after this") are designed to make pairwise comparisons *after* finding a significant ANOVA result.
- Lesson 455 — Why Post-Hoc Tests Are Needed After ANOVA
- Posterior
- Given a positive test, what's the probability they actually have the disease?
- Lesson 107 — Bayes' Theorem Formula and ComponentsLesson 115 — Prior Sensitivity AnalysisLesson 1417 — Bayesian Change-Point DetectionLesson 1550 — What Are Conjugate Priors?Lesson 1552 — Gamma-Poisson ConjugacyLesson 1557 — The Beta-Binomial Model
- posterior distribution
- is the end result of Bayesian inference—it's what you *actually care about*.
- Lesson 1537 — The Posterior DistributionLesson 1539 — Interpreting Posterior ProbabilitiesLesson 1563 — Sequential Updating with New Data
- Posterior distributions
- tell you not just "which variant is winning?
- Lesson 1586 — Multi-Armed Bandit ConnectionsLesson 1587 — Bayesian A/B Testing in Practice
- Posterior mean
- a weighted average of your prior mean and the sample mean, weighted by their precisions (inverse variances)
- Lesson 1553 — Normal-Normal ConjugacyLesson 1561 — Posterior Mean and Mode
- Posterior Mode
- The peak of the posterior distribution, also called the Maximum A Posteriori (MAP) estimate — the single most probable value.
- Lesson 1561 — Posterior Mean and Mode
- Posterior predictive checks
- answer this by simulating new datasets from your posterior distribution and comparing them to your observed data.
- Lesson 1596 — Posterior Predictive Checks and Model Comparison
- Posterior variance
- combines information from both the prior and the data
- Lesson 1553 — Normal-Normal Conjugacy
- Posterior: P(B|A)
- Your *updated* belief about A *after* observing evidence B
- Lesson 107 — Bayes' Theorem Formula and Components
- PostgreSQL
- is an enterprise-grade, open-source DBMS known for handling complex queries and large datasets.
- Lesson 845 — Database Management Systems (DBMS)Lesson 862 — Case Sensitivity in Text FilteringLesson 940 — Database Support and AlternativesLesson 1041 — Formatting and Parsing Dates
- power
- to detect effects in the predicted direction
- Lesson 345 — Directionality in Hypothesis TestingLesson 397 — Power and Efficiency of Non-Parametric TestsLesson 475 — Choosing Between Parametric and Non-Parametric TestsLesson 1495 — Power Analysis Fundamentals
- Power (1 - β)
- , typically 0.
- Lesson 296 — Sample Size for Comparing Two GroupsLesson 343 — Calculating Power for Common TestsLesson 344 — Power Analysis in Study Design
- Power analysis
- is the process of determining the minimum sample size required to detect an effect of a given size with adequate statistical power, all while controlling your Type I error rate (alpha).
- Lesson 344 — Power Analysis in Study Design
- Power-imbalanced contexts
- Employees consenting to employer tracking, students in research studies, prisoners, or patients in medical settings
- Lesson 1918 — Special Populations and Vulnerable Groups
- powerful
- when your data is truly normal, but it's very sensitive to non-normality—it might reject equal variances simply because your data isn't perfectly bell-shaped, not because variances actually differ.
- Lesson 380 — Testing Equal Variances: Levene's and Bartlett's TestsLesson 450 — Homogeneity of Variance (Homoscedasticity)
- Practical implementation
- Lesson 2070 — Separating Data from Code
- Practical limit
- Most effective trees are 3-5 levels deep with 3-7 branches per node
- Lesson 1623 — Depth vs Breadth in Metric Trees
- practical significance
- is the difference large enough to matter in context?
- Lesson 367 — Interpreting Two-Sample Test ResultsLesson 386 — Effect Size Interpretation GuidelinesLesson 387 — Confidence Intervals for Effect SizesLesson 389 — Reporting Effect Sizes in PracticeLesson 529 — Practical vs Statistical SignificanceLesson 609 — Practical vs Statistical SignificanceLesson 1480 — Minimum Detectable Effect (MDE)
- Praise good work
- When you see clever solutions or clear code, say so!
- Lesson 2024 — Code Review Best Practices
- Pre-creates
- a set number of connections when your application starts
- Lesson 1092 — Connection Pooling Basics
- Pre-experiment validation
- means running tests to ensure your randomization works properly and your metrics behave as expected *before* you expose users to actual treatment differences.
- Lesson 1483 — Pre-Experiment Validation
- Pre-filtering problems
- Different groups experiencing different dropout rates during assignment
- Lesson 1524 — Sample Ratio Mismatch (SRM)
- Pre-register analyses
- Decide your approach *before* seeing results
- Lesson 30 — The Reproducibility Crisis and Solutions
- Pre-registration
- means writing down your hypotheses, metrics, sample size, stopping rules, and correction methods *before* you peek at any results.
- Lesson 1508 — Pre-Registration and Correction Strategy
- precise control
- , especially when creating multiple subplots or building complex visualizations.
- Lesson 1256 — Two Interfaces: pyplot vs Object-OrientedLesson 1277 — Adjusting Subplot Spacing and Layout
- precision
- of your sample statistic as an estimate of the true population parameter.
- Lesson 265 — Using Standard Error in PracticeLesson 295 — Trade-offs: Precision, Confidence, and CostLesson 387 — Confidence Intervals for Effect SizesLesson 389 — Reporting Effect Sizes in PracticeLesson 1418 — Evaluating Change-Point Detection MethodsLesson 1567 — Posterior Mean as Weighted Average
- Precision & Recall
- For classification problems, how many relevant items did you catch, and how many false alarms did you trigger?
- Lesson 14 — Model Evaluation and Validation
- Precision is needed
- Reading exact values from 3D axes is significantly harder than 2D
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- Precision matters
- viewers need to read exact values or make close comparisons
- Lesson 1233 — Position as the Most Effective Channel
- Predicting product lifespans
- Manufacturers use k < 1 to model defects caught in early testing
- Lesson 188 — Weibull Distribution: Hazard Function and Reliability
- prediction intervals
- a range where we expect the true value to fall with a certain confidence level (often 80% or 95%).
- Lesson 794 — Forecasting Concepts and HorizonsLesson 800 — Generating Forecasts with SARIMA
- Prediction intervals grow wider
- as the forecast horizon extends.
- Lesson 800 — Generating Forecasts with SARIMA
- Predictions
- Every observation gets the identical predicted value
- Lesson 647 — Impact on Model Results and Reporting
- Predictive parity
- When the model predicts success, is it equally accurate across groups?
- Lesson 1884 — Detecting Bias in Your Data
- Predictive problems
- answer "What will happen?
- Lesson 2096 — Distinguishing Descriptive, Diagnostic, and Prescriptive Problems
- Predicts sustainable growth
- – When it improves, revenue and retention typically follow
- Lesson 1604 — What is a North Star Metric?
- Prefect
- , and **Dagster** log every execution step.
- Lesson 1164 — Tools for Lineage TrackingLesson 1843 — Declaring Dependencies in Orchestration Tools
- Pregnancy status
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Preliminary evidence
- correlation coefficients, group comparisons, or statistical summaries that suggest the hypothesis may hold (e.
- Lesson 1203 — Documenting Hypotheses and Evidence
- Prepare Your Data
- Lesson 447 — Conducting One-Way ANOVA in Practice
- Prerequisites
- Required software, packages, and versions
- Lesson 1989 — Best Practices for Sharing Reproducible Reports
- Prescriptive problems
- answer "What should we do?
- Lesson 2096 — Distinguishing Descriptive, Diagnostic, and Prescriptive Problems
- Present contradictory evidence
- that challenges your hypothesis
- Lesson 1929 — Avoiding Cherry-Picking Results
- Preserves slower-moving patterns
- (the trend component)
- Lesson 755 — Moving Averages for Trend Estimation
- Prevalence
- the base rate of the disease in the population
- Lesson 109 — Medical Diagnostic TestingLesson 116 — From Bayes' Theorem to Bayesian Inference
- Prevention
- Use additive decomposition or agreed-upon attribution rules *before* initiatives launch.
- Lesson 1642 — Attribution Pitfalls and Common Errors
- Prevents Direct Pushes
- No one can use `git push` directly to protected branches—all changes must go through pull requests.
- Lesson 2027 — Protecting Branches and Required Reviews
- Prevents Force Pushes
- Protects against accidental history rewrites that could break reproducibility.
- Lesson 2027 — Protecting Branches and Required Reviews
- Preview data structure
- without downloading entire tables
- Lesson 877 — LIMIT: Restricting the Number of Rows Returned
- Price sensitivity
- (competitor pricing, perceived value)
- Lesson 1675 — Churn Attribution and Root Cause Analysis
- Pricing optimization
- Test whether annual plans reduce hazard rates compared to monthly
- Lesson 838 — Subscription and Membership Duration Modeling
- Primacy Effect
- Conversely, existing users are *already comfortable with the old version*.
- Lesson 1525 — Novelty and Primacy Effects
- Primary contact
- Your email or project maintainer's handle
- Lesson 2083 — Contributing Guidelines and Contact Information
- Primary data geoms
- The main visual elements (points, lines, bars)
- Lesson 1355 — Layer Order and Plot Composition
- Primary Key
- A unique identifier for each row (like `customer_id`)
- Lesson 843 — Relational Database ConceptsLesson 921 — Primary and Foreign Key RelationshipsLesson 1048 — What Are Primary Keys?Lesson 1051 — Introduction to Foreign Keys
- primary metric
- (or success metric) must directly align with your business goal.
- Lesson 1478 — Defining Success MetricsLesson 1485 — Documentation and Pre-Registration
- Primary test
- Compare mean purchase frequency between age groups using appropriate statistical tests
- Lesson 1204 — From Hypothesis to Analysis Plan
- Principal Data Scientist
- Strategic technical direction, influence company-wide architecture, recognized external expert
- Lesson 2140 — Individual Contributor vs Management Tracks
- Prior
- How common is the disease in the population?
- Lesson 107 — Bayes' Theorem Formula and ComponentsLesson 1417 — Bayesian Change-Point DetectionLesson 1550 — What Are Conjugate Priors?Lesson 1552 — Gamma-Poisson Conjugacy
- prior belief
- (what you thought before seeing evidence)
- Lesson 108 — Updating Beliefs with New EvidenceLesson 112 — Legal Evidence and Jury ReasoningLesson 115 — Prior Sensitivity AnalysisLesson 1417 — Bayesian Change-Point DetectionLesson 1557 — The Beta-Binomial ModelLesson 1566 — Conjugate Normal-Normal Model
- prior distribution
- quantifies your beliefs about a parameter *before* you observe any data.
- Lesson 1534 — The Prior DistributionLesson 1543 — Defining Prior DistributionsLesson 1544 — Informative vs Uninformative PriorsLesson 1563 — Sequential Updating with New DataLesson 1565 — Prior Distributions for Normal MeansLesson 1581 — Setting Priors for A/B Tests
- Prior knowledge
- | Ignored | Incorporated explicitly |
- Lesson 1580 — Bayesian vs Frequentist A/B Testing
- Prior mean (μ₀)
- Your best guess for the population mean before seeing data
- Lesson 1565 — Prior Distributions for Normal Means
- Prior precision
- How concentrated is your prior distribution?
- Lesson 1549 — Prior-Likelihood Trade-offs
- Prior probability P(Guilty)
- base rate of guilt before evidence
- Lesson 112 — Legal Evidence and Jury Reasoning
- Prior standard deviation (σ₀)
- How uncertain you are about that guess
- Lesson 1565 — Prior Distributions for Normal Means
- Prior: P(A)
- Your initial belief about A *before* seeing evidence B
- Lesson 107 — Bayes' Theorem Formula and Components
- Prioritize by pain
- Refactor the parts of your pipeline that cause the most frequent issues or slow you down most
- Lesson 2137 — Refactoring Strategies and Debt Paydown
- Prioritize Interpretability
- Lesson 678 — Choosing the Right Link Function
- Prioritized
- Rank recommendations by impact, feasibility, or urgency.
- Lesson 1970 — Recommendations and Next Steps
- Priority level
- based on business impact and testability, which hypotheses deserve formal testing first?
- Lesson 1203 — Documenting Hypotheses and Evidence
- Priors
- Specify distributions for unknown parameters
- Lesson 1594 — PyMC: Probabilistic Programming in Python
- Priors are similar
- Starting at 30% versus 35% won't create huge differences
- Lesson 115 — Prior Sensitivity Analysis
- Priors matter less when
- Lesson 115 — Prior Sensitivity Analysis
- Privacy Attacks
- Models trained on sensitive data might leak information through inference attacks, even if you've applied privacy techniques.
- Lesson 1920 — Anticipating Misuse of Data Products
- Privacy budget (ε)
- How much privacy you're willing to "spend" (smaller ε = more noise = more privacy)
- Lesson 1899 — Adding Noise for Privacy
- Privacy-preserving machine learning
- where training data stays encrypted throughout
- Lesson 1903 — Secure Multi-Party Computation
- Proactive monitoring
- Owner spots anomalies and drives root-cause analysis
- Lesson 1619 — What is Metric Ownership?
- Probability Density Function (PDF)
- .
- Lesson 155 — Definition and Properties of Continuous Random VariablesLesson 156 — Probability Density Functions (PDFs)Lesson 162 — Uniform Distribution: PDF and CDF
- Probability Mass Function (PMF)
- comes in.
- Lesson 118 — Probability Mass Functions (PMF)Lesson 119 — Properties of Valid PMFsLesson 120 — Cumulative Distribution Functions (CDF) for Discrete VariablesLesson 124 — Bernoulli Distribution PMF and Parameters
- Probability of Being Best
- directly answers this question by computing the probability that a given variant has the highest true conversion rate (or other metric) compared to all other variants.
- Lesson 1583 — Probability of Being BestLesson 1586 — Multi-Armed Bandit Connections
- Probability sampling
- means every member of the population has a *known, non-zero chance* of being selected.
- Lesson 242 — Probability vs Non-Probability Sampling
- Probability sampling advantages
- Lesson 242 — Probability vs Non-Probability Sampling
- Probability sampling challenges
- Lesson 242 — Probability vs Non-Probability Sampling
- Probability sampling methods
- give you statistical validity.
- Lesson 243 — Choosing the Right Sampling Method
- Probability statements
- You can make direct claims like "There's a 95% probability the conversion rate is between 0.
- Lesson 1547 — Interpreting Posterior Distributions
- Probability threshold
- Stop when P(B better than A | data) > 0.
- Lesson 1585 — Early Stopping in Bayesian Tests
- Probe edge cases
- "What happens if the model is wrong?
- Lesson 2102 — Understanding Stakeholder Goals and Constraints
- Probit
- has thinner tails (based on the normal distribution)
- Lesson 674 — The Probit LinkLesson 678 — Choosing the Right Link Function
- probit link
- does the same job but uses the cumulative distribution function (CDF) of the standard normal distribution instead.
- Lesson 674 — The Probit LinkLesson 676 — Canonical vs Non-Canonical LinksLesson 677 — Interpreting Coefficients Under Different Links
- Problem Definition
- Lesson 9 — The Data Science Lifecycle OverviewLesson 10 — Problem Definition and Scoping
- Process
- Each worker applies the same operation to its chunk independently
- Lesson 1768 — Data Parallelism Fundamentals
- Process everything, every time
- Lesson 1828 — Incremental vs Full Load Strategies
- Processing speed
- One-pass transformation instead of read-then-transform
- Lesson 1802 — Filtering During Read with dtype and Converters
- Product A
- 10,000 new users/month, 10% retention → 1,000 active users
- Lesson 1614 — Growth Without Retention
- Product B
- 2,000 new users/month, 70% retention → 1,400 active users
- Lesson 1614 — Growth Without Retention
- Product categories
- An item categorized as "electronics" cannot also be "clothing" (assuming mutually exclusive classification)
- Lesson 81 — Mutually Exclusive Events
- Product changes
- Measure impact on cohorts before vs after a launch
- Lesson 1644 — What is Cohort Analysis?
- Product feedback
- Early adopters who volunteer feedback aren't typical users
- Lesson 246 — Volunteer and Self-Selection Bias
- Product focus
- Should you optimize for retention of casual users or delight of power users?
- Lesson 1698 — Power User Curves and Engagement Distribution
- Product gaps
- (missing features, usability issues)
- Lesson 1675 — Churn Attribution and Root Cause Analysis
- Product Launches
- Companies use DiD when rolling out features to some markets first.
- Lesson 1459 — Real-World DiD Applications
- Product Page View
- Lesson 1679 — Defining Funnel Steps and Events
- Product recommendations
- Finding pairs of products from the same `products` table
- Lesson 945 — Introduction to Self-Joins
- Product reviews
- skew positive when only satisfied customers bother to write them
- Lesson 247 — Survivorship Bias
- Product Team Objective
- Improve discovery experience
- Lesson 1608 — Connecting North Star Metrics to OKRs
- Product-market fit quality
- Higher floors suggest stronger fit
- Lesson 1658 — Flattening and Asymptotic Behavior
- Production code
- Applications should specify exactly which columns they need for clarity and performance
- Lesson 851 — Selecting All Columns with Asterisk
- Production deployment
- Compiled models are easier to integrate into non-Python systems
- Lesson 1595 — Stan: High-Performance Bayesian Inference
- Production pipelines
- that run automatically (ETL, model training, inference)
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- Production pipelines dominate
- Python integrates better with web services, APIs, and deployment infrastructure.
- Lesson 1375 — Choosing Tools: When to Use R vs Python for Visualization
- Productivity
- Focus on business logic, not query construction
- Lesson 1117 — What is an ORM and Why Use It?Lesson 1469 — Building a Simple Causal DAG
- Professionalism
- Lesson 1292 — Introduction to Styling: Why Aesthetics Matter
- Profiling reports
- go deeper: statistics for numeric columns (mean, min, max), cardinality for categorical fields, missing value percentages, and distribution summaries.
- Lesson 2067 — Automating Documentation with Code
- Profitability focus
- By identifying unprofitable segments (LTV < CAC), you can adjust targeting criteria, reduce spend, or experiment with lower-cost channels.
- Lesson 1669 — LTV Segmentation and Targeting
- Programming
- You'll need to write code to clean, analyze, and visualize data.
- Lesson 7 — The Data Science Skill Stack
- Project portability
- Each project carries its own dependency specification, making deployment predictable
- Lesson 2039 — Virtual Environments: Concept and Benefits
- Project Structure
- Brief overview of directory organization
- Lesson 2077 — The Purpose and Anatomy of a Good README
- Project templates
- solve this by providing a blueprint—a cookie cutter, if you will—that stamps out a consistent structure every time you start fresh.
- Lesson 2076 — Code Organization Templates and Cookiecutter
- Project Title and Description
- One-line summary and brief explanation of the project's purpose
- Lesson 2077 — The Purpose and Anatomy of a Good README
- Project-Join Normal Form
- ) eliminates **join dependencies**.
- Lesson 1068 — Higher Normal Forms: 4NF and 5NF
- Prometheus
- , **Grafana**, and **Datadog** automate this process, offering dashboards that show pipeline status at a glance and trigger alerts when thresholds are breached.
- Lesson 1861 — Monitoring Tools and Dashboards
- Proportion test
- When your metric is a conversion rate or percentage
- Lesson 1749 — Measuring Statistical Significance
- Proportional allocation
- assigns credit based on estimated contribution size (e.
- Lesson 1640 — Attribution in Multi-Team Environments
- proportions
- like the percentage of customers who click an ad, or the fraction of defective products?
- Lesson 224 — CLT for ProportionsLesson 253 — Sampling Distribution of the Sample ProportionLesson 297 — Handling Unknown Population ParametersLesson 315 — Common Test Statistics: Z, t, Chi-Square, and FLesson 1187 — Contingency Tables and Cross-Tabulations
- Propose
- a new location nearby (a candidate parameter value)
- Lesson 1590 — The Metropolis-Hastings Algorithm
- Prospects
- People who've shown interest but haven't purchased yet.
- Lesson 1704 — Customer Lifecycle Stages
- Protanopia/Protanomaly
- (red-weak): similar red-green confusion
- Lesson 1248 — Color Blindness and Color Palette Design
- Protected classes
- are groups of people shielded by law from discrimination.
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Protection from SQL injection
- Parameterization is automatic
- Lesson 1117 — What is an ORM and Why Use It?
- Prototyping models
- and experimenting with different approaches
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- Provenance questions
- Can you trust data from third-party APIs or scraped sources?
- Lesson 1762 — Extended Dimensions: Veracity and Value
- Proximity
- Elements placed close together are perceived as related.
- Lesson 1236 — Gestalt Principles in Visualization
- Proxy validation is skipped
- Teams assume a surrogate metric correlates with the real goal without validating that relationship (remember lesson 1520: Validating Surrogate Metrics).
- Lesson 1530 — Mismatched Metrics and Goals
- proxy variable
- is a feature that correlates strongly with a protected attribute, allowing a model to infer sensitive information indirectly.
- Lesson 1883 — Protected Classes and Proxy VariablesLesson 1889 — Proxy Variables and Redlining
- Prunes intelligently
- Eliminates candidate change-points that can never be part of the optimal solution, based on proven mathematical conditions
- Lesson 1416 — PELT Algorithm: Pruned Exact Linear Time
- Pseudonymization
- replaces identifiers with artificial labels—Patient A, Patient B—allowing you to track the same individual across records without knowing their real identity.
- Lesson 1895 — Data Anonymization Basics
- Public datasets
- Government databases, research repositories, open data portals
- Lesson 11 — Data Collection and Acquisition
- Purchase Frequency
- counts how many purchases the typical customer makes in a given period (say, per year).
- Lesson 1663 — Simple LTV: Average Revenue Per Customer
- Pure AR (Autoregressive) Process
- Lesson 733 — Using ACF and PACF Together
- Pure coincidence
- Random chance, especially with small samples or cherry-picked data
- Lesson 493 — The Fundamental Difference: Association vs Cause-and-EffectLesson 494 — Spurious Correlations and Coincidence
- Purpose
- What question does this report answer?
- Lesson 1989 — Best Practices for Sharing Reproducible ReportsLesson 2007 — Branch Naming Conventions
- Purpose limitation
- Data collected for one purpose can't be repurposed for unrelated analytics without new consent
- Lesson 1904 — What is GDPR and Why It MattersLesson 1905 — Core Principles of GDPR
- put it back
- , shake the bag, and draw again.
- Lesson 298 — The Bootstrap Method: Resampling Your DataLesson 299 — How Bootstrap Resampling Works
- Pyramid Principle
- , developed by Barbara Minto at McKinsey, flips the traditional "journey" narrative on its head.
- Lesson 1942 — The Pyramid Principle: Starting with the ConclusionLesson 1944 — Executive Summary Best PracticesLesson 1945 — Logical Flow: From Question to AnswerLesson 1952 — The Pyramid Principle: Leading with Conclusions
- Python
- , `plotly.
- Lesson 1374 — Interactivity: plotly in R vs Python and Integration PatternsLesson 1987 — Environment and Dependency ManagementLesson 2073 — Naming Conventions for Files and Functions
- Python (pandas/statsmodels)
- Lesson 646 — Reference Categories in Statistical Software
- Python with NumPy
- Lesson 482 — Calculating Pearson Correlation in Practice
- Python with Pandas
- Lesson 482 — Calculating Pearson Correlation in Practice
- Python with SciPy
- Lesson 482 — Calculating Pearson Correlation in Practice
- Python's approach
- is like learning the second language natively from the start.
- Lesson 1374 — Interactivity: plotly in R vs Python and Integration Patterns
- Pyvis
- is purpose-built for network visualization.
- Lesson 1321 — Interactive Network Graphs with Plotly and Pyvis
Q
- Q-Q linearity
- Points hugging the diagonal reference line
- Lesson 377 — Testing Normality: Visual Methods
- Q-Q plot
- compares your data's quantiles against a theoretical normal distribution.
- Lesson 377 — Testing Normality: Visual MethodsLesson 565 — What Q-Q Plots Show: Comparing Residual Distribution to Normal
- Q-Q plot (quantile-quantile plot)
- Residuals should fall along a straight diagonal line
- Lesson 449 — Normality of ResidualsLesson 788 — Checking Residual Normality
- Q-Q plot first
- Does the pattern look problematic for your purposes?
- Lesson 570 — Q-Q Plots vs Formal Normality Tests: When Visual Checks Matter
- Q-Q plots
- , **histograms**, and tests like **Shapiro-Wilk**.
- Lesson 290 — Assumptions and Diagnostics for Difference IntervalsLesson 587 — Identifying Outliers in Regression Context
- Q1
- (25th percentile): 25% of data falls below this value
- Lesson 1383 — Understanding the Interquartile Range (IQR)
- Q1 (First Quartile)
- The value at the 25% mark — one quarter of your data falls below this point
- Lesson 51 — Interquartile Range (IQR)
- Q3
- (75th percentile): 75% of data falls below this value
- Lesson 1383 — Understanding the Interquartile Range (IQR)
- Q3 (Third Quartile)
- The value at the 75% mark — three quarters of your data falls below this point
- Lesson 51 — Interquartile Range (IQR)
- Quadratic or polynomial trends
- When your data curves upward or downward in an accelerating pattern
- Lesson 736 — Higher-Order Differencing
- Qualitative and aspirational
- "Transform user onboarding" beats "Improve metrics"
- Lesson 1609 — Setting Effective Objectives
- Quality
- Good units ÷ total units produced (capturing defects)
- Lesson 1636 — Manufacturing Metrics: OEE, Yield, and Cycle Time
- Quality control
- A manufacturing process with high variability produces inconsistent products
- Lesson 46 — What is Variability?Lesson 351 — When to Use a One-Sample t-Test
- Quality control pass rates
- (proportion of acceptable products)
- Lesson 184 — Beta Distribution: Bounded Between 0 and 1
- Quality gates exist
- Automated tests can run, and approval requirements can block poor code from merging
- Lesson 2022 — Understanding Pull Requests
- Quality metrics
- Products manufactured in different batch sizes
- Lesson 43 — Weighted Mean and Its Applications
- Quantify business outcomes
- Lesson 1969 — Translating Technical Findings for Business Audiences
- Quantiles
- are the general family of cut-points that divide ranked data into *any* equal-sized groups.
- Lesson 57 — Quantiles: Quartiles, Deciles, and BeyondLesson 306 — Bootstrap for Non-Standard Problems
- Quarantine
- bad records for review (flexible approach)
- Lesson 1826 — Data Validation and Schema EnforcementLesson 1866 — Handling Failed Quality Checks
- Quarantine new work
- Apply strict standards to new features while gradually improving old ones
- Lesson 2137 — Refactoring Strategies and Debt Paydown
- Quarterly cycles
- Business revenues influenced by fiscal quarters
- Lesson 707 — Seasonality: Regular Periodic Patterns
- Quartiles
- (4 groups): Cut your data into quarters.
- Lesson 57 — Quantiles: Quartiles, Deciles, and Beyond
- Quartiles (4 groups)
- Lesson 1010 — NTILE(): Dividing Rows into Buckets
- Quartiles or deciles
- Divide customers into equal-sized groups (top 10%, next 10%, etc.
- Lesson 1669 — LTV Segmentation and Targeting
- Query execution time
- The obvious metric, but run queries multiple times to account for caching
- Lesson 1077 — Measuring Performance Impact of Denormalization
- Queue theory
- Time until multiple service completions
- Lesson 181 — Gamma Distribution: Shape and Rate Parameters
- Quick ad-hoc queries
- You're doing temporary analysis and speed matters more than precision
- Lesson 851 — Selecting All Columns with Asterisk
- Quick Ratio
- measures how much new and expansion revenue you gain versus how much you lose:
- Lesson 1629 — SaaS Growth Metrics: Quick Ratio and Net Revenue Retention
- Quick updates
- Brief email summaries or Slack messages for "no blockers, progressing as planned"
- Lesson 2104 — Communication Cadence and Updates
- Quintiles
- (5 groups): Split data into fifths, useful in economic studies and portfolio analysis.
- Lesson 57 — Quantiles: Quartiles, Deciles, and Beyond
R
- r = -1
- Perfect negative linear relationship (as one variable increases, the other decreases proportionally)
- Lesson 476 — What is Pearson Correlation?Lesson 477 — Interpreting the Correlation Coefficient
- r = +1
- Perfect positive linear relationship (as one variable increases, the other increases proportionally)
- Lesson 476 — What is Pearson Correlation?Lesson 477 — Interpreting the Correlation Coefficient
- r = 0
- No linear relationship (the variables don't follow a straight-line pattern together)
- Lesson 476 — What is Pearson Correlation?Lesson 477 — Interpreting the Correlation Coefficient
- R Charts
- Best for small subgroups (n ≤ 10).
- Lesson 1399 — Control Charts for Variability (R and S Charts)
- R Charts (Range Charts)
- track the difference between the highest and lowest values in each sample group.
- Lesson 1399 — Control Charts for Variability (R and S Charts)
- R-hat statistic
- Compares variance within and between multiple chains; values near 1.
- Lesson 1592 — Burn-in, Thinning, and Convergence Diagnostics
- R-squared
- (written as R² or r²) tells you the **proportion of variance in Y that is explained by X**.
- Lesson 531 — What is R-Squared?Lesson 543 — Residuals as Unexplained Variation
- R-squared and adjusted R-squared
- Model fit is unchanged
- Lesson 647 — Impact on Model Results and Reporting
- R's approach
- is like having an interpreter who translates your speech (ggplot2 code) into another language (plotly).
- Lesson 1374 — Interactivity: plotly in R vs Python and Integration Patterns
- R²
- measures the proportion of variance in Y explained by your regression model
- Lesson 534 — R-Squared vs Correlation SquaredLesson 613 — The Adjusted R-Squared Formula
- R² = 0
- Your model explains none of the variance; you might as well use the mean of Y as your prediction
- Lesson 531 — What is R-Squared?Lesson 533 — Interpreting R-Squared Values
- R² = 0.15
- Only 15% of variance is explained; 85% remains unexplained.
- Lesson 533 — Interpreting R-Squared Values
- R² = 0.85
- Your model explains 85% of the variance—most of the variation is captured by your regression line.
- Lesson 533 — Interpreting R-Squared Values
- R² = 1
- Your model perfectly predicts every Y value (rare in real life!
- Lesson 531 — What is R-Squared?
- Race and ethnicity
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Radio silence after complaints
- Sometimes the absence of follow-up signals they've given up
- Lesson 1673 — Leading Indicators of Churn
- Radioactive decay
- An atom that hasn't decayed for an hour is no more "due" to decay than a fresh atom
- Lesson 167 — Memoryless Property of Exponential
- Rainbow palettes
- They suggest order where none exists and aren't colorblind-friendly.
- Lesson 1309 — Choropleth Maps: Basics and Best Practices
- Random Assignment
- Each participant has an equal chance of being assigned to either group
- Lesson 1435 — What is a Randomized Controlled Trial?Lesson 1486 — Why Randomization Matters in A/B Tests
- Random number generators
- Does each digit appear with equal frequency?
- Lesson 421 — Applications: Uniform, Genetic Ratios, and Distributions
- Random sampling
- Your data comes from a random process
- Lesson 419 — Assumptions and Minimum Expected Frequencies
- random seeds
- come to the rescue.
- Lesson 28 — Random Seeds and Deterministic ComputationLesson 29 — Code and Environment Management
- Randomization
- Lesson 400 — Assumptions and Conditions for Proportion TestsLesson 499 — Why Controlled Experiments Are NeededLesson 1436 — The Gold Standard for Causality
- Randomization Quality
- Both groups should have similar characteristics (demographics, behavior patterns) if randomization works correctly
- Lesson 1483 — Pre-Experiment Validation
- Randomization unit
- User, session, or other unit you defined
- Lesson 1485 — Documentation and Pre-Registration
- Randomize assignment
- Split users randomly into control and treatment groups (e.
- Lesson 1641 — Isolating Effects with Control Groups
- Randomized Controlled Trial (RCT)
- is an experimental method where participants are randomly assigned to either a **treatment group** (receives the intervention) or a **control group** (does not receive the intervention).
- Lesson 1435 — What is a Randomized Controlled Trial?Lesson 1677 — Measuring Churn Reduction Impact
- Randomizing by session
- gives you more experimental units (higher power), but risks violating independence assumptions and creates inconsistent experiences.
- Lesson 1481 — Unit of Randomization
- Randomizing by user
- gives cleaner results and consistent experience, but requires more users to detect effects.
- Lesson 1481 — Unit of Randomization
- Randomly assign users
- to treatment (real ad) or control (PSA/ghost ad)
- Lesson 1747 — Ghost Ads and PSA Tests
- randomly assigned
- to treatment or control groups.
- Lesson 1436 — The Gold Standard for CausalityLesson 1526 — Selection Bias in Opt-In Tests
- range
- .
- Lesson 47 — Range: The Simplest MeasureLesson 54 — When to Use Each MeasureLesson 266 — What is a Confidence Interval?
- Range queries
- (`WHERE age BETWEEN 25 AND 35`) find the starting point, then scan consecutive sorted leaves
- Lesson 1079 — B-Tree Indexes: Structure and Mechanics
- Range retention
- User was active *at any point* from start through that period (cumulative)
- Lesson 1648 — Cohort Retention Rates
- Range sliders
- excel with time-series data or any ordered sequence where users need to examine specific intervals (e.
- Lesson 1303 — Range Sliders and Zoom Controls
- Range violations
- Negative ages or dates in the future
- Lesson 1109 — Input Validation and Defense in DepthLesson 1150 — What is Data Validation?
- Rank the absolute values
- of differences from smallest to largest
- Lesson 392 — Wilcoxon Signed-Rank Test
- Rank them
- from 1 (smallest) to n (largest), averaging tied ranks
- Lesson 393 — Mann-Whitney U Test (Wilcoxon Rank-Sum)
- Rank users
- by their activity level (highest to lowest)
- Lesson 1698 — Power User Curves and Engagement Distribution
- ranking
- you pool all observations from both groups, assign ranks from smallest to largest (ignoring which group they came from), then sum the ranks for each group.
- Lesson 393 — Mann-Whitney U Test (Wilcoxon Rank-Sum)Lesson 474 — Friedman Test: Non-Parametric Repeated Measures ANOVALesson 488 — Computing Spearman Correlation
- Rankings
- (1st, 2nd, 3rd) alongside the actual data values
- Lesson 1005 — Introduction to Window Functions
- Rankings are important
- ordering items from high to low
- Lesson 1233 — Position as the Most Effective Channel
- Ranks all observations
- from smallest to largest across *all* groups combined (ignoring group membership temporarily)
- Lesson 471 — Kruskal-Wallis H Test: The Non-Parametric One-Way ANOVA
- Rare Events
- Earthquakes per year, typos per page, or accidents per month—anything that happens occasionally but at a predictable average rate.
- Lesson 144 — Poisson Applications: Arrivals and Events
- rate
- is stable but individual occurrences are unpredictable
- Lesson 153 — Real-World Use Cases: Quality Control and DefectsLesson 692 — Offset Terms for ExposureLesson 1552 — Gamma-Poisson Conjugacy
- Rate data
- Counts per unit of time, space, or population (e.
- Lesson 689 — When to Use Poisson Regression
- Rate limiting
- Prevent bulk misuse of APIs
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- rate parameter
- .
- Lesson 139 — The Poisson Process and Rate ParameterLesson 165 — Exponential Distribution: PDF and CDFLesson 1552 — Gamma-Poisson Conjugacy
- Rate parameter (β, "beta")
- Controls how quickly probability "decays" or spreads out.
- Lesson 181 — Gamma Distribution: Shape and Rate Parameters
- rate parameter λ
- (lambda), you can calculate the probability of observing *exactly* k events in your interval.
- Lesson 140 — Poisson Probability Mass FunctionLesson 166 — Exponential Distribution: Mean and Variance
- rates
- when your observations have unequal exposure times or denominators.
- Lesson 692 — Offset Terms for ExposureLesson 1613 — Raw Counts vs. Rates and Ratios
- Ratio data
- (numeric with meaningful zero: height, count, salary) leverages:
- Lesson 1238 — Matching Encoding to Data Type
- Ratio to partition average
- `value / AVG(value) OVER (PARTITION BY category)`
- Lesson 1019 — Comparing Values to Window Aggregates
- Ratios
- (like revenue per customer)
- Lesson 306 — Bootstrap for Non-Standard ProblemsLesson 1613 — Raw Counts vs. Rates and Ratios
- Raw Kurtosis (Fisher's)
- The complete formula above, which subtracts 3 at the end.
- Lesson 67 — Calculating Kurtosis
- RDD (Resilient Distributed Dataset)
- is Spark's core data structure—a collection of objects distributed across the nodes in your cluster.
- Lesson 1777 — RDDs: Resilient Distributed Datasets Fundamentals
- React Slowly to Changes
- Lesson 1598 — Characteristics of Lagging Indicators
- Reactivations
- If churned customers return, do you subtract them from "customers lost"?
- Lesson 1671 — Churn Rate Calculation Methods
- Reactive mode
- You're constantly firefighting instead of preventing fires
- Lesson 1617 — The Danger of Lagging-Only Metrics
- Read both versions carefully
- to understand what each branch changed
- Lesson 2011 — Resolving Merge Conflicts
- Read Committed
- You only see committed data, but values can change during your transaction
- Lesson 1116 — Transaction Isolation and Concurrency
- Read Uncommitted
- You can see other transactions' uncommitted changes (risky!
- Lesson 1116 — Transaction Isolation and Concurrency
- Read-heavy workloads
- If a table is queried 10,000 times daily but updated once, duplicating data to avoid joins is worthwhile.
- Lesson 1071 — When to Denormalize: Performance Trade-offsLesson 1073 — Storing Computed Values and Aggregates
- Readability
- Listing columns in a logical hierarchy makes your query easier to understand
- Lesson 906 — Order Matters: Column Sequence in GROUP BYLesson 924 — Using Table Aliases in JoinsLesson 974 — When to Use FROM Subqueries vs CTEsLesson 1106 — Parameter Placeholders: Named ParametersLesson 1292 — Introduction to Styling: Why Aesthetics Matter
- Readmission rate
- measures the percentage of patients returning within 30 days—a lagging indicator of both care quality and discharge planning effectiveness.
- Lesson 1633 — Healthcare Metrics: Patient Outcomes and Operational Efficiency
- Real-time
- (milliseconds-to-seconds) demands streaming pipelines with immediate processing.
- Lesson 1825 — Designing Pipeline Architecture
- Real-time learning
- Update beliefs as information arrives rather than waiting
- Lesson 1538 — Updating Beliefs with Sequential Data
- Real-world examples
- Lesson 805 — Left and Interval Censoring
- Real-World Needs
- If you're building an analytics dashboard that constantly needs customer names with their order totals, joining `customers` and `orders` thousands of times per hour might waste resources.
- Lesson 1070 — When to Stop Normalizing
- Realism matters more
- Your domain knowledge doesn't fit standard conjugate families
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- Reassess consent
- before any new use case—even internal ones
- Lesson 1915 — Secondary Use and Scope Creep
- Rebalance quarterly
- as business conditions evolve
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- Rebase
- rewrites history by moving your branch's commits to start from a different point.
- Lesson 2014 — Understanding Git Rebase vs MergeLesson 2016 — Rebasing Feature Branches
- Rebuild Fragmented Indexes
- When fragmentation exceeds 30-40%, rebuild the index to reorganize data pages.
- Lesson 1086 — Index Maintenance and Monitoring
- Recalculate the test statistic
- for this permuted dataset
- Lesson 395 — Permutation Tests for Means and Beyond
- Recalculates centers
- based on the customers assigned to them
- Lesson 1705 — K-Means Clustering for Segmentation
- Recall
- TP / (TP + FN) — of all real changes, how many did you catch?
- Lesson 1418 — Evaluating Change-Point Detection Methods
- Recency
- How recently did they make a purchase?
- Lesson 1703 — RFM Analysis: Recency, Frequency, Monetary Value
- Reciprocal
- (`1/Y`) can handle extreme heteroscedasticity but changes interpretation dramatically.
- Lesson 591 — When and Why to Transform Variables
- Recognizing boundaries of competence
- means honestly assessing what you know versus what a problem requires, and making responsible decisions about whether to proceed alone, seek help, or decline the work entirely.
- Lesson 34 — Recognizing Boundaries of Competence
- Recommendation
- Launch automated alerts for at-risk accounts
- Lesson 1948 — The Recommendation Slide: Making It Actionable
- Recommendations
- Actionable next steps tied to findings
- Lesson 1966 — Report Structure and Executive Summary
- Recommendations backed by evidence
- , not just observations
- Lesson 2091 — Stage 7: Communication and Handoff
- Reconcile findings
- Lesson 210 — Combining Visual and Statistical Methods
- Record time-to-event
- Days/months until failure (or censoring if still working at study end)
- Lesson 837 — Product Warranty and Failure Analysis
- Recovery from encoding issues
- Lesson 1141 — Recovering from Corrupted or Partially Broken Data
- Recovery is risky
- Fixing an error by rerunning might make things worse
- Lesson 1847 — What is Idempotency?
- Recursive Member
- The self-referencing query that adds the next "layer" by joining back to what you've already found.
- Lesson 996 — Recursive CTEs: Introduction
- Recursive operations
- CTEs support recursion; subqueries don't
- Lesson 974 — When to Use FROM Subqueries vs CTEs
- Recuse yourself
- from projects where you can't be objective
- Lesson 35 — Conflicts of Interest and Independence
- Recycles
- the connection back to the pool when you're done (via `close()` or context manager)
- Lesson 1092 — Connection Pooling Basics
- Red flags
- include:
- Lesson 562 — Index Plots and Time-Ordered ResidualsLesson 584 — Correlation Matrices for Predictors
- Red flags for non-stationarity
- Lesson 715 — Visual Tests for Stationarity
- Redshift
- Offers both traditional nodes and newer "Spectrum" for separated storage
- Lesson 1813 — Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
- Reduce multicollinearity
- VIF values drop for remaining predictors
- Lesson 585 — Remedies: Variable Selection
- Reduce Redundancy
- Instead of storing a customer's address in every order record, you store it once in a `customers` table and reference it using a foreign key.
- Lesson 1061 — Introduction to Normalization
- Reduce wasted effort
- If the simple answer settles the question, you saved days of work
- Lesson 2110 — The Minimum Viable Analysis (MVA)Lesson 2111 — Fast Feedback Loops with Stakeholders
- Reduced data redundancy
- Category names aren't repeated for every product
- Lesson 1810 — Snowflake Schema and Normalization Trade-offs
- Reduced Feature Adoption
- When active users stop exploring new features or abandon key workflows they once used regularly, disengagement may be brewing.
- Lesson 1700 — Leading Indicators of Disengagement
- Reduced human error
- No forgotten runs or copy-paste mistakes
- Lesson 1986 — Automated Report Generation
- Reduced LTV
- Shorter customer lifespans mean less total revenue per customer
- Lesson 1670 — What is Churn and Why It Matters
- Reduced model
- Uses only your baseline predictors (e.
- Lesson 623 — Partial F-Tests for Nested ModelsLesson 654 — Testing Interaction Significance
- Reduced opportunity cost
- of running inferior variants
- Lesson 1515 — Trade-offs: Sample Size, Speed, and Complexity
- Reduced peak memory
- Never materialize the "wrong" version
- Lesson 1802 — Filtering During Read with dtype and Converters
- Reduced power per test
- With the same overall sample size, each pairwise comparison has less data and thus less ability to detect real effects
- Lesson 1528 — Testing Too Many Variants
- Reduced sampling variability
- Larger samples produce statistics (like means) that cluster more tightly around the true population value.
- Lesson 340 — Power and Sample Size Relationship
- Reduced statistical significance
- Even though your overall model might fit well (good R-squared), individual predictors may appear non-significant
- Lesson 580 — What is Multicollinearity?
- Reducing skewness
- – Converting the stretched-out tail into a more symmetric bell shape
- Lesson 212 — Log Transformations
- Redundant labels
- If the axis already shows values, don't repeat them on every bar
- Lesson 1237 — Chart Junk and Data-Ink RatioLesson 1246 — Visual Clutter and ChartjunkLesson 1963 — Removing Chartjunk
- Redundant variables
- highly correlated features that provide similar information
- Lesson 1192 — Correlation Matrices and Heatmaps
- reference category
- or **baseline**.
- Lesson 636 — The Reference CategoryLesson 642 — What is a Reference Category?Lesson 644 — Choosing a Reference Category
- Reference lines
- Add `geom_hline()` or `geom_vline()` early so data appears over them, or late to emphasize thresholds
- Lesson 1355 — Layer Order and Plot CompositionLesson 1962 — Contextualizing Numbers
- referential integrity
- they guarantee that relationships between tables remain valid.
- Lesson 1051 — Introduction to Foreign KeysLesson 1055 — What is Referential Integrity?Lesson 1150 — What is Data Validation?
- Referral
- Traffic from links on other websites (blogs, news articles, partner sites)
- Lesson 1712 — Common Channel CategoriesLesson 1758 — Cohort-Based Payback Analysis
- Referrer Headers
- are automatically sent by browsers, telling your server which website the user came from.
- Lesson 1713 — Tracking Users by Channel
- Reflects Customer Value
- Lesson 1605 — Characteristics of Good North Star Metrics
- Reframe, don't just refuse
- Instead of "I can't do that," try:
- Lesson 1931 — When to Push Back on Requests
- Regression and Feature Importance
- Lesson 1602 — Identifying Leading Indicators for Your Metrics
- Regression plots
- Fit and display linear models
- Lesson 1281 — Introduction to Seaborn's Statistical Plots
- Regular aggregate (collapses rows)
- Lesson 1014 — Introduction to Window Aggregation Functions
- Regular audits
- Schedule quarterly reviews to identify unused notebooks, deprecated feature columns, and abandoned model variants.
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Regular Sync Points
- Lesson 2046 — Best Practices for Environment Management in Teams
- Regular, predictable patterns
- that repeat at fixed intervals—daily, weekly, monthly, or yearly.
- Lesson 705 — The Four Classical Components
- Regularization
- adds a penalty to the model that discourages large coefficient values, stabilizing estimates even when predictors overlap.
- Lesson 586 — Remedies: Regularization PreviewLesson 1569 — Shrinkage and Regularization Effects
- Regulatory constraints
- Are there legal requirements (HIPAA, GDPR) or industry standards that limit what you can analyze or recommend?
- Lesson 1168 — Understanding Domain Context
- Regulatory context
- (HIPAA for healthcare, SOX for finance)
- Lesson 2145 — Transitioning Between Industries and Domains
- Reject all hypotheses
- up to (but not including) that stopping point
- Lesson 1504 — Holm-Bonferroni MethodLesson 1506 — Benjamini-Hochberg Procedure
- reject the null hypothesis
- .
- Lesson 327 — Decision Rules: Reject or Fail to RejectLesson 427 — Interpreting Chi-Squared Test Results
- Rejecting invalid inserts
- You can't add a row with a foreign key value that doesn't exist in the parent table
- Lesson 1055 — What is Referential Integrity?
- rejection region
- the specific zone in your test statistic's distribution where the evidence is strong enough to reject the null hypothesis.
- Lesson 325 — The Rejection RegionLesson 336 — Visualizing Error Types with Sampling DistributionsLesson 345 — Directionality in Hypothesis Testing
- Rejection region shrinks
- – Fewer test statistics will fall in the "reject H₀" zone
- Lesson 342 — Alpha Level Trade-offs
- Relational plots
- Explore relationships between variables (scatter, line plots with confidence intervals)
- Lesson 1281 — Introduction to Seaborn's Statistical Plots
- relationships
- between tables.
- Lesson 843 — Relational Database ConceptsLesson 1121 — Column Types, Constraints, and RelationshipsLesson 1316 — Introduction to Network Graphs and Graph Theory BasicsLesson 2087 — Stage 3: Exploratory Data Analysis
- Relevant Scales
- Help audiences grasp magnitude.
- Lesson 1939 — Context and Comparison: Making Numbers Meaningful
- Reliability
- is the probability a system survives beyond time *t*.
- Lesson 188 — Weibull Distribution: Hazard Function and ReliabilityLesson 1822 — What is a Data Pipeline?
- Remainder
- (or residual): Everything left over (like improvisations)—the noise and potential anomalies
- Lesson 1406 — Decomposing Seasonality
- Remove chart junk
- Delete unnecessary gridlines (keep only what's needed for reading values), drop borders, eliminate 3D effects, and ditch decorative fills.
- Lesson 1958 — Simplifying Visual Complexity
- Removes between-subject variability
- (some people naturally weigh more)
- Lesson 370 — Differences as the Unit of Analysis
- Removing duplicates
- Identifying and eliminating repeated entries that could skew your analysis.
- Lesson 12 — Data Cleaning and Preparation
- Removing outliers
- Identifying unusual values that might be errors or genuinely extreme cases requiring special handling.
- Lesson 12 — Data Cleaning and Preparation
- Removing redundancy
- Cleaning up result sets with unwanted duplicates
- Lesson 873 — Understanding DISTINCT: Removing Duplicate Rows
- Repeat
- steps 2-3 thousands of times (e.
- Lesson 395 — Permutation Tests for Means and BeyondLesson 703 — Sequential Model Building StrategyLesson 1492 — Rerandomization and Practical ImplementationLesson 1582 — Updating Beliefs with Test DataLesson 1590 — The Metropolis-Hastings AlgorithmLesson 1591 — Gibbs Sampling for Multivariate Posteriors
- Repeatable Read
- Once you read a value, it stays the same in your transaction
- Lesson 1116 — Transaction Isolation and Concurrency
- Repeated Measures
- Lesson 369 — When to Use a Paired t-Test
- Replace metrics with outcomes
- "5% improvement in precision" becomes "prevents 50 wasted sales calls per month"
- Lesson 2105 — Translating Between Technical and Business Language
- Report all preregistered analyses
- , not just "successful" ones
- Lesson 1929 — Avoiding Cherry-Picking Results
- Report Effect Size
- Lesson 447 — Conducting One-Way ANOVA in Practice
- Reporting and analytics
- Dashboards often aggregate data from many tables.
- Lesson 1071 — When to Denormalize: Performance Trade-offs
- Reports excel at explanation
- , providing the context, methodology, and recommendations that dashboards can't accommodate.
- Lesson 1980 — Hybrid Approaches and When to Use Both
- Repository
- The sealed, labeled package that's been officially sent and recorded
- Lesson 1993 — The Three States: Working Directory, Staging, Repository
- Representativeness
- Does your dataset reflect the full population or just a subset?
- Lesson 1169 — Clarifying Assumptions and Constraints
- reproducibility
- and **replicability** sound similar but mean different things—and both are essential for trustworthy science.
- Lesson 26 — Reproducibility vs. ReplicabilityLesson 29 — Code and Environment ManagementLesson 33 — Transparency and ExplainabilityLesson 1643 — Building Attribution FrameworksLesson 1871 — Why Version Control for Data?Lesson 1990 — What is Version Control and Why Git?Lesson 2039 — Virtual Environments: Concept and BenefitsLesson 2047 — What is Dependency Management? (+1 more)
- reproducible
- when someone else (or future-you) can take the same raw data and the same code, run it again, and get *exactly* the same results, tables, figures, and conclusions.
- Lesson 1981 — What Makes a Report Reproducible?Lesson 2036 — Code Review Practices for Data Science
- Reproducible code
- Clean GitHub repos with proper READMEs (as you've learned)
- Lesson 2141 — Building a Portfolio and Personal Brand
- Required for self-joins
- When joining a table to itself (covered later), aliases become essential.
- Lesson 924 — Using Table Aliases in Joins
- Required transformations
- "Log-transform `income` to reduce skewness"
- Lesson 1212 — EDA Summary Documentation and Next Steps
- Requirements
- Python/R version, key dependencies or link to `requirements.
- Lesson 2077 — The Purpose and Anatomy of a Good README
- requirements.txt
- file:
- Lesson 2043 — Creating and Exporting Environment SpecificationsLesson 2044 — Recreating Environments from Specifications
- Requires Reviews
- You can mandate that 1, 2, or more team members approve a pull request before it can merge.
- Lesson 2027 — Protecting Branches and Required Reviews
- Rerandomization
- is a technique where you check covariate balance *before* starting your experiment, and if balance is poor, you rerandomize until you get acceptable balance.
- Lesson 1492 — Rerandomization and Practical Implementation
- Resample your data
- with replacement many times (typically 1,000–10,000 times)
- Lesson 306 — Bootstrap for Non-Standard Problems
- Research Goals
- Lesson 243 — Choosing the Right Sampling Method
- Research sharing
- Publish datasets for reproducibility without exposing participants
- Lesson 1901 — Synthetic Data Generation
- Reset to that state
- `git reset --hard <commit-hash>` restores your branch to that exact point
- Lesson 2021 — Recovering from Rebase Mistakes
- residual
- (or error).
- Lesson 515 — What Makes a 'Best Fit' Line?Lesson 539 — What Are Residuals?Lesson 542 — Computing Fitted Values and ResidualsLesson 711 — Visualizing Components with Decomposition PlotsLesson 742 — Components of Seasonal Decomposition
- Residual (e ᵢ)
- = Yᵢ - Ŷᵢ (the difference you learned about earlier)
- Lesson 538 — What Are Fitted Values?
- Residual (e)
- "Here's how much the *actual* value differs from that prediction"
- Lesson 543 — Residuals as Unexplained Variation
- Residual Autocorrelation
- Lesson 782 — Residual Diagnostics for ARIMA
- Residual component
- (leftover random noise)
- Lesson 711 — Visualizing Components with Decomposition PlotsLesson 742 — Components of Seasonal Decomposition
- Residual deviance
- measures how poorly your *fitted* model (with all predictors) fits.
- Lesson 698 — Null and Residual Deviance
- Residual patterns
- A high R-squared can coexist with systematic patterns in your residuals—violations of the core assumptions that make your predictions unreliable.
- Lesson 537 — When R-Squared is Not EnoughLesson 1189 — Detecting Nonlinear Relationships
- Residual plots
- to check for patterns and violations
- Lesson 537 — When R-Squared is Not EnoughLesson 657 — What Are Polynomial Features?
- Residual Standard Error (RSE)
- comes in.
- Lesson 536 — Residual Standard Error (RSE)Lesson 537 — When R-Squared is Not Enough
- residuals
- the differences between each observation and its group mean—follow a normal distribution.
- Lesson 449 — Normality of ResidualsLesson 451 — Diagnostic Plots for ANOVALesson 516 — Residuals: The Distance from PredictionLesson 543 — Residuals as Unexplained VariationLesson 550 — Normality of ResidualsLesson 556 — What Are Residuals and Why Plot Them?Lesson 575 — Cook's DistanceLesson 593 — Box-Cox Transformation (+4 more)
- Resilient
- RDDs automatically recover from node failures.
- Lesson 1777 — RDDs: Resilient Distributed Datasets Fundamentals
- Resilient Distributed Dataset (RDD)
- a fault-tolerant collection partitioned across nodes.
- Lesson 1774 — What is Apache Spark and Why Use It?
- Resilient Distributed Datasets (RDDs)
- Lesson 1775 — Spark Components: Core, SQL, MLlib, Streaming
- resistant to outliers
- and extreme values.
- Lesson 51 — Interquartile Range (IQR)Lesson 54 — When to Use Each MeasureLesson 1383 — Understanding the Interquartile Range (IQR)
- Resolution (action)
- What specific decision should stakeholders make based on this evidence?
- Lesson 1933 — The Power of Narrative in Data Communication
- Resource allocation
- High-LTV customers justify higher acquisition costs (CAC) and more personalized outreach.
- Lesson 1669 — LTV Segmentation and TargetingLesson 1711 — What Are Acquisition Channels?
- Resource Constraints
- Lesson 243 — Choosing the Right Sampling Method
- Resource management
- Prevents exhausting database connection limits
- Lesson 1092 — Connection Pooling Basics
- Resource Utilization
- CPU, memory, disk I/O, and network usage during pipeline execution.
- Lesson 1856 — Key Metrics to Monitor
- Resource waste
- Running tasks that depend on failed upstream tasks wastes compute resources and makes debugging harder.
- Lesson 1840 — What is Dependency Management in Pipelines?
- Resourced
- Include rough estimates of time, cost, or personnel needed.
- Lesson 1970 — Recommendations and Next Steps
- Response expectations
- "We typically respond within 48 hours"
- Lesson 2083 — Contributing Guidelines and Contact Information
- Responsible Disclosure
- Lesson 1925 — Mitigation Strategies and Responsible DisclosureLesson 1931 — When to Push Back on Requests
- RESTRICT
- (or NO ACTION) prevents the parent operation if children exist:
- Lesson 1054 — Cascading Actions: DELETE and UPDATELesson 1057 — ON DELETE and ON UPDATE Actions
- Result
- All quotas filled, but the sample includes only shoppers willing to stop and talk
- Lesson 240 — Quota SamplingLesson 1566 — Conjugate Normal-Normal Model
- Results/Output
- What the project produces and where to find it
- Lesson 2077 — The Purpose and Anatomy of a Good README
- Retailer loyalty programs
- data sold to data brokers who build detailed consumer profiles
- Lesson 1922 — Surveillance and Secondary Data Uses
- retention
- strategies aim to prevent at-risk customers from leaving in the first place.
- Lesson 1676 — Win-Back and Retention StrategiesLesson 1696 — Feature Adoption and Usage Frequency
- Retention curves
- plot the percentage of users who *remain active* over time (Day-1: 60%, Day-7: 40%, Day-30: 25%).
- Lesson 1660 — Retention Curves vs Churn AnalysisLesson 1661 — What is Customer Lifetime Value (LTV)?Lesson 1678 — What is Funnel Analysis?
- Retention insights
- See if customers stick around longer over time
- Lesson 1644 — What is Cohort Analysis?
- Retraining is constant
- You must retrain models regularly to capture new patterns
- Lesson 2128 — Data Distribution Shifts Frequently
- retry logic
- for transient errors, **idempotency** so rerunning doesn't corrupt data, **checkpointing** to resume mid-pipeline, and **monitoring/alerts** for quick detection.
- Lesson 1825 — Designing Pipeline ArchitectureLesson 1854 — Testing Error Handling
- Reusability
- When you'll reference the same result set multiple times
- Lesson 974 — When to Use FROM Subqueries vs CTEsLesson 1106 — Parameter Placeholders: Named Parameters
- Reusable functions and modules
- that multiple projects import
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- Revenue
- is the quintessential lagging indicator—it tells you what already happened.
- Lesson 1600 — Business Examples: Revenue vs Pipeline
- Revenue accuracy
- Which model's channel weights best predict revenue when you shift budget?
- Lesson 1734 — Comparing and Validating Attribution Models
- Revenue churn
- measures *how much MRR* you lost from cancellations.
- Lesson 1628 — SaaS Metrics: MRR, ARR, and Logo Churn
- Revenue forecasting
- Estimate lifetime value by modeling expected subscription duration
- Lesson 838 — Subscription and Membership Duration ModelingLesson 1644 — What is Cohort Analysis?
- Revenue per user
- = total revenue / users (not just "made $50k!
- Lesson 1613 — Raw Counts vs. Rates and Ratios
- Revenue-focused
- Lesson 1516 — Business Metrics: Definition and Examples
- reverse
- conditional probabilities—it lets you flip P(A|B) into P(B|A).
- Lesson 107 — Bayes' Theorem Formula and ComponentsLesson 430 — Common Applications and Pitfalls
- Reverse causality
- occurs when two variables are correlated, but the direction of influence is the reverse of what you thought.
- Lesson 496 — Reverse CausalityLesson 553 — Exogeneity: X Must Be Independent of ErrorsLesson 1424 — Reverse CausalityLesson 1464 — Instrumental Variables: The Endogeneity Problem
- Reverse causation
- Maybe Y causes X, not X causes Y
- Lesson 493 — The Fundamental Difference: Association vs Cause-and-Effect
- Reverse geocoding
- works the opposite direction: you have coordinates (42.
- Lesson 1315 — Geocoding and Reverse Geocoding
- Reversibility is high
- Changes can be rolled back easily if problems emerge later
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Reversing range logic
- Lesson 868 — The NOT Operator
- Reversing the Hypotheses
- Lesson 313 — Common Pitfalls in Hypothesis Formulation
- Review against WCAG checklist
- document what passes and what needs fixing
- Lesson 1254 — Testing Visualizations for Accessibility
- Review checkpoint
- Show results to stakeholders at sprint end
- Lesson 2113 — Timeboxing and Sprint Planning for Data Projects
- Review logs
- Examine both application logs and database server logs for detailed error messages
- Lesson 1093 — Troubleshooting Connection Issues
- Review notebook-specific PRs carefully
- Understand that diffs may still be noisy even with best practices.
- Lesson 2030 — Version Control for Notebooks: Challenges and Solutions
- Review promptly
- Respect the author's time by reviewing within a day or two.
- Lesson 2024 — Code Review Best Practices
- Review recent changes
- Check pipeline code commits, configuration changes, or dependency updates around when the issue started
- Lesson 1870 — Root Cause Analysis for Quality Issues
- Reweighting
- Adjust training data by giving higher weight to underrepresented or historically disadvantaged groups.
- Lesson 1894 — Auditing and Remediation Strategies
- Rework
- means repeating work because something was missed, misunderstood, or poorly executed the first time—rerunning analysis because you forgot to document your seed, rebuilding features because requirements weren't clarified, or re-validating a model b...
- Lesson 2112 — Iteration vs Rework: Learning from Each Cycle
- Rideshare apps
- Drivers in treatment might reduce wait times for riders in control
- Lesson 1527 — Ignoring Network Effects
- Ridge regression
- modifies least squares by adding a penalty proportional to the *squared* coefficient values.
- Lesson 586 — Remedies: Regularization Preview
- Right
- H₀: The drug has no effect (μ = 0), H₁: The drug works (μ > 0)
- Lesson 313 — Common Pitfalls in Hypothesis Formulation
- RIGHT JOIN
- returns *every row from the right (second) table*, along with matching data from the left (first) table where available.
- Lesson 929 — RIGHT JOIN Syntax and SemanticsLesson 936 — FULL OUTER JOIN Syntax
- Right to erasure
- ("right to be forgotten"): People can request deletion of their data, impacting training datasets and model retraining
- Lesson 1904 — What is GDPR and Why It MattersLesson 1909 — Right to Erasure and Data Retention PoliciesLesson 1911 — GDPR Compliance for Data Scientists
- Right to explanation
- Individuals can demand to understand automated decisions affecting them—black-box models become problematic
- Lesson 1904 — What is GDPR and Why It Matters
- Right to Withdraw
- Lesson 1913 — Elements of Valid Consent
- Right-continuous
- It's continuous from the right side at jump points
- Lesson 810 — The Survival Function S(t)
- Right-skewed
- The distribution has a long tail extending to the right (high values)
- Lesson 178 — Log-Normal Distribution: Definition and Properties
- Right-skewed (positive skew)
- A long tail stretches to the right; most values cluster at the lower end (e.
- Lesson 1175 — Histograms for Distribution Shape
- Risk
- With small samples, you might accidentally get imbalanced groups (e.
- Lesson 1437 — Randomization Mechanisms
- Risk assessment
- Two investments with the same average return might have wildly different risks
- Lesson 46 — What is Variability?
- Risk of gaming exists
- Surrogates might improve while harming long-term value
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Risk tolerance
- High variance might be unacceptable even with better expected value
- Lesson 152 — Decision Making Under Uncertainty
- Risk-adjusted returns
- balancing profitability with stability
- Lesson 1716 — Channel Mix and Portfolio Thinking
- River One: Statistics (1800s–1900s)
- Lesson 5 — The Evolution of Data Science
- River Two: Computing (1950s–1990s)
- Lesson 5 — The Evolution of Data Science
- ROAS < 1
- You're losing money directly on ad spend (spending more than you earn)
- Lesson 1751 — Return on Ad Spend (ROAS): Definition and Calculation
- ROAS = 1
- Breaking even on ad spend (but likely unprofitable after other costs)
- Lesson 1751 — Return on Ad Spend (ROAS): Definition and Calculation
- ROAS > 1
- Generating positive return, but profitability depends on margins
- Lesson 1751 — Return on Ad Spend (ROAS): Definition and Calculation
- robust
- to extreme values than standard deviation because it doesn't square deviations (which amplifies outliers).
- Lesson 52 — Mean Absolute Deviation (MAD)Lesson 115 — Prior Sensitivity AnalysisLesson 363 — Testing Equality of VariancesLesson 380 — Testing Equal Variances: Levene's and Bartlett's TestsLesson 450 — Homogeneity of Variance (Homoscedasticity)Lesson 1572 — Sensitivity Analysis and Prior Robustness
- Robust regression
- techniques offer an alternative: they fit models that automatically downweight or ignore outliers during estimation, so extreme points don't drag your fitted line off course.
- Lesson 590 — Robust Regression Techniques
- robustness
- .
- Lesson 397 — Power and Efficiency of Non-Parametric TestsLesson 452 — Consequences of Assumption ViolationsLesson 475 — Choosing Between Parametric and Non-Parametric Tests
- Robustness Testing
- ensures your model performs consistently.
- Lesson 2089 — Stage 5: Model Development and Validation
- ROI measurement
- Understand true return on marketing investment
- Lesson 1718 — Introduction to Marketing Attribution
- Role-play each audience type
- with a colleague
- Lesson 1956 — Anticipating and Addressing Audience Questions
- Rollback Mechanisms
- Simulate a mid-pipeline failure during a database write or transformation.
- Lesson 1854 — Testing Error Handling
- Rolling a die
- Lesson 78 — Events as Subsets of the Sample SpaceLesson 82 — Collectively Exhaustive Events
- Rolling statistics
- Mean and variance shouldn't drift systematically
- Lesson 741 — Testing Stationarity After Transformation
- Rolling window
- Train on a fixed-size window (e.
- Lesson 789 — Overfitting and Cross-Validation for Time Series
- Root cause analysis
- becomes nearly impossible when you discover issues weeks later
- Lesson 2136 — Monitoring Gaps and Silent Failures
- Rotating 3D views
- Spin a 3D plot to reveal all angles
- Lesson 1327 — Creating Animations with FuncAnimation
- Roughly constant variance
- the noise level should be stable
- Lesson 709 — Irregular Component: Random Noise
- row
- in the table
- Lesson 1117 — What is an ORM and Why Use It?Lesson 1358 — facet_grid() for Two Variables
- Row-level aggregations
- Comparing individual values to group statistics
- Lesson 967 — Subqueries in the SELECT Clause
- Row-level analytics
- that require context from other rows without losing detail
- Lesson 1005 — Introduction to Window Functions
- Rows
- Estimated vs actual row counts.
- Lesson 1084 — Reading and Interpreting Query Execution PlansLesson 1647 — Building a Cohort Table
- Rows (Records)
- Each row represents a single instance or observation.
- Lesson 843 — Relational Database Concepts
- RSS
- = Residual Sum of Squares (the sum of all squared residuals)
- Lesson 536 — Residual Standard Error (RSE)
- Rule
- Check that both np ≥ 10 *and* n(1-p) ≥ 10, where n is your sample size and p is your sample proportion.
- Lesson 282 — Checking Assumptions for Proportion Intervals
- Rule 1
- Each drawer label describes one type of information (not "Age&Address").
- Lesson 1143 — The Three Rules of Tidy DataLesson 1402 — Western Electric Rules
- Rule 2
- Each folder holds one person's complete record (not scattered pieces).
- Lesson 1143 — The Three Rules of Tidy DataLesson 1402 — Western Electric Rules
- Rule 3
- Employee files and project files live in separate cabinets (not jumbled together).
- Lesson 1143 — The Three Rules of Tidy DataLesson 1402 — Western Electric Rules
- Rule 4
- Eight consecutive points on one side of the centerline (even if within 1σ)
- Lesson 1402 — Western Electric Rules
- Rule of thumb
- When sampling without replacement, your sample size should be less than 10% of the population to maintain approximate independence.
- Lesson 282 — Checking Assumptions for Proportion IntervalsLesson 577 — DFBETAS: Influence on Individual CoefficientsLesson 1467 — Testing Instrument Strength and ValidityLesson 1481 — Unit of Randomization
- Rule-of-thumb approaches
- Use formulas based on sample size and variance
- Lesson 1463 — RDD Bandwidth Selection and Local Estimation
- Run optimization
- using constrained optimization algorithms (like scipy's `minimize` with bounds)
- Lesson 1742 — Budget Optimization Using MMM
- Run Robustness Checks
- Lesson 579 — What to Do with Influential Points
- Run statistical tests
- Apply Shapiro-Wilk (for smaller samples) or Anderson-Darling (for general use).
- Lesson 210 — Combining Visual and Statistical Methods
- Run tests longer
- Allow time for behaviors to stabilize (typically 2-4 weeks minimum for behavioral changes)
- Lesson 1525 — Novelty and Primacy Effects
- Run the experiment
- Increase marketing in test regions for a fixed period
- Lesson 1746 — Geo-Lift Experiments
- Running hypothesis tests
- (t-tests, z-tests) that rely on normal theory
- Lesson 202 — Why Test for Normality?
- Running totals
- or moving averages while preserving individual transactions
- Lesson 1005 — Introduction to Window Functions
- Runs in linear time
- Under typical conditions, achieves O(n) complexity instead of O(n²)—a massive speedup for large datasets
- Lesson 1416 — PELT Algorithm: Pruned Exact Linear Time
- Russian nesting dolls
- the innermost subquery runs first, its result becomes a table for the next level up, and so on.
- Lesson 973 — Nested Subqueries in FROM
S
- S Charts
- Preferred for larger subgroups (n > 10) where range becomes less efficient at capturing true variability.
- Lesson 1399 — Control Charts for Variability (R and S Charts)
- S-shaped curve
- Data is skewed (right skew = curve bends up on right; left skew = bends down on left)
- Lesson 204 — Q-Q Plots: Theory and InterpretationLesson 565 — What Q-Q Plots Show: Comparing Residual Distribution to NormalLesson 566 — Reading Q-Q Plots: Interpreting Points Along the Reference Line
- S(∞) = 0
- Eventually, everyone experiences the event (in theory)
- Lesson 810 — The Survival Function S(t)
- SaaS Products
- Lesson 1657 — Day-1, Day-7, Day-30 Benchmarks
- SaaS Sign-up
- Landing Page → Sign-up Form → Email Verification → Onboarding → First Use
- Lesson 1678 — What is Funnel Analysis?
- SaaS tools
- 10-30% (depends on use case)
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- Sales Analysis
- Lesson 908 — Multi-Level Grouping in Business Analytics
- Sales expenses
- sales team salaries and commissions, sales software (CRM, outreach tools), travel and entertainment
- Lesson 1753 — Customer Acquisition Cost (CAC): Components and Calculation
- Sales pipeline metrics
- , on the other hand, are leading indicators.
- Lesson 1600 — Business Examples: Revenue vs Pipeline
- Sales(t)
- is your outcome variable at time *t* (weekly sales, conversions, etc.
- Lesson 1738 — The Core MMM Regression Model
- same number of columns
- with **compatible data types**.
- Lesson 998 — Introduction to Set OperationsLesson 1001 — INTERSECT: Finding Common Rows
- same variance
- (even if their means differ).
- Lesson 361 — Pooled Variance t-TestLesson 379 — The Assumption of Equal Variances (Homoscedasticity)
- Same-store sales (SSS)
- , also called "comparable store sales" or "comps," isolates growth from stores open at least 12-13 months, revealing organic performance by controlling for expansion.
- Lesson 1634 — Retail Metrics: Same-Store Sales and Inventory Turnover
- sample
- a subset meant to represent the population.
- Lesson 50 — Population vs Sample VarianceLesson 229 — Defining Samples and StatisticsLesson 230 — Why We Sample Instead of CensusLesson 232 — Notation ConventionsLesson 237 — Cluster SamplingLesson 261 — Standard Error vs Standard Deviation
- Sample distribution
- is one snapshot from that album — maybe 100 randomly selected people.
- Lesson 258 — Comparing Population, Sample, and Sampling Distributions
- Sample from each stratum
- Use simple random sampling *within* each stratum, maintaining the correct proportions
- Lesson 236 — Stratified Sampling
- Sample Mean (x̄)
- The expected value of the sample mean equals the population mean (μ).
- Lesson 255 — Expected Value of Sample Statistics
- Sample Proportion (p̂)
- The expected value equals the true population proportion (p).
- Lesson 255 — Expected Value of Sample Statistics
- Sample quantiles
- (your actual residual values, sorted) on the y-axis
- Lesson 565 — What Q-Q Plots Show: Comparing Residual Distribution to Normal
- Sample size
- (n): Larger samples → smaller SE
- Lesson 260 — Defining Standard ErrorLesson 294 — Margin of Error and Its ComponentsLesson 324 — Common Significance Levels: 0.05, 0.01, and 0.10Lesson 389 — Reporting Effect Sizes in PracticeLesson 1549 — Prior-Likelihood Trade-offsLesson 1692 — Statistical Significance and IterationLesson 1749 — Measuring Statistical Significance
- Sample size (n)
- Larger samples → smaller standard error → smaller margin of error.
- Lesson 271 — Margin of ErrorLesson 335 — Calculating Type II Error Probability (Beta)Lesson 343 — Calculating Power for Common TestsLesson 344 — Power Analysis in Study DesignLesson 1496 — The Four Parameters of Sample Size Calculation
- Sample size calculation
- Based on your Minimum Detectable Effect and power
- Lesson 1485 — Documentation and Pre-RegistrationLesson 1494 — Effect Size: The Minimum Detectable EffectLesson 1508 — Pre-Registration and Correction Strategy
- Sample size challenges
- Intersectional groups may be small, making statistical analysis harder
- Lesson 1893 — Intersectionality in Fairness
- Sample size is large
- More observations make the data speak louder than assumptions
- Lesson 115 — Prior Sensitivity Analysis
- Sample size is small
- With little data, your starting belief dominates
- Lesson 115 — Prior Sensitivity Analysis
- Sample size limitations
- "Based on 500 customers, we're confident in the direction but not precise magnitude"
- Lesson 2122 — When Uncertainty Is Acceptable
- Sample size matters
- Typically, n ≥ 30 is considered sufficient for the CLT to "kick in," though it depends on how non- normal the original population is.
- Lesson 218 — What the Central Limit Theorem States
- Sample Size Per Group
- Lesson 446 — Power and Sample Size for ANOVA
- Sample variance
- Divide by **N-1** (one less than your sample size)
- Lesson 50 — Population vs Sample VarianceLesson 255 — Expected Value of Sample Statistics
- Sampling
- Training a facial recognition model primarily on one demographic
- Lesson 1878 — What is Bias in Data?Lesson 2055 — Why Randomness Matters in Data Science
- Sampling bias
- is a systematic error in how you collect your sample that pushes your results in one direction, away from the truth.
- Lesson 248 — Sampling Error vs Sampling BiasLesson 249 — Coverage Error and UndercoverageLesson 1879 — Selection Bias and Sampling Bias
- sampling distribution
- is the probability distribution of a statistic (like the mean, median, or proportion) computed from *all possible samples* of a fixed size drawn from the same population.
- Lesson 251 — What is a Sampling Distribution?Lesson 257 — Shape of Sampling DistributionsLesson 258 — Comparing Population, Sample, and Sampling Distributions
- Sampling error
- is the natural, random variation you get just because you didn't measure everyone.
- Lesson 248 — Sampling Error vs Sampling Bias
- Sampling new records
- Generate fresh rows that follow the learned patterns but represent no actual person
- Lesson 1901 — Synthetic Data Generation
- Sampling zeros
- People who *could* experience it but happened not to (e.
- Lesson 695 — Zero-Inflated Models
- SARIMA
- (Seasonal ARIMA) adds a second layer of similar components that operate specifically on the seasonal lags.
- Lesson 795 — Seasonal ARIMA (SARIMA) Structure
- Saturated model
- Perfect fit with one parameter per observation
- Lesson 697 — Deviance: A Measure of Model Fit
- Saturation
- is the intensity or purity of the color, ranging from vivid/vibrant to dull/grayish.
- Lesson 1234 — Color: Hue, Saturation, and Luminance
- Say
- "For every additional hour of study time, we expect students' test scores to increase by about 2.
- Lesson 530 — Communicating Results to Non-Technical AudiencesLesson 1955 — Framing Insights in Business Language
- Scalability
- Handles concurrent requests efficiently in multi-threaded or async applications
- Lesson 1092 — Connection Pooling BasicsLesson 1816 — What is ELT? Extract, Load, Transform ExplainedLesson 1822 — What is a Data Pipeline?
- Scale parameter (λ)
- Stretches or compresses the distribution along the time axis
- Lesson 187 — The Weibull Distribution: Shape, Scale, and SurvivalLesson 189 — Fitting Weibull Models to Lifetime Data
- Scale Transformations
- Switch to logarithmic scales with `set_xscale('log')` when data spans multiple orders of magnitude (think: population sizes from villages to countries).
- Lesson 1270 — Customizing Axes: Labels, Limits, and Scales
- Scale-Location plot
- solves this by plotting the *square root* of the *absolute value* of standardized residuals against fitted values.
- Lesson 560 — Scale-Location Plot (Spread-Location Plot)
- Scatter plot matrices
- (Lesson 1191) visually show near-perfect linear relationships
- Lesson 1197 — Identifying Variable Importance and Redundancy
- Scatter plots
- remain your most powerful tool here.
- Lesson 1189 — Detecting Nonlinear RelationshipsLesson 1284 — Pair Plots for Multivariate Exploration
- Schedule quarterly reviews
- with stakeholders to assess whether the tree still represents reality and strategy.
- Lesson 1626 — Maintaining and Evolving Metric Trees
- Scheduler
- Monitors DAGs and triggers tasks when dependencies are met
- Lesson 1833 — Introduction to Apache Airflow
- Scheduling
- is like setting alarm clocks: "Run this job every day at 2 AM.
- Lesson 1832 — Orchestration vs Scheduling
- schema
- is an organizational container that groups related tables together.
- Lesson 846 — Tables, Schemas, and Data TypesLesson 1151 — Schema Validation
- Schema assumptions
- Your code expects a column named `user_id`, but upstream decides to rename it to `customer_id`.
- Lesson 2133 — Undocumented Data Dependencies
- Schema awareness
- Spark knows your column names and data types
- Lesson 1778 — DataFrames and Spark SQL Basics
- Schema Changes
- Tracking modifications to data structure (new columns, type changes, renamed fields).
- Lesson 1856 — Key Metrics to MonitorLesson 2136 — Monitoring Gaps and Silent Failures
- Schema extraction
- pulls structural information: column names, data types, primary keys, constraints.
- Lesson 2067 — Automating Documentation with Code
- Schema validation
- checks structural requirements:
- Lesson 1826 — Data Validation and Schema Enforcement
- scikit-learn
- for prediction-focused workflows and machine learning pipelines.
- Lesson 545 — Extracting Residuals and Fitted Values in PythonLesson 2058 — Seed Scope and Multiple Libraries
- Scoped
- "Identify the top 3 pages where users abandon our checkout process, so we can redesign them to increase completed purchases by 10%"
- Lesson 10 — Problem Definition and ScopingLesson 1166 — Defining the Business Question
- Scoping
- means setting clear boundaries: What will you measure?
- Lesson 10 — Problem Definition and Scoping
- Screen reader testing
- with tools like NVDA, JAWS, or VoiceOver reveals whether your alternative text and data tables are actually helpful
- Lesson 1254 — Testing Visualizations for Accessibility
- Scripts
- are executable files that run a complete workflow — useful for automation and reproducibility.
- Lesson 2071 — Modular Code: Functions and Scripts
- SE
- is the standard error of the mean
- Lesson 269 — Confidence Interval Formula for One MeanLesson 287 — Confidence Intervals for the Difference Between Two ProportionsLesson 353 — Calculating the t-StatisticLesson 402 — Calculating the Test Statistic for ProportionsLesson 409 — Z-Test Statistic for Two Proportions
- SE(p̂)
- is the standard error of the proportion: √(p̂(1-p̂)/n)
- Lesson 278 — Confidence Interval Formula for One Proportion
- Seaborn
- was built specifically to improve on Matplotlib's defaults.
- Lesson 1371 — Default Aesthetics and Design ChoicesLesson 1373 — Statistical Transformations: Built-in vs Manual
- Seaborn FacetGrid
- Similar benefits to ggplot2, with convenient statistical plotting functions built in
- Lesson 1372 — Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
- Seaborn's FacetGrid
- follows a similar declarative philosophy.
- Lesson 1372 — Faceting: ggplot2 vs Seaborn and Matplotlib Subplots
- Seamless visualization
- Plotting libraries expect data in predictable formats.
- Lesson 1149 — Benefits of Tidy Data for Downstream Work
- Search and matching failures
- "café" might not match "cafe" in pattern searches.
- Lesson 1139 — Dealing with Special Characters and Unicode
- Searched CASE
- is more flexible—each WHEN clause can contain any boolean condition.
- Lesson 1031 — Simple CASE vs Searched CASE
- Seasonal
- Fixed period (always 365 days for annual patterns)
- Lesson 708 — Cyclical Patterns: Non-Fixed FluctuationsLesson 711 — Visualizing Components with Decomposition PlotsLesson 742 — Components of Seasonal DecompositionLesson 744 — Classical Decomposition MethodsLesson 747 — Interpreting Decomposition PlotsLesson 767 — Holt-Winters Additive ModelLesson 795 — Seasonal ARIMA (SARIMA) Structure
- Seasonal AR terms
- appear as significant spikes in the PACF at seasonal lags that cut off, while the ACF shows a gradual decay at those seasonal intervals.
- Lesson 796 — Identifying Seasonal Patterns
- seasonal component
- , you might misinterpret normal variation as something special (or vice versa).
- Lesson 707 — Seasonality: Regular Periodic PatternsLesson 711 — Visualizing Components with Decomposition PlotsLesson 742 — Components of Seasonal Decomposition
- Seasonal decomposition
- – separating the data into trend, seasonal, and residual components
- Lesson 1405 — What is Seasonal Hybrid ESD?
- Seasonal differencing
- works the same way, but instead of subtracting adjacent points, you subtract observations that are *one full season apart*.
- Lesson 737 — Seasonal DifferencingLesson 797 — Seasonal Differencing
- Seasonal effects
- If your business has monthly billing cycles, holiday shopping patterns, or fiscal calendar impacts, your test duration should span these periods.
- Lesson 1484 — Duration and Timing Considerations
- Seasonal equation
- Updates the seasonal pattern for each period
- Lesson 767 — Holt-Winters Additive ModelLesson 768 — Holt-Winters Multiplicative Model
- Seasonal fluctuations remain constant
- in absolute size regardless of the trend level
- Lesson 743 — Additive vs Multiplicative Models
- Seasonal Hybrid ESD
- approach you've learned extends to multiple periods by iteratively or simultaneously accounting for each cycle.
- Lesson 1408 — Handling Multiple Seasonal Periods
- Seasonal MA terms
- show up as significant spikes in the ACF at seasonal lags (12, 24, 36) while cutting off after a certain seasonal lag.
- Lesson 796 — Identifying Seasonal Patterns
- seasonal pattern
- evolves over time.
- Lesson 769 — Smoothing Parameters: Alpha, Beta, GammaLesson 771 — Forecasting with Holt-Winters
- seasonal patterns
- that need specialized modeling.
- Lesson 726 — Using ACF for Model IdentificationLesson 760 — Forecasting with Simple Exponential Smoothing
- Seasonality
- Do ice cream sales spike every summer?
- Lesson 19 — Temporal Data and Time SeriesLesson 708 — Cyclical Patterns: Non-Fixed FluctuationsLesson 710 — Additive vs Multiplicative ModelsLesson 711 — Visualizing Components with Decomposition PlotsLesson 765 — Introduction to Holt-Winters MethodLesson 1406 — Decomposing SeasonalityLesson 1412 — What is Change-Point Detection?Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU) (+1 more)
- Seasonally adjusted data
- is your original time series with the seasonal component removed, leaving you with just the trend and irregular components.
- Lesson 748 — Seasonally Adjusted Data
- Second batch arrives
- Use Beta(12, 17) as your new prior → observe 5 successes, 8 failures → get Beta(17, 25) posterior
- Lesson 1563 — Sequential Updating with New Data
- Second difference
- Control group's change = (After - Before)
- Lesson 1452 — The Difference-in-Differences Setup
- Second difference (DiD)
- Subtract the control group's change from the treatment group's change:
- Lesson 1454 — Calculating the DiD Estimator
- Second evidence (witness testimony)
- Use that 60% as your new prior → apply Bayes' Theorem again → posterior becomes 85%.
- Lesson 114 — Sequential Updating
- Second layer
- Three supporting pillars—"Customer surveys show strong demand," "A/B test validated the prediction," "Risk analysis shows minimal downside.
- Lesson 1952 — The Pyramid Principle: Leading with Conclusions
- Second-order differencing
- means you difference the already-differenced data:
- Lesson 736 — Higher-Order Differencing
- Secondary metrics
- protect you from winning the battle but losing the war.
- Lesson 1478 — Defining Success MetricsLesson 1485 — Documentation and Pre-Registration
- Secondary use
- occurs when data collected for one specific purpose gets repurposed for something else—often without obtaining fresh consent from the individuals involved.
- Lesson 1915 — Secondary Use and Scope Creep
- Secure auctions
- where bids remain secret until the winner is determined
- Lesson 1903 — Secure Multi-Party Computation
- Security updates
- Credentials refresh, access control adjustments
- Lesson 1979 — Maintenance and Sustainability Considerations
- See dynamic effects
- Does the policy effect grow or fade over time?
- Lesson 1457 — Multiple Time Periods and Staggered Adoption
- Seek peer review
- from colleagues with no stake in the outcome
- Lesson 35 — Conflicts of Interest and Independence
- Segment by path type
- – compare conversion rates across different journey patterns
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Segment by user tenure
- Compare new users (no primacy effect) separately from existing users
- Lesson 1525 — Novelty and Primacy Effects
- Segment by user type
- Power users, casual users, and at-risk users have different engagement profiles
- Lesson 1693 — Defining User Engagement
- Segment differences
- Compare curves using the log-rank test to see which groups need different retention strategies
- Lesson 835 — Customer Churn Prediction with Survival Analysis
- Segment insights
- Do paid users stick around longer than free users?
- Lesson 1659 — Comparing Retention Across Cohorts
- SELECT
- Applies aggregate functions to each group and projects columns
- Lesson 896 — GROUP BY Execution OrderLesson 909 — Combining Multiple Groups with SELECTLesson 912 — Fundamental Difference: Filter Timing
- SELECT columns
- Choose which columns to display from either or both tables
- Lesson 919 — Basic INNER JOIN Syntax
- Select every kth element
- Starting from position 3, select every 10th element: the 3rd, 13th, 23rd, 33rd.
- Lesson 235 — Systematic Sampling
- Select the parameters
- that minimize the chosen error metric
- Lesson 772 — Holt-Winters Parameter Optimization
- Select What Matters
- Lesson 1217 — The Transition from Explore to Explain
- Selectboxes
- provide dropdown menus for choosing from predefined options:
- Lesson 1332 — Streamlit Widgets: Inputs and Controls
- Selecting features
- that matter most: removing redundant or irrelevant variables that add noise without signal, reducing dimensionality while preserving information.
- Lesson 2088 — Stage 4: Feature Engineering and Preparation
- selection bias
- and **nonresponse bias** you've already learned—survivorship bias is a specific type where the "non-survivors" physically can't be in your dataset.
- Lesson 247 — Survivorship BiasLesson 1432 — Colliders and Bad ControlsLesson 1473 — Conditioning on Colliders: Selection BiasLesson 1526 — Selection Bias in Opt-In TestsLesson 1879 — Selection Bias and Sampling BiasLesson 1938 — Using Metaphors and Analogies
- Selectivity
- is how well a query condition narrows down the result set.
- Lesson 1083 — Index Selectivity and Cardinality
- Self-contained logic
- Keep related calculations within a single query instead of multiple separate queries
- Lesson 959 — Introduction to Subqueries in WHERE
- Self-selection
- When people choose whether to participate.
- Lesson 244 — Selection Bias and Its CausesLesson 1444 — Selection Bias and Treatment Assignment
- Seller utilization
- % of available supply actually transacted
- Lesson 1630 — Marketplace Metrics: GMV, Take Rate, and Liquidity
- Senior Data Scientist
- Own complex projects end-to-end, mentor juniors informally
- Lesson 2140 — Individual Contributor vs Management Tracks
- Senior Manager/Director
- Manage multiple teams or managers, set team strategy, align with business
- Lesson 2140 — Individual Contributor vs Management Tracks
- sensitive
- `WHERE name = 'John'` only matches "John"
- Lesson 862 — Case Sensitivity in Text FilteringLesson 1478 — Defining Success Metrics
- Sensitive attributes
- leak through proxy variables—attributes correlated with protected classes.
- Lesson 1888 — Protected Classes and Sensitive Attributes
- Sensitive to all values
- Every number in your dataset affects the mean—change one value, and the mean changes
- Lesson 39 — The Mean (Arithmetic Average)
- Sensitivity
- probability the test is positive *given* you have the disease (true positive rate)
- Lesson 109 — Medical Diagnostic TestingLesson 216 — Reciprocal and Inverse TransformationsLesson 1534 — The Prior DistributionLesson 1899 — Adding Noise for Privacy
- Sensitivity analyses
- showing how results change under different assumptions
- Lesson 1949 — Anticipating Questions: Building in Appendices
- Sensitivity analysis
- is the practice of deliberately varying your prior choices and observing how the posterior distribution responds.
- Lesson 1572 — Sensitivity Analysis and Prior Robustness
- Sensors
- are specialized operators that continuously check for specific conditions—like whether another pipeline has completed or if a particular file exists in storage.
- Lesson 1845 — Cross-Pipeline Dependencies
- Sensors and IoT devices
- Real-time measurements from physical equipment
- Lesson 11 — Data Collection and Acquisition
- Separate must-fix from suggestions
- Use tags like "critical" vs "nit" or "optional.
- Lesson 2024 — Code Review Best Practices
- Separate signal from noise
- by isolating the long-term pattern from short-term variability
- Lesson 706 — Trend: Long-Term Direction
- Separate when
- Lesson 1147 — Separating and Uniting Columns
- Separating columns
- means splitting one column containing compound data (like "Smith, John" or "2024-01-15 14:30:00") into multiple columns ("LastName", "FirstName" or "Date", "Time").
- Lesson 1147 — Separating and Uniting Columns
- separation of concerns
- the statistical calculation is independent of how you choose to visualize it.
- Lesson 1352 — Statistical Transformations with stat_* LayersLesson 2069 — Project Directory Structure
- Sequence validation
- Check if values follow expected patterns
- Lesson 1024 — LAG Function: Accessing Previous Row Values
- Sequential
- No gaps in numbering (always 1, 2, 3.
- Lesson 1007 — ROW_NUMBER(): Assigning Unique Row Numbers
- Sequential analysis
- Analyze patterns across adjacent time periods
- Lesson 1023 — Introduction to Window Functions: LAG and LEAD
- Sequential chains
- `task_a >> task_b >> task_c`
- Lesson 1843 — Declaring Dependencies in Orchestration Tools
- Sequential decomposition
- Remove the strongest seasonal component first, then detect weaker ones in the residuals
- Lesson 1408 — Handling Multiple Seasonal Periods
- Sequential events
- Comparing different timestamps or events within a single `events` table
- Lesson 945 — Introduction to Self-Joins
- Sequential ordering
- Time flows in one direction; past observations may predict future ones, but not vice versa
- Lesson 704 — What Makes Time Series Data Different?
- Sequential testing
- (also called *sequential analysis* or *continuous monitoring*) provides statistical methods that account for continuous or repeated looks at accumulating data.
- Lesson 1510 — Sequential Testing Overview
- Sequential updating
- means applying Bayes' Theorem iteratively: your **posterior probability after one update becomes the prior probability for the next update**.
- Lesson 114 — Sequential UpdatingLesson 116 — From Bayes' Theorem to Bayesian InferenceLesson 1555 — Advantages and Limitations of Conjugate PriorsLesson 1570 — Comparing Two Means: Bayesian ApproachLesson 1586 — Multi-Armed Bandit Connections
- sequentially
- each new lag builds on previous calculations.
- Lesson 729 — Calculating Partial AutocorrelationsLesson 1037 — CASE Best Practices and PerformanceLesson 1531 — Interference from Concurrent Tests
- Serializable
- Transactions run as if they're completely alone (safest but slowest)
- Lesson 1116 — Transaction Isolation and Concurrency
- Server metrics monitoring
- CPU, memory, or network traffic that follows daily business cycles
- Lesson 1411 — Applications and Limitations
- Service Level Agreement (SLA)
- is a formal promise made to stakeholders or customers about minimum service levels, often with consequences if broken.
- Lesson 1860 — SLA and SLO Definitions
- Service Level Objective (SLO)
- is a specific, measurable target for a service's performance—think of it as your internal goal.
- Lesson 1860 — SLA and SLO Definitions
- Session data
- timestamps, referral sources, pages visited
- Lesson 1719 — The Customer Journey and Touchpoints
- Session Depth
- counts the number of actions or page views within a session.
- Lesson 1695 — Session-Based Engagement Metrics
- session duration
- ?
- Lesson 1624 — Counter-Metrics and GuardrailsLesson 1695 — Session-Based Engagement Metrics
- Session Frequency
- measures how often a user starts new sessions over a given period (e.
- Lesson 1695 — Session-Based Engagement Metrics
- Session Recency
- measures the time since a user's last session.
- Lesson 1695 — Session-Based Engagement Metrics
- Sessions
- are your workspace for database operations.
- Lesson 1122 — Creating Tables and Session Management
- Set alpha accordingly
- Lower if Type I is costly; higher if Type II is costly.
- Lesson 334 — Setting Alpha: Choosing Your Significance Level
- Set constraints
- based on cash position (max payback acceptable)
- Lesson 1759 — Optimizing ROAS, CAC, and Payback Together
- SET DEFAULT
- Similar to SET NULL, but sets the foreign key to a predefined default value instead.
- Lesson 1057 — ON DELETE and ON UPDATE Actions
- SET NULL
- clears the foreign key in child records:
- Lesson 1054 — Cascading Actions: DELETE and UPDATELesson 1057 — ON DELETE and ON UPDATE Actions
- Set priors
- for each group's mean (often using the Normal-Inverse-Gamma or Normal-Normal models you've learned)
- Lesson 1570 — Comparing Two Means: Bayesian Approach
- Set thresholds in advance
- document your acceptance criteria before seeing data
- Lesson 1492 — Rerandomization and Practical Implementation
- Set time windows carefully
- – decide if a 30-day journey with loops still counts as a single funnel attempt
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Set up hypotheses
- H₀: The probability of switching in either direction is equal
- Lesson 436 — Conducting McNemar's Test
- Set your objective
- maximize total conversions, revenue, or profit
- Lesson 1742 — Budget Optimization Using MMM
- Setup (context)
- What problem motivated this analysis?
- Lesson 1933 — The Power of Narrative in Data Communication
- Setup cells
- Import libraries and load data (code + output)
- Lesson 1982 — Literate Programming with Notebooks
- shape
- of continuous data: where values cluster, how spread out they are, and whether the distribution is symmetric or lopsided.
- Lesson 1175 — Histograms for Distribution ShapeLesson 1183 — Scatter Plots for Two Numeric VariablesLesson 1208 — Distribution Checks for All VariablesLesson 1220 — Histograms for Continuous DistributionsLesson 1238 — Matching Encoding to Data TypeLesson 1341 — Data and Aesthetic Mappings
- Shape + Color
- In scatter plots, use different point shapes (circles, triangles, squares) in addition to different colors for categories
- Lesson 1251 — Avoiding Reliance on Color Alone
- Shape parameter (k)
- Controls how the failure rate changes over time
- Lesson 187 — The Weibull Distribution: Shape, Scale, and SurvivalLesson 189 — Fitting Weibull Models to Lifetime Data
- Shape parameter (α, "alpha")
- Controls the shape of the curve.
- Lesson 181 — Gamma Distribution: Shape and Rate Parameters
- Shapiro-Wilk
- .
- Lesson 290 — Assumptions and Diagnostics for Difference IntervalsLesson 1208 — Distribution Checks for All Variables
- Shapiro-Wilk test
- Tests the null hypothesis that residuals are normally distributed
- Lesson 449 — Normality of ResidualsLesson 570 — Q-Q Plots vs Formal Normality Tests: When Visual Checks Matter
- Share of Voice
- tracks your brand's mentions versus competitors—critical for measuring platform or brand dominance within a category.
- Lesson 1631 — Social Media Metrics: DAU/MAU and Content Engagement
- Share your environment
- Specify exact software versions
- Lesson 30 — The Reproducibility Crisis and Solutions
- Shared credit models
- treat the outcome as jointly owned, rewarding collaboration.
- Lesson 1640 — Attribution in Multi-Team Environments
- Sharp
- A scholarship given to *all* students scoring ≥70 on an entrance exam.
- Lesson 1461 — Sharp vs Fuzzy RDD
- Shift+Tab
- (to move backward), **Enter** or **Space** (to activate), and **arrow keys** (for fine control).
- Lesson 1253 — Interactive Accessibility: Keyboard Navigation
- Shortened Session Duration
- Sessions getting briefer over time suggest decreasing value extraction—users aren't finding what they need or losing interest.
- Lesson 1700 — Leading Indicators of Disengagement
- Show sensitivity analyses
- that reveal how fragile findings are
- Lesson 1929 — Avoiding Cherry-Picking Results
- Show, don't tell
- Give viewers the chart without explaining it.
- Lesson 1964 — Testing Visualizations with Audiences
- Showing temporal change
- Population growth, stock prices, disease spread
- Lesson 1306 — Animation and Time-Based Transitions
- Shrinkage
- means your posterior estimate gets "pulled" away from extreme sample values toward your prior belief.
- Lesson 1569 — Shrinkage and Regularization Effects
- sign test
- offers a simple, robust alternative.
- Lesson 391 — The Sign Test for MediansLesson 392 — Wilcoxon Signed-Rank Test
- Significance bounds
- (also called confidence intervals) help you answer this question.
- Lesson 723 — Significance Bounds in ACF Plots
- Significance indicators
- often asterisks or yes/no flags
- Lesson 462 — Interpreting and Reporting Post-Hoc Results
- significance level
- , denoted by the Greek letter **α** (alpha), is a predetermined probability threshold you set *before* conducting a hypothesis test.
- Lesson 323 — What is a Significance Level (α)?Lesson 388 — Effect Size in Sample Size Planning
- Significance level (α)
- , typically 0.
- Lesson 296 — Sample Size for Comparing Two GroupsLesson 328 — The Relationship Between α and Confidence LevelLesson 335 — Calculating Type II Error Probability (Beta)Lesson 343 — Calculating Power for Common TestsLesson 405 — Sample Size and Power for Proportion TestsLesson 446 — Power and Sample Size for ANOVALesson 1496 — The Four Parameters of Sample Size Calculation
- Signs of productive iteration
- Lesson 2112 — Iteration vs Rework: Learning from Each Cycle
- Signs of wasteful rework
- Lesson 2112 — Iteration vs Rework: Learning from Each Cycle
- Signup date
- When a user creates an account (classic acquisition cohort)
- Lesson 1646 — Defining Cohort Start Events
- Silent failures
- If Task A fails but Task B runs anyway (because it doesn't know to wait), you'll process incomplete or corrupted data without realizing it.
- Lesson 1840 — What is Dependency Management in Pipelines?
- silhouette scores
- to quantify segment quality at each cut point.
- Lesson 1706 — Hierarchical Clustering for SegmentationLesson 1708 — Choosing the Number of Segments
- Similarity
- Objects sharing visual properties (color, shape, size) are seen as belonging together.
- Lesson 1236 — Gestalt Principles in Visualization
- Simple area chart
- When you want to emphasize cumulative growth or magnitude over time
- Lesson 1227 — Area Charts and Stacked Area Charts
- Simple Exponential Smoothing
- for level-only data and **Double Exponential Smoothing (Holt's Method)** for data with trend.
- Lesson 765 — Introduction to Holt-Winters Method
- simple linear regression
- (one predictor).
- Lesson 534 — R-Squared vs Correlation SquaredLesson 595 — From Simple to Multiple Linear RegressionLesson 622 — Relationship Between F-Test and t-Tests
- Simple Moving Average (SMA)
- smooths out short-term fluctuations in your time series by averaging the most recent *n* data points.
- Lesson 751 — Simple Moving Average (SMA)
- Simple ratio check
- Calculate the ratio of residual deviance to degrees of freedom from your fitted Poisson model.
- Lesson 693 — Overdispersion in Count Data
- Simple, one-time transformations
- When you need a quick intermediate step and won't reference it again
- Lesson 974 — When to Use FROM Subqueries vs CTEs
- Simplify and Focus
- Lesson 1217 — The Transition from Explore to Explain
- Simplify communication
- "Our average customer is 34 years old" is clearer than showing a spreadsheet of 10,000 ages
- Lesson 38 — What is Central Tendency?
- Simpson's Paradox
- Lesson 430 — Common Applications and PitfallsLesson 1194 — Simpson's Paradox and ConfoundingLesson 1893 — Intersectionality in Fairness
- Simulate color blindness
- on your chart—can distinctions still be seen?
- Lesson 1254 — Testing Visualizations for Accessibility
- Simulation tools
- let you preview how your visualizations appear under different accessibility conditions:
- Lesson 1254 — Testing Visualizations for Accessibility
- Simulation visualization
- Animate particle movements or algorithm steps
- Lesson 1327 — Creating Animations with FuncAnimation
- Simultaneity
- X and Y determine each other simultaneously
- Lesson 553 — Exogeneity: X Must Be Independent of Errors
- Simultaneous decomposition
- Use methods like STL (Seasonal-Trend decomposition using Loess) with multiple seasonal periods specified
- Lesson 1408 — Handling Multiple Seasonal Periods
- Single column
- "What's the average salary per department?
- Lesson 905 — Grouping by Multiple Columns: Basics
- Single samples vary
- Your one sample mean might be 170 cm, but someone else's might be 168 cm.
- Lesson 251 — What is a Sampling Distribution?
- Single source of truth
- One person ensures consistent calculation and definition
- Lesson 1619 — What is Metric Ownership?
- Single trial
- One Bernoulli trial = one observation
- Lesson 123 — Bernoulli Trial Definition and Properties
- size
- of differences while remaining non-parametric (no normality assumption required).
- Lesson 392 — Wilcoxon Signed-Rank TestLesson 1229 — Bubble Charts for Three VariablesLesson 1235 — Pre-Attentive AttributesLesson 1238 — Matching Encoding to Data TypeLesson 1310 — Point Maps and Scatter Plots on MapsLesson 1341 — Data and Aesthetic Mappings
- Size perception
- Humans judge area imperfectly, so don't encode critical comparisons in bubble size alone
- Lesson 1229 — Bubble Charts for Three Variables
- Size variation
- can represent a third numeric variable—larger bubbles for higher values create a "bubble chart" effect.
- Lesson 1265 — Scatter Plots: Relationships Between Variables
- Skeptical stakeholders
- Meet them where they are.
- Lesson 1953 — Adjusting Statistical Depth by Audience
- skewed
- bootstrap distributions or have systematic bias.
- Lesson 304 — BCa Bootstrap Intervals: Bias CorrectionLesson 503 — Confidence Intervals for Correlation CoefficientsLesson 568 — Skewness in Q-Q Plots: Left and Right Deviations
- Skewed distributions
- (lopsided): The mean gets "pulled" toward extreme values.
- Lesson 42 — Comparing Mean, Median, and ModeLesson 221 — CLT for Different Population Distributions
- Skewness
- Does one tail stretch longer?
- Lesson 63 — Understanding Distribution ShapeLesson 208 — Jarque-Bera Test
- Skip the jargon
- No one outside your team needs to hear "coefficient" or "residuals"
- Lesson 530 — Communicating Results to Non-Technical Audiences
- Slack
- Messages sent by teams (value = collaboration enabled)
- Lesson 1604 — What is a North Star Metric?Lesson 1606 — Examples of North Star Metrics by Industry
- Sleep Quality
- → **Alertness** (poor sleep reduces alertness)
- Lesson 1469 — Building a Simple Causal DAG
- Sliders
- let users select numeric values within a range—perfect for filtering years, adjusting thresholds, or setting parameters:
- Lesson 1332 — Streamlit Widgets: Inputs and Controls
- Slow down dramatically
- processing scales with the product, not the sum
- Lesson 943 — CROSS JOIN Results: Size and Structure
- Slow onboarding
- It's unclear which files or features represent the "real" solution
- Lesson 2135 — Dead Experimental Code and Feature Sprawl
- Slow sorting
- operations (especially with `ORDER BY`)
- Lesson 911 — Performance Considerations with Multiple Groups
- Slow-moving funnel
- Users take days between steps (friction, confusion, or decision paralysis)
- Lesson 1681 — Time-Based Funnel Analysis
- Slower payback
- = need more capital or slower scaling
- Lesson 1757 — Payback Period: Definition and Importance
- Slowly decaying ACF
- Bars decrease gradually → suggests a trend or non-stationarity
- Lesson 722 — ACF Plots and Interpretation
- small
- (< 100 rows typically)
- Lesson 943 — CROSS JOIN Results: Size and StructureLesson 1356 — What Are Facets and Small Multiples?Lesson 2034 — Committing Data Artifacts and Model Outputs
- Small drop
- Your predictors may not be useful—the intercept-only model was nearly as good.
- Lesson 698 — Null and Residual Deviance
- Small effect
- d ≈ 0.
- Lesson 385 — Cohen's d for Standardized Mean DifferencesLesson 386 — Effect Size Interpretation GuidelinesLesson 429 — Effect Size: Cramér's V and Phi
- Small Expected Frequencies
- Lesson 430 — Common Applications and Pitfalls
- Small multiples
- Show different "slices" of your data in separate 2D panels
- Lesson 1329 — Effective Use and Pitfalls of 3D Visualizations
- small p-value
- (typically < 0.
- Lesson 380 — Testing Equal Variances: Levene's and Bartlett's TestsLesson 606 — Statistical Significance of Individual CoefficientsLesson 717 — KPSS Test
- Small p-value (e.g., 0.01)
- Your observed data would be very rare if H₀ were true.
- Lesson 318 — What is a P-Value?
- Small sample sizes
- (n < 30): Your confidence intervals and p-values rely heavily on the normality assumption
- Lesson 550 — Normality of Residuals
- small samples
- (n < 50): Tests may fail to detect real non-normality (low power—you might miss problems).
- Lesson 209 — Sample Size Considerations in Normality TestsLesson 265 — Using Standard Error in PracticeLesson 398 — Choosing Between Parametric and Non-Parametric TestsLesson 554 — Consequences of Violating AssumptionsLesson 1379 — Assumptions and Limitations
- Small tables
- The table has few columns and you genuinely need all of them
- Lesson 851 — Selecting All Columns with Asterisk
- Smaller storage footprint
- Less duplication means less disk space
- Lesson 1810 — Snowflake Schema and Normalization Trade-offs
- Smart home devices
- recording conversations used for product development (and sometimes reviewed by humans)
- Lesson 1922 — Surveillance and Secondary Data Uses
- Smooth lines
- use `stat_smooth()` to fit regression or loess curves
- Lesson 1343 — Statistical Transformations
- Smooth trends
- `stat_smooth()` fits regression lines or curves
- Lesson 1352 — Statistical Transformations with stat_* Layers
- Snapshots
- rather than patches (you can't "diff" binary files meaningfully)
- Lesson 1871 — Why Version Control for Data?Lesson 2044 — Recreating Environments from Specifications
- Snowflake
- Pure separation; pause compute clusters without affecting data
- Lesson 1813 — Modern Cloud Data Warehouses: Snowflake, BigQuery, Redshift
- Social
- Unpaid clicks from social media platforms (Facebook, Twitter, LinkedIn, Instagram)
- Lesson 1712 — Common Channel Categories
- Social media
- 50-60% (daily habit)
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)Lesson 1711 — What Are Acquisition Channels?
- Social media likes
- without measuring conversion or brand lift
- Lesson 1616 — Metrics Divorced from Revenue
- Software-defined assets
- approach treats data assets as first-class citizens.
- Lesson 1839 — Alternative Orchestration Tools
- Solution
- Filter out NULLs in the subquery:
- Lesson 962 — NOT IN with SubqueriesLesson 1068 — Higher Normal Forms: 4NF and 5NFLesson 1765 — Big Data vs Big Compute
- Some aggregations
- that require full dataset knowledge are unavailable or slow.
- Lesson 1796 — Limitations and Differences from Pandas
- Sorted retrieval
- (`ORDER BY`) comes nearly free since data is already ordered
- Lesson 1079 — B-Tree Indexes: Structure and Mechanics
- Source
- Which table, file, or API it came from
- Lesson 1163 — Metadata and Data DictionariesLesson 1823 — Pipeline Components: Sources, Transformations, Sinks
- Source connectors
- Extract data from databases, APIs, cloud storage, or streaming services
- Lesson 1822 — What is a Data Pipeline?
- Source information
- Original data location, collection date, version
- Lesson 2065 — Tracking Data Lineage
- Source URL or Location
- The exact web address, API endpoint, database connection string, or file path where you obtained the data.
- Lesson 2063 — Essential Metadata to Capture
- Source/Derivation
- Where the data came from or how it was calculated
- Lesson 2064 — Creating Data Dictionaries
- Space
- (to activate), and **arrow keys** (for fine control).
- Lesson 1253 — Interactive Accessibility: Keyboard Navigation
- Spark Core
- is the foundation of the entire framework.
- Lesson 1775 — Spark Components: Core, SQL, MLlib, Streaming
- Spark SQL
- brings structured data processing to Spark.
- Lesson 1775 — Spark Components: Core, SQL, MLlib, StreamingLesson 1778 — DataFrames and Spark SQL Basics
- Spark Streaming
- enables real-time data processing through micro-batching:
- Lesson 1775 — Spark Components: Core, SQL, MLlib, Streaming
- Spatial Correlation
- Geographic data points near each other (neighboring counties, adjacent plots of land) tend to be more similar than distant ones.
- Lesson 381 — Independence Assumption and Its Violations
- Spatial data
- Neighboring geographic areas influence each other
- Lesson 548 — Independence of Observations
- Spatial heatmaps
- and **density maps** solve this by showing *where* activity is most concentrated, creating smooth gradients that reveal patterns invisible in raw point data.
- Lesson 1312 — Heatmaps and Density Maps for Spatial Data
- Spearman
- .
- Lesson 487 — When to Use Spearman vs PearsonLesson 1184 — Correlation Coefficients in Bivariate Analysis
- Spearman correlation
- works with ranked data instead of raw values.
- Lesson 1184 — Correlation Coefficients in Bivariate Analysis
- Spearman's Rho
- correlates the *ranks* of your data, essentially asking "how well does a linear relationship fit the ranked data?
- Lesson 490 — Kendall's Tau vs Spearman's Rho
- specific
- .
- Lesson 228 — Defining Populations and ParametersLesson 1166 — Defining the Business QuestionLesson 1912 — What is Informed Consent in Data Science?Lesson 2094 — Defining Success Metrics Upfront
- Specific and Actionable
- Avoid vague advice like "improve customer retention.
- Lesson 1970 — Recommendations and Next Steps
- Specific and quantifiable
- – Uses numbers, percentages, or binary outcomes
- Lesson 1610 — Defining Measurable Key Results
- Specification Limits
- are the "voice of the customer.
- Lesson 1400 — Control Limits vs Specification Limits
- Specificity
- "Increase sales" becomes "Predict which existing customers are likely to purchase Product X in the next 30 days"
- Lesson 10 — Problem Definition and ScopingLesson 109 — Medical Diagnostic TestingLesson 498 — Bradford Hill Criteria for CausationLesson 1200 — Formulating Specific, Testable Hypotheses
- Speed
- Parquet/Feather > CSV > JSON > Excel
- Lesson 1133 — Performance Considerations Across FormatsLesson 2123 — Simple Rules Beat Complex Models
- Speed and Scale
- A biased recommendation algorithm can expose millions to harmful content in hours, far beyond what human curation could achieve.
- Lesson 1923 — Algorithmic Amplification of Harm
- Speed and simplicity
- No transformation bottleneck during load—get data in fast, ask questions later.
- Lesson 1816 — What is ELT? Extract, Load, Transform Explained
- Speed matters
- You need rapid inference or real-time updates
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate PriorsLesson 1595 — Stan: High- Performance Bayesian Inference
- Speed up development
- by working with small, fast result sets
- Lesson 877 — LIMIT: Restricting the Number of Rows Returned
- Spillovers
- happen when the treatment affects the control group indirectly.
- Lesson 1458 — Common DiD Pitfalls
- Splines
- and **piecewise methods** offer an alternative approach with some key advantages.
- Lesson 662 — Polynomial Features vs Splines
- Split
- Partition your data into independent chunks (often by rows)
- Lesson 1768 — Data Parallelism Fundamentals
- Split each party's data
- into encrypted "shares" distributed among participants
- Lesson 1903 — Secure Multi-Party Computation
- Split your dataset
- into strata based on confounder values (e.
- Lesson 1430 — Controlling for Confounders: Stratification
- Spot early warning signs
- when new cohorts show unusual churn patterns
- Lesson 1672 — Cohort-Based Churn Analysis
- Spot real trends
- See if unemployment is genuinely rising or just following seasonal patterns
- Lesson 748 — Seasonally Adjusted Data
- Spot trends over time
- Are newer cohorts retaining better than older ones?
- Lesson 1659 — Comparing Retention Across Cohorts
- Spot underutilized gems
- Low adoption but high frequency among adopters suggests poor discoverability
- Lesson 1696 — Feature Adoption and Usage Frequency
- Spotify
- Time spent listening (value = entertainment delivered)
- Lesson 1604 — What is a North Star Metric?Lesson 1606 — Examples of North Star Metrics by Industry
- Spotify's lightweight framework
- that emphasizes simplicity and file-based targets.
- Lesson 1839 — Alternative Orchestration Tools
- Spread
- Lesson 1172 — What is Univariate Analysis?Lesson 1176 — Box Plots for Spread and OutliersLesson 1208 — Distribution Checks for All VariablesLesson 1220 — Histograms for Continuous Distributions
- Spreads
- Which group shows more variability (wider IQR)?
- Lesson 1186 — Box Plots and Violin Plots by Group
- Sprint goal
- "Deliver initial churn prediction baseline with three features"
- Lesson 2113 — Timeboxing and Sprint Planning for Data Projects
- spurious correlation
- occurs when two variables appear statistically related but have no genuine cause-and-effect relationship.
- Lesson 494 — Spurious Correlations and CoincidenceLesson 1422 — Spurious Correlations
- Spurious relationships
- We might detect patterns or correlations that don't actually exist, leading to false confidence in our forecasts.
- Lesson 713 — Why Stationarity MattersLesson 734 — Why Differencing and Detrending Matter
- SQL and Stats Tests
- often come first as screeners.
- Lesson 2142 — Interviewing: Technical and Behavioral Prep
- SQL Server
- Often case-insensitive, but depends on collation settings
- Lesson 862 — Case Sensitivity in Text FilteringLesson 940 — Database Support and Alternatives
- SQLAlchemy Core
- provides a *SQL Expression Language*—a Pythonic way to write SQL queries using functions and methods instead of raw strings.
- Lesson 1118 — SQLAlchemy Core vs ORM
- SQLAlchemy ORM
- provides a higher-level abstraction where you work with *Python classes and objects* instead of tables and rows.
- Lesson 1118 — SQLAlchemy Core vs ORM
- SQLite
- is a lightweight DBMS that stores your entire database in a single file.
- Lesson 845 — Database Management Systems (DBMS)Lesson 940 — Database Support and AlternativesLesson 1041 — Formatting and Parsing Dates
- Square root
- (`sqrt(Y)`) is gentler than log and works well for count data.
- Lesson 591 — When and Why to Transform Variables
- Square Root Transformation
- (`sqrt(x)`) works particularly well for:
- Lesson 213 — Square Root and Cube Root Transformations
- Stability
- Less erratic behavior at data boundaries
- Lesson 662 — Polynomial Features vs SplinesLesson 1734 — Comparing and Validating Attribution Models
- Stability over time
- The relationship shouldn't suddenly shift
- Lesson 1518 — The Relationship Between Surrogate and Business Metrics
- Stabilize coefficient estimates
- Less wobbling between models
- Lesson 585 — Remedies: Variable Selection
- Stabilizing variance
- – Making the spread of data more consistent across different ranges
- Lesson 212 — Log Transformations
- stack
- the bars on top of each other, or **group** them side-by-side.
- Lesson 1226 — Stacked and Grouped Bar ChartsLesson 1353 — Position Adjustments: Dodge, Stack, and Jitter
- Stack traces
- The full path of execution leading to the failure
- Lesson 1851 — Error Logging and Notifications
- Stacked bar charts
- pile segments on top of each other to show both part-to-whole relationships and totals.
- Lesson 1188 — Stacked and Grouped Bar Charts
- Stacked bars
- work best for showing composition and totals simultaneously.
- Lesson 1188 — Stacked and Grouped Bar ChartsLesson 1226 — Stacked and Grouped Bar ChartsLesson 1266 — Bar Plots: Categorical Comparisons
- Staff Data Scientist
- Technical leadership across multiple projects, set standards, solve org-wide problems
- Lesson 2140 — Individual Contributor vs Management Tracks
- Stage
- new users vs.
- Lesson 1701 — What is Customer Segmentation?Lesson 1874 — DVC Pipelines and Stages
- Stage 2
- Use stratified sampling to select universities within those states (ensuring you get different types: public, private, large, small)
- Lesson 238 — Multistage Sampling
- Stage 3
- Use simple random sampling to select individual students from each chosen university
- Lesson 238 — Multistage Sampling
- Stage a single file
- Lesson 1994 — Staging Changes with git add
- Stage multiple files
- Lesson 1994 — Staging Changes with git add
- Stage the resolved files
- with `git add <filename>` (or `git add .
- Lesson 2011 — Resolving Merge ConflictsLesson 2018 — Resolving Conflicts During Rebase
- Staged files
- Files you've added to the staging area with `git add`, ready for the next commit
- Lesson 1998 — Checking Repository Status
- Staging Area
- (Index): The box where you arrange items you've decided to ship
- Lesson 1993 — The Three States: Working Directory, Staging, Repository
- Stakeholder Alignment
- Everyone agrees on what "success" looks like before you start
- Lesson 10 — Problem Definition and ScopingLesson 1973 — Report Review and Quality Checklist
- Stakeholder communication
- Inform affected communities before public release when possible
- Lesson 1925 — Mitigation Strategies and Responsible Disclosure
- Stakeholder confidence
- They wonder why you're still working instead of moving forward
- Lesson 2120 — The Opportunity Cost of Iteration
- Stakeholder indifference
- Additional precision doesn't change the business decision
- Lesson 2116 — Diminishing Returns and the 80/20 Rule
- Stakeholder learning
- Non-technical partners often don't fully understand what they need until they see something concrete.
- Lesson 2109 — Why Data Science is Inherently Iterative
- Stakeholder management
- Translating technical work into business impact
- Lesson 2142 — Interviewing: Technical and Behavioral Prep
- Stakeholder-driven iteration
- Business users see preliminary results and refine requirements.
- Lesson 2092 — Iteration and Feedback Loops in Practice
- Stakeholders need self-service analytics
- (executives checking KPIs, analysts exploring trends)
- Lesson 1330 — Introduction to Interactive Dashboards
- Stakes are high
- Major feature launches, pricing changes, or algorithm overhauls
- Lesson 1522 — Balancing Speed and Accuracy in Metric SelectionLesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- Stakes are low
- Minor UI tweaks, button colors, or copy changes
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Stale tracking
- (data pipeline breaks, no one notices for weeks)
- Lesson 1619 — What is Metric Ownership?
- Standard Attribution Logic
- Lesson 1643 — Building Attribution Frameworks
- Standard deviation
- solves this by taking the square root of the variance, returning the measure to the original units:
- Lesson 49 — Standard Deviation: Interpretable SpreadLesson 52 — Mean Absolute Deviation (MAD)Lesson 54 — When to Use Each MeasureLesson 122 — Variance and Standard Deviation of Discrete Random VariablesLesson 136 — Expectation and Variance of the Negative BinomialLesson 141 — Mean and Variance of Poisson DistributionLesson 148 — Variance and Standard Deviation of Discrete DistributionsLesson 166 — Exponential Distribution: Mean and Variance (+5 more)
- Standard Deviation (SD)
- measures how spread out the *individual values* in your dataset are from the mean.
- Lesson 261 — Standard Error vs Standard Deviation
- Standard deviation = 1
- One unit on the horizontal axis equals one standard deviation
- Lesson 194 — The Standard Normal Distribution
- standard error
- ) is:
- Lesson 223 — Standard Error and the CLTLesson 224 — CLT for ProportionsLesson 256 — Variability of Sample StatisticsLesson 260 — Defining Standard ErrorLesson 271 — Margin of ErrorLesson 276 — Sampling Distribution of a ProportionLesson 277 — Standard Error for ProportionsLesson 300 — Bootstrap Distribution of a Statistic (+3 more)
- Standard Error (SE)
- measures how spread out the *sample means* would be if you took many samples from the same population.
- Lesson 261 — Standard Error vs Standard Deviation
- Standard error (unpooled)
- Lesson 412 — Confidence Interval for Difference
- standard normal distribution
- is a special case of the normal distribution with a **mean (μ) of 0** and a **standard deviation (σ) of 1**.
- Lesson 194 — The Standard Normal DistributionLesson 403 — Finding P-Values for Proportion Tests
- Standard normal tables
- (Z-tables) after converting to Z-scores
- Lesson 173 — Calculating Probabilities with the Normal Distribution
- Standardization for Comparison
- Comparing SAT scores (mean 1050, SD 200) to ACT scores (mean 21, SD 5) directly is meaningless.
- Lesson 201 — Z-Score Applications and Limitations
- Standardize the Approach
- Lesson 2046 — Best Practices for Environment Management in Teams
- Standardized
- divided by the standard error of that cell
- Lesson 428 — Post-Hoc Analysis and ResidualsLesson 588 — Standardized and Studentized Residuals
- Standardized coefficients
- (also called **beta weights** or **β weights**) put all predictors on the same scale by expressing them in standard deviation units.
- Lesson 608 — Standardized Coefficients (Beta Weights)
- Standardized residuals
- divide each residual by an estimate of its standard deviation:
- Lesson 563 — Standardized and Studentized ResidualsLesson 588 — Standardized and Studentized Residuals
- Standardizing capitalization
- ensures "Apple", "APPLE", and "apple" are recognized as the same.
- Lesson 1138 — Cleaning and Standardizing Text Fields
- star schema
- is a common data warehouse design where one central **fact table** (containing measurements like sales amounts, quantities, or counts) connects to multiple **dimension tables** (containing descriptive attributes like customer names, product detail...
- Lesson 956 — Star Schema JoinsLesson 1808 — Star Schema and Fact Tables
- start
- a transaction block and how to **commit** it to save your work.
- Lesson 1112 — Starting and Committing TransactionsLesson 1582 — Updating Beliefs with Test Data
- Start small and targeted
- Don't attempt to rewrite everything at once.
- Lesson 2137 — Refactoring Strategies and Debt Paydown
- Start with d=0
- Check if your original series is already stationary using visual inspection and the Augmented Dickey-Fuller or KPSS tests you learned earlier.
- Lesson 778 — Determining Differencing Order (d)
- Start with domain knowledge
- Which predictors make theoretical sense?
- Lesson 633 — Practical Model Selection Strategy
- Start with initial guesses
- for all parameters (θ₁, θ₂, .
- Lesson 1591 — Gibbs Sampling for Multivariate Posteriors
- Start with the answer
- Lead with your key finding or recommendation (remember the Pyramid Principle from lesson 1952).
- Lesson 1965 — Progressive Disclosure Techniques
- Start with your prior
- `P(θ)` — your belief about parameter θ before seeing data
- Lesson 1545 — Calculating the Posterior Distribution
- Starts at 0
- F(-∞) = 0 (no probability accumulated yet)
- Lesson 157 — Cumulative Distribution Functions (CDFs) for Continuous Variables
- State conclusions in context
- , not just statistical jargon
- Lesson 368 — Common Pitfalls and Best Practices
- State your hypotheses
- For example, H₀: median = 50 vs H₁: median ≠ 50
- Lesson 391 — The Sign Test for MediansLesson 396 — Bootstrap Hypothesis TestingLesson 447 — Conducting One-Way ANOVA in Practice
- Static validation
- Parse your DAG definition without executing it.
- Lesson 1846 — Testing and Validating Dependency Graphs
- Stationarity
- means that a time series has **constant statistical properties over time**.
- Lesson 712 — What is Stationarity?Lesson 740 — Choosing Between Differencing and DetrendingLesson 1169 — Clarifying Assumptions and Constraints
- stationary
- Lesson 716 — Augmented Dickey-Fuller TestLesson 725 — Decay Rates in ACFLesson 734 — Why Differencing and Detrending Matter
- Statistical confirmation
- Run stationarity tests after differencing.
- Lesson 778 — Determining Differencing Order (d)
- Statistical exploration is central
- R's grammar of graphics makes iterative statistical visualization seamless.
- Lesson 1375 — Choosing Tools: When to Use R vs Python for Visualization
- Statistical hypothesis testing
- lets you quantify whether observed differences are likely real effects or just sampling noise.
- Lesson 1684 — Statistical Significance in Funnel Comparisons
- Statistical independence
- Sessions from the same user aren't independent—they're correlated.
- Lesson 1481 — Unit of Randomization
- Statistical methods
- you choose (some tests only work for continuous data)
- Lesson 18 — Numerical Variables: Discrete and ContinuousLesson 1209 — Outlier Detection and Investigation
- Statistical power
- is the probability that your hypothesis test will correctly reject a false null hypothesis.
- Lesson 338 — What is Statistical Power?Lesson 375 — Paired t-Test vs Two-Sample t-TestLesson 397 — Power and Efficiency of Non-Parametric TestsLesson 405 — Sample Size and Power for Proportion TestsLesson 446 — Power and Sample Size for ANOVALesson 505 — Sample Size and Power for Correlation TestsLesson 1493 — Why Sample Size Matters in A/B TestsLesson 1529 — Running Underpowered Tests
- Statistical power increases
- – the test becomes better at detecting *any* deviation
- Lesson 209 — Sample Size Considerations in Normality TestsLesson 341 — Effect Size and Power
- Statistical power varies
- Some comparisons have more precision than others
- Lesson 468 — Balanced vs Unbalanced Designs
- Statistical significance
- (p = 0.
- Lesson 389 — Reporting Effect Sizes in PracticeLesson 529 — Practical vs Statistical SignificanceLesson 609 — Practical vs Statistical SignificanceLesson 1858 — Alerting Strategies
- Statistical significance testing
- answers the question: "Is this predictor's coefficient reliably different from zero, or could I have gotten this result just from random variation?
- Lesson 606 — Statistical Significance of Individual Coefficients
- Statistical sophistication
- Teams must understand alpha spending functions, confidence sequences, or group boundaries— not just basic t-tests
- Lesson 1515 — Trade-offs: Sample Size, Speed, and Complexity
- Statistical test
- Run a DiD-style regression using only pre-treatment data, with placebo "treatment" dates.
- Lesson 1456 — Testing Parallel Trends
- Statistical testing
- you're testing whether categories differ from the reference, not whether they differ from zero
- Lesson 643 — Interpreting Coefficients Relative to Reference
- Statistical tests
- (Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling, Jarque-Bera) give you *objective numbers* with p-values.
- Lesson 210 — Combining Visual and Statistical MethodsLesson 217 — Evaluating Transformation EffectivenessLesson 734 — Why Differencing and Detrending MatterLesson 788 — Checking Residual NormalityLesson 1491 — Covariate Balance and Diagnostics
- statistical transformations
- (or "stats").
- Lesson 1343 — Statistical TransformationsLesson 1352 — Statistical Transformations with stat_* Layers
- Statistical validation
- A test confirming the change isn't random noise
- Lesson 1946 — Supporting Your Claims with Evidence
- statistically significant
- doesn't mean it's **practically meaningful**.
- Lesson 609 — Practical vs Statistical SignificanceLesson 723 — Significance Bounds in ACF Plots
- Statistics
- focuses on testing hypotheses and understanding uncertainty with mathematical rigor.
- Lesson 1 — Defining Data ScienceLesson 7 — The Data Science Skill StackLesson 229 — Defining Samples and Statistics
- Statistics (stat)
- Transformations applied to data (means, counts, smoothing)
- Lesson 1340 — The Seven Layers of Grammar
- Status dependencies
- Certain field combinations are impossible.
- Lesson 1155 — Consistency Checks Across Fields
- Stay interpretable
- Stakeholders understand exactly what changed and why
- Lesson 2128 — Data Distribution Shifts Frequently
- Steep drops
- Many events happening at specific times
- Lesson 815 — Survival Curve Plots and Interpretation
- Step 1: Check Independence
- Lesson 383 — Diagnostic Workflow: When to Proceed or Switch Tests
- Step 1: Decompose
- your historical data into trend, seasonal, and remainder components using your chosen method (classical or STL).
- Lesson 749 — Using Decomposition for Forecasting
- Step 2: Achieve 1NF
- Lesson 1069 — Normalization Process Step-by-Step
- Step 2: Assess Normality
- Lesson 383 — Diagnostic Workflow: When to Proceed or Switch Tests
- Step 2: Calculate IQR
- Lesson 1385 — Calculating IQR Fences in Practice
- Step 4: Flag Outliers
- Lesson 1385 — Calculating IQR Fences in Practice
- Step 4: Reach 3NF
- Lesson 1069 — Normalization Process Step-by-Step
- step function
- that drops at each event time, creating the characteristic "survival curve" you'll visualize.
- Lesson 809 — Introduction to the Kaplan-Meier EstimatorLesson 815 — Survival Curve Plots and InterpretationLesson 1639 — Time Windows and Attribution Decay
- Step-by-step instructions
- How to run the analysis from start to finish
- Lesson 1989 — Best Practices for Sharing Reproducible Reports
- STL
- stands for **S**easonal-**T**rend decomposition using **L**oess.
- Lesson 745 — STL Decomposition (Seasonal-Trend Loess)
- Stop
- at the first p-value that fails to reject; all subsequent tests are also not rejected
- Lesson 1504 — Holm-Bonferroni Method
- Stop early
- when evidence is strong (saving time and resources)
- Lesson 1510 — Sequential Testing Overview
- Stop when stationary
- Don't difference more than necessary—if your tests confirm stationarity, stop there.
- Lesson 778 — Determining Differencing Order (d)
- Stopping
- | Fixed sample size or sequential correction needed | Natural sequential updating, stop anytime |
- Lesson 1580 — Bayesian vs Frequentist A/B Testing
- Storage Limitation
- Lesson 1905 — Core Principles of GDPR
- Storage space
- You're duplicating information that could be derived
- Lesson 1073 — Storing Computed Values and AggregatesLesson 1074 — Duplicating Data Across TablesLesson 1077 — Measuring Performance Impact of Denormalization
- Store data separately
- Use cloud storage (S3, Google Cloud), shared drives, or dedicated data warehouses
- Lesson 2070 — Separating Data from Code
- Straight line
- Normality assumption holds
- Lesson 565 — What Q-Q Plots Show: Comparing Residual Distribution to Normal
- Strain applications
- displaying or processing thousands of rows
- Lesson 911 — Performance Considerations with Multiple Groups
- strata
- (homogeneous subgroups) and then sampling proportionally from each stratum.
- Lesson 236 — Stratified SamplingLesson 817 — Comparing Multiple Survival Curves
- Strategic boundaries
- Choosing cutoffs that produce desired patterns rather than natural ones
- Lesson 1245 — Misleading Aggregations and Binning
- Strategic Callouts
- Lesson 1960 — Annotation and Labeling Best Practices
- Strategic planning
- Identify which touchpoints work best at different customer journey stages
- Lesson 1718 — Introduction to Marketing Attribution
- Strategically aligned
- Connect directly to your North Star Metric or broader business priorities
- Lesson 1609 — Setting Effective Objectives
- Strategy 4: Column-by-column parsing
- Lesson 1136 — Handling Mixed Encodings in a Single Dataset
- Stratified Cox models
- allow you to account for a variable's effect on survival *without* assuming proportional hazards for that variable.
- Lesson 832 — Stratified Cox Models
- Stratified or adjusted approaches
- More sophisticated corrections that balance power and error control
- Lesson 824 — Multiple Group Comparisons
- Stratified randomization
- solves this by first dividing your sample into homogeneous subgroups (strata) based on key covariates, then randomizing *within* each stratum.
- Lesson 1489 — Stratified Randomization Fundamentals
- Stratified sampling
- solves this by dividing your population into **strata** (homogeneous subgroups) and then sampling proportionally from each stratum.
- Lesson 236 — Stratified SamplingLesson 237 — Cluster SamplingLesson 240 — Quota SamplingLesson 243 — Choosing the Right Sampling MethodLesson 1885 — Mitigation Strategies: Data Collection
- Streaming is essential when
- Lesson 1824 — Batch vs Streaming Pipelines
- Streaming pipelines
- work like a phone call—process information instantly as it flows through.
- Lesson 1824 — Batch vs Streaming Pipelines
- Streamlit Cloud
- is the easiest option for Streamlit apps—simply connect your GitHub repository, and it deploys automatically.
- Lesson 1338 — Deployment and Sharing Dashboards
- Strength
- Are points tightly clustered along a line, or scattered widely?
- Lesson 480 — Scatterplots and Visual AssessmentLesson 498 — Bradford Hill Criteria for CausationLesson 1183 — Scatter Plots for Two Numeric Variables
- strong
- when it's much more likely to appear if the person is guilty than if innocent.
- Lesson 112 — Legal Evidence and Jury ReasoningLesson 1610 — Defining Measurable Key Results
- Strong correlation
- Changes in the surrogate should consistently predict changes in the business metric
- Lesson 1518 — The Relationship Between Surrogate and Business Metrics
- Strong relationships
- jump out as values near +1 or -1.
- Lesson 511 — Reading and Interpreting Correlation Matrices
- Strong validation exists
- Your surrogate has proven correlation with business outcomes
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Structure your narrative
- around these three pillars—each becomes a mini-story within your larger presentation
- Lesson 1940 — The Rule of Three in Data Storytelling
- Structured data
- is information organized into rows and columns, like a spreadsheet or database table.
- Lesson 16 — Structured vs Unstructured DataLesson 20 — Primary Data Sources: Databases and Data WarehousesLesson 22 — File Formats: CSV, JSON, and Beyond
- Student's t distributions
- for heavier tails (more robust to outliers)
- Lesson 1565 — Prior Distributions for Normal Means
- Studentized residuals
- go further: they refit the model *without* that specific observation and see how much it differs:
- Lesson 563 — Standardized and Studentized ResidualsLesson 588 — Standardized and Studentized Residuals
- Subgroup analysis
- Always disaggregate your fairness metrics across combinations of protected attributes (gender × race, age × disability status, etc.
- Lesson 1893 — Intersectionality in Fairness
- Subject Matter Expertise
- Lesson 1602 — Identifying Leading Indicators for Your Metrics
- Subject matter experts
- Talk to salespeople, operations staff, customers
- Lesson 1201 — Domain Knowledge as a Hypothesis Source
- Subjective labeling
- When humans label training data—tagging images, rating sentiment, or classifying documents— their personal biases, cultural backgrounds, and varying interpretations create inconsistency.
- Lesson 1880 — Measurement and Label Bias
- Subscriber Acquisition Cost
- Marketing spend divided by new subscribers, but media-specific: track which content drives sign- ups.
- Lesson 1635 — Media and Content Metrics: Watch Time and Content Performance
- Subscription duration modeling
- treats cancellation as the "event" and subscription length as the "time" variable, letting you predict when customers are most likely to churn and what drives retention.
- Lesson 838 — Subscription and Membership Duration Modeling
- SUBSTRING
- extracts a specific portion of a string
- Lesson 1044 — String Manipulation: CONCAT, LENGTH, and SUBSTRING
- Subtract 1
- Because once you know the counts for all but one category, the last one is determined (they must sum to your total sample size)
- Lesson 418 — Degrees of Freedom in Goodness of Fit
- Subtract estimated parameters
- If you had to estimate any population parameters from your data (like a mean or proportion), you lose additional degrees of freedom
- Lesson 418 — Degrees of Freedom in Goodness of Fit
- Subtract the mean
- (x - μ): This centers your data point.
- Lesson 196 — Calculating Z-Scores from Raw Data
- Subtracting intervals
- Lesson 1040 — Date Arithmetic and INTERVAL Operations
- Success
- Commits automatically when the block completes
- Lesson 1114 — Transaction Context Managers in Python
- success criteria
- the numbers that tell you whether your experimental change actually improved things.
- Lesson 1516 — Business Metrics: Definition and ExamplesLesson 2093 — Translating Business Questions into Analytical QuestionsLesson 2103 — Managing Expectations and Defining Success
- Success-Failure Condition
- Lesson 400 — Assumptions and Conditions for Proportion TestsLesson 411 — Sample Size Requirements
- Sudden shifts
- equipment calibration changes, policy updates, or batch effects
- Lesson 562 — Index Plots and Time-Ordered Residuals
- Sum
- all those products
- Lesson 45 — Central Tendency for Grouped DataLesson 225 — CLT for Sums and Other StatisticsLesson 892 — GROUP BY with Different Aggregate FunctionsLesson 894 — NULL Values in GROUP BY
- Sum of absolute residuals
- Better, but mathematically difficult to work with (no smooth derivative).
- Lesson 517 — The Least Squares Criterion
- Sum of raw residuals
- No—positive and negative errors cancel out.
- Lesson 517 — The Least Squares Criterion
- SUM()
- naturally ignores NULLs, so unmatched rows contribute nothing (which is usually what you want)
- Lesson 933 — Aggregating with LEFT JOINs
- Sums of measurements
- (total wait time, cumulative sales)
- Lesson 225 — CLT for Sums and Other Statistics
- Support complex relationships
- between different types of information (customers → orders → products)
- Lesson 842 — What is a Database?
- Supporting observations
- the patterns, anomalies, or visualizations that sparked the hypothesis (e.
- Lesson 1203 — Documenting Hypotheses and Evidence
- Suppression
- removes certain values entirely when they're too identifying—like removing ZIP codes for rural areas where few people live.
- Lesson 1895 — Data Anonymization BasicsLesson 1896 — K-Anonymity
- Surface plots
- provide a solid, colored representation that emphasizes the overall shape and makes valleys and peaks immediately visible.
- Lesson 1325 — 3D Surface and Wireframe Plots
- Surrogate
- 30-day engagement score or feature adoption rate
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is Impractical
- Surrogate keys
- are artificial identifiers created solely for database purposes—typically auto-incrementing integers or UUIDs.
- Lesson 1050 — Choosing Effective Primary Keys
- surrogate metrics
- come in.
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is ImpracticalLesson 1519 — Common Surrogate Metrics in A/B TestingLesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Survey data
- When respondents represent different population sizes
- Lesson 43 — Weighted Mean and Its Applications
- Survey response rates
- (proportion who respond)
- Lesson 184 — Beta Distribution: Bounded Between 0 and 1
- Surveys and questionnaires
- Directly asking people for information
- Lesson 11 — Data Collection and Acquisition
- Survival analysis
- Time to failure after multiple stresses
- Lesson 181 — Gamma Distribution: Shape and Rate ParametersLesson 1674 — Churn Prediction Models
- Survival bias
- only certain types complete treatment and remain observable
- Lesson 1444 — Selection Bias and Treatment Assignment
- Survival models
- predict both *how long* a customer will remain active and *how much* they'll spend during that time.
- Lesson 1668 — Predictive LTV Models
- Survival times
- in medical studies
- Lesson 179 — When Variables Are Log-Normally DistributedLesson 187 — The Weibull Distribution: Shape, Scale, and Survival
- Survivorship bias
- Only studying "survivors" or successes.
- Lesson 244 — Selection Bias and Its CausesLesson 1532 — Survivorship Bias and Attrition
- Swamping
- A valid point gets falsely flagged because outliers distort the statistics
- Lesson 1407 — The ESD Component
- Symmetric
- around the mean (left side mirrors the right)
- Lesson 169 — The Normal Distribution: Definition and PropertiesLesson 194 — The Standard Normal DistributionLesson 1175 — Histograms for Distribution Shape
- Symmetric distributions
- (bell-shaped): Mean, median, and mode are roughly equal—use any, though mean is most common.
- Lesson 42 — Comparing Mean, Median, and ModeLesson 220 — Sample Size Requirements for the CLTLesson 221 — CLT for Different Population Distributions
- Symmetrical (No Skew)
- Lesson 64 — Skewness: Definition and Interpretation
- Symmetry
- Does the left mirror the right?
- Lesson 63 — Understanding Distribution ShapeLesson 174 — Symmetry and the Mode, Median, MeanLesson 377 — Testing Normality: Visual MethodsLesson 1176 — Box Plots for Spread and Outliers
- Symmetry around zero
- well-specified models should show roughly symmetric deviance residuals
- Lesson 701 — Deviance Residuals
- System dependencies
- (compilers, system libraries)
- Lesson 2038 — What is Environment Management and Why It Matters
T
- T-Closeness
- goes further: the distribution of sensitive attributes in each group must be **close to the overall distribution** in the dataset (within threshold T).
- Lesson 1897 — L-Diversity and T-Closeness
- t-distribution
- comes in.
- Lesson 268 — Critical Values and the t-DistributionLesson 272 — When to Use Z vs tLesson 351 — When to Use a One-Sample t-TestLesson 352 — The t-Distribution and Degrees of Freedom
- t-statistic
- is the core calculation in a one-sample t-test.
- Lesson 353 — Calculating the t-StatisticLesson 606 — Statistical Significance of Individual CoefficientsLesson 621 — Interpreting t-Statistics and Confidence IntervalsLesson 654 — Testing Interaction Significance
- t-test
- with these hypotheses:
- Lesson 606 — Statistical Significance of Individual CoefficientsLesson 1749 — Measuring Statistical Significance
- T2D3
- = Triple, Triple, Double, Double, Double.
- Lesson 1629 — SaaS Growth Metrics: Quick Ratio and Net Revenue Retention
- Tab
- key (to move forward), **Shift+Tab** (to move backward), **Enter** or **Space** (to activate), and **arrow keys** (for fine control).
- Lesson 1253 — Interactive Accessibility: Keyboard Navigation
- table
- is like a spreadsheet in a database—it stores data in rows and columns.
- Lesson 846 — Tables, Schemas, and Data TypesLesson 1117 — What is an ORM and Why Use It?
- Table name qualification
- means prefixing column names with their table name using dot notation:
- Lesson 922 — Selecting Columns from Joined Tables
- Table sizes
- Joining smaller tables first reduces intermediate result sets
- Lesson 951 — Join Order and Performance
- tables
- with rows and columns, making it easy to store large volumes of information efficiently and access it reliably.
- Lesson 842 — What is a Database?Lesson 843 — Relational Database Concepts
- Tail behavior
- Are extremes rare or common?
- Lesson 63 — Understanding Distribution ShapeLesson 193 — Choosing Between Distributions in Practice
- Take-Home Projects
- test end-to-end skills: EDA, feature engineering, modeling, and communication.
- Lesson 2142 — Interviewing: Technical and Behavioral Prep
- Target ROAS
- Varies by industry and margins, but often 3-4+ for healthy profitability
- Lesson 1751 — Return on Ad Spend (ROAS): Definition and CalculationLesson 1752 — Target ROAS and Break-Even Analysis
- Target variable
- Actual LTV (from mature cohorts where you've observed full lifecycles)
- Lesson 1668 — Predictive LTV Models
- Targeted interventions
- Identify high-risk periods (e.
- Lesson 838 — Subscription and Membership Duration Modeling
- Task-level
- "Spend 2 hours exploring correlations, then move on"
- Lesson 2121 — Timeboxing and Deadlines
- tasks
- (individual units of work) and **operators** (templates for tasks like PythonOperator, BashOperator, or SQLOperator).
- Lesson 1833 — Introduction to Apache AirflowLesson 1835 — Airflow Operators and Tasks
- Tau-b
- Adjusts for ties in both variables (most common)
- Lesson 491 — Handling Ties in Rank Correlations
- Tax Reforms
- When a city or state changes tax policy, neighboring regions serve as control groups.
- Lesson 1459 — Real-World DiD Applications
- Teaching and documentation
- where the process matters as much as the result
- Lesson 2074 — Notebooks vs Scripts: When to Use Each
- Teaching and prototyping
- Perfect for learning Bayesian concepts or quickly testing ideas
- Lesson 1555 — Advantages and Limitations of Conjugate Priors
- Team alignment
- Give marketing, product, and leadership a shared view of what's working
- Lesson 1718 — Introduction to Marketing AttributionLesson 1727 — Linear Attribution Model
- Team capacity
- Your colleagues who depend on your work are blocked
- Lesson 2120 — The Opportunity Cost of Iteration
- Team-Level Key Results
- Each team then defines 3-5 measurable Key Results that directly influence the North Star.
- Lesson 1608 — Connecting North Star Metrics to OKRs
- Technical
- What systems must the solution integrate with?
- Lesson 2102 — Understanding Stakeholder Goals and Constraints
- Technical → Business
- When stakeholders ask "How accurate is the model?
- Lesson 2105 — Translating Between Technical and Business Language
- Technical attributes
- Browser, operating system, connection speed
- Lesson 1682 — Segmenting Funnels by User Attributes
- Technical audiences
- (data scientists, engineers, analysts) typically:
- Lesson 1950 — Identifying Your Audience: Technical vs Non-Technical
- Technical costs
- landing pages, tracking infrastructure, A/B testing tools
- Lesson 1753 — Customer Acquisition Cost (CAC): Components and Calculation
- Technical deep-dives
- for the data-savvy audience members
- Lesson 1949 — Anticipating Questions: Building in Appendices
- Technical methodology details
- Lesson 1971 — Appendices and Technical Details
- Technical peers
- , on the other hand, often need diagnostic depth: distributions, error bars, residual plots, correlation matrices.
- Lesson 1954 — Tailoring Visualizations to Audience Needs
- Technical reviewers
- can evaluate your conclusion before diving into methods
- Lesson 1942 — The Pyramid Principle: Starting with the Conclusion
- temperature
- (or season).
- Lesson 495 — Confounding VariablesLesson 509 — Confounding Variables and ControlLesson 1427 — What is a Confounding Variable?
- Temperature readings
- If you're monitoring a freezer that must stay below 0°C, a reading of 5°C is an outlier *by definition*, even if it's close to the mean due to equipment malfunction.
- Lesson 75 — Domain-Specific Outlier Rules
- Templates and Tooling
- Lesson 1643 — Building Attribution Frameworks
- Templates are your foundation
- Create standardized templates for data documentation that include:
- Lesson 2068 — Data Provenance Best Practices
- Temporal dependence
- Values at time *t* depend on values at *t-1*, *t-2*, etc.
- Lesson 704 — What Makes Time Series Data Different?
- Temporality
- The cause must come *before* the effect—this is the only non-negotiable criterion.
- Lesson 498 — Bradford Hill Criteria for Causation
- Tenure and LTV
- High-LTV churners warrant more personalized, generous offers
- Lesson 1676 — Win-Back and Retention Strategies
- Terms below were extracted from bolded phrases in lesson content. Click a lesson reference to jump
- Terms of service respect
- – If you're scraping a website or using an API, are you honoring the platform's rules?
- Lesson 36 — Responsible Data Sourcing and Use
- Test
- Run controlled experiment until statistical significance
- Lesson 1692 — Statistical Significance and Iteration
- Test before replacing
- When proposing a new branch, validate that it truly influences parent metrics before permanently adding it to the tree.
- Lesson 1626 — Maintaining and Evolving Metric Trees
- Test causality
- Run experiments where you deliberately move the metric and observe effects.
- Lesson 1615 — Correlation Without Causation
- Test credentials
- Try connecting with a database client tool (like `psql` or SQLite browser) using the same credentials
- Lesson 1093 — Troubleshooting Connection Issues
- Test parallel trends visually
- Pre-treatment coefficients should be near zero
- Lesson 1457 — Multiple Time Periods and Staggered Adoption
- Test queries safely
- by previewing just a handful of rows
- Lesson 877 — LIMIT: Restricting the Number of Rows Returned
- Test restoration
- by recreating the environment on a fresh machine
- Lesson 1987 — Environment and Dependency Management
- Test segments
- scores should separate retained vs churned cohorts clearly
- Lesson 1699 — Engagement Scoring Systems
- Test set
- Fresh data held back until the very end for a final, unbiased evaluation
- Lesson 14 — Model Evaluation and Validation
- Test significance
- using likelihood ratio tests, Wald tests, or AIC/BIC comparisons—tools you've already learned.
- Lesson 703 — Sequential Model Building Strategy
- Test small first
- Use `LIMIT` while developing queries to avoid long waits
- Lesson 880 — Performance Considerations and Best Practices
- Test statistic
- A calculated value measuring deviation from normality (larger = less normal)
- Lesson 207 — Anderson-Darling TestLesson 314 — What is a Test Statistic?Lesson 319 — Calculating P- Values from Test StatisticsLesson 716 — Augmented Dickey-Fuller TestLesson 818 — What is the Log- Rank Test?
- Test with multiple people
- One person's confusion might be unique; three people struggling with the same element reveals a design problem.
- Lesson 1964 — Testing Visualizations with Audiences
- Test your work
- Use CVD simulation tools to preview your visualizations as colorblind viewers see them.
- Lesson 1248 — Color Blindness and Color Palette Design
- Testable hypothesis
- "Customers in Segment A have an average purchase frequency at least 20% higher than Segment B customers.
- Lesson 1200 — Formulating Specific, Testable Hypotheses
- Testing and development
- Engineers can work with realistic data without privacy concerns
- Lesson 1901 — Synthetic Data Generation
- Testing becomes impossible
- You can't validate pipeline logic if each run changes the outcome
- Lesson 1847 — What is Idempotency?
- Testing Multiple Claims Simultaneously
- Lesson 313 — Common Pitfalls in Hypothesis Formulation
- Text columns
- Find alphabetically first and last values (based on sorting order)
- Lesson 885 — MIN and MAX: Finding Extremes
- Text inputs
- capture free-form text for searches or custom labels:
- Lesson 1332 — Streamlit Widgets: Inputs and Controls
- Text processing
- Regular expressions, tokenization, or NLP on millions of documents
- Lesson 1784 — Computation Complexity: Beyond Data Size
- That's analysis
- Recommending Policy A because "efficiency matters most" is **advocacy**—it injects your (or your organization's) values into the decision.
- Lesson 1927 — Separating Analysis from Advocacy
- themes
- (overall aesthetic) and **contexts** (size scaling).
- Lesson 1294 — Seaborn Themes and Context SettingsLesson 1340 — The Seven Layers of Grammar
- Theoretical quantiles
- (what we'd expect from a perfect normal distribution) on the x-axis
- Lesson 565 — What Q-Q Plots Show: Comparing Residual Distribution to Normal
- Theory
- Does domain knowledge suggest this predictor matters?
- Lesson 625 — Practical Workflow: Testing and Interpreting Predictors
- there.
- They all still apply
- when you add more predictors.
- Lesson 601 — Assumptions for Multiple Linear Regression
- They generate moments
- Taking derivatives at t=0 gives you the "raw moments" of the distribution.
- Lesson 150 — Moment Generating Functions
- They penalize complexity
- Adding unnecessary parameters increases the score
- Lesson 781 — Information Criteria: AIC and BIC
- They simplify algebra
- MGFs make it easier to prove properties about sums of independent random variables (like that the sum of independent Poisson variables is also Poisson).
- Lesson 150 — Moment Generating Functions
- They uniquely identify distributions
- If two random variables have the same MGF, they have the same probability distribution—no other function needed!
- Lesson 150 — Moment Generating Functions
- They're correlated
- and run against large tables (thousands of executions)
- Lesson 966 — Performance Considerations for WHERE Subqueries
- They're reasonable
- The conjugate family genuinely captures your prior knowledge
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- Think of it as
- Knocking on someone's front door and asking politely for information they're willing to share.
- Lesson 21 — APIs and Web Scraping
- Think of it like
- A visual IQR calculator that also flags unusual values.
- Lesson 55 — Visualizing SpreadLesson 567 — Common Q-Q Plot Patterns: Heavy Tails and Light TailsLesson 1786 — Data Processing Patterns Best Suited for Spark
- Thinning
- means keeping only every *k*th sample (e.
- Lesson 1592 — Burn-in, Thinning, and Convergence Diagnostics
- Third batch arrives
- Use Beta(17, 25) as prior → and so on.
- Lesson 1563 — Sequential Updating with New Data
- Third evidence (alibi confirmed)
- Use 85% as the new prior → posterior drops to 30%.
- Lesson 114 — Sequential Updating
- Third Normal Form (3NF)
- eliminates *transitive dependencies*, where a non-key attribute depends on another non-key attribute, which in turn depends on the primary key.
- Lesson 1066 — Third Normal Form (3NF)
- Third Quartile (Q3)
- the 75th percentile
- Lesson 59 — The Five-Number Summary and Box PlotsLesson 1383 — Understanding the Interquartile Range (IQR)
- third variable
- here is temperature (or summer season).
- Lesson 497 — The Third Variable ProblemLesson 506 — Introduction to Partial CorrelationLesson 1423 — The Third Variable ProblemLesson 1426 — Real-World Examples: Correlation vs Causation
- This is backwards
- Lesson 106 — Common Misconceptions About Independence
- This is your default
- When in doubt, use two-tailed—it's more conservative and widely accepted.
- Lesson 350 — Choosing the Right Tail Configuration
- This uncertainty matters
- When we estimate σ from a small sample, our confidence interval needs to be *wider* to account for the extra uncertainty.
- Lesson 268 — Critical Values and the t-Distribution
- Thompson Sampling
- directly sample from posterior distributions to make allocation decisions—a natural Bayesian approach.
- Lesson 1586 — Multi-Armed Bandit Connections
- Three columns
- with 100 values each → up to 1,000,000 potential groups
- Lesson 911 — Performance Considerations with Multiple Groups
- Threshold adjustment
- Use different decision thresholds for different groups to equalize outcomes.
- Lesson 1894 — Auditing and Remediation Strategies
- Threshold effects
- Variables behave differently above/below a certain value
- Lesson 1189 — Detecting Nonlinear Relationships
- Tick Marks and Labels
- Customize where tick marks appear and what they say using `set_xticks()` and `set_xticklabels()`.
- Lesson 1270 — Customizing Axes: Labels, Limits, and Scales
- Tick marks or crosses
- Often indicate censored observations
- Lesson 815 — Survival Curve Plots and Interpretation
- Tidy data
- is a standardized way of organizing datasets that follows three simple rules:
- Lesson 1142 — What is Tidy Data?
- tidy data principles
- and creates maintenance nightmares.
- Lesson 1148 — Handling Multiple Types in One TableLesson 1151 — Schema Validation
- Time
- Months (or days) since loan origination
- Lesson 840 — Loan Default Timing and Credit RiskLesson 2102 — Understanding Stakeholder Goals and Constraints
- Time and resource limits
- "We need an answer in two weeks, even if it's rough"
- Lesson 2117 — Defining 'Good Enough' with Stakeholders
- Time for Spark
- When datasets exceed available RAM or when processing takes hours instead of minutes
- Lesson 1783 — Data Size Thresholds: When Pandas Isn't Enough
- Time intervals
- span durations: "January 2024 to March 2024" or "Q1 2023"
- Lesson 19 — Temporal Data and Time Series
- Time investment explodes
- Simple features took hours; the next marginal improvement requires days of engineering
- Lesson 2116 — Diminishing Returns and the 80/20 Rule
- time origin
- is your starting line—the moment when the clock begins for each subject.
- Lesson 803 — Defining the Event and Time OriginLesson 835 — Customer Churn Prediction with Survival Analysis
- Time plot of residuals
- Should look randomly scattered around zero with constant variance
- Lesson 799 — Fitting and Diagnosing SARIMA Models
- Time series comparisons
- multiple metrics over the same time period
- Lesson 1276 — Sharing Axes Between Subplots
- Time Series Data
- Measurements taken over time (stock prices, daily temperatures) often show autocorrelation— today's value relates to yesterday's value.
- Lesson 381 — Independence Assumption and Its ViolationsLesson 548 — Independence of Observations
- Time series plot
- Should show constant mean and variance over time
- Lesson 741 — Testing Stationarity After Transformation
- Time since churn
- Fresh churners respond better than those gone 6+ months
- Lesson 1676 — Win-Back and Retention Strategies
- Time trends and seasonality
- Lesson 1741 — Controlling for Seasonality and External Factors
- Time windows
- set boundaries—how far back you look for attributable touchpoints.
- Lesson 1639 — Time Windows and Attribution Decay
- Time-based rules
- Sales of winter coats in July might look like outliers, but they could be legitimate clearance sales or southern hemisphere orders.
- Lesson 75 — Domain-Specific Outlier Rules
- Time-bound
- Set a clear horizon (quarterly, annually) so urgency is built in
- Lesson 1609 — Setting Effective ObjectivesLesson 1610 — Defining Measurable Key Results
- Time-Lagged Analysis
- Lesson 1602 — Identifying Leading Indicators for Your Metrics
- Time-to-conversion
- analysis models the journey from first contact (lead acquisition) to purchase, treating non- converters as **censored observations**—they didn't experience the "event" (conversion) during your observation window.
- Lesson 839 — Time-to-Conversion in Marketing Funnels
- Time-to-match
- How long until a buyer finds a seller
- Lesson 1630 — Marketplace Metrics: GMV, Take Rate, and Liquidity
- Time-varying covariates
- allow your survival model to reflect these dynamic changes.
- Lesson 833 — Time-Varying Covariates
- Timebox tasks
- 2 days EDA, 2 days feature prep, 1 day modeling
- Lesson 2113 — Timeboxing and Sprint Planning for Data Projects
- Timeboxing
- means allocating a fixed duration—say, three days for EDA or one week for initial modeling—and forcing yourself to produce *something* deliverable when time runs out, even if it's imperfect.
- Lesson 2113 — Timeboxing and Sprint Planning for Data ProjectsLesson 2121 — Timeboxing and Deadlines
- Timeline
- Prototype in 3 weeks, deploy in 6 weeks
- Lesson 1948 — The Recommendation Slide: Making It ActionableLesson 2103 — Managing Expectations and Defining Success
- timeliness
- , **validity**, and **uniqueness**.
- Lesson 1863 — Data Quality DimensionsLesson 1867 — Data Profiling and MonitoringLesson 1869 — Data Quality Metrics and SLAsLesson 1986 — Automated Report GenerationLesson 2086 — Stage 2: Data Acquisition and Assessment
- Timely insights
- Market conditions change; delays reduce relevance
- Lesson 2120 — The Opportunity Cost of Iteration
- Timeout Errors
- occur when connections take too long to establish or queries run longer than allowed.
- Lesson 1093 — Troubleshooting Connection Issues
- Timestamps
- mark exact moments: "2024-03-15 14:32:05" (year-month-day hour:minute:second)
- Lesson 19 — Temporal Data and Time SeriesLesson 1857 — Logging Best PracticesLesson 1988 — Embedding Data Lineage and MetadataLesson 2065 — Tracking Data Lineage
- Timestamps and Version Fields
- Add `created_at` and `updated_at` timestamps to your data.
- Lesson 1848 — Designing Idempotent Operations
- Too few bins
- You lose detail and may miss important patterns
- Lesson 1267 — Histograms and Distribution Plots
- Too few examples
- Training a neural network with 50 samples?
- Lesson 2124 — Insufficient or Low-Quality Data
- Too large
- Each partition takes a long time to process, limiting parallelism.
- Lesson 1794 — Working with Partitions
- Too many bins
- The plot becomes noisy and hard to interpret
- Lesson 1267 — Histograms and Distribution Plots
- Too narrow
- Creates noisy, overfit patterns from random variation
- Lesson 1245 — Misleading Aggregations and Binning
- Too noisy
- Revenue varies wildly day-to-day, drowning out true effects
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is Impractical
- Too rare
- Conversions on high-ticket items are infrequent
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is Impractical
- Top performers
- 90th percentile and above
- Lesson 61 — Using Percentiles for Comparison and Benchmarking
- Top-of-funnel optimization
- Where should you invest to grow your audience?
- Lesson 1720 — First-Touch Attribution Model
- Total registered users
- (without knowing active users or retention)
- Lesson 1612 — What Are Vanity Metrics?
- Trace plots
- Visualize the chain over iterations—it should look like random noise around a stable mean, not trending or stuck
- Lesson 1592 — Burn-in, Thinning, and Convergence Diagnostics
- Track improvements
- Overlay cohorts from before and after a product change to see if retention improved.
- Lesson 1656 — Visualizing Retention Curves
- Track randomization seed
- always save the random seed used for reproducibility
- Lesson 1492 — Rerandomization and Practical Implementation
- Track step repetition frequency
- – identify which steps users commonly revisit
- Lesson 1683 — Multi-Path and Non-Linear Funnels
- Tracking Only Lagging Metrics
- Lesson 1603 — Common Pitfalls in Indicator Selection
- Tracking Pixels
- are tiny, invisible images embedded in emails or third-party sites.
- Lesson 1713 — Tracking Users by Channel
- Trade-off
- Slightly lower power (5-15% efficiency loss if data *were* normal), and results describe distributions or medians, not means.
- Lesson 475 — Choosing Between Parametric and Non-Parametric TestsLesson 1767 — Scale-Up vs Scale- Out Architectures
- Tradeoffs
- Choosing "good enough" over perfection
- Lesson 2142 — Interviewing: Technical and Behavioral Prep
- Traffic source
- Organic search, paid ads, social media, email, direct
- Lesson 1682 — Segmenting Funnels by User Attributes
- Traffic volume
- is often your biggest limitation.
- Lesson 1500 — Practical Considerations and Trade-offsLesson 1714 — Channel-Level Metrics
- Trailing moving averages
- (also called "backward-looking") use only past data points.
- Lesson 753 — Centered vs Trailing Moving Averages
- Train on the rest
- Build your ARIMA, Holt-Winters, or other model using only the training portion
- Lesson 790 — Out-of-Sample Forecast Evaluation
- Transform
- it on cheaper servers or ETL tools (like Informatica or DataStage)
- Lesson 1817 — Historical Context: Why ETL Came First
- Transform back
- to the correlation scale using the inverse transformation
- Lesson 503 — Confidence Intervals for Correlation Coefficients
- Transform within the warehouse
- using SQL-based tools like **dbt** (data build tool)
- Lesson 1821 — Hybrid Approaches and Modern Data Stacks
- Transform your data
- to reflect the null hypothesis being true (e.
- Lesson 396 — Bootstrap Hypothesis Testing
- Transform Your Variables
- Lesson 564 — What to Do When Residual Plots Show Problems
- Transformation History
- What cleaning or calculations were applied
- Lesson 1163 — Metadata and Data Dictionaries
- Transformation layers
- like **dbt** that version-control SQL transformations, run tests, and document data models
- Lesson 1821 — Hybrid Approaches and Modern Data Stacks
- Transformation logic
- Clean, join, aggregate, or enrich data (the "T" in ETL/ELT)
- Lesson 1822 — What is a Data Pipeline?
- Transformations
- Has someone already cleaned or filtered it?
- Lesson 23 — Data Provenance and MetadataLesson 1189 — Detecting Nonlinear RelationshipsLesson 1344 — Scales and Coordinate SystemsLesson 1774 — What is Apache Spark and Why Use It?Lesson 1780 — Transformations vs Actions in SparkLesson 1800 — Chunked Reading with read_csvLesson 1823 — Pipeline Components: Sources, Transformations, SinksLesson 2065 — Tracking Data Lineage
- Transformations are simpler
- Operations like filtering, grouping, and summarizing follow predictable patterns
- Lesson 1142 — What is Tidy Data?
- Transformed coordinates
- apply mathematical transformations to the entire space
- Lesson 1344 — Scales and Coordinate Systems
- Transforming
- using SQL queries within the warehouse itself
- Lesson 1816 — What is ELT? Extract, Load, Transform Explained
- Transforming features
- to meet model assumptions or improve performance: scaling numerical features, encoding categorical variables, handling skewed distributions, or creating polynomial terms.
- Lesson 2088 — Stage 4: Feature Engineering and Preparation
- Transient
- Implement exponential backoff, retry 3-5 times
- Lesson 1849 — Transient vs Permanent Failures
- Transient failures
- typically include:
- Lesson 1849 — Transient vs Permanent FailuresLesson 1850 — Retry Strategies
- Transitive dependencies
- are the hidden culprit: Package A depends on Package B version 2, but Package C needs Package B version 3.
- Lesson 2048 — The Dependency Hell Problem
- Transparency
- Don't hide limitations.
- Lesson 1247 — The Ethics of Visualization DesignLesson 1341 — Data and Aesthetic MappingsLesson 1643 — Building Attribution FrameworksLesson 1816 — What is ELT? Extract, Load, Transform ExplainedLesson 1931 — When to Push Back on RequestsLesson 2029 — Draft Pull Requests and WIP Workflows
- Transparency (alpha)
- prevents overplotting in dense datasets.
- Lesson 1265 — Scatter Plots: Relationships Between Variables
- Transparency/alpha
- Let overlapping points blend, showing density through darker areas
- Lesson 1310 — Point Maps and Scatter Plots on Maps
- Transportation
- Optimizing delivery routes or predicting traffic patterns
- Lesson 6 — Common Data Science Applications
- Treated
- is a binary indicator (1 if unit is in treatment group, 0 if control)
- Lesson 1455 — DiD with Regression
- Treated × Post
- is the **interaction term** between the two indicators
- Lesson 1455 — DiD with Regression
- treatment
- (version B—a new feature, design, or intervention), while the other receives the **control** (version A—the current state or baseline).
- Lesson 1477 — Core Principles of A/B TestingLesson 1482 — Control and Treatment Design
- Treatment Effect Estimation
- calculates the difference in average outcomes between those who received the treatment and those who didn't.
- Lesson 1440 — Treatment Effect Estimation
- treatment group
- (receives the intervention) or a **control group** (does not receive the intervention).
- Lesson 1435 — What is a Randomized Controlled Trial?Lesson 1641 — Isolating Effects with Control GroupsLesson 1677 — Measuring Churn Reduction ImpactLesson 1688 — A/B Testing for Conversion OptimizationLesson 1745 — Holdout Groups and Test Design
- Treatment group, after intervention
- Lesson 1452 — The Difference-in-Differences Setup
- Treatment Type
- (Drug A vs Drug B) and **Gender** (Male vs Female) on recovery time.
- Lesson 653 — Interpreting Categorical × Categorical Interactions
- Tree-based models
- (decision trees, random forests): These algorithms don't use the same linear framework as regression and can handle all k variables without issues
- Lesson 638 — One-Hot Encoding Overview
- trend
- a general direction the data is moving.
- Lesson 706 — Trend: Long-Term DirectionLesson 710 — Additive vs Multiplicative ModelsLesson 711 — Visualizing Components with Decomposition PlotsLesson 715 — Visual Tests for StationarityLesson 742 — Components of Seasonal DecompositionLesson 744 — Classical Decomposition MethodsLesson 747 — Interpreting Decomposition PlotsLesson 761 — Double Exponential Smoothing (Holt's Method) (+6 more)
- Trend component
- (the long-term direction)
- Lesson 711 — Visualizing Components with Decomposition PlotsLesson 742 — Components of Seasonal DecompositionLesson 769 — Smoothing Parameters: Alpha, Beta, Gamma
- Trend equation
- The current rate of change, smoothed over time
- Lesson 761 — Double Exponential Smoothing (Holt's Method)Lesson 767 — Holt-Winters Additive ModelLesson 768 — Holt-Winters Multiplicative Model
- Trend or pattern
- "Sales increased steadily from January to December"
- Lesson 1250 — Text Alternatives and Screen Reader Compatibility
- Trend Signals
- Lesson 1401 — Detecting Out-of-Control Signals
- Trends
- Are sales climbing over the year?
- Lesson 19 — Temporal Data and Time SeriesLesson 562 — Index Plots and Time-Ordered ResidualsLesson 760 — Forecasting with Simple Exponential SmoothingLesson 1183 — Scatter Plots for Two Numeric Variables
- Triggers
- allow one pipeline to programmatically start another pipeline upon completion.
- Lesson 1845 — Cross-Pipeline Dependencies
- Trimming whitespace
- removes leading and trailing spaces that creep in from manual data entry or faulty exports.
- Lesson 1138 — Cleaning and Standardizing Text Fields
- Tritanopia
- (blue-yellow, rare): difficulty with blue and yellow
- Lesson 1248 — Color Blindness and Color Palette Design
- Troubleshoot failures
- If a downstream task fails, check its upstream dependencies first
- Lesson 1841 — Upstream and Downstream Dependencies
- true
- probability without relying on large-sample approximations.
- Lesson 432 — Fisher's Exact Test: The LogicLesson 871 — NULL Handling with Logical Operators
- True metric
- Annual subscription renewal rate
- Lesson 1517 — Surrogate Metrics: When Direct Measurement is Impractical
- True Positives (TP)
- Correctly identified change-points
- Lesson 1418 — Evaluating Change-Point Detection Methods
- Trust erosion
- with stakeholders when they catch problems before you do
- Lesson 2136 — Monitoring Gaps and Silent Failures
- Try common encodings explicitly
- UTF-8 (most modern), Latin-1 (ISO-8859-1, Western European), or CP1252 (Windows)
- Lesson 1135 — Detecting and Fixing Encoding Issues
- Try d=1
- If non-stationary, apply first-order differencing (subtracting each value from the previous one).
- Lesson 778 — Determining Differencing Order (d)
- Try different combinations
- of alpha, beta, and gamma values (typically between 0 and 1)
- Lesson 772 — Holt-Winters Parameter Optimization
- Try multiple reasonable priors
- Use informative, weakly informative, and uninformative priors for the same problem
- Lesson 1572 — Sensitivity Analysis and Prior Robustness
- Tukey's fences
- use the IQR to build "boundary lines" beyond which data points are considered outliers.
- Lesson 72 — IQR Method and Tukey's Fences
- TV(t), Radio(t), Digital(t)
- are your marketing spend amounts in each channel at time *t*
- Lesson 1738 — The Core MMM Regression Model
- two categorical variables
- (like color preference and age group)
- Lesson 422 — Introduction to Chi-Squared Test of IndependenceLesson 1181 — What is Bivariate Analysis?
- Two columns
- with 100 values each → up to 10,000 potential groups
- Lesson 911 — Performance Considerations with Multiple Groups
- two-sample t-test
- , the process is similar but accounts for both group sizes and their combined variability.
- Lesson 343 — Calculating Power for Common TestsLesson 359 — Two-Sample t-Test OverviewLesson 360 — Independent vs. Dependent SamplesLesson 375 — Paired t-Test vs Two-Sample t-Test
- Two-sided
- "The parameter is *different* from the null value" (≠)
- Lesson 308 — Defining the Alternative Hypothesis (H₁ or H ₐ)Lesson 311 — One-Sided vs Two-Sided AlternativesLesson 345 — Directionality in Hypothesis TestingLesson 373 — Hypotheses for Paired t- TestsLesson 401 — Setting Up Hypotheses for ProportionsLesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- Two-sided (two-tailed) test
- This tests whether the *most extreme value* — either the maximum OR minimum — is an outlier.
- Lesson 1393 — Two-Sided vs One-Sided Grubbs' Test
- two-sided test
- , you calculate the probability in *both* tails (values as extreme or more extreme in either direction).
- Lesson 319 — Calculating P-Values from Test StatisticsLesson 325 — The Rejection Region
- Two-tailed
- H₁: p₁ ≠ p₂ (testing for *any* difference)
- Lesson 406 — Two-Sample Proportion Test SetupLesson 433 — Conducting Fisher's Exact Test
- Two-tailed test
- You care about differences in *either* direction (bigger or smaller).
- Lesson 348 — P-Value Calculation DifferencesLesson 354 — Setting Up Hypotheses for One-Sample t-TestLesson 410 — P-Value Calculation and InterpretationLesson 433 — Conducting Fisher's Exact Test
- Type 1 (Overwrite)
- Replace the old value with the new one.
- Lesson 1809 — Dimension Tables and Slowly Changing Dimensions
- Type 3 (Add Column)
- Store both current and previous values in separate columns (e.
- Lesson 1809 — Dimension Tables and Slowly Changing Dimensions
- Type I error
- occurs when you **reject a true null hypothesis**.
- Lesson 330 — Understanding Type I Error (False Positive)Lesson 333 — Consequences of Type I and Type II ErrorsLesson 624 — Multiple Testing Considerations
- Type I Error (α)
- appears as the shaded area *under the null curve* that falls into the rejection region.
- Lesson 336 — Visualizing Error Types with Sampling Distributions
- Type II error
- occurs when you **fail to reject a false null hypothesis**.
- Lesson 331 — Understanding Type II Error (False Negative)Lesson 333 — Consequences of Type I and Type II Errors
- Type II Error (β)
- appears as the shaded area *under the alternative curve* that falls *outside* the rejection region (where you fail to reject H₀).
- Lesson 336 — Visualizing Error Types with Sampling Distributions
- Type mismatches
- Passing `"hello"` when you expect an integer
- Lesson 1109 — Input Validation and Defense in DepthLesson 1150 — What is Data Validation?
- Type of phone
- (iPhone vs Android) can correlate with socioeconomic status
- Lesson 1883 — Protected Classes and Proxy Variables
- Types of contributions welcome
- Documentation fixes?
- Lesson 2083 — Contributing Guidelines and Contact Information
- Typical pattern
- Lesson 1113 — Rolling Back Transactions
U
- Uber
- Rides completed — directly measures successful matching of drivers and riders.
- Lesson 1606 — Examples of North Star Metrics by Industry
- unbiased
- .
- Lesson 255 — Expected Value of Sample StatisticsLesson 521 — Properties of Least Squares EstimatorsLesson 552 — Zero Conditional Mean of ErrorsLesson 554 — Consequences of Violating Assumptions
- Unbounded above
- Theoretically no maximum limit (though rare events in practice)
- Lesson 689 — When to Use Poisson Regression
- UNBOUNDED FOLLOWING
- End at the very last row of the partition
- Lesson 1020 — UNBOUNDED and CURRENT ROW Keywords
- UNBOUNDED PRECEDING
- Start at the very first row of the partition
- Lesson 1020 — UNBOUNDED and CURRENT ROW Keywords
- Unbounded Retention
- (also called "Return on or After Day N") measures the percentage of users who come back *any time on or after* Day N.
- Lesson 1654 — Classic vs Unbounded Retention
- Uncertainty is present
- The relationship between surrogate and business metric is unproven
- Lesson 1522 — Balancing Speed and Accuracy in Metric Selection
- Uncertainty Quantification
- Lesson 1539 — Interpreting Posterior Probabilities
- Under-controlling
- Ignoring confounders because they seem unimportant or weren't measured.
- Lesson 1476 — Common DAG Patterns and Pitfalls
- Under-investing
- in channels with high incremental value but lower raw volume
- Lesson 1717 — Incrementality and True Channel Impact
- Undercoverage
- Your sampling frame (the list you sample from) doesn't include part of the population.
- Lesson 244 — Selection Bias and Its CausesLesson 249 — Coverage Error and Undercoverage
- Undermining trust
- Stakeholders may feel manipulated rather than informed
- Lesson 1927 — Separating Analysis from Advocacy
- Understand complexity
- Most conversions aren't one-click decisions; they involve multiple channels and interactions
- Lesson 1719 — The Customer Journey and Touchpoints
- Understand conditional relationships
- When relationships hold under specific circumstances
- Lesson 1190 — Introduction to Multivariate Analysis
- Understand decision-maker constraints
- Your stakeholder might need results before quarterly board meetings, end-of-month planning sessions, or annual budget reviews.
- Lesson 2099 — Aligning with Business Timelines and Decision Points
- Understand structural changes
- in your domain (markets expanding, behaviors shifting)
- Lesson 706 — Trend: Long-Term Direction
- Understanding cardinality
- Join tables that produce smaller results first when possible
- Lesson 951 — Join Order and Performance
- Understanding patterns
- Knowing that average temperature is 70°F doesn't tell you if you need both winter coats and shorts
- Lesson 46 — What is Variability?
- Understanding the business context
- What decision will this analysis inform?
- Lesson 2085 — Stage 1: Problem Definition and Scoping
- Understanding the real world
- Shape reveals the story behind your numbers.
- Lesson 63 — Understanding Distribution Shape
- Undirected graphs
- show symmetrical relationships.
- Lesson 1316 — Introduction to Network Graphs and Graph Theory Basics
- Unequal variances
- → Use Welch's t-test
- Lesson 383 — Diagnostic Workflow: When to Proceed or Switch TestsLesson 390 — When Parametric Tests Fail: Violations of AssumptionsLesson 398 — Choosing Between Parametric and Non-Parametric TestsLesson 461 — Games-Howell Test for Unequal Variances
- Unexpected duplicates
- The same transaction or observation recorded multiple times
- Lesson 1154 — Uniqueness and Duplication Checks
- Unexpected paths
- (tasks that shouldn't depend on each other)
- Lesson 1846 — Testing and Validating Dependency Graphs
- Unexpected Patterns
- Look for broken correlations (height and weight usually relate; if they suddenly don't, check your data), unusual counts (suddenly 200 records instead of the usual 50), or rare category values appearing too frequently.
- Lesson 1157 — Statistical Anomaly Detection in QA
- unexpected relationships
- in your data
- Lesson 1181 — What is Bivariate Analysis?Lesson 1192 — Correlation Matrices and Heatmaps
- Unicode
- is the universal character encoding standard that assigns a unique number to every character across all writing systems.
- Lesson 1139 — Dealing with Special Characters and Unicode
- Unimodal
- (has one peak at the mean)
- Lesson 169 — The Normal Distribution: Definition and PropertiesLesson 1175 — Histograms for Distribution Shape
- uninformative prior
- that assigns equal probability across all plausible values.
- Lesson 1543 — Defining Prior DistributionsLesson 1581 — Setting Priors for A/B Tests
- Unique identifier validation
- Verify that ID columns contain no duplicates.
- Lesson 1154 — Uniqueness and Duplication Checks
- Uniqueness
- Each value in the primary key column must be unique across the entire table.
- Lesson 1048 — What Are Primary Keys?Lesson 1863 — Data Quality DimensionsLesson 1865 — Data Quality Checks in Pipelines
- Unit tests for dependencies
- Write tests that assert specific relationships exist.
- Lesson 1846 — Testing and Validating Dependency Graphs
- Uniting columns
- is the reverse: combining multiple columns into one when they represent a single logical unit.
- Lesson 1147 — Separating and Uniting Columns
- Units
- Currency (USD), measurements (kg, meters), percentages
- Lesson 1163 — Metadata and Data DictionariesLesson 2064 — Creating Data Dictionaries
- Unnatural constraints
- Sometimes the conjugate form doesn't match your actual prior knowledge
- Lesson 1555 — Advantages and Limitations of Conjugate Priors
- Unpooled variance
- treats each group's variance as unique.
- Lesson 285 — Pooled vs Unpooled Variance Approaches
- Unreliable forecasts
- Predictions become meaningless outside your training period
- Lesson 734 — Why Differencing and Detrending Matter
- Unreliable predictions
- Since the underlying process is changing, our model's parameters—estimated from past data— won't accurately describe future behavior.
- Lesson 713 — Why Stationarity Matters
- Unrepresentative samples
- If your data doesn't reflect the real-world distribution, predictions will fail in production.
- Lesson 2124 — Insufficient or Low-Quality Data
- Unstable Coefficient Estimates
- Lesson 581 — Symptoms of Multicollinearity
- Unstable coefficients
- Small changes in your data can lead to large swings in the estimated regression coefficients
- Lesson 580 — What is Multicollinearity?
- Untracked data sources
- Multiple teams pull from the same database table, but nobody coordinates when structure or semantics change.
- Lesson 2133 — Undocumented Data Dependencies
- Untracked files
- Files Git doesn't know about yet (never staged or committed).
- Lesson 1997 — Viewing Repository State with git statusLesson 1998 — Checking Repository Status
- Unused indexes
- consume storage and slow down writes (INSERT, UPDATE, DELETE) without providing query benefits.
- Lesson 1086 — Index Maintenance and Monitoring
- Update
- Apply Bayes' theorem to compute the posterior using that data
- Lesson 1582 — Updating Beliefs with Test Data
- Update Anomalies
- Lesson 1062 — Data Anomalies: Insert, Update, Delete
- Update complexity
- requiring changes in multiple places
- Lesson 1071 — When to Denormalize: Performance Trade-offsLesson 1074 — Duplicating Data Across Tables
- UPDATE protection
- You cannot change a foreign key to point to a non-existent parent
- Lesson 1052 — Foreign Key Constraints
- Update with data
- from each group separately to get two posterior distributions: one for μ₁ and one for μ₂
- Lesson 1570 — Comparing Two Means: Bayesian Approach
- Updated beliefs
- Compare the posterior to your prior.
- Lesson 1547 — Interpreting Posterior Distributions
- Updates belief
- Strong data can overcome weak priors; strong priors resist contradictory weak data
- Lesson 1537 — The Posterior Distribution
- Updates segment membership
- as customer behavior evolves
- Lesson 1710 — Operationalizing Segments: Scoring and Deployment
- Updating
- existing information
- Lesson 844 — What is SQL?Lesson 1124 — Insert, Update, Delete, and Bulk Operations
- Upper Control Limit (UCL)
- Typically 3 standard deviations above the mean
- Lesson 1396 — Introduction to Control ChartsLesson 1397 — Shewhart Control Chart BasicsLesson 1398 — Control Charts for Means (X-bar Charts)
- Upper fence
- = Q3 + (1.
- Lesson 72 — IQR Method and Tukey's FencesLesson 1385 — Calculating IQR Fences in Practice
- Upper threshold (B)
- Based on acceptable Type I error (α, false positive rate)
- Lesson 1511 — Sequential Probability Ratio Test (SPRT)
- Upserts (Update or Insert)
- Instead of blindly inserting records, use operations that update existing records if they're already present.
- Lesson 1848 — Designing Idempotent Operations
- Upstream
- `clean_data` and `extract_raw_data` (direct and transitive)
- Lesson 1841 — Upstream and Downstream Dependencies
- Upstream dependencies
- are the tasks that must run *before* your current task.
- Lesson 1841 — Upstream and Downstream Dependencies
- Upward (positive) trend
- Values generally increase over time (e.
- Lesson 706 — Trend: Long-Term Direction
- Upward or downward slope
- Warning—variance is changing systematically as fitted values increase
- Lesson 560 — Scale-Location Plot (Spread-Location Plot)
- Usage
- How to run scripts, notebooks, or generate reports
- Lesson 2077 — The Purpose and Anatomy of a Good README
- Use a random mechanism
- to select your sample (random number generator, lottery-style draw)
- Lesson 234 — Simple Random Sampling
- Use active voice
- "We tested three models" beats "Three models were tested"
- Lesson 1967 — Writing Clear and Concise Analysis Sections
- Use additive when
- Lesson 766 — Additive vs Multiplicative Seasonality
- Use asymptotic p-values when
- Lesson 322 — Exact vs Asymptotic P-Values
- Use binomial logic
- Under H₀, positive and negative signs are equally likely (p = 0.
- Lesson 391 — The Sign Test for Medians
- Use Binomial when
- Lesson 146 — When to Use Poisson vs Other Distributions
- Use blocking first
- rerandomization works best *after* applying stratification—it fine-tunes balance within strata
- Lesson 1492 — Rerandomization and Practical Implementation
- Use case
- Three or more related groups (repeated measures)
- Lesson 474 — Friedman Test: Non-Parametric Repeated Measures ANOVALesson 1437 — Randomization Mechanisms
- Use CASE when
- You need inline conditional logic for 3-10 possible outcomes within a query.
- Lesson 1037 — CASE Best Practices and Performance
- Use charset detection libraries
- that analyze byte patterns to suggest likely encodings
- Lesson 1135 — Detecting and Fixing Encoding Issues
- Use colorblind-friendly palettes
- Tools like ColorBrewer, Viridis, and palette simulators help you test combinations.
- Lesson 1248 — Color Blindness and Color Palette Design
- Use concrete examples
- Instead of explaining regularization abstractly, say "prevents the model from memorizing noise in the training data"
- Lesson 2105 — Translating Between Technical and Business Language
- Use concrete units
- Always include what you're measuring ("dollars," "pounds," "hours")
- Lesson 530 — Communicating Results to Non-Technical Audiences
- Use consistent formatting
- APA, IEEE, or your organization's standard
- Lesson 1972 — Citations and References in Data Science Reports
- Use descriptive, hierarchical patterns
- Lesson 2073 — Naming Conventions for Files and Functions
- Use exact p-values when
- Lesson 322 — Exact vs Asymptotic P-Values
- Use Exact Versions
- Lesson 2046 — Best Practices for Environment Management in Teams
- Use explicit JOIN syntax
- with `ON` clauses instead of comma-separated table lists
- Lesson 955 — Avoiding Cartesian Products
- Use Fisher's Exact Test
- as an alternative (for 2×2 tables)
- Lesson 426 — Assumptions and Sample Size Requirements
- Use Geometric when
- Lesson 137 — Geometric vs Negative Binomial: Key DifferencesLesson 146 — When to Use Poisson vs Other Distributions
- Use informative priors when
- Lesson 1544 — Informative vs Uninformative Priors
- Use merge when
- Lesson 2014 — Understanding Git Rebase vs Merge
- Use multiple channels strategically
- Lesson 2104 — Communication Cadence and Updates
- Use multiplicative when
- Lesson 766 — Additive vs Multiplicative Seasonality
- Use Negative Binomial when
- Lesson 137 — Geometric vs Negative Binomial: Key DifferencesLesson 146 — When to Use Poisson vs Other Distributions
- Use OO interface
- for production code, complex layouts, multiple subplots, or when functions need to accept specific axes to plot on
- Lesson 1256 — Two Interfaces: pyplot vs Object-Oriented
- Use Paired t-Test when
- Lesson 375 — Paired t-Test vs Two-Sample t-Test
- Use percentiles when
- Lesson 62 — Percentiles vs Z-Scores: Complementary Position Measures
- Use plain language
- Avoid jargon like "feature importance" or "p-values.
- Lesson 1944 — Executive Summary Best Practices
- Use Poisson when
- Lesson 146 — When to Use Poisson vs Other Distributions
- Use pooled variance when
- Lesson 285 — Pooled vs Unpooled Variance Approaches
- Use pyplot
- for quick exploratory visualizations and simple single plots
- Lesson 1256 — Two Interfaces: pyplot vs Object-Oriented
- Use rank tests
- when: data are skewed, outliers present, small samples where you can't verify normality, or you care about distribution shifts beyond just means
- Lesson 397 — Power and Efficiency of Non-Parametric Tests
- Use rebase when
- Lesson 2014 — Understanding Git Rebase vs Merge
- Use Robust Methods
- Lesson 564 — What to Do When Residual Plots Show Problems
- Use sequential testing methods
- specifically designed for interim analysis (like Group Sequential Testing or Always-Valid Inference from earlier lessons)
- Lesson 1523 — Peeking at Results Early
- Use standardized coefficients when
- Lesson 528 — Standardized vs Unstandardized Coefficients
- Use stratified sampling
- When you know certain groups are underrepresented, deliberately sample more from those groups to balance things out.
- Lesson 250 — Strategies for Bias Detection and Mitigation
- Use t
- if you must *estimate* σ from your sample (using sample standard deviation s) — almost always the case
- Lesson 272 — When to Use Z vs t
- Use t-tests
- when: data are approximately normal, moderate sample sizes, you want maximum power from clean data
- Lesson 397 — Power and Efficiency of Non-Parametric Tests
- Use table aliases carefully
- ensure your `ON` clause references columns from *both* tables, not just one
- Lesson 955 — Avoiding Cartesian Products
- Use the bootstrap distribution
- to build a confidence interval (percentile method, BCa, etc.
- Lesson 306 — Bootstrap for Non-Standard Problems
- Use the CDF
- to find the area beyond your test statistic
- Lesson 319 — Calculating P-Values from Test Statistics
- Use Two-Sample t-Test when
- Lesson 375 — Paired t-Test vs Two-Sample t-Test
- Use uninformative priors when
- Lesson 1544 — Informative vs Uninformative Priors
- Use unpooled variance when
- Lesson 285 — Pooled vs Unpooled Variance Approaches
- Use unstandardized coefficients when
- Lesson 528 — Standardized vs Unstandardized Coefficients
- Use when
- Your research question is "Is there a difference?
- Lesson 345 — Directionality in Hypothesis TestingLesson 475 — Choosing Between Parametric and Non- Parametric TestsLesson 2026 — Merge Strategies: Merge vs Squash vs Rebase
- Use WHERE subqueries
- When filtering data, subqueries in WHERE typically outperform SELECT subqueries
- Lesson 969 — Performance Considerations for SELECT Subqueries
- Use z
- if you *know* the population standard deviation (σ) — rare in real life
- Lesson 272 — When to Use Z vs t
- Use z-scores when
- Lesson 62 — Percentiles vs Z-Scores: Complementary Position Measures
- User behavior shifts
- People interact with systems differently over time
- Lesson 15 — Deployment, Monitoring, and Iteration
- User demographics
- Age group, gender, language preference
- Lesson 1682 — Segmenting Funnels by User Attributes
- User Engagement
- Lesson 908 — Multi-Level Grouping in Business Analytics
- User experience consistency
- Randomizing by session means the same user might see different versions on different visits, creating confusion.
- Lesson 1481 — Unit of Randomization
- User identifiers
- to stitch touchpoints together into coherent journeys
- Lesson 1719 — The Customer Journey and Touchpoints
- User input matters
- (filtering by date range, region, or product category)
- Lesson 1330 — Introduction to Interactive Dashboards
- User support
- Answering questions about metrics and functionality
- Lesson 1979 — Maintenance and Sustainability Considerations
- Uses the one-sample t-test
- on the differences (simpler than two-sample methods)
- Lesson 370 — Differences as the Unit of Analysis
- Using linear regression
- where residuals should be approximately normal
- Lesson 202 — Why Test for Normality?
- Using specific columns
- Select only needed columns to reduce memory overhead
- Lesson 951 — Join Order and Performance
- Using statsmodels
- Lesson 569 — Creating Q-Q Plots: Tools in Python and R
- UTC (Coordinated Universal Time)
- is the universal baseline—think of it as the "source of truth" for time.
- Lesson 1042 — Working with Timestamps and Time Zones
- UTM Parameters
- are tags appended to URLs that capture campaign details.
- Lesson 1713 — Tracking Users by Channel
V
- Vague
- "Our website isn't doing well"
- Lesson 10 — Problem Definition and ScopingLesson 2093 — Translating Business Questions into Analytical QuestionsLesson 2094 — Defining Success Metrics Upfront
- Vague observation
- "Customer behavior looks different between segments.
- Lesson 1200 — Formulating Specific, Testable Hypotheses
- Vague or Undefined Parameters
- Lesson 313 — Common Pitfalls in Hypothesis Formulation
- Valid partition
- Lesson 83 — Partitions of the Sample Space
- Valid Range
- Min/max values, allowed categories, or regex patterns
- Lesson 1163 — Metadata and Data Dictionaries
- validate
- that your leading indicator actually predicts the outcome you care about.
- Lesson 1603 — Common Pitfalls in Indicator SelectionLesson 1692 — Statistical Significance and Iteration
- Validate assumptions early
- Does your preliminary analysis match stakeholder intuition?
- Lesson 2111 — Fast Feedback Loops with Stakeholders
- Validate with cross-validation
- Ensure the model generalizes to unseen data
- Lesson 633 — Practical Model Selection Strategy
- Validate with stakeholders
- Product, engineering, and analytics teams must agree on definitions.
- Lesson 1679 — Defining Funnel Steps and Events
- Validating accuracy
- Checking that values make sense—for example, ensuring ages aren't negative or dates aren't in the future.
- Lesson 12 — Data Cleaning and Preparation
- Validating updates
- Changes to foreign key values are checked against the parent table
- Lesson 1055 — What is Referential Integrity?
- Validation
- Test on held-out data or different time periods
- Lesson 1204 — From Hypothesis to Analysis Plan
- Validation becomes complex
- What metrics indicate your retrained model is "good"?
- Lesson 2128 — Data Distribution Shifts Frequently
- Validation set
- Data you use to check performance during development
- Lesson 14 — Model Evaluation and Validation
- validity
- , and **uniqueness**.
- Lesson 1863 — Data Quality DimensionsLesson 1865 — Data Quality Checks in Pipelines
- value
- of the ordering column, not physical position.
- Lesson 1015 — ROWS vs RANGE Frame SpecificationsLesson 1701 — What is Customer Segmentation?Lesson 1762 — Extended Dimensions: Veracity and Value
- Values near zero
- suggest little to no linear relationship at that lag
- Lesson 720 — The Autocorrelation Function (ACF)
- Vanity metrics
- are measurements that appear impressive at first glance—often large, growing numbers—but don't connect to actionable business outcomes or inform strategic decisions.
- Lesson 1612 — What Are Vanity Metrics?Lesson 1614 — Growth Without Retention
- Var(X) = (1-p)/p²
- Lesson 151 — Expected Value and Variance for Common Distributions
- Var(X) = r(1-p)/p²
- Lesson 136 — Expectation and Variance of the Negative Binomial
- Var(X) = λ
- (variance)
- Lesson 141 — Mean and Variance of Poisson DistributionLesson 151 — Expected Value and Variance for Common Distributions
- variability
- (how much data points differ from each other), let's start with the simplest way to measure it: **range**.
- Lesson 47 — Range: The Simplest MeasureLesson 294 — Margin of Error and Its ComponentsLesson 296 — Sample Size for Comparing Two Groups
- Variability in the data
- More spread (higher standard deviation) → larger standard error → larger margin of error.
- Lesson 271 — Margin of Error
- Variable distributions
- Shape and spread along the diagonal
- Lesson 1191 — Scatter Plot Matrices and Pairplots
- Variable Name
- The exact column name as it appears in your data
- Lesson 2064 — Creating Data Dictionaries
- Variables to exclude
- "Drop `user_id` (high cardinality, no predictive value)"
- Lesson 1212 — EDA Summary Documentation and Next Steps
- variance
- (which squares deviations) and **standard deviation** (which takes the square root of variance), MAD works directly with the actual distances.
- Lesson 52 — Mean Absolute Deviation (MAD)Lesson 54 — When to Use Each MeasureLesson 122 — Variance and Standard Deviation of Discrete Random VariablesLesson 125 — Bernoulli Mean and VarianceLesson 129 — Binomial Mean and VarianceLesson 133 — Expectation and Variance of the Geometric DistributionLesson 136 — Expectation and Variance of the Negative BinomialLesson 141 — Mean and Variance of Poisson Distribution (+6 more)
- Variance = 1/λ²
- The spread is the square of the mean
- Lesson 166 — Exponential Distribution: Mean and Variance
- Variance inequality
- Two-sample t-tests are more sensitive to unequal variances when sample sizes differ between groups.
- Lesson 382 — Robustness of t-Tests to Assumption Violations
- Variance Inflation Factor (VIF)
- quantifies this problem by measuring how much the variance of a coefficient estimate is "inflated" due to correlation with other predictors.
- Lesson 582 — Variance Inflation Factor (VIF)
- Variance inspection
- Directly compare the empirical variance and mean of your count variable across groups.
- Lesson 693 — Overdispersion in Count Data
- Variance of X
- Lesson 180 — Parameters and Moments of the Log-NormalLesson 519 — Computing β₁: The Slope Estimate
- Variety
- captures the diversity of data types and sources.
- Lesson 1760 — Defining Big Data: The Three Vs
- Vector formats
- (like PDF, SVG, EPS) store mathematical descriptions of shapes.
- Lesson 1273 — Saving Figures: Formats and Resolution
- Vectorized operations
- Modern CPUs process columns of uniform data types far faster than mixed-type rows.
- Lesson 1811 — Columnar Storage and Query Optimization
- Velocity
- describes the speed at which data arrives and must be processed.
- Lesson 1760 — Defining Big Data: The Three Vs
- Verdict
- We either reject innocence (guilty) or fail to reject it (not guilty — notice we don't say "innocent")
- Lesson 312 — Hypothesis Testing as a Legal Analogy
- verify
- using the multiplication rule: P(A and B) = P(A) × P(B)
- Lesson 106 — Common Misconceptions About IndependenceLesson 259 — Simulating Sampling DistributionsLesson 542 — Computing Fitted Values and ResidualsLesson 741 — Testing Stationarity After Transformation
- Verify balance
- across both stratification variables and other covariates
- Lesson 1489 — Stratified Randomization Fundamentals
- Verify data collection
- Is this data being captured at all?
- Lesson 2098 — Identifying Data Availability Gaps Early
- Verify independence
- review your sampling method and data collection
- Lesson 290 — Assumptions and Diagnostics for Difference Intervals
- Verify residuals
- Check that the sum of residuals equals zero (or very close)
- Lesson 522 — Implementing Least Squares from Scratch
- Verify the value
- against source data—is it a recording error?
- Lesson 1209 — Outlier Detection and Investigation
- Verify your configuration
- Lesson 1991 — Installing Git and Initial Configuration
- Version drift
- means that installing "the latest" packages today gives you a different environment than "the latest" six months ago, breaking reproducibility even when you follow the same steps.
- Lesson 2048 — The Dependency Hell Problem
- Version your data
- Track which dataset version you used
- Lesson 30 — The Reproducibility Crisis and Solutions
- Version-controlled code
- that documents every transformation
- Lesson 1981 — What Makes a Report Reproducible?
- Vertical bars
- Each bar represents the correlation at a specific lag
- Lesson 722 — ACF Plots and Interpretation
- Vertical patterns
- All cohorts struggling at the same time period (e.
- Lesson 1649 — Visualizing Cohort Data with Heatmaps
- Vertical scaling (scale-up)
- means upgrading to a more powerful single machine—more RAM, more CPU cores, faster disks.
- Lesson 1767 — Scale-Up vs Scale-Out Architectures
- View the reflog
- `git reflog` shows recent `HEAD` movements with timestamps and commit hashes
- Lesson 2021 — Recovering from Rebase Mistakes
- VIF = 5–10
- High correlation (concerning, investigate further)
- Lesson 582 — Variance Inflation Factor (VIF)
- VIF-guided removal
- Remove the predictor with highest VIF, recalculate, repeat
- Lesson 585 — Remedies: Variable Selection
- Violation examples
- Lesson 448 — Independence of Observations
- violin plot
- combines a boxplot with a smoothed density curve mirrored on both sides.
- Lesson 55 — Visualizing SpreadLesson 1268 — Box Plots and Violin PlotsLesson 1286 — Violin Plots and Distribution Shape
- Violin plots
- go further by showing the **full probability density** of the data.
- Lesson 1223 — Box Plots and Violin PlotsLesson 1268 — Box Plots and Violin Plots
- Virality Coefficient (k)
- = Invites Sent per User × Conversion Rate
- Lesson 1631 — Social Media Metrics: DAU/MAU and Content Engagement
- Viridis palettes
- are perceptually uniform and colorblind-friendly:
- Lesson 1368 — Color Scales and Palettes
- Visual check
- Plot the series after each differencing step.
- Lesson 778 — Determining Differencing Order (d)Lesson 1456 — Testing Parallel Trends
- Visual checks
- Lesson 217 — Evaluating Transformation Effectiveness
- Visual Diagnostics
- Histograms or density plots overlaying treatment and control distributions make imbalances immediately visible.
- Lesson 1491 — Covariate Balance and Diagnostics
- Visual inspection
- Plot a histogram.
- Lesson 193 — Choosing Between Distributions in PracticeLesson 734 — Why Differencing and Detrending MatterLesson 1209 — Outlier Detection and Investigation
- Visual inspection first
- Does your plot show obvious trend or changing variance?
- Lesson 718 — Interpreting Stationarity Test Results
- Visual methods
- (histograms, density plots, Q-Q plots) give you the *intuitive picture*.
- Lesson 210 — Combining Visual and Statistical MethodsLesson 377 — Testing Normality: Visual Methods
- Visual proof
- A chart that makes the trend immediately visible
- Lesson 1946 — Supporting Your Claims with Evidence
- Visual separation
- of confidence bands (non-overlapping suggests real differences)
- Lesson 817 — Comparing Multiple Survival Curves
- Visual storytelling
- Plots, dashboards, or interactive demos
- Lesson 2141 — Building a Portfolio and Personal Brand
- visualization
- and **description**.
- Lesson 817 — Comparing Multiple Survival CurvesLesson 1656 — Visualizing Retention Curves
- Visualization tools work smoothly
- Libraries like Pandas and plotting tools expect tidy structure
- Lesson 1142 — What is Tidy Data?
- Visualizations
- Use bar charts comparing segment characteristics side-by-side, box plots showing distributions of key metrics within segments, or radar charts displaying multiple dimensions simultaneously.
- Lesson 1709 — Segment Profiling and Interpretation
- Visualizations over tables
- charts speak louder than numbers
- Lesson 2091 — Stage 7: Communication and Handoff
- Visualize demographics
- Plot key characteristics of your sample against the population.
- Lesson 250 — Strategies for Bias Detection and Mitigation
- Vital Interests
- Lesson 1906 — Legal Bases for Processing Personal Data
- Volume
- refers to the sheer scale of data.
- Lesson 1760 — Defining Big Data: The Three VsLesson 2086 — Stage 2: Data Acquisition and Assessment
- Voluntariness
- Lesson 1913 — Elements of Valid Consent
- Voluntary churn
- happens when customers actively choose to leave.
- Lesson 1670 — What is Churn and Why It Matters
- Volunteer bias
- (also called **self-selection bias**) occurs when people choose whether or not to participate in a study, and those who volunteer differ in important ways from those who don't.
- Lesson 246 — Volunteer and Self-Selection Bias
- VP and above
- Org-wide vision, resource allocation, executive influence
- Lesson 2140 — Individual Contributor vs Management Tracks
- Vulnerability
- Over-reliance on power users means losing a few hurts badly
- Lesson 1698 — Power User Curves and Engagement Distribution
W
- W-shaped attribution model
- recognizes that not all touchpoints are equally important.
- Lesson 1730 — W-Shaped Attribution Model
- WAIC
- (Widely Applicable Information Criterion) or **LOO** (Leave-One-Out cross-validation) to compare them:
- Lesson 1596 — Posterior Predictive Checks and Model Comparison
- Wald tests
- with **z-statistics** (because we're using maximum likelihood estimation, not least squares).
- Lesson 683 — Hypothesis Tests for Individual Coefficients
- Warning/Email
- Elevated error rate, slower performance, approaching thresholds—investigate during business hours
- Lesson 1858 — Alerting Strategies
- Warranty planning
- Understanding failure patterns helps set optimal warranty periods
- Lesson 188 — Weibull Distribution: Hazard Function and Reliability
- Wasted computational resources
- on variables that don't add value
- Lesson 1197 — Identifying Variable Importance and Redundancy
- Wasted effort
- Including redundant features adds complexity without improving predictions
- Lesson 513 — Applications: Feature Selection and Multicollinearity
- Wasted resources
- on flawed approaches
- Lesson 34 — Recognizing Boundaries of CompetenceLesson 1518 — The Relationship Between Surrogate and Business Metrics
- Wasted space
- Many columns contain `NULL` for half the rows
- Lesson 1148 — Handling Multiple Types in One Table
- Watch for hesitation
- If someone pauses, squints, or re-reads labels, you've found friction.
- Lesson 1964 — Testing Visualizations with Audiences
- Watch Time
- (or Listen Time): Total hours users spend consuming content.
- Lesson 1635 — Media and Content Metrics: Watch Time and Content Performance
- Weak or no relationships
- appear as values near 0, suggesting variables are independent of each other.
- Lesson 511 — Reading and Interpreting Correlation Matrices
- Weakly Informative Prior
- Use `Beta(2, 20)` or similar if you expect roughly 10% conversion but aren't certain.
- Lesson 1581 — Setting Priors for A/B Tests
- Weakly informative priors
- gently guide the analysis away from unrealistic values (like 99% conversion) without imposing strong opinions.
- Lesson 1534 — The Prior DistributionLesson 1559 — Uninformative and Weakly Informative PriorsLesson 1565 — Prior Distributions for Normal Means
- Wealth distribution
- A few people hold most wealth
- Lesson 190 — The Pareto Distribution: Heavy Tails and Power LawsLesson 191 — Pareto Principle and the 80/20 Rule
- Weaponization
- A facial recognition system built for user authentication could be repurposed for mass surveillance or stalking.
- Lesson 1920 — Anticipating Misuse of Data Products
- Web Mercator
- What you see in Google Maps and most web applications
- Lesson 1308 — Geographic Data Types and Coordinate Systems
- Web scraping
- Extracting information from web pages
- Lesson 11 — Data Collection and AcquisitionLesson 21 — APIs and Web Scraping
- Web traffic analysis
- Detecting unusual spikes beyond typical weekday/weekend patterns or holiday seasons
- Lesson 1411 — Applications and Limitations
- Website Traffic
- Your blog gets an average of 8 visits per hour.
- Lesson 144 — Poisson Applications: Arrivals and EventsLesson 190 — The Pareto Distribution: Heavy Tails and Power LawsLesson 191 — Pareto Principle and the 80/20 RuleLesson 421 — Applications: Uniform, Genetic Ratios, and DistributionsLesson 746 — Choosing Seasonal Period
- Website traffic and sales
- Marketing spend might drive both independently
- Lesson 1423 — The Third Variable ProblemLesson 1424 — Reverse Causality
- Weekly cycles
- Retail sales peak on weekends, drop on Mondays
- Lesson 707 — Seasonality: Regular Periodic PatternsLesson 1484 — Duration and Timing Considerations
- Weibull
- extends exponential by allowing failure rates to change over time (shape parameter).
- Lesson 193 — Choosing Between Distributions in Practice
- Weight by predictive power
- use churn models or LTV correlations to guide weights
- Lesson 1699 — Engagement Scoring Systems
- Weight your data
- If you can't get a perfect sample, assign weights to underrepresented groups so they count more in your analysis—this mathematically corrects for imbalance.
- Lesson 250 — Strategies for Bias Detection and Mitigation
- weighted average
- of past observations, where recent values matter more than older ones.
- Lesson 757 — Introduction to Exponential SmoothingLesson 1566 — Conjugate Normal-Normal Model
- Welch's ANOVA
- Handles unequal variances without requiring transformations
- Lesson 470 — When Parametric ANOVA Assumptions Fail
- Welch's t-test
- method and doesn't assume equal variances.
- Lesson 285 — Pooled vs Unpooled Variance ApproachesLesson 362 — Welch's t-Test for Unequal VariancesLesson 363 — Testing Equality of VariancesLesson 379 — The Assumption of Equal Variances (Homoscedasticity)Lesson 380 — Testing Equal Variances: Levene's and Bartlett's Tests
- what
- R-squared is and **how** to calculate it, the critical question becomes: *what does the number actually mean?
- Lesson 533 — Interpreting R-Squared ValuesLesson 1346 — The Grammar vs Traditional PlottingLesson 1830 — Documentation and Metadata ManagementLesson 1861 — Monitoring Tools and DashboardsLesson 1948 — The Recommendation Slide: Making It ActionableLesson 2023 — Creating a Pull Request
- What automated decisions
- involve their data (if any)
- Lesson 1908 — Data Subject Access Requests (DSARs)
- what changed
- , **why you changed it**, and **what assumptions you made** at each processing step.
- Lesson 1162 — Documenting TransformationsLesson 1955 — Framing Insights in Business Language
- What data
- you hold about them (copy of all personal data)
- Lesson 1908 — Data Subject Access Requests (DSARs)
- What did you find
- State the key insight in one clear sentence.
- Lesson 1944 — Executive Summary Best Practices
- What follow-up analyses
- you'll run based on different outcomes
- Lesson 1204 — From Hypothesis to Analysis Plan
- what happened
- (facts) from **the context** (dimensions).
- Lesson 956 — Star Schema JoinsLesson 1675 — Churn Attribution and Root Cause Analysis
- What it means
- Your residuals have more extreme values (outliers) than a normal distribution would predict.
- Lesson 567 — Common Q-Q Plot Patterns: Heavy Tails and Light Tails
- What should we do
- Give the top 1–3 recommendations.
- Lesson 1944 — Executive Summary Best PracticesLesson 1955 — Framing Insights in Business Language
- What they actually test
- Whether two groups have **identical distributions**.
- Lesson 394 — Interpreting Rank-Based Tests: Medians vs Distributions
- What this means
- Hat values range from `1/n` to 1.
- Lesson 573 — Calculating and Interpreting Hat Values
- What would constitute evidence
- for or against your hypothesis
- Lesson 1204 — From Hypothesis to Analysis Plan
- Number of messages sent — directly measures the utility users get from communication.
- Lesson 1606 — Examples of North Star Metrics by Industry
- when
- it happened.
- Lesson 19 — Temporal Data and Time SeriesLesson 838 — Subscription and Membership Duration ModelingLesson 840 — Loan Default Timing and Credit RiskLesson 841 — Campaign Response Time AnalysisLesson 1111 — Autocommit Mode vs Explicit TransactionsLesson 1162 — Documenting TransformationsLesson 1850 — Retry StrategiesLesson 1948 — The Recommendation Slide: Making It Actionable
- When differences emerge
- (curves may start together then diverge)
- Lesson 817 — Comparing Multiple Survival Curves
- When duplicates are meaningful
- Combining sales records, event logs, or time-series data where each row represents a distinct occurrence
- Lesson 1000 — UNION ALL: Preserving Duplicates
- When satisfied
- Your β₀ and β₁ estimates are **unbiased**—on average, they hit the true population values.
- Lesson 552 — Zero Conditional Mean of Errors
- When to pin exactly
- Lesson 2050 — Pinning Versions vs Flexible Ranges
- When to shift focus
- Once flattened, optimize retention earlier in the curve rather than fighting churn at the tail
- Lesson 1658 — Flattening and Asymptotic Behavior
- When to use
- Simple tabular data, easy human readability, compatibility with almost any tool.
- Lesson 22 — File Formats: CSV, JSON, and BeyondLesson 453 — Transformations to Meet AssumptionsLesson 1645 — Types of Cohorts: Acquisition vs BehavioralLesson 1828 — Incremental vs Full Load Strategies
- When to use it
- Lesson 44 — Geometric and Harmonic Means
- When to use ranges
- Lesson 2050 — Pinning Versions vs Flexible Ranges
- When to use which
- Report eta-squared for descriptive purposes with your current sample; use omega-squared when making inferences about population-level effects.
- Lesson 445 — Effect Size: Eta-Squared and Omega-Squared
- When violated
- Predictions are systematically wrong at certain X ranges; coefficient estimates are misleading.
- Lesson 552 — Zero Conditional Mean of Errors
- When you reject H₀
- Lesson 356 — Making Decisions and Stating Conclusions
- where
- you are along the X-axis.
- Lesson 659 — Interpreting Polynomial Regression CoefficientsLesson 896 — GROUP BY Execution OrderLesson 898 — HAVING Clause FundamentalsLesson 899 — HAVING vs WHERE: Key DifferencesLesson 903 — Combining WHERE and HAVINGLesson 912 — Fundamental Difference: Filter TimingLesson 1908 — Data Subject Access Requests (DSARs)Lesson 2137 — Refactoring Strategies and Debt Paydown
- WHERE filters first
- It eliminates individual rows from the raw table before any grouping or aggregation happens
- Lesson 915 — Combining WHERE and HAVING
- White noise
- is purely random data with no temporal structure.
- Lesson 724 — ACF Patterns for Different ProcessesLesson 786 — ACF and PACF of ResidualsLesson 799 — Fitting and Diagnosing SARIMA Models
- Who
- you've shared it with (recipients or categories)
- Lesson 1908 — Data Subject Access Requests (DSARs)Lesson 1948 — The Recommendation Slide: Making It Actionable
- Who bears the cost
- Sometimes the aggregate accuracy loss is small, but one subgroup's performance drops significantly.
- Lesson 1891 — Fairness-Accuracy Tradeoffs
- Who drives value
- Are 20% of users responsible for 70% of activity?
- Lesson 1698 — Power User Curves and Engagement Distribution
- Who might challenge this
- Peers and auditors need enough detail to validate your rigor.
- Lesson 1947 — Handling Methodology and Technical Details
- Why
- The chi-squared distribution is a *approximation* that only works well when expected counts are sufficiently large.
- Lesson 426 — Assumptions and Sample Size RequirementsLesson 1162 — Documenting TransformationsLesson 1675 — Churn Attribution and Root Cause AnalysisLesson 1830 — Documentation and Metadata ManagementLesson 1908 — Data Subject Access Requests (DSARs)Lesson 2023 — Creating a Pull RequestLesson 2137 — Refactoring Strategies and Debt Paydown
- Why it matters
- With finite populations, sampling without replacement affects probabilities as you go.
- Lesson 233 — Populations in PracticeLesson 541 — Properties of Residuals
- Why it works
- It handles multiple predictors simultaneously, quantifies each feature's impact, and produces interpretable coefficients.
- Lesson 1674 — Churn Prediction Models
- Why it's powerful
- Lesson 1079 — B-Tree Indexes: Structure and Mechanics
- Why this works
- The regression "controls for" the intermediate lags, removing their influence and revealing only the direct relationship.
- Lesson 729 — Calculating Partial Autocorrelations
- Wide confidence bands
- High uncertainty (small risk set)
- Lesson 815 — Survival Curve Plots and Interpretation
- Wide format
- spreads observations across multiple columns.
- Lesson 1144 — Common Violations: Wide vs Long FormatLesson 1145 — Pivoting Data Longer (Melt)
- Widely understood
- The standard language for discussing variability across fields
- Lesson 49 — Standard Deviation: Interpretable Spread
- Wilcoxon Signed-Rank Test
- improves on this by incorporating the **size** of differences while remaining non-parametric (no normality assumption required).
- Lesson 392 — Wilcoxon Signed-Rank Test
- Wilcoxon test
- (also called Breslow test) weights earlier time points more heavily because more subjects are at risk early on.
- Lesson 823 — Log-Rank Test vs Other Tests
- Win-back
- strategies target customers who've already churned, while **retention** strategies aim to prevent at-risk customers from leaving in the first place.
- Lesson 1676 — Win-Back and Retention Strategies
- Win-Back Candidates
- A subset of churned customers worth targeting for reactivation—perhaps they left for fixable reasons or represent high LTV potential.
- Lesson 1704 — Customer Lifecycle Stages
- Wireframe plots
- show the underlying grid structure more clearly and reduce visual clutter when you need to see through the surface or understand the data's resolution.
- Lesson 1325 — 3D Surface and Wireframe Plots
- with
- observation i included: ŷ ᵢ
- Lesson 576 — DFFITS: Influence on Fitted ValuesLesson 990 — Basic CTE Syntax and Structure
- With adjustment
- Include age in your regression model.
- Lesson 1431 — Controlling for Confounders: Adjustment
- With partitions
- Lesson 1007 — ROW_NUMBER(): Assigning Unique Row Numbers
- Within Groups
- (or "Error"): Variation due to random differences within groups
- Lesson 444 — The ANOVA Table
- Within-Group Variability
- Lesson 446 — Power and Sample Size for ANOVA
- Within-group variance (denominator)
- Measures the average variability within each group (pooled across all groups)
- Lesson 440 — The F-Statistic and Its Distribution
- without
- observation i: ŷᵢ ᵢ
- Lesson 576 — DFFITS: Influence on Fitted ValuesLesson 825 — What is the Cox Proportional Hazards Model?
- Without adjustment
- Exercise appears negatively associated with blood pressure, but is that real or just because older people do both less?
- Lesson 1431 — Controlling for Confounders: Adjustment
- Without manipulation
- means doing so honestly, proportionally, and with full context—not weaponizing emotion to bypass critical thinking or hide inconvenient truths.
- Lesson 1941 — Emotional Connection Without Manipulation
- Without partitions
- Lesson 1007 — ROW_NUMBER(): Assigning Unique Row Numbers
- Word frequency
- A few words appear constantly; most are rare
- Lesson 190 — The Pareto Distribution: Heavy Tails and Power Laws
- Work with ordinal data
- (survey ratings like "good, better, best")
- Lesson 486 — Spearman's Rank Correlation Coefficient
- Working Directory
- Your desk where you're actively working on documents
- Lesson 1993 — The Three States: Working Directory, Staging, Repository
- Working sessions
- Bi-weekly meetings to review preliminary findings and get rapid feedback
- Lesson 2104 — Communication Cadence and Updates
- Working with date ranges
- Lesson 1040 — Date Arithmetic and INTERVAL Operations
- Worst-case scenarios
- Can you survive the potential losses?
- Lesson 152 — Decision Making Under Uncertainty
- Write complexity
- Every time underlying data changes, you must update the aggregate
- Lesson 1073 — Storing Computed Values and AggregatesLesson 1075 — Handling Data Consistency in Denormalized Schemas
- Write operation cost
- How much slower are inserts and updates?
- Lesson 1077 — Measuring Performance Impact of Denormalization
- Wrong
- This ignores the base rate.
- Lesson 110 — Base Rate FallacyLesson 313 — Common Pitfalls in Hypothesis FormulationLesson 1103 — The Dangers of String Formatting in SQL
- Wrong Coefficient Signs
- Lesson 581 — Symptoms of Multicollinearity
- Wrong data type
- Applying `AVG()` to non-numeric columns causes errors
- Lesson 884 — AVG: Computing Averages
- Wrong period
- Seasonal spikes get flagged as false positives, or real anomalies blend into "normal" variation
- Lesson 1409 — Setting Detection Parameters
- Wrong summary
- Using mean when data has outliers (better: median)
- Lesson 1245 — Misleading Aggregations and Binning
X
- x̄
- (x-bar): Sample mean — the average of *your specific sample*
- Lesson 232 — Notation ConventionsLesson 269 — Confidence Interval Formula for One MeanLesson 353 — Calculating the t-StatisticLesson 520 — Computing β₀: The Intercept EstimateLesson 1391 — The Grubbs' Test Statistic
- X → Y
- (causal path) and **Z → X → Y** plus **Z → Y** (a confounder creating a backdoor path **X ← Z → Y**), controlling for **Z** blocks the backdoor while preserving the causal arrow.
- Lesson 1472 — The Backdoor Criterion
- X-axis
- Time periods since the initial event (Day 0, Day 7, Day 30, etc.
- Lesson 1653 — What are Retention Curves?
- X₁, X₂, ..., X
- Your predictor variables (independent variables)
- Lesson 596 — The Multiple Regression Equation
Y
- Y value
- given its X value—it doesn't follow the pattern of the other data points.
- Lesson 587 — Identifying Outliers in Regression Context
- Y-axis
- Percentage of the original cohort still active (0-100%)
- Lesson 1653 — What are Retention Curves?
- Y-units per X-unit
- , and this determines how you communicate your findings.
- Lesson 525 — Units and Scale in Interpretation
- YAML
- Human-readable, great for hierarchical settings
- Lesson 2072 — Configuration Files vs Hard-Coded Values
- YAML header
- Metadata at the top specifying output format, title, author, and date
- Lesson 1983 — R Markdown for Dynamic Reports
- Yearly cycles
- Ice cream sales peak every summer, heating costs rise every winter
- Lesson 707 — Seasonality: Regular Periodic Patterns
- You
- stay focused on what actually matters to your audience
- Lesson 1942 — The Pyramid Principle: Starting with the Conclusion
- You can say
- "Given our data and prior beliefs, there's a 95% probability the true conversion rate is between 45% and 74%.
- Lesson 1562 — Credible Intervals for ProportionsLesson 1578 — Interpreting Credible Intervals
- You CANNOT say
- Lesson 1578 — Interpreting Credible Intervals
- You have limited data
- Conjugacy helps when likelihood is weak
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- You miss real effects
- – Even if your new feature genuinely improves conversion by 2%, your test might conclude "no significant difference" simply because you didn't collect enough data.
- Lesson 1529 — Running Underpowered Tests
- You stay in control
- of the narrative instead of improvising
- Lesson 1949 — Anticipating Questions: Building in Appendices
- You use NOT IN
- with nullable columns (can miss results and run slowly)
- Lesson 966 — Performance Considerations for WHERE Subqueries
- You want stable variance
- Some transformations stabilize variance across different data ranges, meeting another key assumption
- Lesson 211 — Why Transform Data to Normality?
- You waste resources
- – Your engineering team built the feature, you split traffic for weeks, analyzed results.
- Lesson 1529 — Running Underpowered Tests
- You're building linear models
- Transforming the response variable can improve model fit and prediction accuracy
- Lesson 211 — Why Transform Data to Normality?
- You're doing exploratory work
- The interactive nature of Pandas in Jupyter makes rapid iteration easier than Spark's batch- oriented workflows.
- Lesson 1787 — When to Optimize Pandas Instead
- You're exploring
- Early-stage analysis where perfect precision isn't critical
- Lesson 1556 — Choosing Between Conjugate and Non-Conjugate Priors
- Your data is categorical/binary
- Each observation falls into one of two categories (success/failure, yes/no, clicked/didn't click)
- Lesson 399 — When to Use the One-Sample Z-Test for Proportions
- Your data is skewed
- Income data, reaction times, or count data often pile up on one side
- Lesson 211 — Why Transform Data to Normality?
- Your operations are vectorized
- Pandas built on NumPy excels at vectorized operations.
- Lesson 1787 — When to Optimize Pandas Instead
- Your outcome is categorical
- predicting "yes/no" or categories requires logistic regression or classification methods instead
- Lesson 555 — When Regression Is and Isn't Appropriate
- Your own experience
- Previous projects or work in adjacent fields
- Lesson 1201 — Domain Knowledge as a Hypothesis Source
- Your sample size (n)
- Larger datasets have different thresholds
- Lesson 1392 — Critical Values and Significance Testing
- Your significance level α
- (and whether your test is one-tailed or two-tailed)
- Lesson 355 — Finding Critical Values and P-Values
Z
- Z → Y
- (third variable influences Y)
- Lesson 1423 — The Third Variable ProblemLesson 1472 — The Backdoor Criterion
- Z_α/2
- = critical value for your confidence level
- Lesson 296 — Sample Size for Comparing Two GroupsLesson 1497 — Sample Size Formulas for ProportionsLesson 1498 — Sample Size Formulas for Continuous Metrics
- Z_β
- = critical value for your desired power
- Lesson 296 — Sample Size for Comparing Two GroupsLesson 1497 — Sample Size Formulas for ProportionsLesson 1498 — Sample Size Formulas for Continuous Metrics
- z-score
- (or standard score) tells you how many standard deviations a data point is away from the mean.
- Lesson 195 — Z-Score Definition and InterpretationLesson 199 — Finding Percentiles with Z-ScoresLesson 200 — Comparing Values Across Different DistributionsLesson 1376 — What is the Z-Score Method?Lesson 1389 — What is Grubbs' Test?
- z-score method
- uses this to flag outliers: if a data point is *too many* standard deviations away from the mean, it's probably an outlier.
- Lesson 71 — Z-Score Method for Outlier DetectionLesson 1386 — IQR Method vs Z-Score: When to Use Each
- Z-scores
- (which you'll learn to calculate soon) tell you how many standard deviations away from the mean you are.
- Lesson 62 — Percentiles vs Z-Scores: Complementary Position MeasuresLesson 1209 — Outlier Detection and Investigation
- z-statistic
- for proportions.
- Lesson 402 — Calculating the Test Statistic for ProportionsLesson 683 — Hypothesis Tests for Individual Coefficients
- Z-table
- (also called a standard normal table) is a reference chart that shows cumulative probabilities for the standard normal distribution.
- Lesson 198 — Using Z-Tables for Probability
- Z-test
- Used when you have large samples or known population variance
- Lesson 1749 — Measuring Statistical Significance
- z-test statistic
- .
- Lesson 409 — Z-Test Statistic for Two ProportionsLesson 410 — P-Value Calculation and Interpretation
- Z-tests
- for proportions determine if selection rate differences are statistically meaningful
- Lesson 1890 — Measuring Disparate Impact
- Zero
- = normal distribution
- Lesson 67 — Calculating KurtosisLesson 280 — Confidence Intervals for Difference in ProportionsLesson 539 — What Are Residuals?Lesson 984 — NOT EXISTS for Finding Missing Relationships
- Zero residuals
- (`e_i = 0`) mean your prediction was exactly correct (rare in practice!
- Lesson 540 — The Residual Formula
- ZIP code
- often correlates with race and income due to historical segregation patterns
- Lesson 1883 — Protected Classes and Proxy VariablesLesson 1889 — Proxy Variables and Redlining
- Zombie users
- Automated scripts or bots inflate counts without real engagement
- Lesson 1694 — Daily Active Users (DAU) and Monthly Active Users (MAU)
- Zoom controls
- allow users to magnify regions of interest by scrolling or clicking-and-dragging, making dense visualizations navigable.
- Lesson 1303 — Range Sliders and Zoom Controls