Plain-language band-aids to fix gaps in your statistics knowledge
Got some gaps in your statistics knowledge? Or perhaps you’re here for tips on how to communicate data science concepts to beginners? Let me do my best to point you towards some to-the-point explanations!
Note: Whenever there’s a link, it usually takes you to another article where I’ve explained a foundational concept instead of repeating it here. If there’s no link, I haven’t written the article yet — let me know in the comments if you’re curious about one of them in particular or if you’d like to see a missing term added in. The list is arranged alphabetically. Terms with entries are bolded. Enjoy!

A description of all possible states of the world where you would not want to be taking your default action. (Example here.)
Analytics is a subdiscipline of data science that is often confused with statistics.
Analytics is all about finding good questions, while statistics is all about finding good answers.
The key difference is that analytics is concerned primarily with what’s in your data while statistics is concerned with what’s beyond your data. Learn more here.
A term that used to mean something else, but these days it’s often used as a casual synonym for machine learning (ML).
“If it’s written in Python, it’s probably machine learning. If it’s written in PowerPoint, it’s probably AI.” — Mat Velloso
For the difference between ML/AI and statistics, see the entry on machine learning.
Assumptions are ugly band-aids we put over the parts where information is missing. If we knew all the facts (and we knew that our facts were actually true facts), we wouldn’t need assumptions (or statisticians). Unfortunately, when we have partial information — when we want to know about a whole population but we only observe data from a sample — the only way we can make the leap beyond our paltry information is to make assumptions.
STATISTICAL INFERENCE = DATA + ASSUMPTIONS
Yup, statistics isn’t magic, it’s the art of making assumptions veeeeeery carefully (and mathematically) to make conclusions beyond our data.
Bar Chart
A way to visualize counts. Like distributions, you can think of bar charts (used for categorical data) and histograms (used for continuous data) in terms of popularity contests. Or tip jars. That works too.

A statistical school of thought deals with mathematical models of belief. You start with a mathematical description of what you believe and then (via Bayes’ rule) discover what you reasonably ought to believe after adding some data. The results are highly personal because they’re about reasonably updating subjective models of belief — different starting beliefs (called “priors”) should give different results (called “posteriors”). In Bayesian statistics, parameters have probabilities attached to them, which is heinous sacrilege as far as your typical (frequentist) STAT101 class is concerned. The question of whether or not a parameter should have a probability distribution what the Bayesian vs Frequentist controversy is all about.
A formula that helps you to go from probability of checking Twitter when your code is compiling to the probability that your code is compiling when you are checking Twitter. Bayes’ Rule is the mathematical underpinning of Bayesian statistics.
Statistical bias occurs when results are consistently off the mark. But that’s not the only definition of bias, so dive deeper here: selection bias, algorithmic bias, and other kinds of bias.

Data that can take values in two categories, e.g. (Yes, No). When you’re dealing with more than two categories, it’s called multiclass data.
The distribution that describes the probability of a particular number of successes out of a bunch of attempts. Found in the context of binary data.
Captured Data
Captured data are intentionally created for a specific analytical purpose, while exhaust data are byproducts of digital/online activity.
The propensity of one variable to cause another variable to change.
Data that takes category values, e.g. Orange, Apple, Mango. Binary data is a type of categorical data.
See cumulative distribution function.
Central Limit Theorem (CLT)
The CLT is a handy rule that says that if we compute averages or sums from lots of data, those sums/averages will be normally distributed. Learn more here.

Chi-squared Test
Something you may want to learn more about if you work a lot with categorical data. It’s the classic quick check to see if two categorical variables are independent, for example to ask a question like, “Is the distribution of favorite musical genres the same across all college majors?” (Or, more morbidly, many STAT101 professors introduce it as, “Did surviving the Titanic depend on how fancy your ticket was?” Because we statisticians are a grim bunch.)
Classical Statistics
Synonym for Frequentist Statistics.
Cluster Sampling
An approach to collecting data that involves deciding on sections/clusters (e.g. schools), randomly selecting several clusters from the collection, then collecting observations on all of the units (e.g. students) in those clusters.
The branch of mathematics for counting things, like the number of ways you can seat your wedding guests without offending anyone (0). Here’s my primer on the subject, which will help you understand, among other things, why your combination lock is actually a permutation lock.
Confidence Interval
A concept from frequentist statistics with a tricky definition. There’s no way around the fact that it’s tricky, so be careful when you interpret confidence intervals (and don’t confuse them with credible intervals). A 95% confidence interval means: “Were this procedure to be repeated on infinite samples, the calculated confidence interval (which would differ for each sample) would encompass the true population parameter 95% of the time.”
Data that is obtained by measuring, not counting. Examples: 176.5 cm (my height), 12% (free space on my phone), 3.141592… (pi), -40.00 (where Celsius meets Fahrenheit), etc.
Convenience Sample
A sample that is nonrandom but the observations were convenient to make, e.g. when you put a booth in the airport terminal and ask people walking by to take a survey about air travel… what could possibly go wrong?
The propensity of two variables to look like they’re moving together. Learn more here.

Data that can only take non-negative integer values. Obtained by counting things.
Credible Interval
The Bayesian cousin of the confidence interval. It has the easy interpretation you wish a confidence interval had. A 95% credible interval is interpreted as “I believe that the parameter lives between here and here with 95% probability.”
Cumulative Distribution Function (CDF)
A mathematical formula describing the probability of observing a particular value of a random variable. See distribution.
Stuff someone recorded in electronic form. Or, for the slightly more reverent explanation, read this.
A synonym for analytics. The act of finding patterns in your data in order to form hypotheses or generate ideas.

Data science is the discipline of making data useful. Its three subdisciplines are called statistics, machine learning, and analytics. To learn more about the differences between these three areas and how they fit into data science, read this.

Synonym for observation and instance.
Synonym for sample (collection of data).
Exactly what it sounds like — convenient descriptions of various kinds of data you’d encounter in the wild. Many STAT101 classes kick off their first lesson with data types, so if you’re keen to recreate that experience, head over to this article for a guided tour.
A physical action/decision that you commit to doing if you don’t gather any (more) evidence. This is a frequently-overlooked yet super-important concept; you can’t get started with classical statistics without it! (Example here.)
Dependent Variable
The variable (usually Y in our models) we want to predict using some other ones (usually Xs in our models, which are our independent variables).
Data that is obtained by counting, not measuring. Examples: 1 short story, 6 words, 2 baby shoes, 0 times worn, etc.
Think of this as a “histogram” of your population data. The concept is abstract, since we usually can’t observe the population.

Dummy Variable
Synonym for indicator variable.
Empirical Rule
If your data are normally distributed, (68%)-(95%)-(virtually all) will be found within 1–2–3 standard deviations of the mean.
Exhaust Data
Captured data are intentionally created for a specific analytical purpose, while exhaust data are byproducts of digital/online activity. Exhaust data usually come about when websites store activity logs for purposes — such as debugging or data hoarding — other than specific analyses.
Error
The difference between what we observed in our data and what we predicted with our model. Another term for error is “residual.” In simple linear regression, we assume the errors are normally distributed. In a method called GLM, we’re allowed to have more creative assumptions about the errors.
Estimate, Estimator, and Estimand
An estimate is just a fancy word for best guess about the true value of a parameter (the estimand). It’s the value your guess takes, while an estimator is the formula you use for arriving at that number.
A (probability-weighted) average spoken about in the context of distributions. Synonyms include: expectation, expected value, mean.
A scientific procedure undertaken to test a hypothesis involving causal relationships. Core characteristics are randomization into groups and manipulation of those groups by the experimenter (different groups are assigned different “treatments”). Experiments allow you to make causal statements about how two things are related (i.e. in order to be able to say that a medication causes an improvement in disease progression, you need to run a well-designed randomized experiment). Until you have clear, measurable, quantitative null and alternative hypothesis statements, as well as a plan for how you will do different things to different parts of the universe at random, what you are about to do is not an experiment.

Exploratory Data Analysis (EDA)
The act of using some fraction of your data for the purpose of generating ideas, forming hypotheses, and discovering potentially useful inputs for machine learning. All of these inspiring nuggets must be tested on separate data before they can be taken seriously (otherwise you’re cheating).
Frequentist Statistics
The kind of approach you tend to see in STAT 101, based on the long-run frequencies you’d see if procedures were to be repeated infinitely many times. Unlike Bayesian statistics, you’ll never see the words “belief” or “prior” when using these methods. In frequentist statistics, parameters never have probabilities attached to them — this video will help you make sense of this with a coin toss and personality test.
Gauss-Markov Assumptions
Technical assumptions you must make in order to use standard linear regression. They translate roughly as, “Assume it’s normal and well-behaved.” (Now you get the joke on my shirt below.) See normal distribution and scedasticity to learn more.

Generalized Linear Model (GLM)
Generalized linear models (GLMs) extend regression to situations where the distribution of the errors is not normal. You may want to learn more about this if you are trying to predict a categorical response (e.g. click/no click).
A plot describing the frequency with which things occur in your data. The categories (or intervals of values) are on the horizontal axis and the height of the bars gives the relative number of times a particular category has occurred in your data. See also: bar chart and distribution.

A description of how reality might work. H0 stands for null hypothesis (all the worlds in which you’d want to take your default action), H1 stands for alternative hypothesis (all the worlds in which you wouldn’t). (Example here.)
The game of trying to see if your data convinces you that your null hypothesis is ridiculous and thus that you should stop doing your default action. (Example here.)
Independent Variable
The variable (usually X in our models) we want to use to predict another one (usually Y in our models).
Indicator Variable
A variable that takes the value 1 if a condition is met, 0 otherwise. For example, I might record your pet ownership as Cat=1 if you’re owned by a cat and Cat=0 if no cat has claimed you yet. Devs and ML folk call the use of indicator variables one-hot encoding.

You’re using primary data if you (or the team you’re part of) collected observations directly from the real world. In other words, you had control over how those measurements were recorded and stored.
Inherited (secondary) data are those you obtain from someone else. The opposite is primary data. Here’s my guide to working with inherited data.

Kurtosis is a way to describe the chubbiness of a distribution’s tail.
Often a good model to use when the response variable (Y) is binary.
An asymmetric (skewed) distribution that features extremely large or extremely small values which are relatively infrequent.

Machine Learning (ML)
A discipline that’s related to statistics, but has a different focus: automation. Statistics cares about rigor, inference, and coming to the right conclusion, whereas machine learning cares about performance and turning patterns in data into recipes that get the job done. Statistics is extremely important in the testing step of an applied machine learning project, since that’s when it’s time to find out whether the prototype actually works.
An average. Used in the context of talking about samples (“sample mean”) or populations (“population mean”). This is one of the important moments (descriptors of the shape of a distribution), which is why you see it talked about so often.
The middle thing. Arrange your data from smallest to largest and grab the one in the middle — that’s the median. The median is robust to outliers while the mean is not.
Mode is pronounced “the most common value.” The mode corresponds to the spot where a distribution/histogram has its peak. When you hear that a distribution is multimodal, it means there’s more than one peak. When a distribution is symmetric and unimodal, like the pretty little bell-shaped curve, the mode also happens to be the mean. If you want to be technically correct, you’d stop saying “the average Joe” when you actually mean “the modal Joe.”
Model
Depending on context, either a fancy word for recipe or a description of how a system might work. For example, here’s a straight line model: Salary = intercept + slope* Years_Of_Experience + error
Refers to certain numerical summaries of the shape of a distribution. The average of your data points is called the first moment, the average of squares of your data points is the second moment and so on.
Categorical data that can take values in more than two categories, e.g. (Cat, Dog, Parrot, Goldfish, Anteater). When you’re dealing with only two categories, it’s called binary data.
Multiple Comparisons
The act of testing many hypotheses. If you don’t make any special corrections to your methods (multiple testing correction) and you claim to be making statistical conclusions, you’re going on a fishing expedition in your data: you will find “something interesting” by random chance even though the results will not be real.
If you’re not doing statistics but instead running an exploratory data analysis (EDA), you’re in the clear because you know you’re not supposed to take those findings seriously. But if you’re claiming to draw statistical conclusions from this process, what you’re not doing is not statistics — it’s a pantomime that violently misses the point. Just. Don’t.
Always remember to split your data so that you’re able to run a lean and controlled test in an unmolested dataset after you’ve fished around in a different one for inspiration.
Multiple Testing Correction
That said, if you wish to test multiple hypotheses the valid statistical way, you can… but there’s a price.* You must make adjustments (start by reading up on Bonferroni correction (the simplest, strictest, and most data-expensive option) and then progress to its cousins), otherwise the results that you think are “statistically-significant” will turn out to be embarrassingly fake.
*Oh, and that price is pretty steep — you’ll need much more data to get the same quality of result. Don’t test multiple hypotheses unless you’ve got a really good reason for it.
Like simple linear regression, except now you’re allowed to use more than one predictor, e.g. using both Experience and Education to predict Salary.
Multivariate Data
When your measurements are too complicated for a single number, e.g. the set of measurements a tailor needs in order to create a good custom suit for you. Data that’s not multivariate is called univariate (fits in a single number, e.g. your height).
Multivariate Regression
Like multiple regression, except now you’re predicting a response that’s multivariate — your Y is a vector now, not a scalar.
Nonparametric Statistics
Methods which do not require you to make assumptions about which distribution your data come from.
Happens when your targeted participants omit their responses. For example, you set up a booth to ask people how their day is going and many of the people who are having a rubbish day scowl at you instead of taking your survey. As a result, the data you record are too cheerful thanks to nonresponse bias.
The symmetric, bell-shaped distribution found often in nature and wherever we see sums/averages. See Central Limit Theorem.

All possible worlds in which you’re happy to take your default action. (Example here.)
A single item in the sample.

One-Hot Encoding
The use of indicator variables.
Unusual data points or data points which are unlikely to have been generated by the process responsible for the bulk of the data. What should you do with them? It depends…
The probability of obtaining a sample at least as extreme as the one we just observed when assuming the null hypothesis is actually true. That’s a mouthful, so I made the video below to help you wrap your head around it.
To calculate a p-value, you need to know what the CDF looks like under the null hypothesis. The smaller your p-value, the more ridiculous your null hypothesis looks.
A summary measure of a population.
See probability density function.
Poisson Regression
Often a good model to use when the response variable (Y) can only be a nonnegative integer. See also: count data, response, GLM.
The collection of all items we are interested in.
Posterior
The belief you end up having when you add data to your prior (starting belief). If you see this word, you’re in Bayesian statistics land.
Power is the probability of rejecting the null hypothesis if it is false (i.e. of changing your mind if that’s the right thing to do). Along with significance level, it determines the quality of your hypothesis test.
Prediction Interval
An interval giving a plausible range for the next value we might observe. A common statistics gotcha is that people think this is what the confidence interval does — nope, that’s the prediction interval: it’s wider than the confidence interval, meaning that you should be a lot less sure about where the next observation will land than where the population parameter will land. Which makes sense intuitively too— I’d be a lot more surprised to discover that the average height of all males in a city is 6’4” than to discover that the height of some random next dude I see in the grocery store is 6’4”. One of these would make a headline, the other would make a shrug.
Predictor
Another word for an independent (X) variable. Predictors are observed data.
Primary Data
You’re using primary data if you (or the team you’re part of) collected observations directly from the real world. In other words, you had control over how those measurements were recorded and stored. If you didn’t, we call that secondary (inherited) data.
Prior
A starting belief written as a distribution. If you see this word, you’re in Bayesian statistics land.
P(X=4) would be read in English as “The probability that my die lands with the 4 facing up.” If I’ve got a fair six-sided die, P(X=4)=1/6. But… but… but… what is probability and where does that 1/6 come from? Glad you asked! I’ve covered some probability basics for you here, with combinatorics thrown in as a bonus.
Probability Density Function (PDF)
A mathematical formula describing the relative probability of observing a particular value of a random variable. If the random variable is not continuous, technically this should be called a probability mass function and it should not exist at values the random variable can’t take.
Q-Q Plot
A visual testing tool for checking distribution assumptions, which compares the data you got with the data you’d tend to get from the distribution you’re interested in. Also, it often makes you cry, but that’s not why it’s named that.
R-Squared
Proportion of the variability in Y explained by X. In simple linear regression, this is the square of the correlation between X and Y. Here’s a fun trick: chances are you’re pretty bad at intuiting correlations, but you’re pretty good at intuiting R-squared as a performance grade… so you can use this like a magic trick in front of your friends. Try it out on guessthecorrelation.com — instead of guessing the correlation itself, guess R-squared by assigning a “percentage grade” (like a teacher grading a student) to how well you’d say X “captures” Y, then take the square root. (Don’t forget to insert a minus if the cloud of points is moving down and to the right.)
A random variable (R.V.) is a mathematical function that turns reality into numbers. Think of it as a rule to decide what number you should record in your dataset after a real-world event happens.
Many students confuse random variables with random variates. If you’re a casual reader, skip this, but enthusiasts take note: random variates are outcome values like {1, 2, 3, 4, 5, 6} while random variables are functions that map reality onto numbers. Little x versus big X in your textbook’s formulas.
A dataset that’s exactly in the form it was collected in — no cleaning or transformation has been done to it.
Statistical methods involving fitting linear models to data. Usually, the goal is prediction or hypothesis testing about correlations/relationships. See simple linear regression.
A sample which accurately reflects the characteristics of the population.
Residual
Synonym for error.
Response
Another word for the dependent (Y) variable.
Happens when your targeted participants lie in their responses. To enjoy some instant response bias, put on an ugly hat and ask your coworkers whether you look good in it.
A subgroup of the population of interest.
Sample size
The number of observations (data points) in your sample.
The act of drawing observations from a population.
Sampling Frame
The list of all items from which we can draw our sample observations.
Scedasticity (sometimes also spelled skedasticity)
The ugliest possible word we could have picked for a concept that asks, “Is the distribution of errors the same everywhere?”

You’ll see this word in the context of linear regression 101, where we’re asking one of the diagnostic questions about whether the Gauss-Markov assumptions are satisfied (if they are, we can proceed with simple linear regression, and if they aren’t, *sad trombone*). If the scatter of errors around the line looks like a sausage (same width of scatter everywhere), you can say, “Whew, the errors are homoscedastic.” If the scatter looks more a fan or an orchestra of trumpets or a python that has swallowed an antelope, we declare the errors to be heteroscedastic and turn to GLMs instead of simple linear regression or find a clever way to transform your data.
Happens when your sampling method prefers some participants to others. For other definitions of bias, see this list.
Significance Level
The largest probability of Type I error you’re willing to tolerate. Along with power and sample size, this is a massively important knob you use to control the quality of your test.
Regression analysis with just one response variable and one predictor, meaning that you’re just fitting a straight line through your data.
A completely random draw. In this sampling scheme, drawing any permutation of items from the population is equally likely.
Simpson’s Paradox
An aggregation paradox where disaggregated data looked at separately for each group points towards conclusions diametrically opposed to what the aggregated data would show.

Exactly what it sounds like: a measure of the asymmetry of a distribution. For a handy mnemonic to help you remember which is which on positive and negative skew, read this.
A measure of dispersion. Tells you how far your data points are from their mean. For more info, see this article. This term is used in the context of talking about samples (“sample standard deviation”) or populations (“population standard deviation”). Standard deviation is one of the important descriptors of the shape of a distribution, which is why you see it talked about so often. I cover it in more detail here.

A summary measure computed from a sample. In other words, any way of mushing up your data. The way we use the terms these days, analytics is the discipline that’s about calculating statistics, but statistics is all about going beyond those data mushups — an Icarus-like leap into the unknown (expect a big splat if you’re not careful). Learn more here about the subdisciplines of data science.
Statistically Significant
Doesn’t mean what just happened is “significant” in the eyes of the universe. This is a technical term that simply means that a null hypothesis was rejected.
The science of changing your mind under uncertainty. For an 8min intro to the discipline, read this.
Stratified Sampling
Dividing your population into categories, e.g. statisticians and non-statisticians, and then taking simple random sample of a desired size from each category (e.g. 100 statisticians and 100 non-statisticians at random).
Structured data are neatly formatted for analysis. Most of the datasets you’d work with in a statistics class are structured, whereas in the wild, there’s plenty of unstructured data (data that needs *you* to put structure on it).
If we’re pedantic about it, there’s no such thing as unstructured data (since by being stored, they’re necessarily forced to have some kind of structure), but let me be generous. Here’s what the definition intends to convey: structured data are neatly formatted for analysis, while unstructured data are not in the format you want and they force you to put your own structure onto them. (Think of images, emails, music, videos, text comments left by trolls.)
That thing you call unstructured data is just data that needs *you* to put structure on it.
Systematic Sampling
A systematic selection process, e.g. constructing your sample out of all of the IDs which are divisible by 99.
Time Series
A time-indexed dataset where sequential order matters. For example, if we’re recording your hours of sleep today and my hours of sleep today, we can write those in any order — it doesn’t matter if we write yours first or mine first. But if we’re recording your sleep over a few days, we’d be losing something if we shuffled those.
Transformation
Taking a variable (column of your dataset) and applying a function (e.g. the logarithm) to all the values in that column to make a new variable.
Falsely rejecting the null hypothesis. (Equivalent to “convicting an innocent person.”) Related to the concept of false positive, but it’s not quite the same thing (in the same way that a prediction interval is different from a confidence interval).
Incorrectly failing to reject the null hypothesis. (Equivalent to “failing to convict a guilty person.”) The probability of this is equal to one minus Power. Related to false negative, but not quite the same thing (in the same way that a prediction interval is different from a confidence interval).
Correctly rejecting the wrong null hypothesis. In other words, using all the right math to solve the wrong problem!
Undercoverage Bias
Happens when you can’t reach your whole population (e.g. when your survey can only be seen by people who have computers, but you want to make a statement about the population of all human adults).
Uniform Distribution
The probability distribution that is shaped like a brick: all outcomes have equal probability associated with them. Someone asked me to include a photo of myself playing this distribution the way I did the normal distribution above, and this is the best thing I could come up with:

Univariate Data
When your measurements fit in a single number, e.g. your height. If it’s not univariate, it’s multivariate, e.g. the set of measurements a tailor needs in order to create a good custom suit for you.
This is the square of the standard deviation. Used in the context of talking about samples (“sample variance”) or populations (“population variance”). This is one of the important moments (descriptors of the shape of a distribution), which is why you see it talked about so often.
Variable
Casual usage: a column of your dataset if your dataset is formatted the polite way. Formal usage: see random variable.
Volunteer Sample
A sample where the respondents are different from the population because they opted-in (e.g. if we want to measure willingness to lend a hand and we ask people to help out by taking our survey, then we are collecting a volunteer sample and we’d expect that the respondents are more willing to help than the nonrespondents). A recipe for nonresponse bias.
Z-Score
Allows you to compare quantities measured in different scales, for example ‘Among long jumpers, Melanie’s best jump is twice as “exceptional” as Yufeng’s marathon time is among marathoners.’
z-score = (thing – thing’s mean) / (thing’s standard deviation)
https://towardsdatascience.com/stats-gist-list-an-irreverent-statisticians-guide-to-jargon-be8173df090d