Stats Tutorials: Normal Distributions

Now that you have a solid grasp of how to describe a dataset using statistics, let's shift gears into learning about distributions. In the traditional sense, linear data tends to have a Normal Distribution, meaning there is an equal representation among the value on a Continuous scale. This tutorial will cover the following learning objectives:

What is a Normal Distribution?
What is a Standard Deviation?
Measure of Variability

What is a Normal Distribution?

Summary

In a Normal Distribution, the values in a given variable are spread out evenly. At the center of the distribution curve, AKA Bell Curve, is the Mean. This represents the most popular value in the variable.
In Statistics, you will almost always be working with a sample of the population. This means rather than collecting all the data points for a given subject, you get a portion. For example, if you're wanting to measure the household income for every citizen in a country, it will be impossible to survey every citizen. Rather, you would simply get a randomized sample of the population and use it for your experiment.
Just because the sample isn't normally distributed doesn't mean the population isn't. Continuing with the household income example, if you send a survey out to random groups of people, there could be a bias in the sense of the people who tend to respond to surveys have lower income since they have more time, thus it wouldn't be an accurate representation of the entire population.
The Central Limit Theorum states that as your sample size gets bigger, the more it resembles the distribution of the population. Following this logic, it means when collecting data for a Machine learning project, you should aim for collecting as much random data as possible to gain a more accurate insight.
The Mean of a random sample will always be the same as the population they're derived from. With large random samples, the sample Mean will be a pretty good estimate of the true population Mean.
NOTE: Descriptive Statistics is arguably the most important concept to understand in Machine Learning, as it can make or break your model. However, this skill is often overlooked so make sure you have a deep understand of which Statistics to use for which variables.

Measure of Central Tendency

Summary

The Measure of Central Tendency is a method used to evaluate the central value in a given variable.
Measure of Central Tendency is critical to understanding how values in a variable are distributed and for detecting potential outliers. It's also useful for comparing variables between datasets (e.g., samples of a population).
There are three common Measures of Central Tendency:
- Mean: Also known as the Average, this is used to represent the dataset across all values. This is calculated by dividing the sum of all values by the count of all values. This is used specifically for numerical data and is the most commonly used measure of central tendency.
- Median: This is the middle most value in the variable. This is calculated by arranging all values in the variable from least to greatest and then finding the middle value. This is particularly useful because outliers in the variable won't effect the output, whereas it would do so when using the Mean. This can only be used for numerical data.
- Mode: This is used to represent the most frequently occurring value in a variable. This is used for Discrete data, meaning it can be string categorical data or discrete numerical data (e.g., Age, Weight, Height).
NOTE: A general rule of thumb is to use the Median whenever possible to avoid outliers effecting your decision making process or initial analysis of your data.

Measure of Variability

Summary

Range is a statistical measurement used to evaluate the difference between the smallest and largest value in a particular variable. The larger the difference, the more spread out the data is.
The Interquartile Range (IQR) is used to describe the middle 50% of your data. When calculating the Median of a variable, the values are sorted from least to greatest. Once sorted, they are then broken evenly into four segments known as quartiles. The IQR is used to describe the middle 2 quartiles.
To calculate the IQR, your first find the Median, then find the Median of the two halves on either side of the median. This then breaks the values into four evenly divided quartiles. The IQR is the difference between the third quartile and the second quartile.
Variance is a statistical measurement used to describe how spread out the data in the entire dataset is. The formula described in the variable works on a population, though would be biased on a sample. Since statistical samples are random in nature, the variance would not be consistent, thus not accurately representing the population.
Standard Deviation is used to represent the average distince each data point is from the median. This is the square root of the Variance and is used much more frequently when measuring value distributions.

Previous Topic

Next Topic