Stats Tutorials: Descriptive Statistics

Once you have identified the structure and nature of all the variables in your dataset, you'll want to understand how the values are distributed. In Machine Learning, this is covered in the Exploratory Data Analysis (EDA) Phase (EDA Tutorial). Descriptive Statistics provide a high-level overview of how values are distributed and represented across each variable in your dataset. This tutorial will cover the following learning objectives:

What Are Descriptive Statistics?
Measure of Central Tendency
Measure of Variability

What Are Descriptive Statistics?

Summary

Descriptive Statistics are meant for summarizing the data, as opposed to drawing inferences or conclusions from the data.
Various methods can be used to visualize Descriptive Statistics such as Tables, Histograms, and Bar Charts.
There are three types of Descriptive Statistics:
1. Frequency Distribution. This is used to describe how frequently values in a particular variable occur. This can be used for both Qualitative and Quantitative data. Histograms (for Continuous variables) and Bar Charts (for Discrete data) are commonly used to visualize this form of Descriptive Statistics.
2. Measure of Central Tendency. This is used to identify the middle value in an ordered set of values for a particular variable. For Qualitative data, this is commonly shown as the mode, whereas for Quantitative data, this is commonly shown either as the mean or median.
3. Measure of Variability. This is used to describe how spread out the values in a particular variable are. The Range and Standard Deviation are commonly used to communicate how spread out values are.
NOTE: Descriptive Statistics is arguably the most important concept to understand in Machine Learning, as it can make or break your model. However, this skill is often overlooked so make sure you have a deep understand of which Statistics to use for which variables.

Measure of Central Tendency

Summary

The Measure of Central Tendency is a method used to evaluate the central value in a given variable.
Measure of Central Tendency is critical to understanding how values in a variable are distributed and for detecting potential outliers. It's also useful for comparing variables between datasets (e.g., samples of a population).
There are three common Measures of Central Tendency:
- Mean: Also known as the Average, this is used to represent the dataset across all values. This is calculated by dividing the sum of all values by the count of all values. This is used specifically for numerical data and is the most commonly used measure of central tendency.
- Median: This is the middle most value in the variable. This is calculated by arranging all values in the variable from least to greatest and then finding the middle value. This is particularly useful because outliers in the variable won't effect the output, whereas it would do so when using the Mean. This can only be used for numerical data.
- Mode: This is used to represent the most frequently occurring value in a variable. This is used for Discrete data, meaning it can be string categorical data or discrete numerical data (e.g., Age, Weight, Height).
NOTE: A general rule of thumb is to use the Median whenever possible to avoid outliers effecting your decision making process or initial analysis of your data.

Measure of Variability

Summary

Range is a statistical measurement used to evaluate the difference between the smallest and largest value in a particular variable. The larger the difference, the more spread out the data is.
The Interquartile Range (IQR) is used to describe the middle 50% of your data. When calculating the Median of a variable, the values are sorted from least to greatest. Once sorted, they are then broken evenly into four segments known as quartiles. The IQR is used to describe the middle 2 quartiles.
To calculate the IQR, your first find the Median, then find the Median of the two halves on either side of the median. This then breaks the values into four evenly divided quartiles. The IQR is the difference between the third quartile and the second quartile.
Variance is a statistical measurement used to describe how spread out the data in the entire dataset is. The formula described in the variable works on a population, though would be biased on a sample. Since statistical samples are random in nature, the variance would not be consistent, thus not accurately representing the population.
Standard Deviation is used to represent the average distince each data point is from the median. This is the square root of the Variance and is used much more frequently when measuring value distributions.

Previous Topic

Next Topic