The next technique related to Data Cleaning is Working with Missing Values. Missing Values can cause a huge problem when creating linear models or conducting analysis on a dataset. Pandas provides methods to either fill/replace missing valeus or remove them altogether. This tutorial will cover the following learning objectives:
How to Fill Missing Values Using the "fillna" Method
How to Remove Missing Values Using the "dropna" Method
Predicting Missing Values Using the "interpolate" Method
How to Fill Missing Values Using the "fillna" Method
Summary
The fillna method is used to fill/replace missing values in the entire DataFrame with a specified value. This is used with the following syntax: dataframe.fillna([value])
You can also use the method parameter to specify how to fill in missing values. The ffill (Forward Fill) method is used to replace the missing value with the value of the row above. The bfill (Backward Fill) method is used to replace the missing value with the value of the row below. This is used with the following syntax: dataframe.fillna(method='ffill')
You can also mix and match the "ffill" and "bfill" methods by using the limit parameter. This allows you to specify a number of NaN values to be filled using a specific method. Example: inventory.fillna(method='ffill', limit=3, inplace=True) inventory.fillna(method='bfill', limit=10, inplace=True)
If you want to fill in missing values on a per column basis, you can use a dictionary to specify which columns will get which values. This is used with the following syntax: dataframe.fillna({'column1':[value], 'column2':[value]})
The isna method can be applied to a standalone Series or a Series in a DataFrame to check if there are any values in the Series that are missing. This is used with the following syntax: dataframe['column_name'].isna() OR series.isna()
NOTE: As mentioned in the video, you can also fill in missing values with an equation (e.g., the current mean of the column). We will learn how to apply aggregations in the next tutorial.
NOTE: If you're importing a CSV File, you can auto fill Missing Values by using the "na_values" parameter. This is used with the following syntax: pd.read_csv('file_name.csv', na_values='character_to_fill')
How to Remove Missing Values Using the "dropna" Method
Summary
The dropna method is used to remove missing values. By default, this method drops all columns that contain one or more missing values. This is used with the following syntax: dataframe.dropna()
If you want to drop all rows that have ONLY missing values, use the following syntax: dataframe.dropna(how='all')
The thresh parameter is used to specify how many missing values are allowed in each row. If a row exceeds this threshold, it will be dropped. This is used with the following syntax: dataframe.dropna(thresh=[integer])
The subset parameter allows you to drop rows that have missing values in a subset of columns. This is used with the following syntax: dataframe.dropna(subset=['column1', 'column2'])
NOTE: When building a Linear model, removing missing values is critical. Also, if you're building any type of Supervised Learning model, the Label must not contain any missing values.
How to Predict Missing Values Using the "interpolate" Method
Summary
Interpolation is a Statistical process in which a specific vaue is derived from existing data points. The interpolate methods predicts the value of a missing value based on the data points in the column. This is used with the following syntax: dataframe.interpolate()
If you want to only interpolate the missing values for a specific column, use the following syntax: dataframe['column_name'].interpolate()
The Method parameter is used to specify how missing values are interpolated. By default, the linear method is used, though many other are available including time, quadratic, and polynomial. This is used with the following syntax: dataframe.interpolate(method='method')
Interpolation is a technique that is widely used with Time-Series analytics. It's not recommended to use this technique with linear regression, as it could skew the model's accuracy.
Exercise
Congratulations! You just completed the Working with Missing Values Tutorial! To help test your knowledge, let's
practice Replacing, Dropping, and Interpolating Missing Values.
**It's highly recommended that you
complete the exercise outlined in the previous tutorial before beginning this exercise.**
Instructions:
Open your IDE (e.g., VS Code, PyCharm, Spyder).
Create a New Jupyter Notebook, or similar Notebook Environment. Name it "missing-values.ipynb"
Read the Parquet File into a DataFrame object named "listings"
Filter the "listings" DataFrame to only include listings that are missing a value in the "hwy_mpg" column. Assign the filtered DataFrame to a new variable called "temp"
In the "temp" DataFrame, replace all the Missing Values in the "hwy_mpg" column with the value of the "city_mpg" multiplied by 1.15. Make the changes inplace.
In the "temp" DataFrame, drop all values that have a Missing Value in the "price" column. Make the changes inplace.
In the "listings" DataFrame, interpolate the Missing Values in the "hwy_mpg" column.
Exercise Completed! Click here to view the answers.
Have any issues with the above exercise? Post your question on Discord!