The difference between Slicing and Filtering can be quite hazy. However, it never hurts to know the various ways of how to apply filters to DataFrames. This tutorial will cover the following learning objectives:
How to Filter DataFrames Using a Boolean Mask
How to Filter DataFrames Using the "query" Method
How to Filter DataFrames Using the "where" Method
How to Filter DataFrames Using a Boolean Mask
Summary
The shape Attribute is used to show the dimensions of the DataFrame (how many rows and how many columns). This is used with the following syntax: dataframe.shape
A Boolean Mask is used to represent a condition that is used across the entire DataFrame. This is used with the following syntax: dataframe[dataframe['column_name'] [condition]]
Example: sales[sales['month'] == 'January']
Boolean Masks can also be used to apply multiple conditions. This is used with the following syntax: dataframe[(condition1) [Bitwise Operator] (condition2)]
When applying multiple conditions using Boolean Masks, the & (ampersand) symbol is used to state that both conditions must be met, whereas the | (pipe operator) symbol is used to state that at keast one condition must be met.
If you are writing multiple conditions and want to switch the logic on one or more conditions, use the ~ (tilde) symbol. This is used with the following syntax: dataframe[~(condition1) & (condition2)]
NOTE: Boolean Masks are by far the simplest way to apply a conditions to a DataFrame. However, when used in production, it can make your code difficult to read, so it's best to use this approach in development.
NOTE: The symbols used to either combine or separate conditions are known as Bitwise Operators. These are identical in purpose to Logical Operators, though Bitwise Operators are interpreted differently in the Python Compiler.
How to Filter DataFrames Using the "query" Method
Summary
The query method is used to filter a DataFrame by one or more conditions. This is used with the following syntax: dataframe.query("condition1")
Example: sales.query("month == 'January' & year == 2023")
A Boolean Mask is used to represent a condition that is used across the entire DataFrame. This is used with the following syntax: dataframe[dataframe['column_name'] [condition]]
Example: sales[sales['month'] == 'January']
NOTE: When specifying the set of conditions, it's a best practice to place the entire set of conditions between double quotation marks to make filtering string values possible.
NOTE: When deciding which method to use when filtering DataFrames, consider whether readability or performance is more important. It's important to remember that Pandas isn;t built for large datasets, thus readability should be a higher priority.
How to Filter DataFrames Using the "where" Method
Summary
The where method is used to filter a DataFrame by one or more conditions. This is used with the following syntax: dataframe.where([condition1])
Example: sales.where(sales['month'] == 'January')
The primary difference between the query and where methods is that the when a filter is applied using the query method, only the rows matching the conditions are returned, whereas with the where method, all the rows are returned but the rows that DON'T match the filter are shown as NaNs (Pandas version of NULL values).
NOTE: The where method is very rarely used because it tends to cause confusion among team members. Try to use either Boolean Masks or the query method whenever possible.
Exercise
Congratulations! You just completed the Filtering DataFrames Tutorial! To help test your knowledge, let's
practice applying various filters to DataFrames.
**It's highly recommended that you
complete the exercise outlined in the previous tutorial before beginning this exercise.**
NOTE: The "drivetrain", "transmission", and "location" columns have a LEADING space, thus you should include a space in front of the characters in the string.
Instructions:
Open your IDE (e.g., VS Code, PyCharm, Spyder).
Create a New Jupyter Notebook, or similar Notebook Environment. Name it "filters.ipynb"