PANDAS TUTORIALS

Pandas Tutorials: Filtering DataFrames

[Enter image here]

The difference between Slicing and Filtering can be quite hazy. However, it never hurts to know the various ways of how to apply filters to DataFrames. This tutorial will cover the following learning objectives:

  • How to Filter DataFrames Using a Boolean Mask
  • How to Filter DataFrames Using the "query" Method
  • How to Filter DataFrames Using the "where" Method

How to Filter DataFrames Using a Boolean Mask




Summary

  • The shape Attribute is used to show the dimensions of the DataFrame (how many rows and how many columns). This is used with the following syntax:
    dataframe.shape
  • A Boolean Mask is used to represent a condition that is used across the entire DataFrame. This is used with the following syntax:
    dataframe[dataframe['column_name'] [condition]]

    Example:
    sales[sales['month'] == 'January']
  • Boolean Masks can also be used to apply multiple conditions. This is used with the following syntax:
    dataframe[(condition1) [Bitwise Operator] (condition2)]

    Example:
    sales[(sales['month'] == 'January') & (sales['year'] == 2023)]
  • When applying multiple conditions using Boolean Masks, the & (ampersand) symbol is used to state that both conditions must be met, whereas the | (pipe operator) symbol is used to state that at keast one condition must be met.
  • If you are writing multiple conditions and want to switch the logic on one or more conditions, use the ~ (tilde) symbol. This is used with the following syntax:
    dataframe[~(condition1) & (condition2)]

    Example:
    sales(~(sales['month'] == 'January') & (sales['year'] == 2023))
  • NOTE: Boolean Masks are by far the simplest way to apply a conditions to a DataFrame. However, when used in production, it can make your code difficult to read, so it's best to use this approach in development.
  • NOTE: The symbols used to either combine or separate conditions are known as Bitwise Operators. These are identical in purpose to Logical Operators, though Bitwise Operators are interpreted differently in the Python Compiler.

How to Filter DataFrames Using the "query" Method




Summary

  • The query method is used to filter a DataFrame by one or more conditions. This is used with the following syntax:
    dataframe.query("condition1")

    Example:
    sales.query("month == 'January' & year == 2023")
  • A Boolean Mask is used to represent a condition that is used across the entire DataFrame. This is used with the following syntax:
    dataframe[dataframe['column_name'] [condition]]

    Example:
    sales[sales['month'] == 'January']
  • NOTE: When specifying the set of conditions, it's a best practice to place the entire set of conditions between double quotation marks to make filtering string values possible.
  • NOTE: When deciding which method to use when filtering DataFrames, consider whether readability or performance is more important. It's important to remember that Pandas isn;t built for large datasets, thus readability should be a higher priority.

How to Filter DataFrames Using the "where" Method




Summary

  • The where method is used to filter a DataFrame by one or more conditions. This is used with the following syntax:
    dataframe.where([condition1])

    Example:
    sales.where(sales['month'] == 'January')
  • The primary difference between the query and where methods is that the when a filter is applied using the query method, only the rows matching the conditions are returned, whereas with the where method, all the rows are returned but the rows that DON'T match the filter are shown as NaNs (Pandas version of NULL values).
  • NOTE: The where method is very rarely used because it tends to cause confusion among team members. Try to use either Boolean Masks or the query method whenever possible.

Exercise

Congratulations! You just completed the Filtering DataFrames Tutorial! To help test your knowledge, let's practice applying various filters to DataFrames.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

NOTE: The "drivetrain", "transmission", and "location" columns have a LEADING space, thus you should include a space in front of the characters in the string.

Instructions:

  1. Open your IDE (e.g., VS Code, PyCharm, Spyder).
  2. Create a New Jupyter Notebook, or similar Notebook Environment. Name it "filters.ipynb"
  3. In the Notebook, complete the following tasks:
    1. Download the Following Files:
    2. Read the CSV File into a DataFrame object named "listings"
    3. Use a Boolean Mask to select all rows in the DataFrame that have an Automatic Transmission
    4. Use a Boolean Mask to select all rows in the DataFrame that have a FWD drivetrain or a Manual Transmission.
    5. Use the "query" method to select all rows that have a location of "Chicago, IL".
    6. Use the "query" method to select all rows that don't have a Manual Transmission.
  4. Exercise Completed! Click here to view the answers.
  5. Have any issues with the above exercise? Post your question on Discord!