Pandas Tutorials: Aggregating Data

The next technique related to Data Cleaning is Aggregating Data. Aggregation consists of gathering related data and comparing it against one or more groups (e.g., average grade per student). This tutorial will cover the following learning objectives:

How to Aggregate Data Using the "groupby" Method
How to Measure Numerical Data Using Common Aggregation Methods

How to Aggregate Data Using the "groupby" Method

Summary

The groupby method is the most common method used to aggregate data using one or more columns. When aggregating data, you must specify a measure (or calculation) to compare against the groups. This is used with the following syntax:
dataframe.groupby(['group_column1', 'group_column2'])['measure_column'].[aggregate_method]()

Example:
orders.groupby(['customer_name'])['purchase_amount'].sum()
If you want to use multiple measures to compare against a single field, use the following syntax:
dataframe.groupby(['group_column1'])[['measure_column1', 'measure_column2']].[aggregate_method]()

Example:
orders.groupby(['customer_name'])[['purchase_amount', 'num_items']].sum()
If you want to compare a single measure against multiple fields, use the following syntax:
dataframe.groupby(['group_column1', 'group_column2'])['measure_column'].[aggregate_method]()

Example:
orders.groupby(['customer_name', 'purchase_year'])['purchase_amount'].sum()
The as_index parameter is used to change the output DataFrame from a MultiIndex DataFrame to an average DataFrame. This tends to only be helpful when using more than one field as a group. This is used with the following syntax:
dataframe.groupby(['group_column1', 'group_column2'], as_index=False)['measure_column'].[aggregate_method]()

Example:
orders.groupby(['customer_name', 'purchase_year'], as_index=False)['purchase_amount'].sum()
NOTE: A Multiindex DataFrame is a DataFrame that contains multiple indices in a hierarchical structure. This is commonly used when drilling down groups (such as going from year to quarter to month).

How to Measure Numerical Data Using Common Aggregation Methods

Summary

The agg method is used to specify which aggregation method(s) you'd like to include in the output DataFrame. This is used with the following syntax:
dataframe.groupby(['group_column1'])['measure_column'].agg(['aggregation_method1', 'aggregation_method2'])
The sum method is used to add together all values in the column. This can only be used with numeric columns.
The mean method is used to add together all values in the column and divide by the number of values in the column. This can only be used with numeric columns. This is also known as the average.
The count method is used to count the number of distinct values in the column. This can be used with either numeric or string columns.
The min method is used to find the smallest value in the column. This can be used with either numeric or string columns.
The max method is used to find the largest value in the column. This can be used with either numeric or string columns.
The median method is used to sort the values in the column from least to greatest and find the middle value, also known as the 50th percentile. This can only be used with numeric columns.
The std method is used to calculate the Standard Deviation of the values in the column.
NOTE: When getting the mean of a binary column (where the values are either 0 or 1), the value returned is the proportion of 1's to 0's (e.g., if the mean is 0.37, that means 37% of the values are 1's).

Exercise

Congratulations! You just completed the Aggregating Data Tutorial! To help test your knowledge, let's practice comparing measures across various combinations of groups using various aggregation methods.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

Instructions:

Open your IDE (e.g., VS Code, PyCharm, Spyder).
Create a New Jupyter Notebook, or similar Notebook Environment. Name it "aggregating-data.ipynb"
In the Notebook, complete the following tasks:

Download the Following Files:
- Sample Parquet File
Read the Parquet File into a DataFrame object named "listings"
Retrieve the SUM of the "price" column and group by the "make" column. Assign the output DataFrame to a variable named "value_by_make"
Retrieve the MEAN of the "mileage" column and group by the "year", "make", and "model" columns. Assign the output DataFrame to a variable named "avg_mileage_by_vehicle"
Retrieve the MIN and MAX of the "city_mpg" column and group by the "transmission" and "drivetrain" columns. Assign the output DataFrame to a variable named "city_mpg_ranges". Make sure the output DataFrame doesn't have a MultiIndex.
Retrieve the MEDIAN and STANDARD DEVIATION of the "hwy_mpg" column and group by the "fuel_type" column. Assign the output DataFrame to a variable named "median_hwy_mpg_by_fuel". Make sure the output DataFrame doesn't have a MultiIndex.

Exercise Completed! Click here to view the answers.

Have any issues with the above exercise? Post your question on Discord!

Previous Topic

Next Topic