Pandas Tutorials: Combining DataFrames

The next technique related to Data Cleaning is Combining DataFrames. If you want to merge two DataFrames but they don't have a shred column, you can combine them into one DataFrame. This tutorial will cover the following learning objectives:

How to Combine Multiple DataFrames Using the "concat" Method

How to Combine Multiple DataFrames Using the "concat" Method

Summary

The concat method is used to combine two or more DataFrames. This is the same as Unions in Relational Databases. This is used with the following syntax:
pd.concat([dataframe1, dataframe2])
The ignore_index parameter is used to flatten the new index to make it look cleaner. THis is useful for reducing the number of steps in your pipeline (e.g., having to use reset_index "n" number of times). This is used with the following syntax:
pd.concat([dataframe1, dataframe2], ignore_index=True)
The keys parameter is used to create a MultiIndex that helps show where each group of records comes from. This is similar to the "indicator" parameter of the "merge" method. This parameter DOES NOT work with the "ignore_index" parameter, as they would cancel each other out. This is used with the following syntax:
pd.concat([dataframe1, dataframe2], keys=['key1', 'key2'])
If you want to combine two or more DataFrames that have different schemas (that is different columns), then use the following syntax:
pd.concat([dataframe1, dataframe2], axis=1)
If you want to combine a DataFrame and a Series, use the following syntax:
pd.concat([dataframe1, series1], axis=1)
NOTE: If the two DataFrames don't have the same number of columns, the DataFrame with the lesser number of columns will have its records be filled with NaNs in the columns that aren't present.

Exercise

Congratulations! You just completed the Combining DataFrames Tutorial! To help test your knowledge, let's practice Combining Multiple DataFrames.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

Instructions:

Open your IDE (e.g., VS Code, PyCharm, Spyder).
Create a New Jupyter Notebook, or similar Notebook Environment. Name it "joining-dataframes.ipynb"
In the Notebook, complete the following tasks:

Download the Following Files:
Read the "Amercain Used Car Listings" CSV File into a variable named "american_listings"
Read the "Non-American Used Car Listings" CSV File into a variable named "non_american_listings"
Read the "Subset A" CSV File into a variable named "listings_subset_a"
Read the "Subset B" CSV File into a variabled named "listings_subset_b"
Combine the "american_listings" and "non_american_listings" DataFrames into a single DataFrame named "all_listings"
Combine the "listings_subset_a" and "listings_subset_b" DataFrames into a single DataFrame named "combined_subset"
Combine all four DataFrames into a single DataFrame named "all_listings_duplicates"

Exercise Completed! Click here to view the answers.

Have any issues with the above exercise? Post your question on Discord!

Previous Topic

Next Topic