PANDAS TUTORIALS

Pandas Tutorials: Joining DataFrames

[Enter image here]

The next technique related to Data Cleaning is Joining DataFrames. Joining consists of combining datasets from multiple sources that have at least one column in common. This tutorial will cover the following learning objectives:

  • How to Join Multiple DataFrames Using the "merge" Method
  • How to Use Different Join Types

How to Join Multiple DataFrames Using the "merge" Method




Summary

  • The merge method is the used to combine two or more DataFrames on a shared column. This is the same as Joins in Relational Databases. This is used with the following syntax:
    pd.merge(dataframe1, dataframe2, on='shared_column')
    OR
    dataframe1.merge(dataframe2, on='shared_column')
  • NOTE: If you want to combine two DataFrames that don't have a shared column, you can conduct use the concat method which we will discuss in the next tutorial.

How to Use Different Join Types




Summary

  • An inner join follows the concepts presented in Set Theory and returns only those values that are represented in BOTH DataFrames. This is the most common type of join and one you should become very familiar with.
  • An outer join follows the concepts presented in Set Theory and returns all values in BOTH DataFrames. If a value is present in only one of the DataFrames, the columns in the other DataFrame will be NaN values.
  • The how parameter is used to specify which type of join you want to use. This is used with the following syntax:
    pd.merge(dataframe1, dataframe2, on='shared_column', how='join_type')
    OR
    dataframe1.merge(dataframe2, on='shared_column', how='join_type')
  • A left outer join is used to return all the values from the left DataFrame and only the share values in the right DataFrame. All values that are not present in the right DataFrame will be left as NaNs. This is used with the following syntax:
    pd.merge(dataframe1, dataframe2, on='shared_column', how='left')
    OR
    dataframe1.merge(dataframe2, on='shared_column', how='left')
  • A right outer join is the opposite of a left outer join. This join type is used to return all values from the right DataFrame and only the shared values in the left DataFrame. All values that are not present will be left as NaNs. This is not used very often as you can simply switch the order of DataFrames in a left outer join. This is used with the following syntax:
    pd.merge(dataframe1, dataframe2, on='shared_column', how='right')
    OR
    dataframe1.merge(dataframe2, on='shared_column', how='right')
  • The indicator parameter is used to show where each record appears (either "left_only", "right_only", or both). This is helpful when filtering a DataFrame based on unique values shared by each DataFrame. This is used with the following syntax:
    pd.merge(dataframe1, dataframe2, on='shared_column', indicator=True)
    OR
    dataframe1.merge(dataframe2, on='shared_column', indicator=True)
  • The suffixes parameter is used to differentiate duplicate columns when two DataFrames have the same column names (excluding the shared column). This is used with the following syntax:
    pd.merge(dataframe1, dataframe2, on='shared_column', suffixes=(['suffix_left_dataframe', 'suffix_right_dataframe']))
    OR
    dataframe1.merge(dataframe2, on='shared_column', suffixes=(['suffix_left_dataframe', 'suffix_right_dataframe']))
  • The left_on and right_on parameters are used when the shared column is named something else in each DataFrame (e.g., "id" column in the left DataFrame and "customer_id" in the right DataFrame). These are used with the following syntax:
    pd.merge(dataframe1, dataframe2, left_on='left_shared_column_name', right_on='right_shared_column_name')
    OR
    dataframe1.merge(dataframe2, left_on='left_shared_column_name', right_on='right_shared_column_name')
  • NOTE: If you don't specify a value in the how parameter, an "inner join" will be used by default.

Exercise

Congratulations! You just completed the Joining DataFrames Tutorial! To help test your knowledge, let's practice Joining Multiple DataFrames.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

Instructions:

  1. Open your IDE (e.g., VS Code, PyCharm, Spyder).
  2. Create a New Jupyter Notebook, or similar Notebook Environment. Name it "joining-dataframes.ipynb"
  3. In the Notebook, complete the following tasks:
    1. Download the Following Files:
    2. Read the "Amercain Used Car Listings" CSV File into a variable named "american_listings"
    3. Read the "Non-American Used Car Listings" CSV File into a variable named "non_american_listings"
    4. Read the "Subset A" CSV File into a variable named "listings_subset_a"
    5. Read the "Subset B" CSV File into a variabled named "listings_subset_b"
    6. Conduct an Inner Join between the "listings_subset_a" and "listings_subset_b" DataFrames.
    7. Conduct an Outer Join between the "american_listings" and "non_american_listings" DataFrames. Give relevant suffixes to the duplicated columns.
    8. Conduct a Left Join between the "listings_subset_a" and "listings_subset_b" DataFrames. Show which records belong to which DataFrame(s).
  4. Exercise Completed! Click here to view the answers.
  5. Have any issues with the above exercise? Post your question on Discord!