PANDAS TUTORIALS

Pandas Tutorials: DataFrame Methods

[Enter image here]

DataFrames are the heart and soul of Pandas. They take large amounts of both Structured and Unstructured Data from various sources to make transforming, merging, and exporting data much easier. This tutorial will cover the following learning objectives:

  • What Attributes Are Associated with DataFrames
  • How to Extract Data from a DataFrame
  • How to Query Data in a DataFrame
  • What Common DataFrame Methods Are Available

What Attributes Are Associated with DataFrames




Summary

  • Since DataFrames are objects within a class, they inherit certain Attributes. Attributes are descriptive calls to the object that return specific information about the object.
  • The Columns attribute is used to return all the columns in the specified DataFrame. This is used with the following syntax:
    dataframe.columns
  • The Columns attribute can also be used to rename the columns in the specified DataFrame. Example:
    dataframe.columns = ['name', 'age', 'gender']
  • The Index attribute is used to return all the index values of the specified DataFrane. This is used with the following syntax:
    dataframe.index
  • The Index attribute can also be used to change the index values of the specified DataFrame. Example:
    dataframe.index = ['2020', '2021', '2022', '2023']
  • NOTE: You can specify different index values, though the length of the index must match the number of values in the Series, thus it can get quite tedious.

How to Extract Data from a DataFrame




Summary

  • With DataFrames, Slicing is the process of getting either a subset of columns or a subset of rows. To output a single column as a Series, use the following syntax:
    dataframe['column_name']
  • A common exercise in Pandas is to create what's known as a Calculated Field. A Calculated Field is created by either creating an arithmetic equation between two or more columns OR to add, multiply, subtract, or divide a column by a static value. Example:
    dataframe['profit'] = dataframe['revenue'] - dataframe['expenses']
    OR
    dataframe['salary_doubled'] = dataframe['salary'] * 2
  • If you want to extract data from more than one column in a DataFrame, use the syntax below. This will return a DataFrame object.
    dataframe[['column_name1', 'column_name2']]

How to Filter Data from a DataFrame




Summary

  • The Loc method is used to query a DataFrame in a SQL-like format. This is used with the following syntax:
    dataframe.loc['row_filter', 'column_filter']

    SQL Equivalent: SELECT 'column_filter' FROM dataframe WHERE 'row_filter'

    Example:
    dataframe.loc[dataframe['revenue']>= 10000, ['month', 'year', 'revenue', 'expenses']]
  • The Loc method can also be used to filter rows by their corresponding Index value. Example:
    dataframe.loc[3, ['month', 'year', 'revenue']
  • The Iloc method is similar in to the Loc method, though the Iloc method is used to query data using only the row and column indices. This is useful for when you want to get a subset of data but don't know the column names or row values. This is used with the following syntax:
    dataframe.iloc['row_indices', 'column_indices']

    Example:
    dataframe.iloc[3:6, 1:5]
  • NOTE: When slicing values in Python, use the following notation: starting_value:ending_value. If you want to slice values in reverse, use the following notation: starting_value:ending_value:-1

What Common DataFrame Methods Are Available




Summary:

  • The Copy method is used to create an exact copy of the specified DataFrame. This is commonly used when testing transformations. This is used with the following syntax:
    dataframe2 = dataframe1.copy()
  • The astype method is used to convert the data type of a specified Series in a DataFrame. This is used with the following syntax:
    dataframe1['column_name'] = dataframe['column_name'].astype([new_data_type])
  • The Transpose/T method, or attribute, is used to switch the column and index values. This is used with the following syntax:
    dataframe1.transpose()
    OR
    dataframe1.T
  • The Head method will give you the first "n" number of rows in the DataFrame. If no number is specified, the default is 5. This is used with the following syntax:
    dataframe1.head([n])
  • The Tail method will give you the last "n" number of rows in the DataFrame. Just like the Head method, if no number is specified, the default is 5. This is used with the following syntax:
    dataframe1.tail([n])
  • The Sample method will return a random selection of "n" number of rows in the DataFrame. If no number is specified, the default is 1. This is used with the following syntax:
    dataframe1.sample([n])
  • The Info method returns a summary of your DataFrame, this includes the number of values in the Index, the total number of columns, a list of all the columns, how many NULL values occur in each column, and the Data Type of each column. This is used with the following syntax:
    dataframe1.info()
  • The Describe method returns descriptive statistics of your DataFrame columns. These include a count of all values, the number of unique values, the mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile, and 75th percentile of all values in each column. This is used with the following syntax:
    dataframe1.describe()
  • NOTE: In most cases, your DataFrame will have more than 10 columns, thus making it near impossible to view all the column statistics when using the Describe method. To fix this, you can use the Transpose Method or the T attribute to make the descriptive statistics the new column headers, and the column names as the new row values. Example:
    dataframe1.describe().T

Exercise

Congratulations! You just completed the DataFrame Methods Tutorial! To help test your knowledge, let's practice doing some basic queries on DataFrames.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

Instructions:

  1. Open your IDE (e.g., VS Code, PyCharm, Spyder).
  2. Use the Same Notbeook that you created in the previous exercise.
  3. In the Notebook, complete the following tasks:
    1. Show the first 10 rows in the "player_stats" DataFrame.
    2. Show the last 8 rows in the "player_stats" DataFrame.
    3. Show only the "ppg" and "rpg" columns for the Team's Center (use the Index).
    4. Add a New Column to the DataFrame titled "player_name" and give it the following values: ['Donnovan Mitchell', 'Darius Garland', 'Jarrett Allen', 'Evan Mobley', 'Isaac Okoro', 'Ricky Rubio']
    5. Show a summary of the DataFrame.
    6. Show the DataFrame's Descriptive Statistics. Transpose the DataFrame to make it more readable.
  4. Exercise Completed! Click here to view the answers.
  5. Have any issues with the above exercise? Post your question on Discord!