PANDAS TUTORIALS

Pandas Tutorials: Manipulating Strings

[Enter image here]

The next technique related to Data Cleaning is Manipulating Strings. There are dozens of possible ways to manipulate strings including replacing substrings, converting strings to lowercase, uppercase, or title case, and finding the length of a string. This tutorial will cover the following learning objectives:

  • How to Format Strings Using the "lower", "upper", and "title" Methods
  • How to Filter DataFrame Columns by Substrings Using the "contains", "startswith", and "endswith" Methods
  • How to Get Information From Strings Using the "len" Method
  • How to Clean Strings Using the "strip", "replace", and "split" Methods

How to Format String Using the "lower", "upper", and "title" Methods


How to Use the "lower", "upper", and "title" Methods on Raw Strings


How to Use the "lower", "upper", and "title" Methods on a Pandas Series


Summary

  • When working with a Pandas Series that has a "string" data type, you need to use Python's String accessor, str, to tell the Python interpreter to act like the Series object is a string.
  • The lower method is used to convert all characters in a specified string to lowercase. This is used with the following syntax:
    dataframe['column_name'].str.lower()
  • The upper method is used to convert all characters in a specified string to uppercase. This is used with the following syntax:
    dataframe['column_name'].str.upper()
  • The title method is used to convert the first letter of each word in a specified string to uppercase. This is used with the following syntax:
    dataframe['column_name'].str.title()
  • NOTE: Chaining is the process of combining multiple methods in a single command. This is a very efficient way of writing code in Pandas and is considered a best practice. The example below shows a Pandas Series being converted to lowercase, then uppercase, and finally title case. Example:
    cities['name'].str.lower().str.upper().str.title()

How to Filter DataFrame Columns by Substrings Using the "startswith", "endswith", "contains", and "isin" Methods


How to Use the "startswith", "endswith", and "contains" Methods


How to Use the "isin" Method


Summary

  • The startswith method is used to filter a DataFrame Column, or a Pandas Series object, by a substring that the string starts with. This is equivalent to the following SQL query: SELECT * FROM table WHERE column LIKE 'ABC%'. This is used with the following syntax:
    dataframe[dataframe['column_name'].str.startswith('substring')]
  • The endswith method is used to filter a DataFrame Column, or a Pandas Series object, by a substring that the string ends with. This is equivalent to the following SQL query: SELECT * FROM table WHERE column LIKE '%ABC'. This is used with the following syntax:
    dataframe[dataframe['column_name'].str.endswith('substring')]
  • The contains method is used to filter a DataFrame Column, or a Pandas Series object, by a substring that appears anywhere in the string. This is equivalent to the following SQL query: SELECT * FROM table WHERE column LIKE '&ABC%'. This is used with the following syntax:
    dataframe[dataframe['column_name'].str.contains('substring')]
  • The isin method is used to filter a DataFrame Column, or a Pandas Series object, by a list of values. This is NOT used with the str accessor. This is used with the following syntax:
    dataframe[dataframe['column_name'].isin(['value1', 'value2', 'value3'])]
  • NOTE: If you want to apply a filter to a DataFrame via Slicing, you can specify which columns to return by using the following syntax:
    dataframe[dataframe['column'] [filter]][['column1', 'column2']]

    Example:
    cities[cities['name'].str.endswith('City')][['name', 'population']]
  • NOTE: The unique method is used to return an array of all the unique values in a DataFrame Series. This is used with the following syntax:
    dataframe['column'].unique()
  • NOTE: When filtering by raw string values (such as with the isin Method), make sure to include any leading or trailing spaces that may be present. We will discuss how to remove these in the next section of this tutorial.

How to Get Information From Strings Using the "len" Method




Summary

  • The len method is used to retrieve the length (or number of characters) in a string. This is used with the following syntax:
    dataframe['column_name'].str.len()
  • NOTE: The len method is commonly used to identify any leading or trailing spaces. It's also commonly used for Data Quality checks to ensure strings are within a certain range of lengths.

How to Clean Strings Using the "strip", "replace", and "split" Methods


How to Use the "strip" Method


How to Use the "replace" Method


How to Use the "split" Method

Summary

  • The strip method is used to remove all whitespace (leading and trailing spaces) from a string. This is used with the following syntax:
    dataframe['column_name'].str.strip()
  • If you want to strip all leading and trailing spaces from every column header, use the following syntax:
    df.columns = df.columns.str.strip()
  • The replace method is used to replace an existing substring with a new substring. This is used with the following syntax:
    dataframe['column_name'].str.replace('old_substring', 'new_substring')
  • When working with a Pandas DataFrame, the split method is primarily used for splitting a string column into multiple columns using a delimiter (such as a space or comma). The quickest way to split a single string column into multiple columns uses the following syntax:
    datframe[['new_column1', 'new_column2']] = dataframe['string_column'].str.split(pat='delimiter', n=[num_splits], expand=True)
  • By default, the split method returns the split values as a list. This can cause some data quality issues, so the "expand" parameter is used. The expand parameter is used to convert the split values into individual Series (or columns in a DataFrame).
  • NOTE: The len method is commonly used to identify any leading or trailing spaces. It's also commonly used for Data Quality checks to ensure strings are within a certain range of lengths.

Exercise

Congratulations! You just completed the Manipulating Strings Tutorial! To help test your knowledge, let's practice Formatting, Extracting Information From, and Cleaning Strings.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

Instructions:

  1. Open your IDE (e.g., VS Code, PyCharm, Spyder).
  2. Create a New Jupyter Notebook, or similar Notebook Environment. Name it "manipulating-strings.ipynb"
  3. In the Notebook, complete the following tasks:
    1. Download the Following Files:
    2. Read the Pickle File into a variable named "unclean_listings"
    3. Filter the "unclean_listings" DataFrame to only include the "header", "price", and "mileage" columns.
    4. Convert the "header" column to uppercase.
    5. Only include rows that have a "header" that either starts with "201" or ends with "e". Reset the index inplace.
    6. Only include rows where the "price" contains "25". Reset the index inplace.
    7. Only include rows that have a "header" less than or equal to 20 characters in length. Reset the index inplace.
    8. Remove all whitespace from the column headers.
    9. Remove all whitespace from the "header" and "price" columns.
    10. Replace the " miles" and "," substrings in the "mileage" column with empty strings.
    11. Split the "price" column into 2 partitions. Name the new columns "first_split" and "price". Drop the "first_split" column.
    12. Only include rows that have a "price" that does NOT contain the "$" character.
    13. Replace the "," character in the "price" column with an empty string.
    14. Split the "header" column into 3 partitions. Name the new columns "model_year", "make", and "model". Drop the "header" column.
    15. Reorder the columns into the following order: price, model_year, make, model, mileage
  4. Exercise Completed! Click here to view the answers.
  5. Have any issues with the above exercise? Post your question on Discord!