Pandas Tutorials: Working with Databases

DISCLAIMER: Before starting this tutorial, you should have a basic knowledge of Relational Databases and SQL.

Text Files are great for when you are getting started with Pandas or working on a small-scale Data Science project. However, once you start collecting data on a regular basis, you'll need a database. This tutorial will cover the following learning objectives:

How to Use SQLAlchemy with Pandas
How to Execute a SQL Query Against a Database with Pandas
How to Append a DataFrame to an Existing Database Table

How to Use SQLAlchemy with Pandas

Summary

The SQLAlchemy library is a popular tool to connect Relational Databases to Python via an API. The most popular database supported by this library are MySQL, PostgreSQL, and Oracle.
Before getting started with SQLAlchemy, you'll need to install it into your local environment AND the database driver. Run the following code to get started:
pip install sqlalchemy

MySQL:
pip install pymysql

PostgreSQL:
pip install psycopg2
Once you have the correct libraries installed, you need to create the connection string for the connection engine. Run the following code to continue:
from sqlalchemy import create_engine

MySQL:
engine = create_engine('mysql+pymysql://[username]:[password]@localhost:3306/[database_name]')

PostgreSQL:
engine = create_engine('postgresql+psycopg2://[username]:[password]@localhost:5432/[database_name]')
NOTE: If the file you want to use is not in your working directory, you'll need to specify either the Absolute or Relative file path.

How to Execute a SQL Query Against a Database with Pandas

Summary

When you have established the connection to your database, the read_sql_table mathod can be used to view the contents of a table (SQL Equivalent: SELECT * FROM table). This is used with the following syntax:
pd.read_sql_table('table_name', con=[engine_variable])
The read_sql_table method also contains a parameter that lets you specify the columns to return from the table (SQL Equivalent: SELECT [columns] FROM table). This is used with the following syntax:
pd.read_sql_table('table_name', con=[engine_variable], columns=['column1', 'column2', 'column3'])
The read_sql_query method is used to run a raw SQL query against a database connection. This method is used with the following syntax:
pd.read_sql_query('enter query here', con=[engine_variable])
NOTE: Although the read_sql_query works when reading a running a raw SQL query against a database connection, the read_sql method is much simpler and more commonly used. This method has the same syntax as the read_sql_query method.

How to Append a DataFrame to an Existing Database Table

Summary

The to_sql method is used to upload a Dataframe either as a new table or added to an existing table. This method is used with the following syntax:
dataframe.to_sql('table_name', con=[engine_variable])
When using the to_sql method to upload a DataFrame as a new table, you should NEVER include the DataFrame's index. This is can easily cause confusion in your relational model. To exclude the index, use the following syntax:
dataframe.to_sql('table_name', con=[engine_variable], index=False)
When appending a DataFrame to an existing table, the to_sql method offers a parameter called if_exists. This parameter gives you the following options: Fail (stop code execution if the table already exists, Replace (drop the table, recreate the table, and insert the DataFrame's contents), and Append (add the DataFrame's contents to the end of the existing table). This is used with the following syntax:
dataframe.to_sql('table_name', con=[engine_variable], index=False, if_exists=["fail", "replace", "append"])

Exercise

Congratulations! You just completed the Working with Text Files Tutorial! To help test your knowledge, let's practice Reading and Writing some Text Files into and from DataFrames.
**It's highly recommended that you complete the exercise outlined in the previous tutorial before beginning this exercise.**

Instructions:

Open your IDE (e.g., VS Code, PyCharm, Spyder).
Create a New Jupyter Notebook, or similar Notebook Environment. Name it "text-files.ipynb"
Using either MySQL or Oracle, create a database with the following configurations:

Username: test
password: test
Database Name: ecommerce_shop
Schema:
- Customers
  - id (Integer Primary Key)
  - first_name (VARCHAR(50) NOT NULL)
  - last_name (VARCHAR(50) NOT NULL)
  - phone_num (CHAR(10) NOT NULL)
- Orders
  - id (Integer Primary Key)
  - num_items (Integer NOT NULL)
  - total_price (Integer NOT NULL)
  - order_date (DATE NOT NULL)
  - status (VARCHAR(25) IN('Completed', 'In Progress', 'Cancelled'))
  - customer_id (Integer Foreign Key References id column in Customers Table)

Fill in the Database Tables with Dummy Data
In the Notebook, complete the following tasks:

Create a Connection to your Database (Answer will vary).
View All the Data in the "Customers" Table.
View All the Data in the "Orders Table".
Write a SQL Query that shows the Customer's First and Last Names and the Date of their Most Recent Order. Output the Result as a DataFrame.
Write a SQL Query that shows the Top 5 Most Expensive Orders. Output the Result as a DataFrame.
Export the Results of the Query Produced in Step "d" to a New Table Called "most_recent_orders".

Exercise Completed! Click here to view the answers.

Have any issues with the above exercise? Post your question on Discord!

Previous Topic

Next Topic