Text Files are very common for use in small-scale data storage. However, when you have very large amounts of tabular data that needs to be compressed, you'll likely use a different file format. This tutorial will cover the following learning objectives:
How to Import and Export Pickle Files
How to Import and Export Parquet Files
How to Import and Export Pickle Files
Summary
The Pickle file format is a binary file used for serializing and deserializing data. This is commonly used on web servers when embedding large amounts of data on the backend.
The read_pickle method is used to convert a Pickle File into a DataFrame. This is used with the following syntax: pd.read_pickle('test.pkl')
The to_pickle method is used to export a DataFrame to a Pickle file. This is used with the following syntax: dataframe.to_pickle('test.pkl')
If you specify a compression algorithm when using the to_pickle method, you need to specify the same algorthm in the read_pickle method. This is used with the following syntax: dataframe.to_pickle('test.pkl', compression='gzip') pd.read_pickle('test.pkl', compression='gzip')
NOTE: Unlike the video, you DON'T need to import the Pickle library before reading a Pickle file into a Pandas DataFrame.
How to Import and Export Parquet Files
What is Apache Parquet and Why to Use It?
How to Import and Export Parquet Files Using Pandas
Summary
Unlike Excel Workbooks or CSV Files, Parquet Files are designed with Big Data Analytics in mind. This File Format is specifically built for reading and writing large amounts of tabular data in very short periods of time.
Excel Workbooks and CSV Files are row-oriented, meaning they are read row-by-row. This makes reading the files very slow when a certain threshold is reached (about 100,000 rows on average). Parquet is column-oriented, meaning it's read by column values. This makes it much faster at reading thousands of rows in a fraction of the time it takes traditional Text Files.
On the basis of storing repeated values, Text Files don't do anything with them. On the other hand, column-oriented File Formats, such as Parquet, encode repeated values to form a "lookup" metedata table to make reading operations significantly faster.
Dictionary Encoding takes each distinct value in a column, sorts it in ascending order, and gives each value a key to represent it. This create a type of "lookup" table, similar to an index in a relational database.
Delta Encoding parses the values in each column and assigns a key to the portion of the string based on its frequency (the more it occurs, the higher the larger the key).
File Compression is the process of scanning a file and taking out redundant parts. This saves spaces, making the file significantly smaller. With Parquet, you can use wither LZO, GZIP, or Snappy. On Windows machines, GZIP is the default, though Snappy offers the best performance.
The read_parquet method is used to convert a Parquet File into a DataFrame. This is used with the following syntax: pd.read_parquet('test.parquet', engine='pyarrow')
The to_parquet method is used to export a DataFrame as a Parquet file. This is used with the following syntax: dataframe.to_parquet('test.parquet', compression=['gzip', 'snappy'])
NOTE: Before using the read_parquet method, you must install the PyArrow library. This can be done by running the following line of code in your Terminal: pip install pyarrow
Exercise
Congratulations! You just completed the Working with Other Files Tutorial! To help test your knowledge, let's
practice Reading and Writing some Files into and from DataFrames.
**It's highly recommended that you
complete the exercise outlined in the previous tutorial before beginning this exercise.**
Instructions:
Open your IDE (e.g., VS Code, PyCharm, Spyder).
Create a New Jupyter Notebook, or similar Notebook Environment. Name it "other-files.ipynb"