Comparing the Speed and Efficiency of Different File Formats for Storing and Reading Data with Pandas
If you work with data in Python, you know that there are many different file formats that you can use to store and read data. But which file format is the best for your specific needs? In this blog post, we will compare the speed and efficiency of five common file formats for storing and reading data with pandas, a popular data manipulation library in Python: CSV, feather, pickle, HDF5, and parquet. We will measure the time it takes to write and read data with each file format, and discuss the relative strengths and weaknesses of each option. By the end of this post, you should have a good understanding of which file format is the best fit for your data and use case.
Collecting tables
Downloading tables-3.8.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 131.3 MB/s eta 0:00:00
Collecting numexpr>=2.6.2
Downloading numexpr-2.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (380 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 380.7/380.7 KB 62.6 MB/s eta 0:00:00
Collecting cython>=0.29.21
Downloading Cython-0.29.33-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 141.3 MB/s eta 0:00:00
Collecting py-cpuinfo
Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)
Requirement already satisfied: packaging in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from tables) (21.3)
Collecting blosc2~=2.0.0
Downloading blosc2-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 143.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.19.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from tables) (1.23.4)
Collecting msgpack
Downloading msgpack-1.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (322 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 322.4/322.4 KB 69.3 MB/s eta 0:00:00
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from packaging->tables) (3.0.9)
Installing collected packages: py-cpuinfo, msgpack, numexpr, cython, blosc2, tables
Successfully installed blosc2-2.0.0 cython-0.29.33 msgpack-1.0.4 numexpr-2.8.4 py-cpuinfo-9.0.0 tables-3.8.0
WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
Time to write CSV: 0.00554 seconds
Time to write feather: 0.16897 seconds
Time to write pickle: 0.00186 seconds
Time to write HDF5: 0.03703 seconds
Time to write parquet: 0.09065 seconds
Time to read CSV: 0.00331 seconds
Time to read feather: 0.00294 seconds
Time to read pickle: 0.00154 seconds
Time to read HDF5: 0.00417 seconds
Time to read parquet: 0.05417 seconds
with conclusion
Writing:
Time to write CSV: 0.00237 seconds
Time to write feather: 0.00260 seconds
Time to write pickle: 0.00199 seconds
Time to write HDF5: 0.00703 seconds
Time to write parquet: 0.00284 seconds
The fastest file format for writing is pickle
The slowest file format for writing is HDF5
The fastest file format for reading is pickle
The slowest file format for reading is HDF5