How to Finally Ditch Spreadsheets in favour of Notebooks
Recently, it is becoming more and more common to hear stories where excel seems to behave like it is haunted by some kind of ghosts: human genes names are mistaken as dates , limits on the maximum number of rows simply delete additional rows without warning the user, and it also happened that someone lost around 6B$ for a spreadsheet error.
What it's slowly coming out is that when dealing with the so-called big data, AKA huge collections of data, usually containing more than 10K rows, spreadsheets are not the tool to go. There are several limitations in many aspects, like the maximum number of rows and columns, auto-formatting which can lead to mistakes and, of course, since spreadsheet is extremely widespread, different users having different literacies uses them, creating unpredictables situations, especially when a situation escalates and the tool is the only tool used.
In this article, I would like to show how to gradually switch from the spreadsheet in favour of modern notebooks, cloud storage, and how to performs most common spreadsheet operations using Python along with Pandas on a pretty heavy dataset.
Nowadays, there are many solutions out there that allow doing this: you can install Jupyter, work on Colab or use some hosted tools such as Sagemaker. However, it is not always so straightforward to get started and plug everything together. That is the reason why, for this article, I will use Deepnote, a free, hosted, and plug-n-play service where you can run notebooks, have real-time collaboration, and natively access different databases.
The dataset used in this article is the Brazilian E-Commerce Public Dataset by Olist, it contains information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allow viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers.
Since this dataset consists of nine different tables in
.csv format it is helpful to have a schema of the dataset, AKA a map that helps us to understand
how tables are related to each other, giving the possibility to perform a different operation such as
CONCATENATE and many others.
You can think of it as having different sheets related to each other.