From spreadsheet to notebooks 

by Roberto SannazzaroOct 19, 2020
6 likes3 duplicates
Share
Twitter iconTwitter
Facebook iconFacebook
Email
Copy link
Save as PDF
  1. How to Finally Ditch Spreadsheets in favour of Notebooks
    1. Introduction
    2. The data
    3. Loading the data
    4. Answering business questions
      1. Total value of sales
      2. Average value of a transaction
      3. Where customers are based
      4. Most popular ordered products
    5. Where to go now?
import pandas as pd import matplotlib.pyplot as plt plt.rcParams["figure.figsize"] = (20,3) %config InlineBackend.figure_format='retina'

How to Finally Ditch Spreadsheets in favour of Notebooks

Recently, it is becoming more and more common to hear stories where excel seems to behave like it is haunted by some kind of ghosts: human genes names are mistaken as dates , limits on the maximum number of rows simply delete additional rows without warning the user, and it also happened that someone lost around 6B$ for a spreadsheet error.

Introduction

What it's slowly coming out is that when dealing with the so-called big data, AKA huge collections of data, usually containing more than 10K rows, spreadsheets are not the tool to go. There are several limitations in many aspects, like the maximum number of rows and columns, auto-formatting which can lead to mistakes and, of course, since spreadsheet is extremely widespread, different users having different literacies uses them, creating unpredictables situations, especially when a situation escalates and the tool is the only tool used.

In this article, I would like to show how to gradually switch from the spreadsheet in favour of modern notebooks, cloud storage, and how to performs most common spreadsheet operations using Python along with Pandas on a pretty heavy dataset.

Nowadays, there are many solutions out there that allow doing this: you can install Jupyter, work on Colab or use some hosted tools such as Sagemaker. However, it is not always so straightforward to get started and plug everything together. That is the reason why, for this article, I will use Deepnote, a free, hosted, and plug-n-play service where you can run notebooks, have real-time collaboration, and natively access different databases.

The data

The dataset used in this article is the Brazilian E-Commerce Public Dataset by Olist, it contains information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allow viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers.

Since this dataset consists of nine different tables in .csv format it is helpful to have a schema of the dataset, AKA a map that helps us to understand how tables are related to each other, giving the possibility to perform a different operation such as VLOOKUP, SUMIF, CONCATENATE and many others.

You can think of it as having different sheets related to each other.