You might be using a lot of excel at your business. The goal of this notebook is going to be showing you that plenty of your excel workflows can be replaced by a few simple python scripts. This tutorial is going to assume no prior knowledge of python or pandas and all of the code and explanations will be included. This should hopefully make it as easy as possible to get an understanding and get started with code.
For the purpose of this tutorial I am going to be using a library called Pandas. Pandas will give our Python code the functionality we need to replace all your excel spreadsheets.
Firstly, we need to install Pandas.
Next we are going to import pandas into our code, this lets us use all the functionality associated with Pandas as it is not included with Python by default. We are going to nickname it pd for ease of use. This is also a common convention.
I have included some sample data that we will be working with. The data is a xlsx file and we will therefore use a function from pandas called
read_excel. We will then assign that data to a variable called
df which stands for dataframe. If your data is in another format don't worry pandas has functions for
read_sql among others.
The error message we get says we need to install
xlrd this is easy to do with the python package installer
pip just like with pandas we can install
Since our excel file has multiple sheets we will have to specify which sheet we want to read. Simply typing the name of the variable after we have read it will allow Deepnote to show us some high level information about the spreadsheet.
So we can see we have a couple of columns relating to sales performance. We have the data as sale occurred, in what region. What salesperson made the sale and what was the item. How many they sold, what the unit cost was and what the total was.
Now that we have the data loaded in we can start to explore it a little and get some work done with it.
describe method will give us some information about the numerical columns in our data. This tells us the number of items in the columns as well as the mean and standard deviation among other pieces of information.
One of the most basic things to do with an excel spreadsheet is sort data. Fortunately this is simple to do in pandas. Using the
sort_values method we can give a column as input and our Dataframe will then be sorted by that column. Note that dates will be sorted chronologically and words will be sorted alphabetically. Dont forget to assign it to a new dataframe, the name being the same does not matter, or else the change wont be in affect after that line.
Feel free to test this out by passing in any column name. Know that Deepnote also lets you generate easy line graphs in the visualize tab. Its an easy way to make some data visualization simple.
Remove Duplicates from Data
Removing duplicates is a important part of dealing with any dirty dataset. Fortunately this too is easy to do in a single line with Pandas the function is suitably named
Since our data is all unique this doesn't do anything, this is because as long as a single column is unique pandas considers the row to be unique but we can provide a set of columns and it will look for duplicates where only the subset of columns match
This command dropped all the duplicates in the Rep columns so they are all now unique, and chose to keep the first row it found
Do Some Math!
Doing math in Pandas is what is was designed to do. Understanding how to access columns and create new ones is the key point to understand here.
We know that the total column is created by multiplying the units column by the unit cost column. We can verify that easily ourselves. Creating a new column in pandas is simple and using the relationships between columns
As we can see both the total and the new total column match up perfectly. But we already knew that, say we wanted to find which Representative sold the most product. In this case we would want to
groupby the Rep column
That was not what we wanted. In order to actually see the dataframe with the results we want we will have to use an aggregate function . Pandas has a couple of aggregate functions but in this case we will want to use the
Now lets sort by that
Total and see who are our top performers!
There we have it Kivell is our highest performer!
Filtering the data based on a given parameter is easy as well. We can use square brackets and specify the condition we want to use to filter the data. In this case we want to address all the top performers who are selling over 2000$ in merchandise.
This code gives us all the rows in the dataframe where the total row is greater than 2000.
This is whats happening behind the scenes. A comparison is made for every single row.
If you are more familiar with pivot tables the same can be accomplished easily with the
pivot_table function. By setting the aggregator function to sum and the column to group by as
Rep we can accomplish the same goal in similar code. The groupby function and pivot tables are very similar and will ultimately boil down to preference.
Usually a dataset wont be as complete and clean as this one. Sometimes a dataset has null or empty values to be dealt with. Pandas has a couple of nice ways to deal with this. First we will have to introduce some null values. The
iloc function will let us locate any cell in the dataframe.
Now we have some nice null values to deal with. In excel this could be a pain. But luckily we have a couple simple solutions to this problem.
Lets check how many null values we have in each columns. The
isna function makes it easy to check what is null.
fillna method is waht we need to make this right, and within the
fillna method there are a few options. Pandas can let you choose the previous non null value and propogate forward or fill in with 0. Python and pandas let you be flexible.
Finally save your data so it can be used later. We will need another library to make this work.
With these snippets it should be relatively simple to move your work over from excel to a couple of Deepnote notebooks. Python together with Pandas give you much more control and flexibility over your data. Make sure to check out stackoverflow for help with your code and the deepnote community for questions about deepnote.
Many of the ideas explained in this article has were inspired by the articles below.
Source of Data