Sidetable Gives You the Pandas Methods You Didn't Know You Needed

Quickly make DataFrames for missing values, frequency counts, subtotals, and more 🎉

Sidetable Gives You the Pandas Methods You Didn't Know You Needed
 - preview image
Sidetable is a new Python library that adds several helpful pandas DataFrame methods. It is nice for exploratory data analysis and presenting information. In this guide I'll show you how to use sidetable and where it could be a good fit in your data science workflow. As we explore sidetable, we'll use the penguins dataset that was recently added to the [seaborn](https://seaborn.pydata.org/) visualization library. The penguins dataset is mean to be a replacement for the overused [iris](https://archive.ics.uci.edu/ml/datasets/iris) dataset. If nothing else, it adds some variety and allows us to show some penguin pictures. 😀
Pandas is the exploration tool for data analysts and data scientists who use Python. 🐼 The pandas API is large and oriented around data cleaning and data wrangling. The new [sidetable](https://github.com/chris1610/sidetable) package adds convenience methods to DataFrames. These methods make it easier to see missing values, counts of values per column, subtotals, and grand totals.
Let's check it out! 🚀
## Setup To get the latest versions of necessary packages and their dependencies, uncomment and run the following code one time.
# ! pip install sidetable -U # ! pip install pandas -U
Let's import the packages and check the versions.
import sys import pandas as pd import sidetable print(f"Python version {sys.version}") print(f"pandas version: {pd.__version__}") print(f"sidetable version: {sidetable.__version__}")
If your Python version is less than 3.6, I suggest you updated it. Same goes for pandas if your version is less than 1.0. To learn more about the pandas 1.0 update, check out my article [here](https://towardsdatascience.com/whats-new-in-pandas-1-0-ffa99bd43a58).
## Penguin data 🐧 We'll read in the Antartica penguins dataset directly from a .csv file at the GitHub repository that hosts seaborn's datasets. The data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER. See more info [here](https://github.com/allisonhorst/palmerpenguins). Let's put the data in a pandas DataFrame and check out the first few rows.
df_penguins = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv') df_penguins.head(2)
Alright, looks like we've got some penguins. 👍 Alternatively, this dataset can be loaded through seaborn if you install [seaborn](http://seaborn.pydata.org/), `import seaborn as sns` and run `df_penguins = sns.load_dataset('penguins')`. After reading a new dataset into DataFrame, my next move is to use `df.info()` to get some information about it.
df_penguins.info()
We see some basic info about our 6 columns and 344 rows. Let's see how sidetable can help us explore our data.
![Adelie Penguin. source: pixabay.com](https://storage.googleapis.com/published-content/sidetable-pandas-methods-you-didnt-know-you-needed/adelie_penguin.png) Adelie Penguin. source: pixabay.com
## Exploring sidetable All sidetable methods use the `.stb` accessor. For more about the pandas accessor API, see this [post](https://pbpython.com/sidetable.html) by Chris Moffit, the author of sidetable. ### stb.missing() The first sidetable DataFrame method we'll check out is `stb.missing()`. Here's how to use it:
df_penguins.stb.missing()
The result is a DataFrame with the number of missing values per column, ordered by most to fewest. It also displays the total number of rows and the percentage of missing values for each column. With `df.info()`, we would have had to do the arithmetic in our head. 🤔 Alternatively, we could see all the missing values by column with `df.isna().sum()`.
df_penguins.isna().sum()
But `df.isna().sum()` doesn't include the percentages and isn't nicely formatted. Speaking of nice formatting, let's make sidetable's output prettier. 💐
df_penguins.stb.missing(style=True)
Passing `style=True` cleans up the *Percent* column formatting. I plan to use `stb.missing()` whenever I have a bunch of columns with missing data.
### stb.freq() Now let's look at the sidetable DataFrame method `.stb.freq()`, the main course in the sidetable package. 🍲 Let's check out the *species* category.
df_penguins.stb.freq(['species'])
The result of `stb.freq()` is like combining value_counts with and without the `normalize=True` argument, the cumulative count, and the cumulative percentage. As with `stb.missing()`, we can make the styling even nicer by passing `style=true`. 🎉
df_penguins.stb.freq(['species'], style=True)
That's handy.😀
Note that all the sidetable methods operate on DataFrames and return a DataFrame.
type(df_penguins.stb.freq(['species']))
You know, with this data being nominal - not ordinal - we might not want the cumulative columns. We can get rid of those columns by passing `cum_cols=False` like this:
df_penguins.stb.freq(['species'], style=True, cum_cols=False)
![Gentoo Penguins](https://storage.googleapis.com/published-content/sidetable-pandas-methods-you-didnt-know-you-needed/gentoo_penguins.jpg) Gentoo Penguins. source: pixabay.com
Let's see how things look if we pass multiple columns to `stb.freq()`.
df_penguins.stb.freq(['species', 'island', 'sex'])
That's quite a breakdown. Let's look at just the *island* column to explore some more optional arguments.
df_penguins.stb.freq(['island'], style=True)
What if we just want to include islands that make up 50% of the total count? That example is a little contrived, but we're trying to show functionality, so work with me here. 😉 Pass `thresh=.5` to show only the islands contributing up to 50% of the total.
df_penguins.stb.freq(['island'], style=True, thresh=.5)
Notice that this might be a bit confusing. We aren't showing all the islands with at least 50% of the total. We are showing all the islands up to a cumulative threshold. Here's how things look with a threshold of .9.
df_penguins.stb.freq(['island'], style=True, thresh=.9)
If you want to change the label for the values ommitted, you can pass `other_label='my_label'` like this:
df_penguins.stb.freq(['island'], style=True, thresh=.9, other_label='Other Islands')
Pass `value='my_column'` to sum and dislay by the values in that column, instead of counting the occurrences. This doesn't really make sense to do with the current dataset, but it's a nice feature to know about. Here's how the output looks:
df_penguins.stb.freq(['island'], value='flipper_length_mm')
In our case, the original columns aren't capitalized, but the new columns created by sidetable are. If you want the column capitalization to match, you can adjust the resulting DataFrame.
freq_table = df_penguins.stb.freq(['island']) freq_table.columns = freq_table.columns.str.title() freq_table
#### Other EDA options Sidetable's `stb.freq()` is nice because it's lightweight and informative. In some cases you'll want a different tool. If I'm looking for descriptive statistics on numeric data, I often use `df.describe()`.
df_penguins.describe()
Passing `include='all'` shows some information about the non-numeric columns, but then things are a little messy. 🙁
df_penguins.describe(include='all')
If you want a lot of information about your data, check out the [Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) package. It provides a comprehensive report with descriptive stats, histograms, correlations, and more. It's pretty awesome. However, it can take a while to run and be a bit much for many use cases.
Now let's see the final sidetable method.
## stb.subtotal()
If you want to display a DataFrame with a *Grand Total* row that shows the sums of the numeric columns, use `stb.subtotal()`.
df_penguins.stb.subtotal().tail()
This is handy for financial documents and other cases where you want to show the data and the totals in the same table. By combining `df.groupby()` with `stb.subtotal()` you get a grand total and nicely formatted subtotals. Here's a groupby of *species* and *sex*, with a count of *island*.
df_penguins.groupby(['species', 'sex']).agg(dict(island='count'))
Not earth-shattering information in our case, but it shows the counts in each group. The information would be easier to comprehend at a glance with some subtotals and a grand total. `stb.subtotal()` adds those for us. 🎉
df_penguins.groupby(['species', 'sex']).agg(dict(island='count')).stb.subtotal()
Subtotals are often helpful with financial data or ordinal data. `stb.subtotal()` makes it more pleasant to do budgeting and financial reporting tasks that I would normally do in Google Sheets or Microsoft Excel.
## Wrap You've seen how you can use sidetable to quickly display missing values, make nicely formatted frequency tables, and show grand totals and subtotals. It's a handy little library. 👍 ### Recap Here's a recap of the API: - `stb.missing()` - display helpful information about missing values. - `stb.freq()` - display counts, percents, and cumulative information for columns. - `stb.subtotal()` - add a grand total row to a DataFrame. If applied to a groupby, add subtotal information for each group. Passing `style=True` to `stb.missing()` and `stb.freq()` makes the output nicely formatted. There are a nubmer of other arguments you can pass to `stb.freq()` to modify the output.
I hope you found this introduction to sidetable to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀 If you have questions or comments, please share them with me on [Twitter](https://twitter.com/discdiver) or [LinkedIn](https://www.linkedin.com/in/-jeffhale/). Happy sidetabling! 🚀
![Gentoo Penguin](https://storage.googleapis.com/published-content/sidetable-pandas-methods-you-didnt-know-you-needed/gentoo_penguin.jpg) Gentoo Penguin. source: pixabay.com

Run this article as a notebook

Deepnote is a new kind of data science notebook. Jupyter-compatible and with real-time collaboration.

Sign-up for the waitlist below, or find out more here.

To be continued...