Learning On Your Own: Plot with Less Code

Learning On Your Own: Plot with Less Code | Mathaus Silva

1 - Introduction: Easier and Faster

In some cases, you can plot data directly from pandas without needing to use matplotlib. Medium's author Andre Ye argues matplotlib, despite "generally considered to be the simplest way to create visualizations in Python", is a far inferior when compared to pandas. Instead, he believes pandas provides a simpler, quicker plots for analysis and a more convenient and direct interface for data manipulation and visualization.

2.1 - Creating and Plotting Visualization: Bar Charts, Stacked Bar Charts, and Horizontal Bar Charts

Firstly, pandas makes it easier for you to create plots directly. You're able to convert between bar charts, stacked bar charts, and horizontal bar charts just by changing the parameter value.

Consider this example directly from Andre Ye's blog post. By importing libraries pandas and numpy, the following DataFrame is randomly generated with four columns (A, B, C, and D) and ten rows.

import pandas as pd import numpy as np data = pd.DataFrame(np.random.rand(10, 4), columns=['A', 'B', 'C', 'D']) data.head()

"data.head()" returns the first five rows of each column. In order to plot the values of each of the columns per row, we simply use "data.plot.bar()".

data.plot.bar();

The bar chart above displays the values of each of the columns per row. However, smaller values are made difficult to see in this graph (i.e. 4B). If you wish to stack the columns for a more clear visualization, you just need to add "stacked=True" inside the parameter value.

data.plot.bar(stacked=True);

Now, having adjusted the width of each bar by itself, you can more clearly see 4B in comparison to other columns inside his row.

Or, try using "data.plot.barh", which changes the bar's orientation from vertical to horizontal.

data.plot.barh(stacked=True);

By using pandas, all these variants can be easily created with only one line of code because they establish a direct flow with the data. Yet, the same cannot be said to matplotlib.

2.2 - Creating and Plotting Visualization: matplotlib

Now that we've seen pandas' practicality and simplicity when creating visualization, I would like to show what it would take for matplotlib to plot the same bar chart we plotted in pandas.

First, we would have to import matplotlib, assign variable x to return evenly spaced values of 10 rows, and declare our width. Then, plot every column from DataFrame 'data' in manually along with their color, width, and label. And finally, create a for loop for labels, rotate them vertically and apply the best location for the legend.

import matplotlib.pyplot as plt x = np.arange(10) width = 0.15 plt.bar(x + width, data['A'], color='b', width=width, label='A') plt.bar(x + 2*width, data['B'], color='y', width=width, label='B') plt.bar(x + 3*width, data['C'], color='g', width=width, label='C') plt.bar(x + 4*width, data['D'], color='r', width=width, label='D') labels = [] for i in range(0,10): labels.append(i) plt.xticks(x + 3*width, labels = labels, rotation='vertical') plt.legend(loc='best') plt.show()

Compared to pandas' one-liner, matplotlib requires 13 lines of code to produce a similar yet simple bar chart. This is because pandas does a lot of inference in what we want it to plot, so it can visualize what we want it to in many cases without explicitly declaring them.

3 - Data Manipulation Functions: Differencing and Rolling Means

One of the major benefits of using pandas directly is being able to use their DataFrame manipulations. Our first example is using "data.diff()", which returns the difference between one row and the row before it.

data.diff()

Since each row returns the difference between one row and the row before it, it is normal for the first row to return NaN values.

Now, using "data.diff()", we can plot out the differenced data using a box and whisker plot.

data.diff().plot.box(vert=False, color={'medians':'lightblue', 'boxes':'blue', 'caps':'darkblue'});

Once again, pandas' use of parameters showcases the simplicity and practicality of changing colors while plotting.

Another example of a handy data manipulation function in pandas is ".rolling().mean()". By taking the average rolling mean, you can reduce the noisiness of the dataset and provide rolling window calculations. As a comparison, we can plot the original data with noise (in blue) and the rolling mean (in orange) on a line graph.

data2 = pd.DataFrame(np.random.rand(100, 1), columns=['value']).reset_index() data2['value'].plot() data2['value'].rolling(10).mean().plot();

While "data.diff()" and "data.rolling().mean()" are one of pandas' most useful data manipulation functions, there are other more advanced types of plots that can also be created directly from the data, such as "kde" or "density" for density plots, "scatter" for scatter plots, and "hexbin" for hexagonal bin plots.

data.plot.kde(); #distribution plot

data.plot.scatter(x='A',y='B', #scatterplot x and y c='C', #color of data points s=data['C']*200); #size of data points

data.plot.hexbin(x='C',y='D', #hexbin x and y gridsize=18); #hexagon dimensions

For more information, make sure to check pandas' user guide on more advanced chart visualization tools as well as KGP Talkie's "Complete Data Visualization in Pandas" crash course on Youtube: 1. https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html 2. https://www.youtube.com/watch?v=0IKz-KxDCuM&t=2322s (Starting on 37:57)

4 - Subplots: Pie Graph and Line Graph

While using pandas, it is very simple to create subplots from the data. By specifying "subplots=True", pandas automatically creates subplots based on the DataFrame's columns. Considering our newly created DataFrame has columns "X" and "Y" with rows "A" through "E", "data.plot.pie()" will generate two pie graphs with five sections each.

data = pd.DataFrame(np.random.rand(5, 2), index=list("ABCDE"), columns=list("XY")) data.plot.pie(subplots=True, figsize=(8, 4));

If you wish to add more parameters to the pie graph, pandas allows for more customization. Some parameters include custom labels to the slices (labels=[]), color specification for each slice (colors=[]), and the size of the labels (fontsize=). If we were instead using matplotlib, we'd have to create two manually-separate subplots

Finally, pandas' simplicity and practicality can also be seen when plotting line graphs. In this example, subplots are used to graph four different line graphs from columns "A" through "D" using ".plot()".

data = pd.DataFrame(np.random.rand(100, 4), columns=['A','B','C','D']) data.plot(subplots=True,figsize=(20,10));

Similar to the previous pie chart, you are also able to change a line graph's parameters. For example, by adding a parameter "layout=(2, 2)", pandas formats its subplots to the specified layout formatting.

data.plot(subplots=True,layout=(2,2),figsize=(20,10));