What is Pandas?
pandas is a very popular and easy-to-learn Python library for handling tabular data. It can take in data from a wide range of sources such as CSV files, Excel files, HTML tables on the web, and text files. It allows you to apply the same framework to all of these sources to clean and analyze the data using optimized built-in functionality which scales very well with large datasets.
We begin by importing pandas, conventionally aliased as pd. We can then import a CSV file as a DataFrame using the pd.read_csv() function, which takes in the path of the file you want to import. To view the DataFrame in a Jupyter notebook, we simply type the name of the variable.
Since there are so many rows in the DataFrame, we see that most of the data is truncated. We can view just the first or last few entries in the DataFrame using the .head() and .tail() methods.
Selecting Column
Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.
To select multiple columns at once, we use double brackets and commas between column names as shown below.
The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving onto the analysis, especially if the data is wide with many unnecessary variables.
To check the data types of columns we call the .dtypes attribute of the DataFrame. To convert a column to a datetime index, we use the .to_datetime() functions (these functions exist for all supported data types like .to_string() to convert a column to be stored as a string).
In the code below, we also see the syntax to both edit existing columns and create new ones. Specifically, we want to convert the last_review column to a datetime column. So we select it as seen in the previous section and set it equal to the result of the operation. Datetime series have a .dt attribute with built-in attributes and functions. Below, we select the .year attribute of the newly typed datetime column, last_review, to get the year of each row.
Series String Functions
Another useful data cleaning tool is removing leading and trailing whitespace from string data. This can be done using the strip method.
Derived Columns
One useful data cleaning/preparation technique we will cover is combining rows. If we want to make calculations between columns, we can easily do this by applying the operation to each of the series as shown below. Here, we are calculating the minimum number of revenue a listing generates, by calculating the product of the minimum number of stays and the price per night.
Summary Statistics
Once the data is clean and ready to analyze, we can compute some interesting statistics to answer some business questions. The first question we may have is what the average and median price is for the listings in our data. We use the built-in .mean() and .median() methods to compute these.
Grouped Statistics
We can also conduct these calculations on groupings of data using the .groupby() method. This function is very similar to using pivot tables in excel as we select a subset of columns in our data and then conduct aggregate calculations on them. As we mentioned in the introduction of this case study, we are interested in the difference in prices between each type of room listing in our data.
Filtering Data
Often, we are only interested in a subset of the rows in our dataset. For example, we may only be interested in listings under $1000 as they are more common and closer to the typical listing. We do this by passing a Boolean expression into single brackets as shown below.
We can also pass in multiple filters by surrounding each expression in parenthesis and using either & (for and expressions) or | (for or expressions). You will get an error if you do not surround the expressions with parentheses.
Plotting
pandas also has built-in plotting capabilities. For example, we can see the distribution of prices for each listing in our dataset using a histogram in one line of code. Note, we use the under $1000 DataFrame here as we cannot see the bars very clearly when including all prices.
Pandas Series
Series is the primary building block of pandas. It represents a one-dimensional labeled Numpy array
Series Attributes
s.index : show the indexes s.values : show the values len(s) : number of elements s.head() : first 5 rows s.head(10) : first 10 rows s.tail() : last 5 rows s.tail(10) : last 10 rows
Select Series Elements s['b'] : by named label s[2] : by integer index s[['b','d']] : multiple select by label s[[1,3]] : multiple select by index integer s[3:8] : slice items 3 to 8
Pandas Dataframe
df.shape : Dimensionality of a DF df.columns : columns of a DF df.index : index of a DF df.values : values of a DF