C02_Pandas_Lauren

What is Pandas?

pandas is a very popular and easy-to-learn Python library for handling tabular data. It can take in data from a wide range of sources such as CSV files, Excel files, HTML tables on the web, and text files. It allows you to apply the same framework to all of these sources to clean and analyze the data using optimized built-in functionality which scales very well with large datasets.

We begin by importing pandas, conventionally aliased as pd. We can then import a CSV file as a DataFrame using the pd.read_csv() function, which takes in the path of the file you want to import. To view the DataFrame in a Jupyter notebook, we simply type the name of the variable.

import pandas as pd pd.__version__

# Download a sample file from http://insideairbnb.com/ ! wget http://data.insideairbnb.com/united-states/fl/broward-county/2022-06-17/visualisations/listings.csv -O listings.csv

# read the airbnb NYC listings csv file airbnb = pd.read_csv("listings.csv")

# display the pandas DataFrame display(airbnb)

Since there are so many rows in the DataFrame, we see that most of the data is truncated. We can view just the first or last few entries in the DataFrame using the .head() and .tail() methods.

# View first few entries airbnb.head()

# View last few entries airbnb.tail()

Selecting Column

Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.

# Results for a single column airbnb['name']

To select multiple columns at once, we use double brackets and commas between column names as shown below.

The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving onto the analysis, especially if the data is wide with many unnecessary variables.

# results for multiple columns hosts = airbnb[['host_id', 'host_name']] hosts.head()

To check the data types of columns we call the .dtypes attribute of the DataFrame. To convert a column to a datetime index, we use the .to_datetime() functions (these functions exist for all supported data types like .to_string() to convert a column to be stored as a string).

In the code below, we also see the syntax to both edit existing columns and create new ones. Specifically, we want to convert the last_review column to a datetime column. So we select it as seen in the previous section and set it equal to the result of the operation. Datetime series have a .dt attribute with built-in attributes and functions. Below, we select the .year attribute of the newly typed datetime column, last_review, to get the year of each row.

# Show the data types for each column airbnb.dtypes

# Change the type of a column to datetime airbnb['last_review'] = pd.to_datetime(airbnb['last_review']) airbnb.dtypes

# extract the year from a datetime series airbnb['year'] = airbnb['last_review'].dt.year airbnb['year'].head()

Series String Functions

Another useful data cleaning tool is removing leading and trailing whitespace from string data. This can be done using the strip method.

# Strip leading and trailing spaces from a string series airbnb['name'] = airbnb['name'].str.strip() airbnb['name'].head()

# uppercase all strings in a series airbnb['name_upper'] = airbnb['name'].str.upper() airbnb['name_upper'].head()

# lowercase all strings in a series airbnb['name_lower'] = airbnb['name'].str.lower() airbnb['name_lower'].head()

Derived Columns

One useful data cleaning/preparation technique we will cover is combining rows. If we want to make calculations between columns, we can easily do this by applying the operation to each of the series as shown below. Here, we are calculating the minimum number of revenue a listing generates, by calculating the product of the minimum number of stays and the price per night.

# calculate using two columns airbnb['min_revenue'] = airbnb['minimum_nights'] * airbnb['price'] airbnb[['minimum_nights', 'price', 'min_revenue']].head()

Summary Statistics

Once the data is clean and ready to analyze, we can compute some interesting statistics to answer some business questions. The first question we may have is what the average and median price is for the listings in our data. We use the built-in .mean() and .median() methods to compute these.

# get the mean price airbnb['price'].mean()

# get the median price airbnb['price'].median()

airbnb['price'].std()

airbnb['price'].var()

Grouped Statistics

We can also conduct these calculations on groupings of data using the .groupby() method. This function is very similar to using pivot tables in excel as we select a subset of columns in our data and then conduct aggregate calculations on them. As we mentioned in the introduction of this case study, we are interested in the difference in prices between each type of room listing in our data.

# get the mean grouped by type of room airbnb[['room_type', 'price']].groupby('room_type', as_index=False).mean()

# get the median grouped by type of room airbnb[['room_type', 'price']].groupby('room_type', as_index=False).median()

Filtering Data

Often, we are only interested in a subset of the rows in our dataset. For example, we may only be interested in listings under $1000 as they are more common and closer to the typical listing. We do this by passing a Boolean expression into single brackets as shown below.

# get all rows with price < 1000 airbnb_under_1000 = airbnb[airbnb['price'] < 1000] airbnb_under_1000.head()

We can also pass in multiple filters by surrounding each expression in parenthesis and using either & (for and expressions) or | (for or expressions). You will get an error if you do not surround the expressions with parentheses.

# get all rows with price < 1000 and year equal to 2020 airbnb_2019_under_1000 = airbnb[(airbnb['price'] < 1000) & (airbnb['year'] == 2020)] airbnb_2019_under_1000.head()

Plotting

pandas also has built-in plotting capabilities. For example, we can see the distribution of prices for each listing in our dataset using a histogram in one line of code. Note, we use the under $1000 DataFrame here as we cannot see the bars very clearly when including all prices.

# distribution of prices under $1000 ax = airbnb_under_1000['price'].plot.hist(bins=40)

Pandas Series

Series is the primary building block of pandas. It represents a one-dimensional labeled Numpy array

import numpy as np

pd.Series([1,3,5,6])

pd.Series([1,3,5,6], index=['A1','A2','A3','A4'])

a = {'A': 5, 'B': 7} s = pd.Series(a) s

a = np.random.randn(100)*5+100 date = pd.date_range('20220101',periods=100) s = pd.Series(a,index=date) s

a = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']) b = pd.Series([4, 3, 2, 1], index=['d', 'c', 'b', 'a']) a + b # different from Python list

a - b

a * b

a/b

Series Attributes

s.index : show the indexes s.values : show the values len(s) : number of elements s.head() : first 5 rows s.head(10) : first 10 rows s.tail() : last 5 rows s.tail(10) : last 10 rows

Select Series Elements s['b'] : by named label s[2] : by integer index s[['b','d']] : multiple select by label s[[1,3]] : multiple select by index integer s[3:8] : slice items 3 to 8

date = pd.date_range('20220101',periods=20) s = pd.Series(np.random.randn(20),index=date) # Slice out the data from 2022-01-05 to 2022-01-10?

Pandas Dataframe

d = [[1,2],[3,4]] df = pd.DataFrame(d,index=[1,2],columns=['a','b']) df

import numpy as np d = np.arange(24).reshape(6,4) d

d = np.arange(24).reshape(6,4) df = pd.DataFrame(d, index=np.arange(1,7), columns=list('ABCD'))

pd.DataFrame( { 'name': ['Ally','Jane','Belinda'], 'height':[160,155,163], }, columns = ['name','height'], index = ['A1','A2','A3'] )

date = pd.date_range('20170101',periods=6) s1 = pd.Series(np.random.randn(6),index=date) s2 = pd.Series(np.random.randn(6),index=date) df = pd.DataFrame({'Asia':s1,'Europe':s2}) df

df.shape : Dimensionality of a DF df.columns : columns of a DF df.index : index of a DF df.values : values of a DF

a = [[3,4],[5,6]] b = [[6,5],[4,3]] a2 = pd.DataFrame(a,index=[1,2],columns=['d','b']) a2

b2 = pd.DataFrame(b,index=[3,2],columns=['c','b']) b2

print(a2+b2)

from pandas import DataFrame my_df = DataFrame(data = np.random.randn(16).round(2).reshape(4,4), index = ['r'+str(i) for i in range(1, 5)], columns = ['c'+str(i) for i in range(1, 5)])

for i in range(5): 'r'+str(i)

my_df

my_df.T

my_df.loc[['r1', 'r4'], ['c3', 'c4']]

my_df.iloc[[0, 3], [2, 3]]

!head -5 listings.csv

%ls

import os [x for x in os.listdir(os.getcwd()) if 'csv' in x]

!mkdir data

%ls

airbnb.to_csv('./data/listings.csv')

os.listdir(os.getcwd() + '/data')

Data Manipulation

%cd data

airbnb1 = pd.read_csv('listings.csv')

airbnb1.sum(axis = 0)

airbnb1.sum(axis = 1)

Math Operations

airbnb1.fillna(0).describe().round(1)

airbnb1.head()

airbnb1.set_index('id', inplace=True)

airbnb1.head()

Groupby

airbnb1.room_type.value_counts()

airbnb1_grouped = airbnb1.groupby("room_type") len(airbnb1_grouped)

for i in ['Entire home/apt', 'Private room', 'Shared room', "Hotel room"]: print(airbnb1_grouped.get_group(i).shape)

airbnb1_grouped #.apply(lambda x: x[['host_name', 'price', 'room_type']] #.sort_values(by = 'price', ascending = False) #.iloc[:3,:]) airbnb1_grouped.apply(lambda x: x[['host_name', 'price', 'room_type']].sort_values(by = 'price', ascending = False).iloc[:3,:])

airbnb1.groupby('room_type').apply(lambda x: x['price'].describe())

airbnb.groupby('room_type')['price'].mean()

airbnb1.groupby(['room_type', 'neighbourhood'])['price'].mean()

airbnb1.groupby(['room_type', 'neighbourhood'])['price'].mean().unstack()

Pivot Table

# pd.pivot_table(data = airbnb1, # index = 'room_type', # values = 'price', # aggfunc = 'mean') pd.pivot_table(data = airbnb1, index = 'room_type', values = 'price', aggfunc = 'mean')

pd.pivot_table(data = airbnb1, index = 'room_type', columns = 'neighbourhood', values = 'price', aggfunc = 'mean')

Group_by Vs Pivot Table

%timeit airbnb1.groupby('room_type')['price'].mean()

%timeit pd.pivot_table(data = airbnb1, index= 'room_type', values='price', aggfunc='mean')

%timeit airbnb1.groupby(['room_type', 'neighbourhood'])['price'].mean().unstack()

%timeit pd.pivot_table(data=airbnb1, index='room_type', columns='neighbourhood', values='price', aggfunc='mean')

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}What is Pandas?