import pandas as pd pd.__version__

# Download a sample file from http://insideairbnb.com/ ! wget http://data.insideairbnb.com/united-states/fl/broward-county/2022-06-17/visualisations/listings.csv -O listings.csv

# read the airbnb NYC listings csv file airbnb = pd.read_csv("listings.csv")

# display the pandas DataFrame display(airbnb)

# View first few entries airbnb.head(10)

# View last few entries airbnb.tail(10)

Selecting Column

Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.

# Results for a single column airbnb['name']

To select multiple columns at once, we use double brackets and commas between column names as shown below.The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving onto the analysis, especially if the data is wide with many unnecessary variables.

# results for multiple columns hosts = airbnb[['host_id', 'host_name']] hosts.head()

To check the data types of columns we call the .dtypes attribute of the DataFrame. To convert a column to a datetime index, we use the .to_datetime() functions (these functions exist for all supported data types like .to_string() to convert a column to be stored as a string).

In the code below, we also see the syntax to both edit existing columns and create new ones. Specifically, we want to convert the last_review column to a datetime column. So we select it as seen in the previous section and set it equal to the result of the operation. Datetime series have a .dt attribute with built-in attributes and functions. Below, we select the .year attribute of the newly typed datetime column, last_review, to get the year of each row.

# Show the data types for each column airbnb.dtypes

# Change the type of a column to datetime airbnb['last_review'] = pd.to_datetime(airbnb['last_review']) airbnb.dtypes

# extract the year from a datetime series airbnb['year'] = airbnb['last_review'].dt.year airbnb['year'].head()

Series String Functions

Another useful data cleaning tool is removing leading and trailing whitespace from string data. This can be done using the strip method.

# Strip leading and trailing spaces from a string series airbnb['name'] = airbnb['name'].str.strip() airbnb['name'].tail()

# uppercase all strings in a series airbnb['name_upper'] = airbnb['name'].str.upper() airbnb['name_upper'].tail()

# lowercase all strings in a series airbnb['name_lower'] = airbnb['name'].str.lower() airbnb['name_lower'].tail()

Derived Columns

One useful data cleaning/preparation technique we will cover is combining rows. If we want to make calculations between columns, we can easily do this by applying the operation to each of the series as shown below. Here, we are calculating the minimum number of revenue a listing generates, by calculating the product of the minimum number of stays and the price per night.

# calculate using two columns airbnb['min_revenue'] = airbnb['minimum_nights'] * airbnb['price'] airbnb[['minimum_nights', 'price', 'min_revenue']].head()

Summary Statistics

Once the data is clean and ready to analyze, we can compute some interesting statistics to answer some business questions. The first question we may have is what the average and median price is for the listings in our data. We use the built-in .mean() and .median() methods to compute these.

# get the mean price airbnb['price'].mean()

# get the median price airbnb['price'].median()

airbnb['price'].std()

airbnb['price'].var()

Group Statistics

We can also conduct these calculations on groupings of data using the .groupby() method. This function is very similar to using pivot tables in excel as we select a subset of columns in our data and then conduct aggregate calculations on them. As we mentioned in the introduction of this case study, we are interested in the difference in prices between each type of room listing in our data.

# Get the mean grouped by type of room airbnb[['room_type', 'price']].groupby('room_type', as_index=False).mean()

# get the median grouped by type of room airbnb[['room_type', 'price']].groupby('room_type').median()

Filtering Data

Often, we are only interested in a subset of the rows in our dataset. For example, we may only be interested in listings under $1000 as they are more common and closer to the typical listing. We do this by passing a Boolean expression into single brackets as shown below.

# get all rows with price < 1000 airbnb_under_1000 = airbnb[airbnb['price'] < 1000] airbnb_under_1000.head()

We can also pass in multiple filters by surrounding each expression in parenthesis and using either & (for and expressions) or | (for or expressions). You will get an error if you do not surround the expressions with parentheses.

# get all rows with price < 1000 and year equal to 2020 airbnb_2019_under_1000 = airbnb[(airbnb['price'] < 1000) & (airbnb['year'] == 2020)] airbnb_2019_under_1000.head()

Plotting

Pandas also has built-in plotting capabilities. For example, we can see the distribution of prices for each listing in our dataset using a histogram in one line of code. Note, we use the under $1000 DataFrame here as we cannot see the bars very clearly when including all prices.

# distribution of prices under $1000 ax = airbnb_under_1000['price'].plot.hist(bins=40)

Panda Data Frame

d = [[1,2],[3,4]] df = pd.DataFrame(d,index=[1,2],columns=['a','b']) df

import numpy as np d = np.arange(24).reshape(6,4) d

df = pd.DataFrame(d, index=np.arange(1,7), columns=list('ABCD')) df

d = np.arange(24).reshape(6,4) df = pd.DataFrame(d, index=np.arange(1,7), columns=list('ABCD'))

pd.DataFrame( { 'name': ['Ally','Jane','Belinda'], 'height':[160,155,163], }, columns = ['name','height'], index = ['A1','A2','A3'] )

date = pd.date_range('20170101',periods=6) s1 = pd.Series(np.random.randn(6),index=date) s2 = pd.Series(np.random.randn(6),index=date) df = pd.DataFrame({'Asia':s1,'Europe':s2}) df

df.shape : Dimensionality of a DF df.columns : columns of a DF df.index : index of a DF df.values : values of a DF

a = [[3,4],[5,6]] b = [[6,5],[4,3]] a2 = pd.DataFrame(a,index=[1,2],columns=['d','b']) a2

b2 = pd.DataFrame(b,index=[3,2],columns=['c','b']) b2

print(a2+b2)

from pandas import DataFrame my_df = DataFrame(data = np.random.randn(16).round(2).reshape(4,4), index = ['r'+str(i) for i in range(1, 5)], columns = ['c'+str(i) for i in range(1, 5)])

my_df

my_df.T

my_df.loc[['r1', 'r4'], ['c3', 'c4']]

my_df.iloc[[0, 3], [2, 3]]

!head -5 listings.csv

%ls

!mkdir data

%ls

airbnb.to_csv('./data/listings.csv')

Groupby

airbnb1_grouped = airbnb.groupby("room_type") len(airbnb1_grouped)

for i in ['Entire home/apt', 'Private room', 'Shared room', "Hotel room"]: print(airbnb1_grouped.get_group(i).shape)

airbnb1_grouped #.apply(lambda x: x[['host_name', 'price', 'room_type']] #.sort_values(by = 'price', ascending = False) #.iloc[:3,:]) airbnb1_grouped.apply(lambda x: x[['host_name', 'price', 'room_type']].sort_values(by = 'price', ascending = False).iloc[:3,:])

airbnb.groupby('room_type').apply(lambda x: x['price'].describe())

airbnb.groupby(['room_type', 'neighbourhood'])['price'].mean().unstack()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Selecting Column