Selecting Column
Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.
To select multiple columns at once, we use double brackets and commas between column names as shown below.The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving onto the analysis, especially if the data is wide with many unnecessary variables.
To check the data types of columns we call the .dtypes attribute of the DataFrame. To convert a column to a datetime index, we use the .to_datetime() functions (these functions exist for all supported data types like .to_string() to convert a column to be stored as a string).
In the code below, we also see the syntax to both edit existing columns and create new ones. Specifically, we want to convert the last_review column to a datetime column. So we select it as seen in the previous section and set it equal to the result of the operation. Datetime series have a .dt attribute with built-in attributes and functions. Below, we select the .year attribute of the newly typed datetime column, last_review, to get the year of each row.
Series String Functions
Another useful data cleaning tool is removing leading and trailing whitespace from string data. This can be done using the strip method.
Derived Columns
One useful data cleaning/preparation technique we will cover is combining rows. If we want to make calculations between columns, we can easily do this by applying the operation to each of the series as shown below. Here, we are calculating the minimum number of revenue a listing generates, by calculating the product of the minimum number of stays and the price per night.
Summary Statistics
Once the data is clean and ready to analyze, we can compute some interesting statistics to answer some business questions. The first question we may have is what the average and median price is for the listings in our data. We use the built-in .mean() and .median() methods to compute these.
Group Statistics
We can also conduct these calculations on groupings of data using the .groupby() method. This function is very similar to using pivot tables in excel as we select a subset of columns in our data and then conduct aggregate calculations on them. As we mentioned in the introduction of this case study, we are interested in the difference in prices between each type of room listing in our data.
Filtering Data
Often, we are only interested in a subset of the rows in our dataset. For example, we may only be interested in listings under $1000 as they are more common and closer to the typical listing. We do this by passing a Boolean expression into single brackets as shown below.
We can also pass in multiple filters by surrounding each expression in parenthesis and using either & (for and expressions) or | (for or expressions). You will get an error if you do not surround the expressions with parentheses.
Plotting
Pandas also has built-in plotting capabilities. For example, we can see the distribution of prices for each listing in our dataset using a histogram in one line of code. Note, we use the under $1000 DataFrame here as we cannot see the bars very clearly when including all prices.
Panda Data Frame
df.shape : Dimensionality of a DF df.columns : columns of a DF df.index : index of a DF df.values : values of a DF