Importing the needed libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline
Importing the data
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
Viewing the first 5 columns of the dataframe
df.head()
Displaying the data type
df.dtypes
Obtaining a statistical summary of the data
df.describe()
Dropping the columns "id" and "Unnamed: 0" from axis 1 and using the method drop(), then using the method describe() to obtain a statistical summary of the data.
for col in df.columns:
if 'Unnamed: 0' and 'id' in col:
del df[col]
df.describe()
Displaying the missing values for the columns bedrooms and bathrooms
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
Replacing the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace.
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)
Replacing the missing values of the column 'bathrooms' with the mean of the column 'bedrooms' using the method replace.
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
Used the method value_counts to count the number of houses with unique floor values, then used the method .to_frame() to convert it to a dataframe.
df['floors'].value_counts().to_frame()
Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.
sns.boxplot(x="waterfront", y="price", data=df)
Used the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.
sns.regplot(x="sqft_above", y="price", data=df)
plt.ylim(0,)
Used the Pandas method corr() to find the feature other than price that is most correlated with price.
df.corr()['price'].sort_values()