Importing the needed libraries

import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler,PolynomialFeatures %matplotlib inline

Importing the data

file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv' df=pd.read_csv(file_name)

Viewing the first 5 columns of the dataframe

df.head()

Displaying the data type

df.dtypes

Obtaining a statistical summary of the data

df.describe()

Dropping the columns "id" and "Unnamed: 0" from axis 1 and using the method drop(), then using the method describe() to obtain a statistical summary of the data.

for col in df.columns: if 'Unnamed: 0' and 'id' in col: del df[col] df.describe()

Displaying the missing values for the columns bedrooms and bathrooms

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum()) print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

Replacing the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace.

mean=df['bedrooms'].mean() df['bedrooms'].replace(np.nan,mean, inplace=True)

Replacing the missing values of the column 'bathrooms' with the mean of the column 'bedrooms' using the method replace.

mean=df['bathrooms'].mean() df['bathrooms'].replace(np.nan,mean, inplace=True)

print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum()) print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())

Used the method value_counts to count the number of houses with unique floor values, then used the method .to_frame() to convert it to a dataframe.

df['floors'].value_counts().to_frame()

Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

sns.boxplot(x="waterfront", y="price", data=df)

Used the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.

sns.regplot(x="sqft_above", y="price", data=df) plt.ylim(0,)

Used the Pandas method corr() to find the feature other than price that is most correlated with price.

df.corr()['price'].sort_values()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Importing the needed libraries