Importing the needed libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
%matplotlib inline
Importing the data
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
Viewing the first 5 columns of the dataframe
df.head()
Unnamed: 0int64
idint64
0
0
7129300520
1
1
6414100192
2
2
5631500400
3
3
2487200875
4
4
1954400510
Displaying the data type
df.dtypes
Obtaining a statistical summary of the data
df.describe()
Unnamed: 0float64
idfloat64
count
21613.0
21613.0
mean
10806.0
4580301520.864988
std
6239.280019895457
2876565571.312057
min
0.0
1000102.0
25%
5403.0
2123049194.0
50%
10806.0
3904930410.0
75%
16209.0
7308900445.0
max
21612.0
9900000190.0
Dropping the columns "id" and "Unnamed: 0" from axis 1 and using the method drop(), then using the method describe() to obtain a statistical summary of the data.
for col in df.columns:
if 'Unnamed: 0' and 'id' in col:
del df[col]
df.describe()
Unnamed: 0float64
pricefloat64
count
21613.0
21613.0
mean
10806.0
540088.1417665294
std
6239.280019895457
367127.19648269983
min
0.0
75000.0
25%
5403.0
321950.0
50%
10806.0
450000.0
75%
16209.0
645000.0
max
21612.0
7700000.0
Displaying the missing values for the columns bedrooms and bathrooms
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
number of NaN values for the column bedrooms : 13
number of NaN values for the column bathrooms : 10
Replacing the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace.
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)
Replacing the missing values of the column 'bathrooms' with the mean of the column 'bedrooms' using the method replace.
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
number of NaN values for the column bedrooms : 0
number of NaN values for the column bathrooms : 0
Used the method value_counts to count the number of houses with unique floor values, then used the method .to_frame() to convert it to a dataframe.
df['floors'].value_counts().to_frame()
floorsint64
1.0
10680
2.0
8241
1.5
1910
3.0
613
2.5
161
3.5
8
Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.
sns.boxplot(x="waterfront", y="price", data=df)
Used the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.
sns.regplot(x="sqft_above", y="price", data=df)
plt.ylim(0,)
Used the Pandas method corr() to find the feature other than price that is most correlated with price.
df.corr()['price'].sort_values()