Basic Statistics in Python with Pandas
1. Reading a dataset
# In this cell we are only importing some usefull libraries
import pandas as pd # Data analytics manipulation tool (use dataframes)
import seaborn as sns # Package for visualization
# Here, we will read our cars.csv file
df = pd.read_csv('cars.csv')
2. Exploring columns types
df.describe()
df["price_usd"].mean()
df["price_usd"].median()
df["price_usd"].plot.hist(bins = 20)
sns.displot(df, x = 'price_usd', hue = 'engine_type', multiple = 'stack')
df.groupby('engine_type').count()
df.value_counts()
df_audi_q7 = df[(df['manufacturer_name'] == 'Audi') & (df['model_name'] == 'Q7')]
sns.histplot(df_audi_q7, x = 'price_usd', hue = 'year_produced')
3. Standard deviation and quantiles
If we take the median of an ordered list of values, and average the squares of all values in the list minus the median, we will have the "variation". The square root of the variation is called "standard deviation"
# Standard deviation
df['price_usd'].std()
# Range is the maximum - minimum value
max_val = df['price_usd'].max()
min_val = df['price_usd'].min()
rg = max_val - min_val
print(f'Max = {max_val}, Min = {min_val}, Range = {rg}')
# Quantiles: Remember we the quantile 2 is the median
median = df['price_usd'].median()
Q1 = df['price_usd'].quantile(0.25)
Q3 = df['price_usd'].quantile(0.75)
min_val = df['price_usd'].quantile(0.0)
max_val = df['price_usd'].quantile(1.0)
print(min_val, Q1, median, Q3, max_val)
# Inter quantile range: range where are the majority of elements
iqr = Q3 - Q1
iqr
sns.histplot(df, x = 'price_usd')
# sns.boxplot(df['price_usd'])
sns.boxplot(data = df, x = 'price_usd')
sns.boxplot(data = df, x = 'engine_fuel', y = 'price_usd')
sns.displot(df, x = 'engine_fuel', y = 'price_usd')
sns.histplot(df, hue = 'engine_fuel', x = 'price_usd')