Conociendo nuestros datos de pingüinos. 🗺🧭🐧

Instalar librerías necesarias

Importar librerías

import empiricaldist import janitor import matplotlib.pyplot as plt import numpy as np import palmerpenguins import pandas as pd import scipy.stats import seaborn as sns import sklearn.metrics import statsmodels.api as sm import statsmodels.formula.api as smf import statsmodels.stats as ss import session_info

Run to view results

Establecer apariencia general de los gráficos

%matplotlib inline sns.set_style(style='whitegrid') sns.set_context(context='notebook') plt.rcParams['figure.figsize'] = (11, 9.4) penguin_color = { 'Adelie': '#ff6602ff', 'Gentoo': '#0f7175ff', 'Chinstrap': '#c65dc9ff' }

Run to view results

Loading data

Using the palmerpenguins package that is installed with PIP

Raw data from the package

raw_penguins_df = palmerpenguins.load_penguins_raw() raw_penguins_df.head(2)

Run to view results

Using the dataset from seaborn

preprocessed_penguins_df = sns.load_dataset("penguins") preprocessed_penguins_df.head(2)

Run to view results

Collection of data, Cleaning and Validation

What kind of variables are in the dataset?

preprocessed_penguins_df.dtypes # getting the data types of each vriables by column

Run to view results

How many variables by category are in the dataset?

preprocessed_penguins_df.dtypes.value_counts() # number of variables by category

Run to view results

How many variables and observations are in the dataset?

row_pre_pro,column_pre_pro =preprocessed_penguins_df.shape #getting rows and columns print(f'Rows_pre_pro: {row_pre_pro}, Columns_pre_pro: {column_pre_pro}')

Run to view results

Are there null values in the dataset?

preprocessed_penguins_df.isnull().any() #checking for null values and giving the recap with .any()

Run to view results

IF had null data, How many are there by category?

preprocessed_penguins_df.isnull().sum() #summing up the number of null values in each column with .sum()

Run to view results

How many null data are in the dataset?

preprocessed_penguins_df.isnull().sum().sum() #summing up the total number of null values, applying twice .sum()

Run to view results

What is the proportion of null data by each variable?

( preprocessed_penguins_df .isnull() .sum(axis=0) )

Run to view results

( preprocessed_penguins_df .isnull() .melt(var_name='Category',value_name='missing') .pipe( lambda df: ( sns.displot( data=df, y='Category', hue='missing', multiple='fill', aspect=2 ) ) ) )

Run to view results

¿Cómo podemos visualizar los valores nulos en todo el conjunto de datos?

preprocessed_penguins_df.isnull().T.head(3)

Run to view results

( #It is transponse the dataset to get a hetamap where we can identiy the missing values preprocessed_penguins_df .isnull() .T # tranpose the dataset to visualize the data in the heatmap .pipe( lambda df: sns.heatmap(data=df,annot=True, cmap='crest') #annot= write the data en each cell ) )

Run to view results

How many observations miss if drop the null data?

processed_penguins_df=( preprocessed_penguins_df .dropna() ) print(f'Full dataset: {row_pre_pro,column_pre_pro}') print(f'Processed dataset:{processed_penguins_df.shape}')

Run to view results

Counting and proportions

Prelude: How statistics describe the dataset?

processed_penguins_df.describe(include='all') # gives a brief summary using describe of the main statistics

Run to view results

Describing the main statistics, but only in the numerical variables.

processed_penguins_df.describe(exclude=np.object)

Run to view results

processed_penguins_df.describe(include=np.number) # other alternative to do the same #processed_penguins_df.describe(exclude=np.object)

Run to view results

Analyzing categorical variables using describe()

processed_penguins_df.describe(include=object) # gives a brief summary using describe of the categorical variables

Run to view results

Redefining the variables type:objects, as categorical

( processed_penguins_df .astype( # .astype() changes the type of the variables { 'species':'category', 'island':'category', 'sex':'category' }).describe(include='category') )

Run to view results

How to visualize the counts?

Using Pandas to visualize the number of elements by species

print(f'Count of Species: {processed_penguins_df.species.value_counts()}') ( processed_penguins_df .species .value_counts() .plot( kind='bar', color=['blue','orange','green'], title='Count of Species' ) ) plt.show()

Run to view results

Using the Seaborn to visualize the number of elements by species

sns.catplot( data=processed_penguins_df, x='species', kind='count', palette=penguin_color, order=processed_penguins_df.value_counts('species', sort=True).index ) plt.show()

Run to view results

Using Seaborn to visualize with a barplot

( processed_penguins_df .value_counts('species', sort=True) # counting the number of elements by species and sorting them .reset_index(name= 'count') # reset the index to create a table, usina name, rename the column which contains the index data. .pipe( lambda df: ( sns.barplot( data=df, x='species', y='count', palette=penguin_color ) ) ) ) plt.show()

Run to view results

How to visualize the proportions?

( processed_penguins_df .add_column('x', '') .pipe( lambda df: ( sns.displot( data=df, x='x', hue= 'species', palette=penguin_color, multiple='fill', ) ) ) ) plt.show()

Run to view results

list_categories=processed_penguins_df.select_dtypes(include='object').columns #get the columns with object type print(list_categories) # list with the names of the columns with object type fig, axes= plt.subplots(1,3,figsize=(15,5)) # create the 3 subplots to interact with the data for i in range(len(list_categories)): # interact between the range of the list ( processed_penguins_df .value_counts(list_categories[i]) # counting the number of elements by elements inside the list .plot( ax=axes[i], #access to the object axes to plot in the index[i] kind='bar', color=['blue','orange','green'], title=list_categories[i] ) )

Run to view results

( processed_penguins_df .value_counts('species') #series object .pipe( lambda df:( plt .pie( df.values, # gigve the valus in axes y with values of series object processed_penguins labels=df.index, # give the index of the series object autopct='%1.0f%%', #figsize=(10,1) ) ) ) ).plt.show()

Run to view results

Central tendency statistics

Implementation of mean(), median() and mode()

#Computing the mean by variable print(f'Mean using pandas:') print(f'{processed_penguins_df.bill_depth_mm.mean()}') #give the mean by variable en columns using pandas print(f'Mean using numpy:') print(f'{np.mean(processed_penguins_df.bill_depth_mm)}') # give the mean using numpy passing dataframe and the variable #Computing the mean in the entire dataset (only for numerical variables) print('\n') print(f'Mean in the entire dataset') print(f'{processed_penguins_df.mean()}') #Computing the median in the entire dataset (only for numerical variables) print('\n') print(f'Median in the entire dataset') print(f'{processed_penguins_df.median()}') #Computing the mode in the entire dataset (for categorical and numerical variables) print('\n') print(f'Mode in the entire dataset') print(f'{processed_penguins_df.mode()}') #Computing the mode but only applied to categorical varibales print('\n') print(f'Mode for the categorical variables') processed_penguins_df.describe(include=object)

Run to view results

Scattered Measures

What is the maximum value by category-variable?

processed_penguins_df.max() # returns the maximum value by category-variable without distinquishing #processed_penguins_df.max(numeric_only=True)

Run to view results

What is the minimum value by category-variable?

processed_penguins_df.min() # return the minimum value by category-variable without distinquishing processed_penguins_df.min(numeric_only=True)

Run to view results

What is the range by each numerical variable?

# computing the range = max -min by each numerical variable processed_penguins_df.max(numeric_only=True)-processed_penguins_df.min(numeric_only=True)

Run to view results

What is the std in the numeric variables?

#With one-std we get around the 65% or 68% of the data #With two-std, we get around 95% of data #With three-std, we get around 99.7% of data processed_penguins_df.std()

Run to view results

What is the interquartile range?

# The interquartile range gets 50% of the data processed_penguins_df.quantile(0.25)

Run to view results

processed_penguins_df.quantile(0.75) - processed_penguins_df.quantile(0.25)

Run to view results

Making a table where we can visualize the Q1, Q2, Q3 and the IQR

( processed_penguins_df .quantile(q=[0.75, 0.50, 0.25]) #getting the quantile 75% 50% and 25% by each numerical variable .T #transposing the table to visualize correct the data .rename_axis(index='variable') #renaming the name of index , axes=0 .reset_index() #reset the number of index in the table .assign( #assigning a new column in the table where the lambda function is applied iqr = lambda df: df[0.75] - df[0.25] ) )

Run to view results

How can I visualize the distribution of a specific variable?

Histogram

processed_penguins_df.flipper_length_mm.mode().values[0]

Run to view results

sns.histplot( #using seaborn to display the histogram of flipper_length_mm data=processed_penguins_df, x='flipper_length_mm', hue='species' ) # using plt.axline to display the mean, median and mode (a vertical line across the Axes) plt.axvline( x=processed_penguins_df.flipper_length_mm.mean(), #point in the x axis equal to mean color='red', linestyle='dashed', linewidth=2 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.median(), #point in the x axis equal to median color='blue', linestyle='dashed', linewidth=2 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.mode().values[0],# it only returns the value in the axes specified color='black', linestyle='dashed', linewidth=4 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.quantile(0.25), #point in the x axis equal to Quantile Q1 color='yellow', linestyle='dashed', linewidth=2 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.quantile(0.75), #point in the x axis equal to Quantile Q3 color='yellow', linestyle='dashed', linewidth=2 ) plt.show()

Run to view results

BOX diagram or Boxplot

sns.boxplot( x=processed_penguins_df.flipper_length_mm, ) plt.show() # One limitation with the use of boxplot is that the graph does not show how the bias is in the data, therefore # reading or giving a meaning coul be a little challenging.

Run to view results

Limitation

def freedman_diaconis_bindwidth(x: pd.Series) -> float: """Find optimal bindwidth using Freedman-Diaconis rule.""" IQR = x.quantile(0.75) - x.quantile(0.25) N = x.size return 2 * IQR / N ** (1 / 3)

Run to view results

sns.histplot( data=processed_penguins_df, x='flipper_length_mm', binwidth=3 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.mean(), color='red', linestyle='dashed', linewidth=2 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.median(), color='blue', linestyle='dashed', linewidth=2 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.mode().values[0], color='black', linestyle='dashed', linewidth=4 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.quantile(0.25), color='yellow', linestyle='dashed', linewidth=2 ) plt.axvline( x=processed_penguins_df.flipper_length_mm.quantile(0.75), color='yellow', linestyle='dashed', linewidth=2 )

Run to view results

Distributions PMFs, CDFs y PDFs

Probability Mass Function (PMFs)

Using seaborn

sns.histplot( data=processed_penguins_df, #df to be used x='flipper_length_mm', #variable to get the probability binwidth=1, stat='probability' #type of stats: probability mass function ) plt.show()

Run to view results

Using the library empiricaldist

pmf_flipper_length_mm = empiricaldist.Pmf.from_seq( #library empiricaldist, module.from_seq processed_penguins_df.flipper_length_mm, #variable to be used and normalize=True #normalized to get the probability--> False: return frequency )

Run to view results

pmf_flipper_length_mm.bar() #plotting the PMF usin method .bar()

Run to view results

In Addition, the library returns the probability with a given numerical value

pmf_flipper_length_mm(231)

Run to view results

processed_penguins_df.flipper_length_mm.max()

Run to view results

Empirical Cumulative distribution functions (ECDFs)

Using seaborn

sns.ecdfplot( data=processed_penguins_df, x="flipper_length_mm" ) plt.show()

Run to view results

Using empiricaldist

cdf_flipper_length_mm = empiricaldist.Cdf.from_seq( #computing the CDF using library empiricaldist processed_penguins_df.flipper_length_mm, normalize=True #normalized to get the probability--> False: return frequency )

Run to view results

cdf_flipper_length_mm.plot() #plotting the CDF using method .plot() q = 200 # Specify quantity p = cdf_flipper_length_mm.forward(q) plt.vlines( #draw a vertical line in the graph x=q, #point in the x axis equal to q ymin=0, #point in the y axis equal to 0 ymax=p, #point in the y axis equal to p color = 'black', linestyle='dashed' ) plt.hlines( #draw a horizontal line in the graph y=p, #point in the y axis equal to p xmin=pmf_flipper_length_mm.qs[0],#point in the x axis equal to pmf_flipper_length_mm.qs[0] xmax=q, #point in the x axis equal to q color='black', linestyle='dashed' ) plt.plot(q, p, 'ro') #draw a point in the graph

Run to view results

cdf_flipper_length_mm.step() p_1 = 0.25 # Specify probability p_2 = 0.75 ps = (0.25, 0.75) # IQR qs = cdf_flipper_length_mm.inverse(ps) # get the values of the variable that corresponds to the probability plt.vlines( # display the vertical lines in the graph x=qs, ymin=0, ymax=ps, color = 'black', linestyle='dashed' ) plt.hlines( # display the horizontal lines in the graph y=ps, xmin=pmf_flipper_length_mm.qs[0], xmax=qs, color='black', linestyle='dashed' ) plt.scatter( # display the points in the graph x=qs, y=ps, color='red', zorder=2 )

Run to view results

Compare the CDF but, displaying the data by species

sns.ecdfplot( data=processed_penguins_df, x='flipper_length_mm', hue='species', palette=penguin_color )

Run to view results

Probability Density Function

sns.kdeplot( data=processed_penguins_df, # data frame x='flipper_length_mm', # variable to be used bw_method=0.09 # parameter to smoothing the bandwidth )

Run to view results

Comparing distributions with the variable body mass from the dataset

#Getting the statistics with the describe method stats = processed_penguins_df.body_mass_g.describe() stats

Run to view results

np.random.seed(42) #to generate the same random numbers xs = np.linspace(stats['min'], stats['max']) #random vector in between the min and max from body mass ys = scipy.stats.norm(stats['mean'], stats['std']).cdf(xs) #using scipy library to create the cumulative distribution function plt.plot(xs, ys, color='black', linestyle='dashed') #plotting the results empiricaldist.Cdf.from_seq( #drawing the CDF using empirical dist processed_penguins_df.body_mass_g, normalize=True ).plot() #plotting in the same graph plt.show()

Run to view results

Comparing PDF distribution

xs = np.linspace(stats['min'], stats['max'] ) #generating the x vector ys = scipy.stats.norm(stats['mean'], stats['std']).pdf(xs) #making a norm distrib. using scipy with mean and std plt.figure(figsize=(7,7)) plt.plot(xs, ys, color='black', linestyle='dashed') #ploting the vectors sns.kdeplot( #using seaborn to graph kernel density estimation data=processed_penguins_df, #data frame to be used x='body_mass_g' #variable to be used )

Run to view results

Plot the other distributions with the other variables

Plot the graphs of seaborn in a matrix

The Central limit Theorem and the law of the large numbers

Law of the large numbers

dice = empiricaldist.Pmf.from_seq([1, 2, 3, 4, 5, 6]) #probability of the dice print(dice) plt.figure(figsize=(5,5)) dice.bar()

Run to view results

Code to make a large sample using the dice probability

for sample_size in (1e2, 1e3, 1e4): #iterator with the size 1e2, 1e3 and 1e4 sample_size = int(sample_size) #convert the size to an integer values = dice.sample(sample_size) #size of the sample into values sample_pmf = empiricaldist.Pmf.from_seq(values) #computing the probability of the sample plt.figure(figsize=(5,5)) sample_pmf.bar() #plotting the Pmf plt.axhline(y=1/6, color = 'red', linestyle='dashed') plt.ylim([0, 0.50]) plt.title(f"Sample size: {sample_size}")

Run to view results

Central Theorem Limit

# Working the variable sex; however this variable is categorical in the dataset, therefore we go # to transform it into a numerical variable to follow a binomial distribution--> [1,0] processed_penguins_df.sex.value_counts(normalize=True).plot(kind='bar',figsize=(4,4)) #normalize the values between the total and the value by sex

Run to view results

# In this section, the categorical variable will be transformed into a numerical one, but following a # binomial distribution. Then the values for male and female will be [1,0] # the method .replace() will replace the values sex_numeric = processed_penguins_df.sex.replace(['Male', 'Female'], [1, 0])

Run to view results

from warnings import simplefilter simplefilter(action="ignore", category=pd.errors.PerformanceWarning)

Run to view results

number_samples = 1000 # defining the total number of samples sample_size=35 # defining the sample to be gotten samples_df = pd.DataFrame() #making the data frame where the samples will be saved np.random.seed(42) #establish the seed to get the same result for i in range(1, number_samples + 1): #iterador from 1 to 1000 sex_numeric_sample = sex_numeric.sample(sample_size, replace=True).to_numpy() # using method sample to get the random values defined by sample_size sample_name = f"sample_{i}" # variable with the name of the column according to the sample samples_df[sample_name] = sex_numeric_sample #saving the sample in the column defined male_population_mean = samples_df.mean().mean() print(f"Estimated percentage of male penguins in population is: {male_population_mean * 100:.4f}%")

Run to view results

sample_means_binomial = pd.DataFrame(samples_df.mean(), columns=['sample_mean']) plt.figure(figsize=(5,5)) sns.kdeplot(data=sample_means_binomial) plt.axvline(x=sex_numeric.mean(), color='red', linestyle='dashed')

Run to view results

Making a dataframe with the mean of each sample --> 1000 sample = 1000 means

#making a list comprehension to get the item and the mean by each sample #applying .iloc() to filter the column and using the iterator to save the sample and the mean i=[[i,samples_df.iloc[:,0:i].mean().mean()] for i in range(1, number_samples + 1)] dataframe_means_samples =pd.DataFrame(i,columns=['sample','estimated_mean']) dataframe_means_samples.head(2)

Run to view results

plt.figure(figsize=(4,4)) #sns.kdeplot(data=dataframe_means_samples.estimated_mean,bw_method=50) sns.histplot(dataframe_means_samples,x='estimated_mean',stat='probability',bins=20,kde=True) plt.show()

Run to view results

--> Bivariate Analysis<--

Establish relationships: Scatter plots

#Using seaborn to plot the scatterplot with two variables plt.figure(figsize=(7.5,7.5)) sns.scatterplot( data=processed_penguins_df, #data frame to get information x='bill_length_mm', #variable to be used y='bill_depth_mm', #variable to be used alpha=1/2, #transparency 1/n --> n points in the same place (1) s=100 #size of the points ) plt.show()

Run to view results

The graphic above is not clear to look where the information is gathering, to find more information we can use a 2-D histogram to sear more info.

sns.displot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm', rug=True #draw a line and show how many points are acumulated by axis )

Run to view results

sns.displot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm', kind='kde', rug=True, #hue='species' #to hue by type, we can use the variable species )

Run to view results

Combining different types of graphs in one graph

#jointplot provides a convenient interface with several canned plot kinds sns.jointplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm', #kind='kde', hue='species', #marginal_kws=dict(bins=25, fill=False) )

Run to view results

Establishing relationships: Violin graphs and boxplot

# Establishing or searching relationships in between categorical and numerical variable plt.figure(figsize=(7.2,7.2)) sns.scatterplot( data=processed_penguins_df, x='species', y='flipper_length_mm', hue='species', palette=penguin_color )

Run to view results

Now, we include noise to verify the data is not overlap, but using the stripplot

plt.figure(figsize=(7.2,7.2)) sns.stripplot( data=processed_penguins_df, x='species', y='flipper_length_mm', palette=penguin_color )

Run to view results

Using boxplot to visualize the behavior of cate_varaible and num_varaible

# In the same figure we can add a stripplot in the boxplot to visualize the data plt.figure(figsize=(7.1,7.1)) sns.boxplot( data=processed_penguins_df, x='flipper_length_mm', y='species', palette=penguin_color, whis=np.inf #control the whisker length in the boxplot ) sns.stripplot( data=processed_penguins_df, x='flipper_length_mm', y='species', color='.3' ) plt.show()

Run to view results

In order to understand how symmetric is the data, we can use the Violinplots to visualize the symmetric distribution and do a quick analysis so as to recognize if the data is following a normal distribution or not.

#plt.figure(figsize=(10,15)) sns.violinplot( data=processed_penguins_df, x='species', y='flipper_length_mm', color='.8' ) sns.stripplot( data=processed_penguins_df, x='species', y='flipper_length_mm', palette=penguin_color ) plt.show()

Run to view results

Another option to avoid the overlap is using the swarmplot or beeswarm, the functionality is to give a better representation of the distribution of values. NOTE: it does not scale perfect with a large number of observations

sns.swarmplot( data=processed_penguins_df, x='species', y='flipper_length_mm', hue='species', palette=penguin_color )

Run to view results

Finding information about the weight by island

# to find more information, we will use a violinplot and swarmplot to see the distribution of values by island, # using the weight to find informaation sns.violinplot( data=processed_penguins_df, x='island', y='body_mass_g', hue='species', ) sns.swarmplot( data=processed_penguins_df, x='island', y='body_mass_g', hue='species', )

Run to view results

Establishing relationships: Matrix of correlations

¿Existe una correlación lineal entre alguna de nuestras variables?

processed_penguins_df.corr()

Run to view results

¿Como puedo visualizar los coeficientes de correlación?

sns.heatmap( data=processed_penguins_df.corr(), cmap=sns.diverging_palette(20, 230, as_cmap=True), #color map for the data center=0, #value in the center of the matrix vmin=-1, #min value of the matrix vmax=1, #max value of the matrix linewidths=0.5, cbar_kws={"shrink": 0.5}, annot=True #put the values of the correlation in the matrix plot )

Run to view results

sns.clustermap( data=processed_penguins_df.corr(), cmap= sns.diverging_palette(20, 230, as_cmap=True), # 'BrBG' center=0, vmin=-1, vmax=1, linewidths=0.5, cbar_kws={"shrink": 0.5}, annot=True )

Run to view results

How can a categorical variable be represented to find correlation?

The process is simple, the first action is identified the categorical variable and change its status to numerical with the use of replace. Afterward a new column must be included in the dataset with the value assigned to the categorical variable

processed_penguins_df = ( processed_penguins_df .assign( numeric_sex=lambda df: df.sex.replace(['Female', 'Male'], [0, 1]) ) )

Run to view results

sns.clustermap( data=processed_penguins_df.corr(), cmap= sns.diverging_palette(20, 230, as_cmap=True), # 'BrBG' center=0, vmin=-1, vmax=1, linewidths=0.5, cbar_kws={"shrink": 0.5}, annot=True )

Run to view results

Limitations with the matrix of correlation

Only limits to find a possible lineal correlation, nonetheless, it absent does not mean it exists other type of correlation and relationship

x1=np.linspace(-100, 100, 100) #vector with 100 values between -100 and 100 y1=x**2 #parabola function y1+=np.random.normal() #adding noise that follows a normal distribution (could be other distributions) plt.figure(figsize=(5.2,5.2)) sns.scatterplot(x=x1, y=y1) plt.show() #searching for correlation using nmpy np.corrcoef(x1, y1)

Run to view results

x1 = np.linspace(-100, 100, 100) y1 = x ** 3 y1 += np.random.normal(0, 1000, x.size) plt.figure(figsize=(5.2,5.2)) sns.scatterplot(x=x1, y=y1) plt.show() np.corrcoef(x1, y1)

Run to view results

plt.figure(figsize=(5.2,5.2)) sns.scatterplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm' ) np.corrcoef(processed_penguins_df.bill_length_mm, processed_penguins_df.bill_depth_mm)

Run to view results

The coefficient of correlation does not talk about the impact about the relationship

# Making two lines with different slopes np.random.seed(42) x_1 = np.linspace(0, 100, 100) y_1 = 0.1 * x_1 + 3 + np.random.uniform(-2, 2, size=x_1.size) plt.figure(figsize=(5.2,5.2)) sns.scatterplot( x=x_1, y=y_1 ) x_2 = np.linspace(0, 100, 100) y_2 = 0.5 * x_2 + 1 + np.random.uniform(0, 60, size=x_2.size) sns.scatterplot( x=x_2, y=y_2 ) plt.legend(labels=['1', '2']) plt.show() print(np.corrcoef(x_1, y_1)) print(np.corrcoef(x_2, y_2))

Run to view results

Establishing relationships: Analysis of simple regression

res_1 = scipy.stats.linregress(x=x_1, y=y_1) #getting the values from linregression into a variable res_2 = scipy.stats.linregress(x=x_2, y=y_2) #getting the values from linregression into a variable print(res_1, res_2, sep="\n")

Run to view results

plt.figure(figsize=(7,5.2)) #plotting the scatter plots sns.scatterplot( x=x_1, y=y_1 ) sns.scatterplot( x=x_2, y=y_2 ) #plotting the regression lines with the info from scipy y=m*x+b m:slope, b:intercept fx_1 = np.linspace(x_1.min(), x_1.max()) fy_1 = res_1.slope*fx_1 + res_1.intercept plt.plot(fx_1, fy_1) fx_2 = np.linspace(x_2.min(), x_2.max()) fy_2 = res_2.slope*fx_2 + res_2.intercept plt.plot(fx_2, fy_2) plt.legend(labels=['1', '1', '2', '2']) plt.show()

Run to view results

plt.figure(figsize=(5.2,5.2)) sns.scatterplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm' ) res_penguins = scipy.stats.linregress(x=processed_penguins_df.bill_length_mm, y=processed_penguins_df.bill_depth_mm) print(res_penguins) fx_1 = np.array([processed_penguins_df.bill_length_mm.min(), processed_penguins_df.bill_length_mm.max()]) fy_1 = res_penguins.intercept + res_penguins.slope * fx_1 plt.plot(fx_1, fy_1)

Run to view results

plt.figure(figsize=(5.2,5.2)) sns.lmplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm', hue='species', #height=10 )

Run to view results

LIMITATIONS with the analysis of simple regression

The linear regression is not symmetric

x = processed_penguins_df.bill_length_mm y = processed_penguins_df.bill_depth_mm res_x_y = scipy.stats.linregress(x=x, y=y) res_y_x = scipy.stats.linregress(y=x, x=y) print(res_x_y, res_y_x, sep="\n")

Run to view results

#Plottting bill_depth_mm vs bill_length_mm to contrast the linear regression plt.figure(figsize=(5.2,5.2)) sns.scatterplot( x=x, y=y ) fx_1 = np.array([x.min(), x.max()]) fy_1 = res_x_y.intercept + res_x_y.slope * fx_1 plt.plot(fx_1, fy_1)

Run to view results

#Plottting bill_length_mm vs bill_depth_mm to contrast the linear regression plt.figure(figsize=(5.2,5.2)) sns.scatterplot( x=y, y=x ) fx_1 = np.array([y.min(), y.max()]) fy_1 = res_y_x.intercept + res_y_x.slope * fx_1 plt.plot(fx_1, fy_1)

Run to view results

La regresión no nos dice nada sobre la causalidad, pero existen herramientas para separar las relaciones entre varias variables

La pendiente es -0.634905, lo que significa que cada milímetro adicional de profundidad del pico es asociado a un decremento de -0.634905 milímetros de la longitud del pico de un pingüino.

( smf.ols( formula="bill_length_mm ~ bill_depth_mm", data=processed_penguins_df ) .fit() .params )

Run to view results

( smf.ols( formula="bill_depth_mm ~ bill_length_mm", data=processed_penguins_df ) .fit() .summary() )

Run to view results

Análisis de regresión múltiple

Olvidé mi báscula para pesar a los pingüinos, ¿Cuál sería la mejor forma de capturar ese dato?

Creando modelos

model_1 = ( smf.ols( formula="body_mass_g ~ bill_length_mm", data=processed_penguins_df ) .fit() ) model_1.summary()

Run to view results

model_2 = ( smf.ols( formula="body_mass_g ~ bill_length_mm + bill_depth_mm ", data=processed_penguins_df ) .fit() ) model_2.summary()

Run to view results

model_3 = ( smf.ols( formula="body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm", data=processed_penguins_df ) .fit() ) model_3.summary()

Run to view results

model_4 = ( smf.ols( formula="body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + C(sex)", #add C(var_categorical) data=processed_penguins_df ) .fit() ) model_4.summary()

Run to view results

model_5 = ( smf.ols( formula="body_mass_g ~ flipper_length_mm + C(sex)", data=processed_penguins_df ) .fit() ) model_5.summary()

Run to view results

VISUALIZING THE RESULTS

#making a new DF with the info from each model using the dict() constructor models_results = pd.DataFrame( dict( actual_value=processed_penguins_df.body_mass_g, prediction_model_1 = model_1.predict(), prediction_model_2 = model_2.predict(), #predicting the body mass from each model prediction_model_3 = model_3.predict(), prediction_model_4 = model_4.predict(), prediction_model_5 = model_5.predict(), species=processed_penguins_df.species, sex=processed_penguins_df.sex ) ) models_results.head(5)

Run to view results

Using the seaborn distributions to visualize the models again the current value

plt.figure(figsize=(7.2,7.2)) sns.ecdfplot( data=models_results#.select_columns(['actual_value', 'prediction_model_5']), ) plt.show()

Run to view results

plt.figure(figsize=(7.2,7.2)) sns.kdeplot( data=models_results, cumulative=False ) plt.show()

Run to view results

With the analysis done, we can observe that the model_5 is adjusting a little better to the original data. For that reasons is a good advice to analyze the correlation between the variables to avoid to waste time making various models in order to get the better model

sns.lmplot( data=processed_penguins_df, x='flipper_length_mm', y='body_mass_g', height=6.5, hue='sex' ) plt.show()

Run to view results

ANALYZING LOGISTIC REGRESSION

processed_penguins_df = ( processed_penguins_df .assign(sex_numeric=lambda df: df.sex.replace(['Male', 'Female'], [1, 0])) )

Run to view results

smf.logit( formula='sex_numeric ~ flipper_length_mm + bill_length_mm + bill_depth_mm + C(island)', data=processed_penguins_df ).fit().summary()

Run to view results

Counting the number of males by each island to identify if the results from above are correct (more probably or less probably)

( processed_penguins_df .value_counts(['island', 'sex']) .reset_index(name='count') )

Run to view results

processed_penguins_df.species.unique()

Run to view results

processed_penguins_df = ( processed_penguins_df .assign(is_adelie=lambda df: df.species.replace(['Adelie', 'Chinstrap', 'Gentoo'], [1, 0, 0])) ) processed_penguins_df.head(3)

Run to view results

Working with the new model to predict penguins male or female

model_is_adelie = smf.logit( formula='is_adelie ~ flipper_length_mm + C(sex)', data=processed_penguins_df ).fit(maxiter=100) model_is_adelie.params

Run to view results

Validating if the previous model and the original data match

is_adelie_df_predictions = pd.DataFrame( dict( actual_adelie = processed_penguins_df.species.replace(['Adelie', 'Chinstrap', 'Gentoo'], [1, 0, 0]), predicted_values = model_is_adelie.predict().round() #round()c to get values between (1,0) ) ) is_adelie_df_predictions

Run to view results

( is_adelie_df_predictions .value_counts(['actual_adelie', 'predicted_values']) .reset_index(name='count') .pivot_wider( index='actual_adelie', names_from='predicted_values', values_from='count' ) .rename_column('actual_adelie', 'actual / predicted') )

Run to view results

print( sklearn.metrics.confusion_matrix( is_adelie_df_predictions.actual_adelie, is_adelie_df_predictions.predicted_values ) ) sklearn.metrics.accuracy_score( is_adelie_df_predictions.actual_adelie, is_adelie_df_predictions.predicted_values )

Run to view results

Paradoja de Simpson

sns.scatterplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm' )

Run to view results

sns.regplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm' )

Run to view results

sns.lmplot( data=processed_penguins_df, x='bill_length_mm', y='bill_depth_mm', hue='species', height=10, palette=penguin_color )

Run to view results

sns.pairplot(data=processed_penguins_df, hue='species', palette=penguin_color)

Run to view results

Información de sesión

session_info.show()

Run to view results

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Conociendo nuestros datos de pingüinos. 🗺🧭🐧

Instalar librerías necesarias

Importar librerías

Establecer apariencia general de los gráficos

Loading data

Using the dataset from seaborn

Collection of data, Cleaning and Validation

What kind of variables are in the dataset?

How many variables by category are in the dataset?

How many variables and observations are in the dataset?

Are there null values in the dataset?

IF had null data, How many are there by category?

How many null data are in the dataset?

What is the proportion of null data by each variable?

¿Cómo podemos visualizar los valores nulos en todo el conjunto de datos?

How many observations miss if drop the null data?

Counting and proportions

Prelude: How statistics describe the dataset?

Describing the main statistics, but only in the numerical variables.

Analyzing categorical variables using describe()

Redefining the variables type:objects, as categorical

How to visualize the counts?

Using Pandas to visualize the number of elements by species

Using the Seaborn to visualize the number of elements by species

How to visualize the proportions?

Central tendency statistics

Implementation of mean(), median() and mode()

Scattered Measures

What is the maximum value by category-variable?

What is the minimum value by category-variable?

What is the range by each numerical variable?

What is the std in the numeric variables?

What is the interquartile range?

How can I visualize the distribution of a specific variable?

Histogram

BOX diagram or Boxplot

Limitation

Distributions PMFs, CDFs y PDFs

Probability Mass Function (PMFs)

Using seaborn

Using the library empiricaldist

Empirical Cumulative distribution functions (ECDFs)

Using seaborn

Using empiricaldist

Compare the CDF but, displaying the data by species

Probability Density Function

Comparing PDF distribution

.css-ftsyh3{color:#008E44;font-weight:inherit;letter-spacing:-0.09px;}Plot the other distributions with the other variables

The Central limit Theorem and the law of the large numbers

Law of the large numbers

Central Theorem Limit

--> Bivariate Analysis<--

Establish relationships: Scatter plots

Combining different types of graphs in one graph

Establishing relationships: Violin graphs and boxplot

Establishing relationships: Matrix of correlations

¿Existe una correlación lineal entre alguna de nuestras variables?

¿Como puedo visualizar los coeficientes de correlación?

How can a categorical variable be represented to find correlation?

Limitations with the matrix of correlation

Only limits to find a possible lineal correlation, nonetheless, it absent does not mean it exists other type of correlation and relationship

The coefficient of correlation does not talk about the impact about the relationship

Establishing relationships: Analysis of simple regression

LIMITATIONS with the analysis of simple regression

The linear regression is not symmetric

La regresión no nos dice nada sobre la causalidad, pero existen herramientas para separar las relaciones entre varias variables

Análisis de regresión múltiple

Olvidé mi báscula para pesar a los pingüinos, ¿Cuál sería la mejor forma de capturar ese dato?

Creando modelos

VISUALIZING THE RESULTS

ANALYZING LOGISTIC REGRESSION

Paradoja de Simpson

Información de sesión

Conociendo nuestros datos de pingüinos. 🗺🧭🐧

Plot the other distributions with the other variables