Conociendo nuestros datos de pingüinos. 🗺🧭🐧
Instalar librerías necesarias
Importar librerías
Run to view results
Establecer apariencia general de los gráficos
Run to view results
Loading data
Using the palmerpenguins package that is installed with PIP
Raw data from the package
Run to view results
Using the dataset from seaborn
Run to view results
Collection of data, Cleaning and Validation
What kind of variables are in the dataset?
Run to view results
How many variables by category are in the dataset?
Run to view results
How many variables and observations are in the dataset?
Run to view results
Are there null values in the dataset?
Run to view results
IF had null data, How many are there by category?
Run to view results
How many null data are in the dataset?
Run to view results
What is the proportion of null data by each variable?
Run to view results
Run to view results
¿Cómo podemos visualizar los valores nulos en todo el conjunto de datos?
Run to view results
Run to view results
How many observations miss if drop the null data?
Run to view results
Counting and proportions
Prelude: How statistics describe the dataset?
Run to view results
Describing the main statistics, but only in the numerical variables.
Run to view results
Run to view results
Analyzing categorical variables using describe()
Run to view results
Redefining the variables type:objects, as categorical
Run to view results
How to visualize the counts?
Using Pandas to visualize the number of elements by species
Run to view results
Using the Seaborn to visualize the number of elements by species
Run to view results
Using Seaborn to visualize with a barplot
Run to view results
How to visualize the proportions?
Run to view results
Run to view results
Run to view results
Central tendency statistics
Implementation of mean(), median() and mode()
Run to view results
Scattered Measures
What is the maximum value by category-variable?
Run to view results
What is the minimum value by category-variable?
Run to view results
What is the range by each numerical variable?
Run to view results
What is the std in the numeric variables?
Run to view results
What is the interquartile range?
Run to view results
Run to view results
Making a table where we can visualize the Q1, Q2, Q3 and the IQR
Run to view results
How can I visualize the distribution of a specific variable?
Histogram
Run to view results
Run to view results
BOX diagram or Boxplot
Run to view results
Limitation
Run to view results
Run to view results
Distributions PMFs, CDFs y PDFs
Probability Mass Function (PMFs)
Using seaborn
Run to view results
Using the library empiricaldist
Run to view results
Run to view results
In Addition, the library returns the probability with a given numerical value
Run to view results
Run to view results
Empirical Cumulative distribution functions (ECDFs)
Using seaborn
Run to view results
Using empiricaldist
Run to view results
Run to view results
Run to view results
Compare the CDF but, displaying the data by species
Run to view results
Probability Density Function
Run to view results
Comparing distributions with the variable body mass from the dataset
Run to view results
Run to view results
Comparing PDF distribution
Run to view results
Plot the other distributions with the other variables
Plot the graphs of seaborn in a matrix
The Central limit Theorem and the law of the large numbers
Law of the large numbers
Run to view results
Code to make a large sample using the dice probability
Run to view results
Central Theorem Limit
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Making a dataframe with the mean of each sample --> 1000 sample = 1000 means
Run to view results
Run to view results
--> Bivariate Analysis<--
Establish relationships: Scatter plots
Run to view results
The graphic above is not clear to look where the information is gathering, to find more information we can use a 2-D histogram to sear more info.
Run to view results
Run to view results
Combining different types of graphs in one graph
Run to view results
Establishing relationships: Violin graphs and boxplot
Run to view results
Now, we include noise to verify the data is not overlap, but using the stripplot
Run to view results
Using boxplot to visualize the behavior of cate_varaible and num_varaible
Run to view results
In order to understand how symmetric is the data, we can use the Violinplots to visualize the symmetric distribution and do a quick analysis so as to recognize if the data is following a normal distribution or not.
Run to view results
Another option to avoid the overlap is using the swarmplot or beeswarm, the functionality is to give a better representation of the distribution of values. NOTE: it does not scale perfect with a large number of observations
Run to view results
Finding information about the weight by island
Run to view results
Establishing relationships: Matrix of correlations
¿Existe una correlación lineal entre alguna de nuestras variables?
Run to view results
¿Como puedo visualizar los coeficientes de correlación?
Run to view results
Run to view results
How can a categorical variable be represented to find correlation?
The process is simple, the first action is identified the categorical variable and change its status to numerical with the use of replace. Afterward a new column must be included in the dataset with the value assigned to the categorical variable
Run to view results
Run to view results
Limitations with the matrix of correlation
Only limits to find a possible lineal correlation, nonetheless, it absent does not mean it exists other type of correlation and relationship
Run to view results
Run to view results
Run to view results
The coefficient of correlation does not talk about the impact about the relationship
Run to view results
Establishing relationships: Analysis of simple regression
Run to view results
Run to view results
Run to view results
Run to view results
LIMITATIONS with the analysis of simple regression
The linear regression is not symmetric
Run to view results
Run to view results
Run to view results
La regresión no nos dice nada sobre la causalidad, pero existen herramientas para separar las relaciones entre varias variables
La pendiente es -0.634905, lo que significa que cada milímetro adicional de profundidad del pico es asociado a un decremento de -0.634905 milímetros de la longitud del pico de un pingüino.
Run to view results
Run to view results
Análisis de regresión múltiple
Olvidé mi báscula para pesar a los pingüinos, ¿Cuál sería la mejor forma de capturar ese dato?
Creando modelos
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
VISUALIZING THE RESULTS
Run to view results
Using the seaborn distributions to visualize the models again the current value
Run to view results
Run to view results
With the analysis done, we can observe that the model_5 is adjusting a little better to the original data. For that reasons is a good advice to analyze the correlation between the variables to avoid to waste time making various models in order to get the better model
Run to view results
ANALYZING LOGISTIC REGRESSION
Run to view results
Run to view results
Counting the number of males by each island to identify if the results from above are correct (more probably or less probably)
Run to view results
Run to view results
Run to view results
Working with the new model to predict penguins male or female
Run to view results
Validating if the previous model and the original data match
Run to view results
Run to view results
Run to view results
Paradoja de Simpson
Run to view results
Run to view results
Run to view results
Run to view results
Información de sesión
Run to view results