My first notebook on Deepnote
!kaggle datasets download -d arjunprasadsarkhel/2021-olympics-in-tokyo --unzip
Downloading 2021-olympics-in-tokyo.zip to /work
0%| | 0.00/348k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 348k/348k [00:00<00:00, 9.21MB/s]
# Importing libraries
#------Data Pre-Procesing----------
import pandas as pd
import numpy as np # data pre-processing
#------Data Visualizations---------
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px # data visualization
#------Additional-----------
import warnings
warnings.filterwarnings("ignore") # additional
Read Data
df_athletes = pd.read_excel('Athletes.xlsx')
df_coaches = pd.read_excel('Coaches.xlsx')
df_entries_gender = pd.read_excel('EntriesGender.xlsx')
df_medals = pd.read_excel('Medals.xlsx')
df_teams = pd.read_excel('Teams.xlsx')
Exploring our data
First Look
print(df_athletes.head())
print(f'Here we have {df_athletes.shape[0]} rows and {df_athletes.shape[1]} columns')
Name NOC Discipline
0 AALERUD Katrine Norway Cycling Road
1 ABAD Nestor Spain Artistic Gymnastics
2 ABAGNALE Giovanni Italy Rowing
3 ABALDE Alberto Spain Basketball
4 ABALDE Tamara Spain Basketball
Here we have 11085 rows and 3 columns
data_source = {'Athletes':df_athletes,
'Coaches':df_coaches,
'Entries_gender':df_entries_gender,
'Medals':df_medals,
'Teams':df_teams}
for dataset in data_source:
print('-'*60)
print(f'This is {dataset} dataset\n')
print(data_source[dataset].head())
print(f'Here we have {data_source[dataset].shape[0]} rows and {data_source[dataset].shape[1]} columns')
------------------------------------------------------------
This is Athletes dataset
Name NOC Discipline
0 AALERUD Katrine Norway Cycling Road
1 ABAD Nestor Spain Artistic Gymnastics
2 ABAGNALE Giovanni Italy Rowing
3 ABALDE Alberto Spain Basketball
4 ABALDE Tamara Spain Basketball
Here we have 11085 rows and 3 columns
------------------------------------------------------------
This is Coaches dataset
Name NOC Discipline Event
0 ABDELMAGID Wael Egypt Football NaN
1 ABE Junya Japan Volleyball NaN
2 ABE Katsuhiko Japan Basketball NaN
3 ADAMA Cherif Côte d'Ivoire Football NaN
4 AGEBA Yuya Japan Volleyball NaN
Here we have 394 rows and 4 columns
------------------------------------------------------------
This is Entries_gender dataset
Discipline Female Male Total
0 3x3 Basketball 32 32 64
1 Archery 64 64 128
2 Artistic Gymnastics 98 98 196
3 Artistic Swimming 105 0 105
4 Athletics 969 1072 2041
Here we have 46 rows and 4 columns
------------------------------------------------------------
This is Medals dataset
Rank Team/NOC Gold Silver Bronze Total \
0 1 United States of America 39 41 33 113
1 2 People's Republic of China 38 32 18 88
2 3 Japan 27 14 17 58
3 4 Great Britain 22 21 22 65
4 5 ROC 20 28 23 71
Rank by Total
0 1
1 2
2 5
3 4
4 3
Here we have 93 rows and 7 columns
------------------------------------------------------------
This is Teams dataset
Name Discipline NOC Event
0 Belgium 3x3 Basketball Belgium Men
1 China 3x3 Basketball People's Republic of China Men
2 China 3x3 Basketball People's Republic of China Women
3 France 3x3 Basketball France Women
4 Italy 3x3 Basketball Italy Women
Here we have 743 rows and 4 columns
def first_look(dataset):
print(dataset.info())
print(f'\nUniques values:\n{dataset.nunique()}')
print(f'\nDuplicates: {dataset.duplicated().sum()}')
print(f'\nNull values: \n{dataset.isnull().sum()}')
first_look(df_athletes)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11085 entries, 0 to 11084
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 11085 non-null object
1 NOC 11085 non-null object
2 Discipline 11085 non-null object
dtypes: object(3)
memory usage: 259.9+ KB
None
Uniques values:
Name 11062
NOC 206
Discipline 46
dtype: int64
Duplicates: 1
Null values:
Name 0
NOC 0
Discipline 0
dtype: int64
first_look(df_coaches)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394 entries, 0 to 393
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 394 non-null object
1 NOC 394 non-null object
2 Discipline 394 non-null object
3 Event 249 non-null object
dtypes: object(4)
memory usage: 12.4+ KB
None
Uniques values:
Name 381
NOC 61
Discipline 9
Event 6
dtype: int64
Duplicates: 1
Null values:
Name 0
NOC 0
Discipline 0
Event 145
dtype: int64
first_look(df_entries_gender)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Discipline 46 non-null object
1 Female 46 non-null int64
2 Male 46 non-null int64
3 Total 46 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.6+ KB
None
Uniques values:
Discipline 46
Female 38
Male 41
Total 41
dtype: int64
Duplicates: 0
Null values:
Discipline 0
Female 0
Male 0
Total 0
dtype: int64
first_look(df_medals)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 93 non-null int64
1 Team/NOC 93 non-null object
2 Gold 93 non-null int64
3 Silver 93 non-null int64
4 Bronze 93 non-null int64
5 Total 93 non-null int64
6 Rank by Total 93 non-null int64
dtypes: int64(6), object(1)
memory usage: 5.2+ KB
None
Uniques values:
Rank 67
Team/NOC 93
Gold 14
Silver 17
Bronze 21
Total 30
Rank by Total 30
dtype: int64
Duplicates: 0
Null values:
Rank 0
Team/NOC 0
Gold 0
Silver 0
Bronze 0
Total 0
Rank by Total 0
dtype: int64
first_look(df_teams)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 743 entries, 0 to 742
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 743 non-null object
1 Discipline 743 non-null object
2 NOC 743 non-null object
3 Event 743 non-null object
dtypes: object(4)
memory usage: 23.3+ KB
None
Uniques values:
Name 146
Discipline 20
NOC 84
Event 36
dtype: int64
Duplicates: 0
Null values:
Name 0
Discipline 0
NOC 0
Event 0
dtype: int64