Demographic data analizer
Created by Darío López Díaz. Work in progress..
The idea of this project is to analize demographic data, which consist of education, race, income and work time per week data of people from different countries. We will compute several values of interest, along the data. The data frame is coming from: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
The data looks in the following way:
df = pd.read_csv('adult.data.csv')
df.head()
print(df.groupby('race').size())
sns.countplot(data=df,x='race')
plt.xticks(rotation=45)
plt.show()
df[['age','sex']].set_index('sex').drop(index='Female').mean()
df.filter(items=['education']).value_counts()['Bachelors'] / df.filter(items=['education']).value_counts().sum() * 100
Salary_Degree = df.filter(items=['education','salary']).value_counts()['Bachelors']+df.filter(items=['education','salary']).value_counts()['Masters']+df.filter(items=['education','salary']).value_counts()['Doctorate']
NumOf_Degree = df.filter(items=['education']).value_counts()['Bachelors']+df.filter(items=['education']).value_counts()['Masters']+df.filter(items=['education']).value_counts()['Doctorate']
((Salary_Degree / NumOf_Degree)*100)['>50K']
Non_Advance = df.filter(items=['education','salary']).value_counts().drop(labels=['Bachelors','Masters','Doctorate'])
Non_Advance_More50 = Non_Advance.drop(labels=['<=50K'],level=1).sum()
Non_Advance_Less50 = Non_Advance.drop(labels=['>50K'],level=1).sum()
Non_Advance_More50 / (Non_Advance_More50 + Non_Advance_Less50) * 100
df['hours-per-week'].min()
Min_Hours_More50 = df[df['hours-per-week']==1]['salary'].value_counts()['>50K']
Min_Hours_Total = df[df['hours-per-week']==1]['salary'].value_counts().sum()
Min_Hours_More50 / Min_Hours_Total *100
People_Over50byCountry = df[['native-country','salary']].set_index('salary').drop(labels='<=50K').groupby('native-country').size()
print(People_Over50byCountry.idxmax())
People_Over50byCountry.max() / People_Over50byCountry.sum() * 100
People_More50 = df[['occupation','native-country','salary']].set_index('salary').drop(labels='<=50K')
People_More50_India = People_More50.set_index('native-country').loc['India']
People_More50_India.groupby('occupation').size().idxmax()