Demographic data analizer
Created by Darío López Díaz. Work in progress..
The idea of this project is to analize demographic data, which consist of education, race, income and work time per week data of people from different countries. We will compute several values of interest, along the data. The data frame is coming from: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
The data looks in the following way:
df = pd.read_csv('adult.data.csv')
df.head()
print(df.groupby('race').size())
sns.countplot(data=df,x='race')
plt.xticks(rotation=45)
plt.show()
race
Amer-Indian-Eskimo 311
Asian-Pac-Islander 1039
Black 3124
Other 271
White 27816
dtype: int64
df[['age','sex']].set_index('sex').drop(index='Female').mean()
df.filter(items=['education']).value_counts()['Bachelors'] / df.filter(items=['education']).value_counts().sum() * 100
Salary_Degree = df.filter(items=['education','salary']).value_counts()['Bachelors']+df.filter(items=['education','salary']).value_counts()['Masters']+df.filter(items=['education','salary']).value_counts()['Doctorate']
NumOf_Degree = df.filter(items=['education']).value_counts()['Bachelors']+df.filter(items=['education']).value_counts()['Masters']+df.filter(items=['education']).value_counts()['Doctorate']
((Salary_Degree / NumOf_Degree)*100)['>50K']
Non_Advance = df.filter(items=['education','salary']).value_counts().drop(labels=['Bachelors','Masters','Doctorate'])
Non_Advance_More50 = Non_Advance.drop(labels=['<=50K'],level=1).sum()
Non_Advance_Less50 = Non_Advance.drop(labels=['>50K'],level=1).sum()
Non_Advance_More50 / (Non_Advance_More50 + Non_Advance_Less50) * 100
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/generic.py:4153: PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
df['hours-per-week'].min()
Min_Hours_More50 = df[df['hours-per-week']==1]['salary'].value_counts()['>50K']
Min_Hours_Total = df[df['hours-per-week']==1]['salary'].value_counts().sum()
Min_Hours_More50 / Min_Hours_Total *100
People_Over50byCountry = df[['native-country','salary']].set_index('salary').drop(labels='<=50K').groupby('native-country').size()
print(People_Over50byCountry.idxmax())
People_Over50byCountry.max() / People_Over50byCountry.sum() * 100
United-States
People_More50 = df[['occupation','native-country','salary']].set_index('salary').drop(labels='<=50K')
People_More50_India = People_More50.set_index('native-country').loc['India']
People_More50_India.groupby('occupation').size().idxmax()