Data Science for Good Project
Exploring the Gender Pay Gap Factors
Mitesh Shah and Daniel Hwang - 4th Period Data Science II
Considering Data
https://www.kaggle.com/datasets/fedesoriano/gender-pay-gap-dataset
Import and Filter Data & Data Cleanup
Run to view results
Run to view results
Run to view results
Run to view results
Comparing "sch" vs "schupd" to see which to drop
Run to view results
Run to view results
Run to view results
Checking for Null Values in Different Columns
Run to view results
Dropping columns with too many missing + irrelevant columns to analysis
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Now, we're left with 107 columns and 33091 entries of people with wages in total.
Run to view results
Run to view results
The following have 0 standard deviation (Look! A central tendency!) aka all the rows are the same:
Run to view results
Run to view results
In this bar graph, we check for the male/female ratio or distribution across the dataset to ensure that the data is representative of the gender pay gap fairly. (1 = male, 2 = female, from the data description.) (Analysis: Maybe a pie chart would have been easier to read but this probably works just as well. We also have no idea how many male and female rows have been dropped as we cleaned up the data.)
Run to view results
Creating new 'object' columns for industry and occupation of each entry/person
Run to view results
Agriculture, miningconstruction, durables, nondurables, Transport, Utilities, Communications, retailtrade, wholesaletrade, finance, SocArtOther, hotelsrestaurants, Medical, Education, professional, publicadmin - Industries
Run to view results
manager, business, financialop, computer, architect, scientist, socialworker, postseceduc, legaleduc, artist, lawyerphysician, healthcare, healthsupport, protective, foodcare, building, sales, officeadmin, farmer, constructextractinstall, production, transport - Occupations
Run to view results
Run to view results
Run to view results
At the end of all of the data cleanup and filtering process, we are left with 102 columns and the same 33091 rows of data.
Analyze Data
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Woah! That's a big difference. Let's explore that more.
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Run to view results
Present Your Findings
Your Dataset
1. Our dataset contains information about men's and women's pay over a 30-year time period in the United States, with data collected from the census across a large number of different industries and jobs, showing a holistic view of the large gender pay gap that statistically exists in the United States economy.
2. We chose this dataset because we wanted to explore a social issue that is pertinent in all facets and all areas of life.
3. The data in the dataset came from Census data, which is collected by the Government, so this is valid, correct, and highly reliable data.
4. Without listing every single column, the dataset essentially covers the gender, income, professional metadata, region, industry, occupation, race, education level, age, time period, and other factors of a given human between 1980 and 2010. The data values are, as we saw before, ints, floats, booleans / boolean-adjacent integers, objects (strings for categories).
Measures of Central Tendency and Spread
We used the mean value as we had complete information for everyone and we were able to conclude the difference between the income of men and women across the different industries among other important information.
One measure of spread for our dataset is IQR metrics. These told us that there are a couple of really low and really high-income outliers but the majority of the middle 50% is around the same income values, meaning that we have a relatively normal distribution.
Data Visualizations
1. Our first data visualization was the bar graph that we used to see the male/female distribution. We check for the male/female ratio or distribution across the dataset to ensure that the data is representative of the gender pay gap fairly. In the end, there were more female rows than male rows, but the difference was statistically insignificant considering how large our data set is. (1 = male, 2 = female, from the data description.) (Analysis: Maybe a pie chart would have been easier to read but this probably works just as well. We also have no idea how many male and female rows have been dropped as we cleaned up the data.)
2. Our second data visualization was the bar graphs that compared men and women in their income in every single industry. We used this plot as it helped us see the extreme difference in each industry and where the gap between genders was the strongest. We can see from that that finance, communications, and medicine are the top 3 in their gender pay gaps. These industries are the worst in terms of their gap. This visualization was very helpful in putting the numbers side-by-side to one another and making it very easy to understand and see the averages and how they compared to each other.
Interpret Data / Statistical Questions
1. We were able to answer our statistical questions, which are right here below:
a. Statistical Question One As described by this data set, how large is the overall gender pay gap?
b. Statistical Question Two If any, in what industries and occupations is the gender pay gap most prevalent? Least prevalent?
c. Statistical Question Three Are there any similarities within the industries where there are most and least gaps between pay?
d. Statistical Question Four What factors contribute to the gender wage gap and which industries are these factors most exaggerated in?
We could answer them because the data was easy to manipulate and find these things and we were able to understand and create visualizations that made it work.
2. The conclusions that can be drawn overarchingly from the data analysis are:
3. Our findings are important because they showcase the inequality and social injustice faced in the workplace and how, although many deny its existence, there is a clear gender pay gap that has been exemplified in certain industries and has negatively impacted the economic and social welfare that women have.
4. Some possible threats to our validity are misuse of factors in the data or a misrepresentation of the data as the data that gets cut off could have been important data and some things may not have been fully captured using the measures of central tendency or the values that we used for our computations. We also did not take into account education (high school completion, post-secondary education) in our analysis, so further analysis could shed more light on what contributes to the gender pay gap.
5. Our findings now raise these new questions: