Data Science for Good Project

Exploring the Gender Pay Gap Factors

Mitesh Shah and Daniel Hwang - 4th Period Data Science II

Considering Data

Source: Kaggle

https://www.kaggle.com/datasets/fedesoriano/gender-pay-gap-dataset

In short, this dataset is a compilation of the US national wage data, mainly from 1980 - 2010.

The data is reliable because it is collected over a long period of time and across different regions, representing the differences across years and combating latent bias. Additionally, the data is collected through reliable census data.

The data has a lot of erroneous factors (including many column values that we did not understand using the descriptors listed on Kaggle), but essentially we want to look at the gender pay gap and analyze it as a whole.

Statistical Questions: As described by this data set, how large is the overall gender pay gap? If any, in what industries and occupations is the gender pay gap most prevalent? Least prevalent? Are there any similarities within the industries where there are most and least gaps between pay?

Import and Filter Data & Data Cleanup

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt

Run to view results

df0 = pd.read_csv("PanelStudyIncomeDynamics.csv") pd.set_option("display.max_columns", None)

Run to view results

df0.head(5)

Run to view results

df0.dtypes

Run to view results

Comparing "sch" vs "schupd" to see which to drop

print(df0["sch"].describe())

Run to view results

print(df0["schupd"].describe())

Run to view results

df0.isnull().sum()

Run to view results

Checking for Null Values in Different Columns

sns.heatmap(df0.isnull(), yticklabels=False, cbar=False, cmap="viridis")

Run to view results

Dropping columns with too many missing + irrelevant columns to analysis

dropping = df0.iloc[:, 41:206] dropping.head(5)

Run to view results

df1 = df0.drop(df0.iloc[:, 41:206], axis=1) df1

Run to view results

df1.shape

Run to view results

df2 = df1.drop(columns=["female", "sch"]) df2

Run to view results

df2.shape

Run to view results

pd.set_option("display.max_rows", None) df2.isnull().sum()

Run to view results

df3 = df2.dropna() df3

Run to view results

sns.heatmap(df3.isnull(), yticklabels=False, cbar=False, cmap="viridis")

Run to view results

df3.shape

Run to view results

Now, we're left with 107 columns and 33091 entries of people with wages in total.

df3.info()

Run to view results

df3.describe()

Run to view results

The following have 0 standard deviation (Look! A central tendency!) aka all the rows are the same:

employed (everyone is employed)

selfemp (everyone is not self employed)

immigrantsamp (no data was taken from the immigrant subsample)

military (none are military)

basesamp (everyone is part of the base sample data set)

sumind(everyone works in an industry)

sumocc (everyone has an occupation)

df4 = df3.drop(columns=["employed", "selfemp", "immigrantsamp", "military", "basesamp", "sumind", "sumocc"])

Run to view results

#df4

Run to view results

In this bar graph, we check for the male/female ratio or distribution across the dataset to ensure that the data is representative of the gender pay gap fairly. (1 = male, 2 = female, from the data description.) (Analysis: Maybe a pie chart would have been easier to read but this probably works just as well. We also have no idea how many male and female rows have been dropped as we cleaned up the data.)

sns.histplot(data=df4, x="sex")

Run to view results

Creating new 'object' columns for industry and occupation of each entry/person

df4["occ2name"].unique()

Run to view results

Agriculture, miningconstruction, durables, nondurables, Transport, Utilities, Communications, retailtrade, wholesaletrade, finance, SocArtOther, hotelsrestaurants, Medical, Education, professional, publicadmin - Industries

industries = ["Agriculture", "Mining_Construction", "Durables_Manuf", "Nondurables_Manuf", "Transport", "Utilities", "Communications", "Retail_Trade", "Wholesale_Trade", "Finance", "Social_Work_and_Arts", "Hotels_Restaurants", "Medical", "Education", "Professional_Services", "Public_Admin"] def industrialize(cols): agri = cols[0] mine = cols[1] dura = cols[2] ndur = cols[3] tran = cols[4] util = cols[5] comm = cols[6] retl = cols[7] whol = cols[8] fina = cols[9] soci = cols[10] hotl = cols[11] medi = cols[12] educ = cols[13] serv = cols[14] publ = cols[15] if agri == 1: return industries[0] elif mine == 1: return industries[1] elif dura == 1: return industries[2] elif ndur == 1: return industries[3] elif tran == 1: return industries[4] elif util == 1: return industries[5] elif comm == 1: return industries[6] elif retl == 1: return industries[7] elif whol == 1: return industries[8] elif fina == 1: return industries[9] elif soci == 1: return industries[10] elif hotl == 1: return industries[11] elif medi == 1: return industries[12] elif educ == 1: return industries[13] elif serv == 1: return industries[14] elif publ == 1: return industries[15] else: return "NoIndustry" indcolnames = ["Agriculture", "miningconstruction", "durables", "nondurables", "Transport", "Utilities", "Communications", "retailtrade", "wholesaletrade", "finance", "SocArtOther", "hotelsrestaurants", "Medical", "Education", "professional", "publicadmin"] df4['industry'] = df4[indcolnames].apply(industrialize, axis=1)

Run to view results

manager, business, financialop, computer, architect, scientist, socialworker, postseceduc, legaleduc, artist, lawyerphysician, healthcare, healthsupport, protective, foodcare, building, sales, officeadmin, farmer, constructextractinstall, production, transport - Occupations

occupations = ["Manager", "Businessperson", "Financial", "Computer_Technician", "Architect_Engineer", "Scientist", "Social_Worker", "Post_Sec_Educator", "Other_Educator", "Artist", "Lawyer_Physicians", "Healthcare_Practitioner", "Health_Support", "Protective_Service", "Food_Service", "Building_Maintenance", "Sales", "Office_Support", "Farmer", "Construction", "Production", "Transportation"] def occupate(cols): mang = cols[0] busi = cols[1] fina = cols[2] comp = cols[3] arch = cols[4] scie = cols[5] soci = cols[6] psed = cols[7] lged = cols[8] arts = cols[9] lawp = cols[10] care = cols[11] supt = cols[12] prot = cols[13] food = cols[14] buil = cols[15] sale = cols[16] offc = cols[17] farm = cols[18] cons = cols[19] prod = cols[20] tran = cols[21] if mang == 1: return occupations[0] elif busi == 1: return occupations[1] elif fina == 1: return occupations[2] elif comp == 1: return occupations[3] elif arch == 1: return occupations[4] elif scie == 1: return occupations[5] elif soci == 1: return occupations[6] elif psed == 1: return occupations[7] elif lged == 1: return occupations[8] elif arts == 1: return occupations[9] elif lawp == 1: return occupations[10] elif care == 1: return occupations[11] elif supt == 1: return occupations[12] elif prot == 1: return occupations[13] elif food == 1: return occupations[14] elif buil == 1: return occupations[15] elif sale == 1: return occupations[16] elif offc == 1: return occupations[17] elif farm == 1: return occupations[18] elif cons == 1: return occupations[19] elif prod == 1: return occupations[20] elif tran == 1: return occupations[21] else: return "NoOccupation" occcolnames = ["manager", "business", "financialop", "computer", "architect", "scientist", "socialworker", "postseceduc", "legaleduc", "artist", "lawyerphysician", "healthcare", "healthsupport", "protective", "foodcare", "building", "sales", "officeadmin", "farmer", "constructextractinstall", "production", "transport"] df4['occupation'] = df4[occcolnames].apply(occupate, axis=1)

Run to view results

df4.head(5)

Run to view results

print(df4.shape)

Run to view results

At the end of all of the data cleanup and filtering process, we are left with 102 columns and the same 33091 rows of data.

Analyze Data

# Shows the years and the incomes of men who worked over 50 hours a week men50p = df4.loc[(df4.sex == 1) & (df4.usualhrwk > 50), ["wave", "annlabinc", "usualhrwk"]] men50p

Run to view results

# Shows the years and the incomes of women who worked over 50 hours a week women50p = df4.loc[(df4.sex == 2) & (df4.usualhrwk > 50), ["wave", "annlabinc", "usualhrwk"]] women50p

Run to view results

print(men50p.describe()) print(women50p.describe())

Run to view results

# Mean income for all men print(df4["annlabinc"].loc[df4.sex == 1].mean(axis=0))

Run to view results

# Mean income for all women print(df4["annlabinc"].loc[df4.sex == 2].mean(axis=0))

Run to view results

Woah! That's a big difference. Let's explore that more.

for ind in industries: men_avg = df4["annlabinc"].loc[(df4.sex == 1) & (df4.industry == ind)].mean(axis=0) women_avg = df4["annlabinc"].loc[(df4.sex == 2) & (df4.industry == ind)].mean(axis=0) print(ind+": " + str(men_avg)) # men print(ind+": " + str(women_avg)) # women print("Differences of pay in "+ind+": "+ str(men_avg-women_avg)) print()

Run to view results

# First 5 industry_means1 = { 'Male Income': (27600.07, 40022.88, 43485.12, 44705.56, 42681.79), 'Female Income': (18230.86, 34081.85, 30155.33, 23752.86, 31389.29), } x = np.arange(5) # the label locations w = 0.3 # the width of the bars m = 0 fig, ax = plt.subplots(layout='constrained') for attribute, measurement in industry_means1.items(): offset = w * m rects = ax.bar(x + offset, measurement, w, label=attribute) ax.bar_label(rects, padding=2) m += 1 ax.set_ylabel('Annual Income ($)') ax.set_title('Income by Industry for Men/Women') ax.set_xticks(x + w, industries[0:5]) for tick in ax.xaxis.get_major_ticks()[1::2]: tick.set_pad(15) ax.legend(loc='upper left', ncols=3) ax.set_ylim(0, 80000) plt.show()

Run to view results

# Middle 5 industry_means2 = { 'Male Income': (54665.20, 60830.71, 36712.95, 44529.11, 75365.01), 'Female Income': (36182.87, 34529.87, 21104.49, 32227.07, 34727.96), } x = np.arange(5) # the label locations w2 = 0.3 # the width of the bars m2 = 0 fig, ax = plt.subplots(layout='constrained') for attribute, measurement in industry_means2.items(): offset = w2 * m2 rects = ax.bar(x + offset, measurement, w2, label=attribute) ax.bar_label(rects, padding=3) m2 += 1 # Add some text for labels, title and custom x-axis tick labels, etc. ax.set_ylabel('Annual Income ($)') ax.set_title('Income by Industry for Men/Women') ax.set_xticks(x + w2, industries[5:10]) for tick in ax.xaxis.get_major_ticks()[1::2]: tick.set_pad(15) ax.legend(loc='upper left', ncols=3) ax.set_ylim(0, 80000) plt.show()

Run to view results

# Last 6 industry_means3 = { 'Male Income': (35097.97, 26791.10, 53772.08, 40947.99, 59091.64, 52094.77), 'Female Income': (22031.58, 16498.26, 30126.83, 29772.98, 36446.34, 37337.84), } x = np.arange(6) # the label locations w3 = 0.3 # the width of the bars m3 = 0 fig, ax = plt.subplots(layout='constrained') for attribute, measurement in industry_means3.items(): offset = w3 * m3 rects = ax.bar(x + offset, measurement, w3, label=attribute) ax.bar_label(rects, padding=3) m3 += 1 # Add some text for labels, title and custom x-axis tick labels, etc. ax.set_ylabel('Annual Income ($)') ax.set_title('Income by Industry for Men/Women') ax.set_xticks(x + w3, industries[10:16]) for tick in ax.xaxis.get_major_ticks()[1::2]: tick.set_pad(15) ax.legend(loc='upper left', ncols=3) ax.set_ylim(0, 80000) plt.show()

Run to view results

white = df4["white"].sum() black = df4["black"].sum() hisp = df4["hisp"].sum() others = df4["othrace"].sum() piedata = [white, black, hisp, others] pielabels = ["White", "Black", "Hispanic", "Other"] palette_color = sns.color_palette('bright') plt.pie(piedata, labels=pielabels, colors=palette_color, autopct='%.0f%%') plt.show()

Run to view results

sns.scatterplot(data=df4, x="schupd", y="annlabinc", hue="sex", palette="bright")

Run to view results

sns.lineplot(data=df4, x="wave", y="annlabinc", hue="sex", palette="dark")

Run to view results

sns.lineplot(data=df4, x="usualhrwk", y="sex", palette="bright")

Run to view results

sns.jointplot(data=df4, x="age", y="annlabinc", hue="sex", palette="pastel")

Run to view results

import sklearn as skl from sklearn.linear_model import LinearRegression xdf = df4[["wave", "sex", "famwgt", "schupd", "age", "annlabinc"]] + df4[occcolnames] + df4[indcolnames] features = list(xdf.columns) features.remove('annlabinc') target = ['annlabinc']

Run to view results

xdf = xdf.dropna() X = xdf[features] Y = xdf[target]

Run to view results

model = LinearRegression() model.fit(X, Y) acc = model.score(X, Y) print(acc) plt.bar(X.columns[np.argsort(model.feature_importances_)], model.feature_importances_) plt.xticks(rotation=90)

Run to view results

''' sns.set_theme(style="whitegrid") # Initialize the matplotlib figure f, ax = plt.subplots(figsize=(6, 15)) # Load the exa # Plot the total crashes sns.set_color_codes("pastel") sns.barplot(x="total", y="abbrev", data=df, label="Total", color="b") # Plot the crashes where alcohol was involved sns.set_color_codes("muted") sns.barplot(x="alcohol", y="abbrev", data=crashes, label="Alcohol-involved", color="b") # Add a legend and informative axis label ax.legend(ncol=2, loc="lower right", frameon=True) ax.set(xlim=(0, 24), ylabel="", xlabel="Automobile collisions per billion miles") sns.despine(left=True, bottom=True) '''

Run to view results

print(np.quantile(df4['annlabinc'], [0, 0.25, 0.5, 0.75, 1]))

Run to view results

from scipy.stats import iqr # first get iqr iqr= iqr(df4['annlabinc']) # then get lower & upper quartiles lower_threshold = np.quantile(df4['annlabinc'], 0.25) upper_threshold = np.quantile(df4['annlabinc'], 0.75) # then find outliers outliers = df4[(df4['annlabinc'] < lower_threshold) | (df4['annlabinc'] > upper_threshold)] print(outliers)

Run to view results

Present Your Findings

Your Dataset

1. Our dataset contains information about men's and women's pay over a 30-year time period in the United States, with data collected from the census across a large number of different industries and jobs, showing a holistic view of the large gender pay gap that statistically exists in the United States economy.

2. We chose this dataset because we wanted to explore a social issue that is pertinent in all facets and all areas of life.

3. The data in the dataset came from Census data, which is collected by the Government, so this is valid, correct, and highly reliable data.

4. Without listing every single column, the dataset essentially covers the gender, income, professional metadata, region, industry, occupation, race, education level, age, time period, and other factors of a given human between 1980 and 2010. The data values are, as we saw before, ints, floats, booleans / boolean-adjacent integers, objects (strings for categories).

Measures of Central Tendency and Spread

We used the mean value as we had complete information for everyone and we were able to conclude the difference between the income of men and women across the different industries among other important information.

One measure of spread for our dataset is IQR metrics. These told us that there are a couple of really low and really high-income outliers but the majority of the middle 50% is around the same income values, meaning that we have a relatively normal distribution.

Data Visualizations

1. Our first data visualization was the bar graph that we used to see the male/female distribution. We check for the male/female ratio or distribution across the dataset to ensure that the data is representative of the gender pay gap fairly. In the end, there were more female rows than male rows, but the difference was statistically insignificant considering how large our data set is. (1 = male, 2 = female, from the data description.) (Analysis: Maybe a pie chart would have been easier to read but this probably works just as well. We also have no idea how many male and female rows have been dropped as we cleaned up the data.)

2. Our second data visualization was the bar graphs that compared men and women in their income in every single industry. We used this plot as it helped us see the extreme difference in each industry and where the gap between genders was the strongest. We can see from that that finance, communications, and medicine are the top 3 in their gender pay gaps. These industries are the worst in terms of their gap. This visualization was very helpful in putting the numbers side-by-side to one another and making it very easy to understand and see the averages and how they compared to each other.

Interpret Data / Statistical Questions

1. We were able to answer our statistical questions, which are right here below:

a. Statistical Question One As described by this data set, how large is the overall gender pay gap?

b. Statistical Question Two If any, in what industries and occupations is the gender pay gap most prevalent? Least prevalent?

c. Statistical Question Three Are there any similarities within the industries where there are most and least gaps between pay?

d. Statistical Question Four What factors contribute to the gender wage gap and which industries are these factors most exaggerated in?

We could answer them because the data was easy to manipulate and find these things and we were able to understand and create visualizations that made it work.

2. The conclusions that can be drawn overarchingly from the data analysis are:

$45,733.06064392997-$29,074.911297907358=$16,658.15

Largest Gender Pay Gaps: Finance, Communications, and Medicine Least Gender Pay Gaps: Mining_Construction, Agriculture, Hotels_Restaurants

The industries with the largest pay gap typically need more education in those areas, and the industries with the smallest pay gap typically do not require more education. This suggests that education plays a large role in the gender pay gap.

Again, education probably plays a large role in the gender pay gap. Finance and medicine especially require large amounts of education.

3. Our findings are important because they showcase the inequality and social injustice faced in the workplace and how, although many deny its existence, there is a clear gender pay gap that has been exemplified in certain industries and has negatively impacted the economic and social welfare that women have.

4. Some possible threats to our validity are misuse of factors in the data or a misrepresentation of the data as the data that gets cut off could have been important data and some things may not have been fully captured using the measures of central tendency or the values that we used for our computations. We also did not take into account education (high school completion, post-secondary education) in our analysis, so further analysis could shed more light on what contributes to the gender pay gap.

5. Our findings now raise these new questions:

How has the gender pay gap changed specifically over the last 13 years?

In new company workforces, how does the gender distribution compare to that of other industries and other occupations and proportionally to the region it is located in or the populations of the cities and the country?

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Data Science for Good Project

Exploring the Gender Pay Gap Factors

Considering Data

Import and Filter Data & Data Cleanup

Comparing "sch" vs "schupd" to see which to drop

Checking for Null Values in Different Columns

Dropping columns with too many missing + irrelevant columns to analysis

Creating new 'object' columns for industry and occupation of each entry/person

Analyze Data

Present Your Findings

Your Dataset

Measures of Central Tendency and Spread

Data Visualizations

Interpret Data / Statistical Questions

Data Science for Good Project