Quick note: This project is subject to change over the course of the Christmas holiday period of 2023.
Introduction.
"Statistics at a Glance: The Burden of Cancer in the United States"
(Courtesy of cancer.gov):
• In 2020, an estimated 1,806,590 new cases of cancer will be diagnosed in the United States and 606,520 people will die from the disease.
• The most common cancers (listed in descending order according to estimated new cases in 2020) are breast cancer, lung and bronchus cancer, prostate cancer, colon and rectum cancer, melanoma of the skin, bladder cancer, non-Hodgkin lymphoma, kidney and renal pelvis cancer, endometrial cancer, leukemia, pancreatic cancer, thyroid cancer, and liver cancer.
• Prostate, lung, and colorectal cancers account for an estimated 43% of all cancers diagnosed in men in 2020. For women, the three most common cancers are breast, lung, and colorectal, and they will account for an estimated 50% of all new cancer diagnoses in women in 2020.
• The rate of new cases of cancer (cancer incidence) is 442.4 per 100,000 men and women per year (based on 2013–2017 cases).
• The cancer death rate (cancer mortality) is 158.3 per 100,000 men and women per year (based on 2013–2017 deaths).
• The cancer mortality rate is higher among men than women (189.5 per 100,000 men and 135.7 per 100,000 women).
• When comparing groups based on race/ethnicity and sex, cancer mortality is highest in African American men (227.3 per 100,000) and lowest in Asian/Pacific Islander women (85.6 per 100,000).
• As of January 2019, there were an estimated 16.9 million cancer survivors in the United States. The number of cancer survivors is projected to increase to 22.2 million by 2030.
• Approximately 39.5% of men and women will be diagnosed with cancer at some point during their lifetimes (based on 2015–2017 data).
• In 2020, an estimated 16,850 children and adolescents ages 0 to 19 will be diagnosed with cancer and 1,730 will die of the disease.
• Estimated national expenditures for cancer care in the United States in 2018 were $150.8 billion. In future years, costs are likely to increase as the population ages and more people have cancer. Costs are also likely to increase as new, and often more expensive, treatments are adopted as standards of care.
The data.
There are some characters unreadable by pandas, so adding an argument for special encodings wouldn't go amiss...
There are 3047 entries in 34 columns, two of which are object columns. One of those object columns has been binned using decile analysis which is great for accounting but not of too much use here today. I won't be looking at a statistical description this time due to it being unnecessary for this data.
Dataframe information.
Missing values.
Pretty good news here, only three columns with missing features and only one column with over 30% missing. That column will be dropped due to a lack of further, reliable information with which to fill in the missing values.
Interpolation of missing values.
Filling the missing values in "PctPrivateCoverageAlone" and "PctEmployed16_Over" using the method that resulted in the best balance of accuracy and quality; linear interpolation.
Finally, filling in the last remaining features skipped by way of the linear interpolation's forward-fill characteristics using the mean of all of the treated values.
Dropping columns with more than 30% missing values.
Data cleaning.
snake_case treatment for camelCase labels, then renaming some columns.
Outlier treatment.
There are / were some extreme outliers in "avg_household_size", "median_age", "avg_deaths_per_year" as well as in the target column so they will be cut.
There are median ages in a range of 400-600 years old which don't make any sense. Experimenting with repositioning the decimal point to reflect, say, an age in the 40's or 60's resulted in nothing that made any further sense so I will drop the features above a figure of 100.
There is an average household size of 0 which will have to go, and a couple of problematic outliers in both the "avg_deaths_per_year" and "avg_ann_count" columns, both of which can be dealt with by dropping the one single outlier in "avg_deaths_per_year".
And finally the "target_death_rate" outlier will definitely cause some havoc, so adios to that outlier.
Creating new features.
Splitting the Geography features by the comma, saving the first word to a new column, "state".
Creating state codes to use in choropleth maps (the choropleth maps unfortunately use abbreviations which have to be mapped to the full state names -.- Stripping any whitespace before I do anything.
Mapping the objects in the "state_list" dictionary to the state names and storing them in a column, "state_code".
Creating a column named "poverty_max" which will be the summed (mean) poverty features for each state:
Correlations.
The strongest correlations to the target variable are:
• 1: Incidence rate and pct public coverage alone.
• 2: Poverty percent.
• 3: Pct high school 25 & over and pct public coverage.
The least-correlated to the target variable are:
• 1: Pct Batchelor's Degree 25 & over.
• 2: Median income.
• 3: Pct private coverage.
Making a copy of the original dataframe as "df_1".
Analysis.
General overview.
Average annual count vs. average deaths per year.
The majority of the death rates in the dataset exist between counts of 100 and 250. And visible in the tight grouping of markers on the left hand side of the plot, the average annual count figures under 5000 account for 98.7% of the data collected.
(The marker at 25K (on the far right) was purposefully left in the data as opposed to being treated as an outlier. I don't like to remove data unnecessarily even at the cost of a bit of model accuracy where I can help it, these data points are quite rich in information and given the nature of the data, I am quite happy to keep them included and suffer a little bit of model loss.)
There is a vertical cluster of data points at an average annual count of 1962.667684 which, after a long search, pertains to an incidence rate figure of 453.549422 and something intriguing in the study per capita column. I would like to see what this is.
Separating the data where the average annual count is 1962.667684, creating a subset dataframe with that data.
Statistical description for the subset dataframe.
Looking at the states linked to both of these features we see Nevada, Minnesota and Kansas, a mean White population of 91, a mean Black population of 1.6, a mean Asian population of 1.1 and a mean 'Other Race' population of 1.9 which is quite a high average compared to the other ethnic minorities. Other notable figures are a mean "pct_private_coverage" figure of 72.7, a mean "pct_employed16_over" figure of 61, a "pcd_married" figure of 56, a female median age of 43 and a mean pct_hs18_24 figure of 31.6.
Another high figure resides in the "study_per_cap" column where the all of the study per capita data for the state of Kansas was carried out solely within these 201 features. Also, the average death rate for this small portion of data is almost on a par with the entire original dataframe's average death rate.
So, what happened here is a bit of a mystery. There isn't a lot of data with which to make an educated guess, but with the Female median age in mind, perhaps the healthcare providers or government initiated a study due to the (hitherto) unexplained rise in cancer mortalities affecting females of a certain demographic within a slightly higher age group than normal. I decided to look at some government data for Kansas in this year and breast cancer was the most common form of cancer in 2015. Whether that is the case here is up for debate though.
Back to the original dataframe:
Select KBest for "avg_ann_count".
Coefficients of Select KBest.
I selected 25 of the most important features using "avg_ann_count" as the target variable, the results are:
"Avg_household_size", "median_age_female", "male" and "percent_married" are being classed as the most important features, followed by "pct_public_coverage", "pct_employed_16_over" & "pct_private_coverage". Then we see "poverty_percent", "pct_no_hs18_24", "pct_other_race" and "pct_asian" finally make up the list.
Experimenting with the feature size required to train the KBest algorithm will raise or drop accuracy and add weight to certain variables, but these are the prominent features across all runs: Respondents of Other and Asian descent, a higher Female median age than Male, employed with public coverage and almost as much private coverage as public, married and living in a degree of poverty are the most important factors in predicting the average count of cases diagnosed in the U.S.A..
Lowering the size of 'k' will lower accuracy slightly, leave out percent married households and raise private coverage, but I would like to leave married households as a 'k' value as there is quite a high count of married couples in this model. In its' current form, the model reflects a negative married households value as well as the highest accuracy, possibly due to married households mostly having private coverage.
Avg_deaths_per_year by ethnicity.
• The White ethnicity accounts for 83.72% of the population in the dataset, and accounts for 79.52% of the average deaths per year.
• The Black ethnicity accounts for 9.01% of the population in the dataset, and accounts for 10.90% of the average deaths per year.
• The Asian ethnicity accounts for 1.26% of the population in the dataset, and accounts for 2.96% of the average deaths per year.
• The 'Other' ethnicities account for 1.99% of the population in the dataset, and account for 2.77% of the average deaths per year.
Incidence rate vs. median age, average household size and poverty percent.
• The majority incidence rate values exist in the low to medium age bins, all with a mixture of household sizes and poverty values.
• Along with the aforementioned low and medium age bins, we see the inclusion of the very low median age group having an incidence rate over 600, mostly with a poverty percent value over 20, and three of these markers with high incidence rates reflect above-average household sizes.
Incidence rate by median age per gender.
The incidence rates are fairly evenly spread, although there are some deeper colored markers representative of some higher median Female age groups amid the clusters of comparatively low Male median age features which are also experiencing quite high incidence rates.
The highest incidence rate feature is 1014.2, at a Male median age of 25.6 and a Female median age of 23.6.
Death rate per state.
• Kentucky holds the highest death rate value of 2.4%. That is followed by Mississippi, Tennessee, Arkansas, Louisiana and West Virginia.
• All 21 of the states up to North Carolina in the pie chart have a death rate of 2% or over.
Median income per state by high school educated 25 & over.
Here I will be looking for small, dark hued markers representing a low median age and low on the education scale by state.
• The District of Columbia has the lowest average of high school educated 25 & over and also among the three highest income values. Whether correlated or not, Columbia holds a reasonably high average death rate value of 182.3.
• West Virginia has the highest average of high school educated 25 & over as well as one of the lowest median income values. WV holds one of the highest death rate values of all states, 197.8.
• Of the states with the highest average death rate values (Kentucky, Tennessee, Arkansas, Mississippi), all but one have an average high school 25 & over value above 38.
• Of the states with the lowest average death rate values (Utah, Colorado, Hawaii, Arizona, Idaho), none have an average high school 25 & over value above 31. Colorado's is as low as 27.
Median income per state by average birth rate.
• South Dakota has the highest average birth rate at a value of 7.56 and its median income is in the lower band, around 48K.
• North Dakota is the second-most common entry for average birth rates with a value of 7.55. ND is slightly higher up the median income range with a figure of 55K, and holds a slightly lower average death rate value of 2 less than SD.
• Utah is the third-largest entry for birth rates with an average of 7.34. It sees median income of 56K and a death rate figure 26 less than North Dakota.
• The aforementioned states with the highest average death rates (Kentucky, Tennessee, Arkansas, Mississippi, Alabama) have birth rate averages of 5 to 5.4, besides Arkansas which has an average birth rate of 5.8.
• The states with the lowest average birth rates (Connecticut, Vermont, NY, Rhode Island, Massachusetts) mostly have above-average incomes.
Median income vs. average deaths per year.
There is a +32% spike in average deaths at the median income figure of 55,686 when compared to the (second-highest) median income figure of 55,058. That 55,686 median income figure actually sees an even greater average death rate (38%) compared to its' neighbouring median income figure. So there are spikes in death rates for respondents earning between 53,929 and 55,686 per annum which could be in one of a handful of states depending on mean or max value analysis.
I compiled a few days' worth of extra research for this median income bracket to ascertain where the deaths may be occurring and how (not included here here both the sanity of the reader and to remain on-topic). There were research materials cited with links to government data sites outlining issues arising in townships where the people relative to the income groups in this dataset would live - give or take a few miles -, adjusted for year-on-year income growth. Although it was more of an exercise in loose causation (verging on 'reaching'), I thought it might help to offer some possibilities.
Present in the research were instances of contaminated groundwater supplies in the case of the area surrounding Michigan thanks to a petrochemical company using Michigan's drinking water supply as a daily sludge dump, excess cancer-causing chemicals in sandstone water wells, government-warning-worthy levels of mercury found in fish caught in some infamously toxic surrounding lakes, as well as some evidence of high bronchial and lung cancer rates in townships with heavy industry. This could explain some of the incidence disproportion between certain states (where Illinois was returned as the state with the highest of all state-specific average incidence rates). It is a pity that this dataset doesn't include any geographic coordinates or the cancer type, if they were present it wouldn't be too much of a challenge to figure out the local cause of cancer to a relatively good degree of accuracy.
Median income by death rate.
The majority of death rate per capita figures relate to members of the public in the lower median income bracket, with:
• A mean death rate of 194 in the 20K dollars - 40K dollars bracket.
• A mean death rate of 174 in the 40K dollars - 60K dollars bracket.
• A mean death rate of 162 in the 60K dollars - 80K dollars bracket.
• A mean death rate of 153 in the 80K dollars - 100K dollars bracket.
• And a mean death rate of 136 in the final 100K dollars - 120K dollars bracket.
Median income vs. percent private coverage and percent private coverage alone.
The majority of people holding private coverage are in the higher median income brackets.
Median income vs. percent public coverage and percent public coverage alone.
The median income figures relative to public coverage are almost reversed, with the majority of the people in the lower median income bracket using this coverage type.
Percent public coverage by average white & black ethnicity.
Here we see the deeper blue hues representing a higher percentage of those of Black ethnicity holding the majority of public coverage in a tight group between 53 and 56 percent. There are deep blue groups in the lower public coverage values but the largest portion appears to exist between 40 and 60.
Percent private coverage by white & black ethnicity.
And there are tighter groups in the left side of the chart below 55 representing those of Black ethnicity holding less private coverage, but overall the size and spread of these values leads me to believe the majority of Black and White ethnicity values here hold private coverage.
Percent public coverage by Asian & other ethnicity.
It seems there is a large group of public coverage clusters held by those of Asian ethnicity as well as lower figures for those classed as 'Other race' spread more evenly across the public coverage range. There are deeper green hues existing between 20 and 25 percent public coverage as well as a few in the upper ranges above 50 for those of 'Other race'.
Public coverage could be quite even for both (if not all) ethnicities when looking at the mean, but seeing the visualisations shows us that certain higher ethnicity counts exist in lower coverage values and vice-versa.
Percent private coverage by Asian & other ethnicity.
Once again (as per the White and Black ethnicities) there isn't a massive difference between these two ethnicities. Here we can see in the existence of light to medium green hues across the board representing those of 'Other race', the majority of Asians in this dataset seem hold a shade more private coverage but in larger, more condensed amounts in the upper regions.
Study per capita vs. state.
• We see Kansas, Illinois, Wisconsin and Iowa with a sum of over 20K in study per capita spending.
• Rhode Island, Nevada and the District of Columbia hold the least values.
Study per cap by poverty percent.
Splitting the state into counties to see where the majority of the study per cap exists vs. poverty percent shows 6 high (above 4500) study per capita markers below a poverty percent of the average poverty percent value of 16.8. This is a good sign for those living in a below-average poverty region; the sum of study per capita for those living in below-average poverty is 295,276.25. Although this may seem unfair to those living in above-average poverty knowing cancer affects everybody & only having a share of 168,341.62 study per capita spend, cancer does appear to affect those living in below-average poverty the most.
Study per cap vs. death rate cont..
In a bid to get some clearer insights into the information contained in the above scatterplot, I looked at several other features to add alongside the study per capita and death rate columns. The incidence rate column - although a reasonably logical option to compare against the study per capita column due to studies being conducted as incidence rates rise - wasn't giving much insight, but average deaths per year seemed to return the most feasible information.
The largest average study_per_cap rate resides in the District of Columbia where the highest average death rate resides. Later in the EDA when we incorporate the ethnicities we will see that this is the state with the majority Black ethnicity count. The study per capita figure for this area is 66% of the average deaths per year count.
Once the rest of the states by ethnicity count and economics have been assessed, we see the American government / research agencies applying a good spread of research to each ethnicity based on a specific state's socio-economic conditions. And although there appear to be a few states with disproportionately large amounts being spent on research when compared to average deaths per year (as well as many states being overlooked), the figures make a little more sense once you include features such as "avg_ann_count" and "avg_deaths_per_year" together. GDP by state doesn't equate to high GDP == high research, nor is it a case of Red team spending vs. Blue team spending, so all things considered, with the exception of a handful of states, the cancer research will probably be sporadic depending on cancer 'hotspots', new cases / outbreaks etc..
Percent difference between study_per_cap & avg_deaths_per_year by state.
• The largest positive difference between study per capita and average deaths per year resides in Florida (+957.9%), Arizona (941.6%) and Nevada (624%).
• The largest negative difference between study per capita and average deaths per year resides in Kansas (-89.5%, North Dakota (-86.5%) and Montana (-85.5%).
Study per cap vs. median income.
There are mild differences in the study per capita and median income bins, with the highest average study per capita seeing the highest median income and the lowest study per cap seeing the lowest median income.
The values in the middle of the dataframe don't all follow the same pattern however, it would appear that out of the other three median income ranges, the highest average study per cap has been applied within the 40K-60K median income range which is where the majority of the mortalities appear to be occurring.
State by percent no high school 18-24 & percent Bachelors degree 18-24.
• Louisiana and Nevada are the two states with the highest population of those aged 18-24 with no high school education.
• Texas and Virginia are the two states with the highest population of those aged 18-24 holding Bachelors degrees.
• Hawaii is the state with the lowest population of of those aged 18-24 with no high school education and the second-lowest population of Bachelors degree holders, the state with that figure is the District of Columbia.
Average household size by percent no high school 18-24 & death rate.
The light blue hues between a household size of 2.45 - 2.7 represent the minority average death rate values of those aged 18-24 having no high school education. The majority for this demographic exists in the latter, larger household sizes especially above the 3.3 average.
The rightmost bar with the deepest blue hue represents an average death rate figure of 259.1, amid that group is a cluster of 28 people with no high school education aged 18-24 and an average household size of 3.93.
Slightly further down the chart from that is an average death rate figure of 219, with a cluster of 41 people with no high school education aged 18-24 and an average household size of 3.86.
Average household size by percent Bachelor degree 18-24 & death rate.
The chart for the Bachelors degree holders shows a notable deep green bar in the average household size range of 1.91 where the death rate is 214.7. In that bar exists a group of 18 Bachelors degree holders aged 18-24.
Compared to the above chart, there are similar distributions of marginally lower deep green groups in the upper household size ranges mostly above 3.0 representing almost as many deaths as those with no high school education, spread over a wider array of household sizes (as opposed to a couple of large household sizes) besides one at a household size of 3.93 where 2 Bachelors degree holders exist in a group of 259 average deaths.
So although the differences are small, there are slightly more Bachelors degree holders more negatively affected in the lower and 'normal' average household size ranges than no high school educated people of the same age group, but slightly more of those with no high school education more negatively affected in the upper average household size ranges.
Average household size per white and black ethnicity.
The deep blue hues here represent the majority household average sizes between 2.4 and 2.65 for those of Black ethnicity, while the markers for those of White ethnicity make up smaller and larger average household sizes.
Average household size per Asian and other ethnicity.
Those of Asian ethnicity mostly sum-up the lower and upper average household sizes, where those classed as Other race / ethnicity appear to be clustered between an average household size of 2.4 and 2.9.
Poverty percent vs. state.
(The high school 25 & over, Bachelor's Degree 25 & over, poverty max and percent no high school 18-24 data for each state are all available on hover).
Poverty percent by state shows Georgia return a darker hue than the rest of the states but Georgia doesn't necessarily hold the highest figure for poverty percent, it is actually second to Kentucky and Missouri.
Georgia's low high school education figure, low Bachelor's degree rate and relatively high poverty max are all being taken into consideration to class it as the worst affected state, economically.
Poverty max by state, we see South Dakota, Alabama and Kentucky experiencing the largest poverty max values. Connecticut, New Hampshire and Rhode Island are the least poverty-stricken states.
For contrast between some of the states experiencing the most & least poverty, Georgia's high school 25 & over figure is 43.3 where New Hampshire's is 31.5. GA's percentage of Bachelor's Degree holders is 9.2 where New Hampshire's is 19.1. And Georgia's no high school 18-24 figure is 26.7 where New Hampshire's is 15.4. So, multiple socio-economic differences there.
Poverty percent by ethnicity.
For the respondents of White ethnicity, the majority are living below a poverty percent of 25.
Respondents of Black ethnicity make up the largest figure of Americans living in poverty, accounting for 6.7% of the below-average poverty data.
Respondents of Asian ethnicity see a more positive distribution, with almost double the amount of below-average poverty data than above average poverty data.
A different story for respondents of 'Other' race though, they make up only 1.98% of the below-average poverty data and 2.14% of the above-average poverty data.
Married households by coverage type.
As a result of the selectKBest chart at the beginning of the EDA, I assumed (read: "A semi-educated gut feeling" because I don't like to assume) that the presence of the married households column negatively affected the importance of public coverage, *possibly* due to the majority of married households having private coverage.
The size of the makers represent the public coverage percent (small == less / large == more). The darker hues represent the private coverage (lighter == less / you get the point).
We see a good spread of makers reflecting a wide array of values, although there are visible - if slight - majorities of light-hued, large markers representing more public coverage & less private coverage existing in the lower married household range, below 50. The darker-hued, smaller markers representing more private coverage & less public coverage appear to exist in the higher married household ranges above 60.
Percent employer-provided private coverage vs. death rate.
There is a reasonable spread of death rate data in the lower regions of the employer-provided coverage and a negative trendline, with darker (lower) death rate hues in the upper ranges for the provided healthcare type, so that's better news for those with employer-provided healthcare than those with public healthcare at least.
Final thoughts.
Up to this point we can safely deduct from the EDA that low income / high rates of poverty, low education levels, a lack of available research, public coverage and high incidence rates are the #1 obstacles and features when analysing cancer mortalities among all ethnicities, particularly those of Black ethnicity in this dataset. That available research does appear to be doing a reasonable job of reducing the mortalities although not to a degree that would make a notable difference at certain rates (I would like to see more recent study_per_cap > 8000 vs. mortality figures out of interest & see how things have changed over the last 8 years).
As there are differences for the socio-economic factors influencing cancer mortalities between states and ethnicities I will create two models: One single-target regression model for the target variable minus the states in a bid to get a handle on the nationwide socio-economic attributes influencing cancer mortalities, and another, MTR model which will be trained on the target variable and the different ethnicities, with the states included with the rest of the independent variables.
VIF analysis.
Single-target regression with LightGBM.
Quantile regression - prediction intervals with LightGBM.
Setting the upper boundary:
Setting the lower boundary:
Plotting prediction intervals for "poverty_percent" (example):
Shap summary for single-target regression using LightGBM.
Here we see the largest of values for average deaths per year increase the prediction, which is not illogical. The larger values for population estimates decrease predictions, as stated in the EDA where the increase in population doesn't necessarily equate to an increase in cancer rates (population gain will 'outrun' the cancer figures, thankfully). The average annual count also shows lower numbers increasing predictions due to the population growth outrunning the mortality per capita rate. A high incidence rate bears an expectedly high relevence to prediction. It is the lower figures for Batchelor's Degree 25 & over which increase predictions as well as the lower median ages. As seen in the EDA we can now verify the fact that it is the large poverty values which increase predictions as well as highschool educated people aged 18-24, the highschool educated 25 & over to a slightly lesser degree than the 18-24 age group, and then the unemployed aged 16 and over.
With regard to ethnicities, we see a moderate to low percentage of 'Other race' and White persons, plus a moderate to high percent of Black persons increase predictions. A relatively average to low spread of Asian persons increases predictions (although being a tad more dependent on other variables than black persons and those of 'other race'). With regard to coverage, we see those with public coverage alone and those with employer-provided private healthcare increase predictions.
Other socio-economic features raising predictions include the average-low male median age group, a slightly above average family size, average to low median incomes (depending on other variables) and average-low birth rates.
Multi-target-regression using XGBoost.
Shap explainer.
(States included as independent variables).
The results using XGBoost with the Black ethnicity as the secondary target variable alongside the death rate target.
Georgia has been included as the most important state as per the EDA, some other important features are the poverty percent, unemployed 16 and over, a lower value of Bachelor degrees, public and private coverage, married households, a high average of married couples with relatively low average family sizes.