A “parasitic” relationship: How flood risk is impacted by income and elevation

This project was created by Nikki Trueblood, Manas Khatore, Audrey Chun, and Sean Yang as a part of the General Membership program from Data Science Society @ UC Berkeley. Special thanks to Siddhant Satapathy for his mentorship throughout the process.

Background and Hypothesis

When we first watched the 2019 award-winning film "Parasite," we were inspired by its underlying social commentary on issues such as environmental justice and socioeconomic disparities. In one scene, a torrential downpour occurs, and while the rich Park family watches the rain from the comfort of their mansion high above the city, the poor Kim family rushes downhill to save their belongings as their house floods. We wanted to test whether the depictions of elevation and socioeconomic status actually play factors in flood risk as seen in the movie. Do rich people tend to live at higher elevations? Does this mean they are at a lower flood risk?

We hypothesized that those with higher median household incomes and those who live at higher elevations are at a lower risk of flooding. To test our hypothesis, we chose to gather data on the state of New York due to New York's plentiful pre-existing data online, the high variation in socioeconomic status in New York, and the dense, urban population in NYC, which is similar to the city depicted in "Parasite."

Although simply fact-checking a movie would be a fun project, the potential findings from analyzing this data could demonstrate the impacts of a larger issue of environmental justice. According to E&E News, flood risks to low-income homes are set to triple by 2050, potentially affecting millions of lives. We want to understand what exactly increases flood risk so that we can best target these communities to protect them from future natural disasters.

# importing the libraries used in the deepnote from datascience import * import pandas as pd import wseaborn as sns import numpy as np import matplotlib import matplotlib.pyplot as plt import array from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import train_test_split from sklearn import metrics

Data Collection and Cleaning

Below is a raw data table containing a systematic sample of coordinates in New York State (obtained from CUGIR). We then cleaned this table to obtain the longitude and latitude coordinates in float values, assigning the coordinates to an array.

# table with New York state coordinates # defines get_coordinates that extracts the first coordinate pair for each row in coordinates column raw_coordinates = Table().read_table('convertcsv.csv') raw_coordinates.show(5) coordinates_nystate = raw_coordinates.select('coordinates') def get_coordinates(x): x = x.split(",") return [float(x[0]), float(x[1])] coordinates_nystate = coordinates_nystate.apply(get_coordinates, "coordinates") print(coordinates_nystate)

We imported the flood risks of every U.S. congressional district, cleaning the data to obtain the flood risks for only New York congressional districts. We also deleted columns containing data that we did not need. The data were obtained from First Street Foundation.

# cleaned table to only show New York congressional districts and average risk score flood_risk = Table().read_table('CongDist_level_risk.csv') flood_risk_ny = flood_risk.select("CDID", "name", "avg_risk_score_all").where('name', are.containing("New York")) flood_risk_ny.show(10)

The following two code blocks contain the latitudes, longitudes, and congressional district ID's for every New York congressional district. This data was obtained from the US Census Bureau.

#latitudes of each congressional district latitudes = make_array(40.983872, 40.685161, 40.860496, 40.636948, 40.596562, 40.734862, 40.690959, 40.621438, 40.642184, 40.703222, 40.566974, 40.751026, 40.838862, 40.809782, 40.824031, 40.939735, 41.159532, 41.411657, 42.227811, 42.748882, 44.008384, 42.870694, 42.344244, 43.210319, 43.323395, 42.961825, 43.00085) #longitudes of each congressional district longitudes = make_array(-72.549515, -73.294978, -73.505379, -73.60591, -73.827212, -73.823455, -73.910504, -73.911077, -73.947201, -74.006292, -74.126314, -73.95042, -73.932982, -73.837004, -73.891163, -73.791088, -73.919355, -74.080981, -74.425633, -73.988366, -74.609231, -75.662664, -77.845808, -76.648094, -77.715489, -78.840983, -78.316882)

#combining congressional district IDs (CDID), latitudes, and longitudes together in one table CDID_array = np.arange(3601, 3628) coordinates = Table().with_columns("CDID", CDID_array, "latitude", latitudes, "longitude", longitudes) coordinates.show(27)

Below, we combined the flood risk table and the New York coordinates table by congressional district. We needed to add coordinates to the flood risk table in order to combine it with other variables such as elevation and median household income, both of which are dependent on latitude and longitude.

flood_risk_with_coordinates = flood_risk_ny.join("CDID", coordinates) latitudes = flood_risk_with_coordinates.column('latitude') flood_risk_coordinates = flood_risk_with_coordinates.drop('latitude').with_column('latitude', latitudes) longitudes = flood_risk_with_coordinates.column('longitude') flood_risk_coordinates = flood_risk_coordinates.drop('longitude').with_column('longitude', longitudes) flood_risk_coordinates.show(27)

We created the "closest_district" function to allow us to take in a latitude and longitude and return the closest congressional district. This way we could, for example, take in a latitude/longitude coordinate that tells us the elevation of a place in New York State and assigns it to which district it belongs to.

def closest_district(lat, lon): cd_latitude = flood_risk_coordinates.column('latitude') cd_longitude = flood_risk_coordinates.column("longitude") lat_diffs = abs(cd_latitude - lat) long_diffs = abs(cd_longitude - lon) total_diffs = np.sqrt(lat_diffs**2 + long_diffs **2) total_diffs_list = total_diffs.tolist() min_total_diff = min(total_diffs) min_index = total_diffs_list.index(min_total_diff) return flood_risk_coordinates.column("name").item(min_index) closest_district(42.678, -73.982)

Below is a data table we generated using Geocodio by inputting the latitude and longitude coordinates of New York's 27 congressional districts and obtaining the median household income at those coordinates.

# table generated from Geocodio; shows median household income for each lat/long pair income = Table().read_table('NY_lats_longs_income.csv') income = income.select("latitude", "longitude", "ACS Economics/Median household income/Total/Value") income.show(27)

Below, we used a Google Maps Elevation API to obtain the elevation of every coordinate pair in our "coordinates_nystate" array by taking in latitude and longitude.

import requests elevations_arr = [] for x in coordinates_nystate: lon = str(x[0]) lat = str(x[1]) url = "https://maps.googleapis.com/maps/api/elevation/json?locations=" + lat + "%2C" + lon + "&key=AIzaSyDiBJZcweY090zri7RJb_IvF7i0dXSd-ik" payload={} headers = {} response = requests.request("GET", url, headers=headers, data=payload) # print(response.text) elevations_arr.append(dict(eval(response.text))['results'][0]['elevation']) elevations_arr

In the next two cells, we used the "closest_district" function defined earlier to find the average elevation and the average income of every New York Congressional District from the "coordinates_nystate" array. We then added the districts and elevations to a table containing only latitudes and longitudes. We created a table that contains the lat/longs and average elevations of each of the districts.

longitudes = [x[0] for x in coordinates_nystate] latitudes = [x[1] for x in coordinates_nystate] coordinates_and_districts = Table().with_columns('latitudes', latitudes, 'longitudes', longitudes) new_districts = coordinates_and_districts.apply(closest_district, 'latitudes', 'longitudes') coordinates_and_districts = coordinates_and_districts.with_columns('cd', new_districts, 'elevations', elevations_arr).group('cd', np.average) coordinates_and_districts.show(10)

# income.apply(closest_district, "latitude", "longitude") # this table is missing districts 7, 12, 14 income = income.with_column("cd", income.apply(closest_district, "latitude", "longitude")) most_districts = coordinates_and_districts.join("cd", income).drop("latitude", "longitude") most_districts = most_districts.join('cd', flood_risk_coordinates, "name") most_districts = most_districts.drop("CDID", "latitude", "longitude") most_districts = most_districts.relabeled("latitudes average", "latitude").relabeled("longitudes average", "longitude").relabeled("elevations average", "elevation") most_districts.show()

ny_missing_districts = flood_risk_coordinates.take(6, 11, 13) missing_lats = ny_missing_districts.column("latitude") missing_longs = ny_missing_districts.column("longitude") missing_elevs = [] for i in np.arange(3): lon = str(missing_longs.item(i)) lat = str(missing_lats.item(i)) url = "https://maps.googleapis.com/maps/api/elevation/json?locations=" + lat + "%2C" + lon + "&key=AIzaSyDiBJZcweY090zri7RJb_IvF7i0dXSd-ik" payload={} headers = {} response = requests.request("GET", url, headers=headers, data=payload) # print(response.text) missing_elevs.append(dict(eval(response.text))['results'][0]['elevation']) ny_missing_districts = ny_missing_districts.with_column("elevation", missing_elevs) ny_missing_districts = ny_missing_districts.join("name", income, "cd").drop("CDID", "latitude_2", "longitude_2") ny_missing_districts

Below is our full data table containing the coordinates, median household income, elevation, and flood risk of every New York Congressional district. However, there were a few missing values so we filled in the missing income values manually in the following cell using data from censusreporter.org.

# adding ny_missing_districts to the most_districts table cds = np.append(most_districts.column("cd"), ny_missing_districts.column("name")) lats = np.append(most_districts.column("latitude"), ny_missing_districts.column("latitude")) longs = np.append(most_districts.column("longitude"), ny_missing_districts.column("longitude")) incomes = np.append(most_districts.column("ACS Economics/Median household income/Total/Value"), ny_missing_districts.column("ACS Economics/Median household income/Total/Value")) elevs = np.append(most_districts.column("elevation"), ny_missing_districts.column("elevation")) risk_scores = np.append(most_districts.column("avg_risk_score_all"), ny_missing_districts.column("avg_risk_score_all")) full_data = Table().with_columns("cd", cds, "latitude", lats, "longitude", longs, "median household income", incomes, "elevation", elevs, "average risk score", risk_scores) full_data_df = full_data.to_df() full_data_df

# Filling in the missing values of income from censusreporter.org full_data_df.loc[3, 'median household income'] = 46298 full_data_df.loc[11, 'median household income'] = 57320 full_data_df.loc[25, 'median household income'] = 61336 full_data_df

%matplotlib inline

Visualizations

Below is a scatterplot with a linear regression line showing the trend for elevation and median household income. We can see a slight negative association, meaning that higher elevation tends to be associated with lower income. It should be noted that many points are close to the 0-100 range that represent areas in New York City or close to the coast.

fig, ax = plt.subplots(figsize = (20,10)) plt.scatter(data = full_data_df, x = "elevation", y = "median household income") plt.xlabel("elevation") plt.ylabel("median household income") elevation_and_income = full_data_df[["elevation", "median household income"]] sns.regplot(elevation_and_income['elevation'], elevation_and_income['median household income']) print("The correlation coefficient is: " + str(elevation_and_income.corr(method ='pearson').iloc[0, 1]))

Below is a scatterplot with a linear regression line showing the trend for median household income and flood risk. We can see a slight negative association, meaning that higher income is associated with a lower risk score.

fig, ax = plt.subplots(figsize = (20,10)) plt.scatter(data = full_data_df, x = "median household income", y = "average risk score") plt.xlabel("median household income") plt.ylabel("average risk score") income_and_risk = full_data_df[["median household income", "average risk score"]] sns.regplot(income_and_risk['median household income'], income_and_risk['average risk score']) print("The correlation coefficient is: " + str(income_and_risk.corr(method ='pearson').iloc[0, 1]))

Below is a scatterplot with a linear regression line showing the trend for elevation and flood risk. We can see a slight positive association, meaning that higher elevation is associated with a higher risk score. This seems counterintuitive, as one would think that lower elevations have higher risks of flooding. However, according to FEMA, "heavy rains, poor drainage, and even nearby construction projects" increase flood risk rather than strictly elevation. Districts located at lower elevations could also have better infrastructure to prevent against flooding.

fig, ax = plt.subplots(figsize = (20,10)) plt.scatter(data = full_data_df, x = "elevation", y = "average risk score") plt.xlabel("elevation") plt.ylabel("average risk score") elevation_and_risk = full_data_df[["elevation", "average risk score"]] sns.regplot(elevation_and_risk['elevation'], elevation_and_risk['average risk score']) print("The correlation coefficient is: " + str(elevation_and_risk.corr(method ='pearson').iloc[0, 1]))

# creates a new dataframe without Districts 5-15 data_df_no_nyc = full_data_df.drop([20, 21, 26, 22, 23, 1, 2, 3, 4, 24, 25])

fig, ax = plt.subplots(figsize = (20,10)) plt.scatter(data = data_df_no_nyc, x = "median household income", y = "average risk score") plt.xlabel("median household income") plt.ylabel("average risk score") income_and_risk = data_df_no_nyc[["median household income", "average risk score"]] sns.regplot(income_and_risk['median household income'], income_and_risk['average risk score']) print("The correlation coefficient is: " + str(income_and_risk.corr(method ='pearson').iloc[0, 1]))

fig, ax = plt.subplots(figsize = (20,10)) plt.scatter(data = data_df_no_nyc, x = "elevation", y = "average risk score") plt.xlabel("elevation") plt.ylabel("average risk score") elevation_and_risk = data_df_no_nyc[["elevation", "average risk score"]] sns.regplot(elevation_and_risk['elevation'], elevation_and_risk['average risk score']) print("The correlation coefficient is: " + str(elevation_and_risk.corr(method ='pearson').iloc[0, 1]))

While the correlations did get stronger, there wasn't a drastic improvement. We decided to perform our hypothesis testing and create linear regression models using the full dataset so that we could utilize more data points.

Hypothesis Testing

In order to conduct our hypothesis testing, we are going to treat the 27 NY congressional districts as a sample of many different points or divisions within the state of New York (since congressional district lines are consistently adjusted anyways). This will allow us to bootstrap from the 27 congressional districts.

def standard_units(arr): mean = np.mean(arr) stddev = np.std(arr) return (arr - mean)/stddev standardized_income = standard_units(full_data_df['median household income']) standardized_risk = standard_units(full_data_df['average risk score']) standardized_tbl1 = Table().with_columns('median household income', standardized_income, 'average risk score', standardized_risk) def correlation(tbl, x_col, y_col): x_std = standard_units(tbl.column(x_col)) y_std = standard_units(tbl.column(y_col)) return np.average(x_std*y_std) def fit_line(tbl, x_col, y_col): r = correlation(tbl, x_col, y_col) x_std = np.std(tbl.column(x_col)) y_std = np.std(tbl.column(y_col)) x_mean = np.mean(tbl.column(x_col)) y_mean = np.mean(tbl.column(y_col)) s = r*y_std/x_std return make_array(s, (y_mean-s*x_mean)) resample_slopes1 = [] for i in np.arange(10000): table_resample1 = standardized_tbl1.sample() resample_line1 = fit_line(table_resample1, "median household income", "average risk score") resample_slope1 = resample_line1[0] resample_slopes1 = np.append(resample_slopes1, resample_slope1) lower_end = percentile(5, resample_slopes1) higher_end = percentile(95, resample_slopes1) print("90% confidence interval for slope: [{:g}, {:g}]".format(lower_end, higher_end)) Table().with_column("Slope estimate", resample_slopes1).hist() plt.scatter(lower_end, -0.002, color='red', s=40); plt.scatter(higher_end, -0.002, color='red', s=40);

Based on our 90% confidence interval, we will reject our null hypothesis and conclude that the true slope between flood risk and income is less than 0 because 0 is not contained in our confidence interval. Thus, we can conclude that there is an inverse relationship between flood risk and income; as income increases, flood risk decreases.

standardized_elevation = standard_units(full_data_df['elevation']) standardized_risk = standard_units(full_data_df['average risk score']) standardized_tbl2 = Table().with_columns('elevation', standardized_elevation, 'average risk score', standardized_risk) resample_slopes2 = [] for i in np.arange(10000): table_resample2 = standardized_tbl2.sample() resample_line2 = fit_line(table_resample2, "elevation", "average risk score") resample_slope2 = resample_line2[0] resample_slopes2 = np.append(resample_slopes2, resample_slope2) lower_end = percentile(5, resample_slopes2) higher_end = percentile(95, resample_slopes2) print("90% confidence interval for slope: [{:g}, {:g}]".format(lower_end, higher_end)) Table().with_column("Slope estimate", resample_slopes2).hist() plt.scatter(lower_end, -0.002, color='red', s=40); plt.scatter(higher_end, -0.002, color='red', s=40);

Based on our 90% confidence interval, we would fail to reject our null hypothesis, since 0 is contained in our confidence interval. We cannot conclude that there is a linear relationship between elevation and flood risk for the state of New York, meaning that it would not make sense to predict the flood risk from elevation.

Linear Regression / Modelling

# allots 90% of full_data_df to be training data for the model to "learn" on # allots 10% of full_data_df to be test data to determine how accurate the model is train_data, test_data = train_test_split(full_data_df, test_size = 0.10, random_state = 43)

For each model, we calculated the coefficient of determination (R^2) to gauge how well the model was doing at predicting the test data. Higher R^2 values correspond to a more accurate model. The model below predicts the average flood risk score from median household income; we used income as the predictor variable since we were able to conclude that there is a linear relationship between income and risk score.

# Linear model with income as the predictor variable X_train_inc = train_data[['median household income']] y_train_inc = train_data['average risk score'] X_test_inc = test_data[['median household income']] y_test_inc = test_data['average risk score'] income_model = LinearRegression(fit_intercept=True) income_model.fit(X_train_inc, y_train_inc) y_pred_inc = income_model.predict(X_test_inc) print("The coefficient of determination is " + str(metrics.r2_score(y_test_inc, y_pred_inc))) print("The mean squared error is " + str(mean_squared_error(y_test_inc, y_pred_inc)))

We can see that the coefficient of determination from our income model is about 0.59.

To confirm the results of our hypothesis test, we decided to create a model that predicts risk score from just elevation and a combined model that incorporates both income and elevation as predictor variables. Since we cannot conclude that there is a linear relationship between elevation and risk score, we should expect the elevation and combined models to have a lower coefficient of determination than the income model.

# Linear model with just elevation as the predictor variable X_train_elev = train_data[['elevation']] y_train_elev = train_data['average risk score'] X_test_elev = test_data[['elevation']] y_test_elev = test_data['average risk score'] elevation_model = LinearRegression(fit_intercept=True) elevation_model.fit(X_train_elev, y_train_elev) y_pred_elev = elevation_model.predict(X_test_elev) print("The coefficient of determination is " + str(metrics.r2_score(y_test_elev, y_pred_elev))) print("The mean squared error is " + str(mean_squared_error(y_test_elev, y_pred_elev)))

# Linear model with income and elevation as predictor variables X_train = train_data[[ 'median household income', 'elevation']] y_train = train_data['average risk score'] X_test = test_data[['median household income','elevation']] y_test = test_data['average risk score'] combined_model = LinearRegression() combined_model.fit(X_train, y_train) y_pred = combined_model.predict(X_test) print("The coefficient of determination is " + str(metrics.r2_score(y_test, y_pred))) print("The mean squared error is " + str(mean_squared_error(y_test, y_pred)))

The coefficients of determination for the elevation and combined models are -0.12 and 0.49 respectively. This falls in line with our prediction that the income model would perform better than the elevation and combined models. Note that the negative coefficient of determination for the elevation model means that our model does such a poor job at fitting the data that even a horizontal line would act as a better model.

Conclusion

What does all of this code mean? To circle back to our original hypothesis, we both accept and reject our hypothesis because we were correct about one relationship and incorrect about the other.

For income and flood risk, we observed a negative correlation coefficient, a confidence interval that didn’t contain zero, and a higher coefficient of determination, so we concluded that those with higher income have a lower flood risk. However, for elevation and flood risk, we observed a negative correlation coefficient, a confidence interval that included zero, and a lower coefficient of determination, so we concluded that the relationship between elevation and flood risk is very weak if not nonexistent, at least in the state of New York. Ultimately, we found that income is a better factor in predicting flood risk than elevation, which was really surprising to us since it seemed like a given that lower elevations would flood more easily.

Now let’s discuss some limitations and corresponding future analysis we could do with this project. The biggest issue we encountered was that as a result of us using the New York congressional districts, we had only 27 data points. This caused us to observe ecological correlation, meaning that we weren’t able to see the distributions or variance within each individual county. Furthermore, our machine learning model and hypothesis testing were less effective as they could have been as there was simply not enough data to make very accurate predictions or conclusions. A resolution to this would be to analyze smaller divisions of New York such as cities or to increase the scale of the project beyond New York. We also obtained our data from inconsistent sources as we filled in nan values with a different source than our original dataset’s source. A resolution to this would be to find a dataset with more complete information. Lastly, we only examined two variables in this project - elevation and income. There are likely many other confounding variables that impact flood risk such as whether the location is urban or rural, the education level of the residents in the area, or the quality of infrastructure in the area. A resolution to this would be to find data to explore these other variables.

Ultimately, our findings further proved the importance of addressing environmental justice issues in our country. As the prevalence of natural disasters continues to increase, it is essential that we look at which communities will be impacted the most. As we talked about in the introduction, this is a huge issue that can potentially affect millions of lives. And although in "Parasite," Mr. Kim says that the best kind of plan is no plan, the only way we can soften the blow of natural disasters to low-income communities is to create a plan to help them.