# A “parasitic” relationship: How flood risk is impacted by income and elevation

This project was created by Nikki Trueblood, Manas Khatore, Audrey Chun, and Sean Yang as a part of the General Membership program from Data Science Society @ UC Berkeley. Special thanks to Siddhant Satapathy for his mentorship throughout the process.

## Background and Hypothesis

When we first watched the 2019 award-winning film "Parasite," we were inspired by its underlying social commentary on issues such as environmental justice and socioeconomic disparities. In one scene, a torrential downpour occurs, and while the rich Park family watches the rain from the comfort of their mansion high above the city, the poor Kim family rushes downhill to save their belongings as their house floods. We wanted to test whether the depictions of elevation and socioeconomic status actually play factors in flood risk as seen in the movie. Do rich people tend to live at higher elevations? Does this mean they are at a lower flood risk?

We hypothesized that those with higher median household incomes and those who live at higher elevations are at a lower risk of flooding. To test our hypothesis, we chose to gather data on the state of New York due to New York's plentiful pre-existing data online, the high variation in socioeconomic status in New York, and the dense, urban population in NYC, which is similar to the city depicted in "Parasite."

Although simply fact-checking a movie would be a fun project, the potential findings from analyzing this data could demonstrate the impacts of a larger issue of environmental justice. According to E&E News, flood risks to low-income homes are set to triple by 2050, potentially affecting millions of lives. We want to understand what exactly increases flood risk so that we can best target these communities to protect them from future natural disasters.

## Data Collection and Cleaning

Below is a raw data table containing a systematic sample of coordinates in New York State (obtained from CUGIR). We then cleaned this table to obtain the longitude and latitude coordinates in float values, assigning the coordinates to an array.

```
[[-75.75 44.25 ]
[-78.75 42.25 ]
[-75.25 43.875]
...
[-73.875 41.125]
[-79. 43.375]
[-75. 41.5 ]]
```

We imported the flood risks of every U.S. congressional district, cleaning the data to obtain the flood risks for only New York congressional districts. We also deleted columns containing data that we did not need. The data were obtained from First Street Foundation.

The following two code blocks contain the latitudes, longitudes, and congressional district ID's for every New York congressional district. This data was obtained from the US Census Bureau.

Below, we combined the flood risk table and the New York coordinates table by congressional district. We needed to add coordinates to the flood risk table in order to combine it with other variables such as elevation and median household income, both of which are dependent on latitude and longitude.

We created the "closest_district" function to allow us to take in a latitude and longitude and return the closest congressional district. This way we could, for example, take in a latitude/longitude coordinate that tells us the elevation of a place in New York State and assigns it to which district it belongs to.

Below is a data table we generated using Geocodio by inputting the latitude and longitude coordinates of New York's 27 congressional districts and obtaining the median household income at those coordinates.

Below, we used a Google Maps Elevation API to obtain the elevation of every coordinate pair in our "coordinates_nystate" array by taking in latitude and longitude.

In the next two cells, we used the "closest_district" function defined earlier to find the average elevation and the average income of every New York Congressional District from the "coordinates_nystate" array. We then added the districts and elevations to a table containing only latitudes and longitudes. We created a table that contains the lat/longs and average elevations of each of the districts.

Below is our full data table containing the coordinates, median household income, elevation, and flood risk of every New York Congressional district. However, there were a few missing values so we filled in the missing income values manually in the following cell using data from censusreporter.org.

## Visualizations

Below is a scatterplot with a linear regression line showing the trend for elevation and median household income. We can see a slight negative association, meaning that higher elevation tends to be associated with lower income. It should be noted that many points are close to the 0-100 range that represent areas in New York City or close to the coast.

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
The correlation coefficient is: -0.3117324290329096
```

Below is a scatterplot with a linear regression line showing the trend for median household income and flood risk. We can see a slight negative association, meaning that higher income is associated with a lower risk score.

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
The correlation coefficient is: -0.293874398296924
```

Below is a scatterplot with a linear regression line showing the trend for elevation and flood risk. We can see a slight positive association, meaning that higher elevation is associated with a higher risk score. This seems counterintuitive, as one would think that lower elevations have higher risks of flooding. However, according to FEMA, "heavy rains, poor drainage, and even nearby construction projects" increase flood risk rather than strictly elevation. Districts located at lower elevations could also have better infrastructure to prevent against flooding.

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
The correlation coefficient is: 0.3198979540433497
```

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
The correlation coefficient is: -0.33732699484549866
```

```
/shared-libs/python3.7/py/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
The correlation coefficient is: 0.38246588310670454
```

While the correlations did get stronger, there wasn't a drastic improvement. We decided to perform our hypothesis testing and create linear regression models using the full dataset so that we could utilize more data points.

## Hypothesis Testing

In order to conduct our hypothesis testing, we are going to treat the 27 NY congressional districts as a sample of many different points or divisions within the state of New York (since congressional district lines are consistently adjusted anyways). This will allow us to bootstrap from the 27 congressional districts.

```
90% confidence interval for slope: [-0.527332, -0.0399913]
```

Based on our 90% confidence interval, we will reject our null hypothesis and conclude that the true slope between flood risk and income is less than 0 because 0 is not contained in our confidence interval. Thus, we can conclude that there is an inverse relationship between flood risk and income; as income increases, flood risk decreases.

```
90% confidence interval for slope: [-0.00505808, 0.561469]
```

Based on our 90% confidence interval, we would fail to reject our null hypothesis, since 0 is contained in our confidence interval. We cannot conclude that there is a linear relationship between elevation and flood risk for the state of New York, meaning that it would not make sense to predict the flood risk from elevation.

## Linear Regression / Modelling

For each model, we calculated the coefficient of determination (R^2) to gauge how well the model was doing at predicting the test data. Higher R^2 values correspond to a more accurate model. The model below predicts the average flood risk score from median household income; we used income as the predictor variable since we were able to conclude that there is a linear relationship between income and risk score.

```
The coefficient of determination is 0.5868231184081878
The mean squared error is 0.012569758908870915
```

We can see that the coefficient of determination from our income model is about 0.59.

To confirm the results of our hypothesis test, we decided to create a model that predicts risk score from just elevation and a combined model that incorporates both income and elevation as predictor variables. Since we cannot conclude that there is a linear relationship between elevation and risk score, we should expect the elevation and combined models to have a lower coefficient of determination than the income model.

```
The coefficient of determination is -0.12827923282518183
The mean squared error is 0.03432476154972611
```

```
The coefficient of determination is 0.4858752461007697
The mean squared error is 0.015640817513067704
```

The coefficients of determination for the elevation and combined models are -0.12 and 0.49 respectively. This falls in line with our prediction that the income model would perform better than the elevation and combined models. Note that the negative coefficient of determination for the elevation model means that our model does such a poor job at fitting the data that even a horizontal line would act as a better model.

## Conclusion

What does all of this code mean? To circle back to our original hypothesis, we both accept and reject our hypothesis because we were correct about one relationship and incorrect about the other.

For income and flood risk, we observed a negative correlation coefficient, a confidence interval that didn’t contain zero, and a higher coefficient of determination, so we concluded that those with higher income have a lower flood risk. However, for elevation and flood risk, we observed a negative correlation coefficient, a confidence interval that included zero, and a lower coefficient of determination, so we concluded that the relationship between elevation and flood risk is very weak if not nonexistent, at least in the state of New York. Ultimately, we found that income is a better factor in predicting flood risk than elevation, which was really surprising to us since it seemed like a given that lower elevations would flood more easily.

Now let’s discuss some limitations and corresponding future analysis we could do with this project. The biggest issue we encountered was that as a result of us using the New York congressional districts, we had only 27 data points. This caused us to observe ecological correlation, meaning that we weren’t able to see the distributions or variance within each individual county. Furthermore, our machine learning model and hypothesis testing were less effective as they could have been as there was simply not enough data to make very accurate predictions or conclusions. A resolution to this would be to analyze smaller divisions of New York such as cities or to increase the scale of the project beyond New York. We also obtained our data from inconsistent sources as we filled in nan values with a different source than our original dataset’s source. A resolution to this would be to find a dataset with more complete information. Lastly, we only examined two variables in this project - elevation and income. There are likely many other confounding variables that impact flood risk such as whether the location is urban or rural, the education level of the residents in the area, or the quality of infrastructure in the area. A resolution to this would be to find data to explore these other variables.

Ultimately, our findings further proved the importance of addressing environmental justice issues in our country. As the prevalence of natural disasters continues to increase, it is essential that we look at which communities will be impacted the most. As we talked about in the introduction, this is a huge issue that can potentially affect millions of lives. And although in "Parasite," Mr. Kim says that the best kind of plan is no plan, the only way we can soften the blow of natural disasters to low-income communities is to create a plan to help them.