# Assignment 1 - DAT405

Emil Josefsson & Tuyen Ngo

## Time spent per person:

Emil Josefsson: 12 h

Tuyen Ngo: 12 h

### 1A: Writing a Python scatter plot

#### Motivations and assumptions:

We plot data for the year 2016 since this is the latest year in our dataset that contains GDP per capita data. Life expectancy contains more recent years, but the GDP data constrains this. Since we filter on the year 2016 for both datasets, we only need to combine them based on the entries' country code.

We choose life expectancy for the Y axis and GDP per capita for the X axis. The reason for this is that we hypothesize GDP per capita can be a strong factor in life expectancy, but not the other way around. By using an appropriate regression model on the data, we can see life expectancy as a function of GDP per capita, $f(x) = y$. We will explain this more in depth in 1b).

### 1B: Discussing results

A quick glance at the plot quickly shows that there is at
least some correlation between GDP per capita and life expectancy.
Using linear regression from `sklearn`

(see below), we can fit
the data to a logarithmic curve. While it's not a super tight
fit, it shows that as GDP per capita increases, so does life
expectancy. However, the life expectancy levels off at around
83 years, despite an increase in GDP per capita. This is evident
by the outlier on the far right: Qatar with a GDP per capita of
$140,000 has a life expectancy at 79.87 years.

The results seem reasonable – a higher GDP per capita indicates, in rough terms, 'more money per person', and thus a richer country. A rich country is more likely to have a developed welfare with healthcare access and thus higher life expectancy. The correlation is $X \rightarrow Y$ and not vice versa - high life expectancy does not cause higher GDP per capita.

As mentioned above, the life expectancy levels off, and this is also reasonable due to humans still being subjects of aging. Currently, there is no way to reverse aging, so after reaching a GDP per capita of around $25,000 the countries' respective life expectancy are all around 80 years. However, it doesn't seem too far fetched that in the future a richer country would be first to develop technology that reverses aging, and that would make our data more similar to a linear function.

### 1C: Data Cleaning

The only data entries that were not included were the countries that did not have both X & Y data. In other words, some countries miss GDP per capita data for 2016 resulting in a null X value but a defined Y value. For instance, one entry that was removed in our filtering is "Africa", which was missing a country code, and Africa is a continent, not a country, so it shouldn't be included in our analysis. We also used this approach when using the GDP dataset, as it would not be meaningful to find data points that exist only in the GDP dataset, but not in the GDP per capita dataset. We wouldn't be able to compare our results in f) and g) properly.

### 1D: Countries with life expectancy higher than one standard deviation above the mean

### 1E: Countries with high life expectancy but low GDP

This becomes quite ambigious because how does one define high life expectancy and low GDP? It is definitely possible to get different results based on how you define the constraints. For this reason, we have decided to showcase four different cases of where we define the constraints differently, and therefore obtain different results.

#### 1E: Approach 1

- We define
*high life expectancy*as at least one standard deviation above the mean life expectancy, just as in 1d). - We define
*low GDP*as at least one standard deviation below the mean GDP in the dataset, in a similar fashion to how we define life expectancy.

##### Results

There are **zero countries found using this approach**. This is
because the GDP dataset is skewed to the left and its spread is enormous (as shown in the histogram and
boxplot above), resulting in a larger standard deviation than
the mean itself. **This means that we are looking for countries
with a negative GDP** (any country whose GDP is less than
$-1,138,235,455,920.2998$ USD). Since such countries don't
exist, we need to redefine what a low GDP constitutes.

#### 1E: Approach 2

- We define
*high life expectancy*as at least one standard deviation above the mean life expectancy, just as in 1d). - We define
*low GDP*as at least one standard deviation below the mean**for the countries with the highest life expectancy**; the variable derived in 1d).

##### Results

Once again, **no countries are found** because our
*low GDP*-definition results in a negative value (which
is invalid), despite deriving it from the list of countries
with a high life expectancy. It seems that the GDP data
among these countries are too spread out as well.

#### 1E: Approach 3

Since the spread of GDP is too big, both among all countries and among
countries with high life expectancy, we do not get any meaningful
data using the mean together with standard deviation. One standard deviation
below the mean GDP results in a negative GDP, which no country has.
Instead, we will define *low GDP* as any value below the 1^{st} quartile
of the GDP data.

*High life expectancy*: At least one standard deviation above the mean life expectancy*Low GDP*: Any value below the 1^{st}quartile of the GDP data.

#### Results

Unsurprisingly, **no countries are found** again, because there is no overlap
in regards to our definitions. The countries that have high life expectancy
have a higher GDP than the 1^{st} quartile of the GDP data.

To further illustrate our findings we plot a scatter plot of "Life expectancy vs GDP", and all countries that have high life expectancy are colored yellow. We also plot an orange line to show where the lower quartile of all GDP is. As seen in the zoomed in graph, the yellow data points fall just to the right of the limit, and there are no yellow data points on the left (which would count it as a country with low GDP). Thus, there are no countries found using the aforementioned definitions.

#### 1E: Approach 4 (Final)

*High life expectancy*: At least one standard deviation above the mean life expectancy*Low GDP*: Any value below the 1^{st}quartile of the GDP data**among countries with high life expectancy.**

##### Results:

### 1F: Does every strong economy (normally indicated by GDP) have high life expectancy?

Similar to as derived in 1E, it is more meaningful to look at a boxplot using medians and quartiles,
rather than mean and standard deviation, because the GDP data is too skewed for that. This is
further illustrated by the graph below, plotted by our helper function `_gdp_le_scatter_plot`

.
The dashed green line is the GDP value that is one standard deviation above the mean. If
we use this as our definition of a strong economy, only one country fulfils this criteria,
which is not very meaningful.

Instead, for this task:

- We define a strong economy (high GDP) as values between the upper quartile ($Q_3$) and the max.
- The data for life expectancy is not as skewed and will be defined as one standard deviation above the mean.

To answer the question, we will see if there are **any country that has a high GDP as defined above,
but not meeting the high life expectancy definition.** If there exists *any* such country,
then not *every* strong economy has a high life expectancy. We use the graph below to visually
find out if there exists any country that is blue and to the right of our high GDP definition
(the solid orange line).

#### Results

It is quite obvious from the plot above that there are at least half a dozen countries
with high GDPs (indicator of a strong economy) that fall below our threshold
of what counts as "high life expectancy". The most extreme example is the blue data point on the
far right, with the highest GDP, but still out of range of what we define as having
a high life expectancy. Below, we list the three strongest economies that don't meet
our requirement of high life expectancy. The results we obtain are highly interesting,
and relates to the next question. All of the three listed countries have a sizable population,
so how would the results look like if we took population into consideration by using
GDP *per capita* as an indicator of a strong economy instead?

However, to answer the question properly, **no, not every strong economy
has a high life expectancy.** This is extra notable with India, with a high GDP,
but much lower life expectancy compared to the other countries.

### 1G: GDP per capita as an indicator of a strong economy

To compare how GDP per capita works as an indicator instead of GDP in regards to task f), we use the same approach by drawing a colored scatter plot but with the GDP per capita dataset instead. We also plot the lines for where the upper quartile ($Q_3$) is, and the line for where one standard deviation above the mean ($\mu+\sigma$) is, in a similar fashion to 1f).

We can see from the figure above that no matter which of the two lines we
would use to define the limit for what constitutes as a strong economy
($\mu + \sigma \: $ or $ \: Q*3$), there are still blue data points
to the right of the lines, just like in f). In other words, there still exist countries
that are considered as strong economies, but don't have a high life
expectancy. _Thus, using GDP per capita to answer the previous task's
question gives the same answer: no, not every strong economy has a
high life expectancy.*

However, using GDP per capita as an indicator of a strong economy for
*this question* has merits since the other dimension "life expectancy" belongs
to an individual. **With GDP per capita, we get a "symmetry" in the data,**
as this indicator is more individualistic than the whole country's GDP.

Furthermore, things get interesting when we compare the two figures from respective task. In f), we see that a big portion of the countries with high life expectancy are regarded as weak economies (they fall behind the limit we define). On the other hand, the graph presented above shows that almost all countries with high life expectancy count as strong economies if we use the upper quartile as the limit. By flipping the question to "Is every country that has a high life expectancy a strong economy?", we can answer it more easily when using GDP per capita as an indicator instead of just GDP. The answer would be "yes, almost every country that has a high life expectancy is a strong economy" (Greece is the exception). This is not as clear when we observe the graph from task f).

## 2: Investigating new data

Note: Unless datasets are mutually dependent and closely related, we cannot directly answer any questions and draw conclusions, although it can spark further research and act as a catalyst in researching a field. In other words, correlation does not equal causation.

For this task, we have chosen three new datasets:

- (1) Self reported happiness for each country
- (2) Suicide death rates for each country
- (3) Alcohol consumption for USA

To get a context and begin understanding the data, we decide to plot each dataset independently from the others. Furthermore, to constrain the size of the task, we will only look at one country: USA, which means we filter out remaining countries. We also filter the year range to begin at 2005 for all datasets since one of the datasets start at 2005 and does not have data prior to that year.

#### 2A+B: Meaningful questions + Insights and findings

Our initial thoughts:

- the less happy you are, the more likely you are to drink.
- the more you drink, the more likely you are to commit suicide
- the less happy you are, the more likely you are to commit suicide

Thus, we expect there to be a three-way relation and correlation between the datasets. The questions are posed as:

- How are happiness and suicidal rates related? Do they follow a similar trend?
- How are suicidal rates and alcohol consumption related?
- How are happiness and alcohol consumption related?
- Can we find a three-way correlation? I.e, are all the datasets connected in a way that make them affect each other?

Below, we will plot the data to answer our questions, and the sets will share the x-axis, but have different scales in the y-axes as they have different units. We are mostly interested in the shape of the data over the years and how they have changed.

##### Happiness and suicide

As seen in the diagram above: during the period of 2006 - 2017, the life satisfaction has decreased in similar rate to an increase in suicide rates. Therefore, there seems to be an inverse correlation between life satisfaction (happiness) and suicide rate, which is definitely reasonble since a happy population would most likely not commit suicides.

##### Suicide and alcohol consumption

Once again, there seems to be a quite tight correlation between the two datasets. When alcohol consumption increases, so does the suicide rate. Interestingly, both also decrease at around 2010, and then increase in a similar rate again. We figure that people who commit suicide are suffering for their own individual reasons, and it is not unheard of to use alcohol to drown their sorrows. Thus, it makes sense that there is a tight correlation between these two datasets. However, we would not imply that there is a causation here since we are lacking too much information. Yet, we'd say that it's impossible that the suicide rate would be the cause of an increase or decrease in alcohol consumption, as the person has sadly passed away by then and could not affect the alcohol consumption data.

##### Happiness and alcohol consumption

Just like the other two cases, a correlation is seen between alcohol consumption and life satisfaction. Although the trend is not as clear as the previous ones, it is still a noticable trend. What we expected to see however was an inverse correlation between the two, i.e. as alcohol consumption decreases, life satisfaction increases. The plot show us that when one increase, so does the other. We don't know about possible other variables that exist that may affect both alcohol and happiness, which also means we need to take this result, along with the rest, with a grain of salt.

##### Three-way correlation

There doesn't seem to exist a strong correlation between the three datasets. We observe that there might exist an inverse correlation between life satisfaction and suicide rate, as well as a correlation between the suicide rate and alcohol consumption in USA. However, the graph above shows us that there's not a clear correlation between life satisfaction and alcohol consumption, even though alcohol consumption seemed to follow the suicide rate's shape. Furthermore, our choice to filter the years made us end up with a span that is too small to say anything significant.