# In-class Exercise Solutions, Week 6

## Exercise 1

First, load the precipitation data into a DataFrame. (I use `low_memory=False`

to remove warnings about disparate data types.)

**Question 1:** What was the total number of inches of precipitation in 1990?

(Note that the "DlySum" column is in 100ths of an inch.)

**Question 2:** What percentage of days in the 2000s had at least 1 inch of precipitation?

**Question 3:** On what date did the highest precipitation take place, as far as this data shows?

**Question 4:** Did any other date tie with that one for the highest?

(We fetch all such rows, and find that there is only one, as follows. So no, there was no tie.)

## Exercise 2

First, we load the two datasets mentioned in the slides.
1. the sample of home mortgage applications we've used in several different weeks of the course, `practice-project-dataset-1.csv`

2. the file you prepared as homework for today, of 2016 election results by state, which we'll call `npr-2016-election-data.csv`

The mortgage applications dataset requires a little cleaning up of the property value column, for later use numerically.

**Question 1:** What is the median property value for mortgage applications by state?

**Question 2:** What is the median property value for mortgage applications by race of primary borrower, sorted in descending order?

**Question 3:** Create a new column in the mortgage dataset that assigns to each mortgage the percentage of votes that went to Trump in that state in 2016. Create a scatterplot of that column against property value.

**Question 4:** Consider just the most Republican states ($\ge60\%$ for Trump) vs. just the most Democratic states ($\le40\%$ for Trump), and wonder whether the median property value is different for those two subsamples. Run a hypothesis test at the 95% confidence level for this question.

The $p$-value is small enough that we easily reject the null hypothesis of the two means being equal. Red and blue states have statistically significantly different mean property values.

Wow! A big difference, too!

## Exercise 3

**Question 1:** We want to compute the average number of units produced across all factories in the past quarter.

Map-reduce, starting with the factories table.

- Map function: Given a factory, open its data file and compute the number of units produced in the last quarter.
- Reduce function: Average.

**Question 2:** We want to compute the median salary of employees by factory.

Split-apply-combine, starting with the employees table.

- Split category: Which factory the employee works at.
- Apply function: Choose the salary column.
- Combine function: Median.

**Question 3:** We want to see a scatterplot of the relationship between number of employees at a factory and average daily units produced at that factory.

The plot routine will need a two-column table, one column containing number of employees and the other the average daily units produced. Creating each column is a separate task.

First, do a split-apply-combine, starting with the employees table.

- Split category: Which factory the employee works at.
- Apply function: Doesn't need to do anything; choose any column.
- Combine function: Length/count.

The result is a table mapping factories to number of employees at the factory.

Second, map the list of factories from that table through a function that opens each factory's file and computes the average daily units produced at that factory, returning it.

We then have the two columns we need and can plot them.