Part1
1) For each continuous attribute, calculate its average, standard deviation, minimum, and maximum values.
2) For the discrete attribute, count the frequency for each of its distinct values.
3) Draw histogram of the class variable
4) Draw the distribution of values for a continuous attribute using a histogram.
5) Draw some scatter plots for a couple of attribute pairs.
6) Draw a parallel diagram for some attributes in the data set
7) For each diagram describe your interpretation and insight.
Insights:
Our first insight was looking at the age distribution, we noticed that the dataset was heavily skewed towards the younger population, with the vast majority of entries being 40 and under. This skew towards younger demographics could potnetially skew other data values, so we will keep this in mind. The next observation was that the captial gains / capital loss was 0 for a vast majority of people. We believe this may be due to the younger age demographic skew we discussed earlier, as younger people are less likely on average to be invested in the stock market, which would cause 0 captical gains/losses.
We noticed a large spike in "hours worked per week" at the 40 mark. This was a good sanity check to confirm that our advanced data analysis techniques were working as intended, since the majority of working people are working 40 hours a week. Another interesting skew we discovered was that almost every participant was located in the United States. This will also potentially skew the results we discover.
For out scatter plot we decided to analyze the age correlation with captial gains/losses to confirm our hypothesis. As it turns out we were actually incorrect in our original theory. This was a great discovery, because it showed that our preconcieved ideas were not accurate. As it turns out, there is not a strong correlation between age and captial gains/loss, which seems to show that people of all ages dabble with investments. This discovery showed the power of data analysis
Part 2
Identify which attributes have missing values and address the issue by: