Name: Brooke DuBois

Name: Catherine Zhang

Name: Indya Donovan-Pinot

Name: Christian Yun

Name: William Delery

Name: Andrea Diaz

Current household income 1 = not say

2 = under 20,000

3 = 20,000 - 39,999

4 = 40,000 - 59,999

5 = 60,000 - 79,999

6 = 80,000 - 99,999

7 = 100,000 - 199,999

8 = 200,000 - 999,999

9 = over 1,000,000

# Research Question

Use the space to briefly describe the broad research question you're hoping to address in this project.

Does the amount of importance one places on parenting explain variability in income? We are exploring whether there is a pattern between the percentage of importance a person places on parenting and the value of their income. If someone values parenting highly, then they might have a higher income because they might be preparing for parenting.

# Simple Model

a) Word Equation

Income = Mean + Error

This equation represents the average income of emerging adults without taking into consideration the amount of importance they place on parenting. Income represents the individual income of emerging adults and is equal to the mean income of emerging adults(Mean) plus any unexplained variability(Error).

b) GLM Notation

$$Y_i = b_0 + e_i$$

Yi represents the income of emerging adults (The number corresponding to a specific range of income). While b0 is the mean income of individuals without considering parenting importance. Yi is the individual incomes. Income of emerging adults is the outcome variable(Yi), mean represents the average income of emerging adults(b0), and error represents unexplained variability(ei).

c) Visualizations

In both the histogram and the boxplot, you can see that the distribution is skewed to the right, or positively skewed. On the histogram, the tail on the right side is longer and fatter than the left tail. As income increases the amount of people with that income decreases.This distribution means that on average a higher proportion of people in the sample have lower incomes, and for the most part the higher the income the lower the amount of people. Also, on the boxplot, the positive skew is indicated with the median(line between the boxes) being closer to the bottom box(lower quartile) than the top box(upper quartile). The positive skew means that the mean income(4.504) is greater than the median(4) as seen in the histogram where the blue line(mean) is to the right of 4.

d) Yi = 4.504 + ei

This linear model shows that the income of emerging adults is, on average, 4.504, which is approximately a range of income of $40,000 to 79,999.

e) Quantifying Error (ANOVA)

Because we do not have an explanatory variable in the simple model, we do not have PRE or F value. However, the sum of squares for the simple model is 8584.209. The sum of squares represents the amount of deviation data points are from the mean. Our sum of squares is relatively high, meaning that our data set has large variability and a big error.

f) Visualization of sampling variability

The distribution of sampling variability is centered around the mean income of emerging adults, which is 4.504. The distribution of sample means seems to be unimodal and approximately normal, where the means of each sample are plotted equally and roughly symmetrically on both the left and right side of the center. This visualization of sampling variability shows the likelihood of all DGPs our data could have come from. Our data is most likely when coming from DGPs around our original sample mean, resulting in the normal distribution.

g) Numeric description of sampling variability

The mean income of emerging adults without considering the importance they place on marriage is 4.504348($40,000 to 79,999). The median income of emerging adults was 4(40,000 to 59,999). The standard deviation shows that, on average, the scores are about 1.982548 away from the mean income. You can see that in the data set the income is split fairly evenly between 4.504 with 51% being greater than and 49% being less than the mean value. The tally function shows us that the data is evenly distributed.

h) Confidence intervals

We are 95% confident that the true population mean value of income is between the interval from 4.421174 to 4.587522, which is equivalent to the income range of $40,000 to 79,999.

i) Model Comparison

j) Conclusion

Because the confidence interval does not include 0, we can reject the null hypothesis. In this distribution, we are 95% confident that the mean income of emerging adults was generally within the range of 4.421174 to 4.587522(40,000 to 79,999), with a standard deviation of about 2, which is equivalent to a deviation from mean income range of under $20,000.

# Qualitative Predictor Model

a) Word Equation

Income = Importance of Parenting + Error

An individuals's income can be determined using the percentage they assign to parental importance plus any errors from this data.

b) GLM Equation

Since this is a two-group model, b0 represents the mean of percentages of people who marked low parental importance (low marriage1_2 values), b1 is the increment to add on to b0 in order to obtain the mean percentage from people who marked high parental importance (high marriage1_2 values). To obtain someone's income score (variable income), the GLM says to add b0 and b1 (depending on whether the parental importance was high or low), plus any error. The Xi value is 0 for the low group, and 1 for the high group.

c) Visualizations

a. In both visualizations, we see that there is not a clear difference in distributions of income values between people that marked high and low values of importance of parenting (variable marriage1_2). Both graphs show that the distribution of income for people who have a high value of parenting importance and for people who have a low value are very similar. The box plot does a better job at showing that the means are near identical for both groups, and so is the distribution. These graphs overall show visually that there is very little different in distribution of incomes between people who had higher values for parenting importance in marriage1_2.

d) Fit a linear model

a. 𝑌𝑖=4.4300+.1487𝑋𝑖+𝑒𝑖 The results from this output show that the mean income value for people who marked low values for the parenting importance variable is 4.43. It also shows that there is only a .1487 difference in income value between the high parenting importance group and the low parenting importance group (with the high group being greater.) What this information tells us is that both groups had very similar income scores of 4-5, meaning that the the mean income range was 40,000-79,999.

e) Quantifying Error (ANOVA)

The high F value shows us that is not probable that the two groups have the same mean, and that it might have been by chance that that was the case in this study. The PRE number is very small (0.0014) meaning not much variation in the simple model is explained with the complex model. Furthermore, the p value is bigger than 0.95, which shows that there is a relatively likely chance of the DGP where the ismple model being true results in data like our sample.

f) Visualization of sampling variability

This visualization of sampling variability is unimodal and normally distributed with the greatest amount of values occurring around the mean value: 4.43 (the b0 value calculated in question d). This mean does not differ extraordinarily from the simple model’s mean. This visualization can help us see the from which GDPs the data we have sampled is considered likely (based on our own definition of likely).

g) Numeric description of sampling variability

The standard deviation of the sampling distribution of means would be 0.08479. This small number shows lesser variability in the sampling distribution for means, so the visualization would be narrower. This means that the income mean values were not very far apart on average.

h) Confidence intervals

We are 95% confident that the true population mean value of income is between the interval from 4.31246631 to 4.5475520. Since our confidence interval does not include 0, this data is unlikely to be a result of a GDP where there is no correlation between explanatory and outcome variables (where b1 would be 0).

i) Model Comparison

The alpha level would be .05 (1 - .95). The area shaded purple represents the p value. The p value is 0.030 and is smaller than our alpha level of 0.05, therefore we reject the simple model. In essence, this model comparison shows the probability of getting the F value we calculated (3.078) or a more extreme value, where there are 1 and 3183 degrees of freedom, if the empty model is true = .030.

j) Conclusion

From the visualizations and general linear model, it appears that the marriage1_2 variable (when separated into Parenting2Group) does not explain any variation in income. The means for both groups in Parenting2Group are very similar, and so are the distributions. The p-value is less than the alpha level, so it means that we should reject the simple model. However, when using the upper and lower bounds of the confidence interval, they include 0 so we can't entirely reject the simple model because it could be likely our data came from a DGP of slope 0. For our purposes, we are using the p-value and therefore from Parenting2Group we reject the simple model meaning that there possibly could be that Parenting2Group explains variation in income, contrary to the visualizations.

# Quantitative Predictor Model

a) Word Equation

INCOME = IMPORTANCE OF PARENTING + ERROR

This word equation represents our research question. We are trying to find if IMPORTANCE OF PARENTING explains any variation on INCOME. Literally, this word equation is saying that someone's income can be partially predicted by the importance the place on parenting, plus other factors (error).

b) GLM Equation

$$Y_i = b_0 + b_1 X_i + e_i$$

Yi represents the outcome value for each individual in the sample (in this case, it would be the income bracket of the individual). b0 is our intercept - this is the income bracket value that we would predict for an individual that assigned a 0 score for importance of parenting. In this case it is 4.37. Xi represents our explanatory variable, which is relative importance of parenting which was on a scale from 1 to 100. b1 is our slope value - this is the increment we add on to predicted income value for each increment in our explanatory variable which is a very small value of 0.005. For example, to find the predicted income value for someone who places a 60% importance value on parenting, we would add b1*60 to b0. Ei is error, or the difference between the predicted value presented by our model (b0 + b1Xi) and Yi.

c) Visualizations

Based on the scatterplot above, for their varying incomes, it seems that they each have similiar outputs based on the visualization. These would help us understand how people's rating of the importance of parenting does not explain much variability in income. As visible in the scatterplot the dots for each ranking of parenting have dots at each of the different income levels. They appear to be random and there doesn't seem to be any pattern that relates the ranking to income. If we saw at certain rankings the dots occured more at one income we would be able to explain more variability. However, the best fitting line in coral shows a slight positive correlation.

For data with quantitative outcome and quantitative explanatory variables you should not use a histogram as it does not represent the data in a visually appealing way. When we use a qualitative explanatory variable, we can get many different values on a continuous spectrum, resulting in too many faceted histograms. In contrast, qualitative explanatory variables limit the values to the groups specified, so we get a manageable number of histograms that are visually useful.

d) Fit a linear model

𝑌𝑖=4.37+.0054(𝑋𝑖)+𝑒𝑖

In fitting this linear model, we find that the predicted income value of someone who allocates no importance to parenting (ranking of 0) is around income bracket level of 4.37. For each additional percentage of importance given to parenting in relation to marriage, career, and personal leisure, we observe an increase of 0.005 to the level of income predicted.

e) Quantifying Error (ANOVA)

The PRE represents the proportion of reduced error resulting from implementing the complex model in comparison to the simple model. Marriage1_2 (relative importance of parenting) explains around .0018 of the variation in income from our empty model. It is explaining less than 1% of the error. However,the probability that we get such a PRE value or more extreme from a DGP in which the simple model is true is only .04. If our alpha was higher than .04, we would reject the simple model. Furthermore, the F value is greater than 1, but still not extremely high, making it reasonable to conclude that the importance of parenting does not greatly effect income. The F ratio is a measure of strength.

f) Visualization of sampling variability

The mean appears to be around 4.3 in this population distribution. This mean is similiar to the mean of the simple model. Here many samples were taken and the mean of each sample was used as part of the data to construct this new distribution. The data seems to be normally distributed with most values clustering around the mean of 4.3. To the left and right of the mean is roughly symmetric. The sampling variability visualization allows us to get an idea of from what DGPs a sample with the same mean as our data could be taken.

g) Numeric description of sampling variability

If we did this study many times, the standard deviation of our slope estimates would be about 0.002711. Since the deviation is very small we can see how the data is clustered around one value. There is not much variation in the data.

h) Confidence intervals

Based on this data we can see how the data is very clustered as the 95 confidence interval goes only from 8.68*10^-5 to .011. This confidence interval says our data is likely (with a 95% confidence level) from the DGPs with a slope in the range of 8.68*10^-5 to .011. Additionally, we can see that our confidence interval does not include 0, meaning that data such as ours is unlikely to come from a GDP where our b1 is 0 (indicating no correlation between our explanatory and outcome variables).

i) Model Comparison

The B area which is shaded purple is the p value. This value is very small and could be interpreted as the proportion of getting a PRE larger than the PRE found in the original sample if the empty model were true. The p value is slightly smaller than our alpha level of 0.05. Furthermore we fail to reject the complex model.

j) Conclusion

We will reject the empty model because our p-value is less than our alpha value. Although the graphs show that there is not much correlation between the quantitative predictor and the outcome variable, statistically since the p value of 0.034 is less than our alpha level of 0.05 so we must reject the simple model.

What this means in regards to our research question is this: the simple model represents a DGP where one's personal importance of parenting does not correlate with the income they make. Because we rejected our simple model as our p-value was greater than 0.05, we accept that in the true population DGP, it is likely that importance of parenting DOES correlate with income made to some degree.