Name: Haile Arriaza
Name: April Meza
Name: Bianca Bailon
Name: Madie Bachelor
Name: Yasmin Arcos
Name: Valeria Araujo
Research Question
Does age affect stress levels?
In our study of emerging adulthood, we chose to observe how age affects stress levels. Specifically, our question is the following: Do people become more stressed as they grow older?
Simple Model
a) Word Equation: Stress = Model + Error
Interpretation: Stress levels are calculated using the average of stress taken from the model and some error.
b) GLM Equation: $Y_i$ = $b_0$ + $e_i$
Interpretation: $Y_i$ represents an individual's stress level, our sample model is estimated by $b_0$, while the difference between an individual's stress level and sample model is represented by $e_i$.
c) Visuals:
gf_histogram(Stress ~ NULL, data = EAMMIdata)
gf_boxplot(~Stress, data = EAMMIdata)
Interpretation: The histogram below is a distribution of the data in our sample. The overall shape is what we would consider to be normally distributed, with our mean for the stress variable found at ~3.05. Similarily, the box plot summarizes our data by displaying the shape of the distribution, highlighting the mean and variability within our data. The box plot hints at the symmetry of our data dsitribution as the bolded horizontal line, which represents the mean/median, runs roughly through the center. The lines that extend from the box itself represent the model's variability, allowing us to observe outliers that weren't as evident in our histogram.
d) Fit Linear Model: lm(Stress ~ NULL, data = EAMMIdata)
Interpretation: The intercept produced by our linear model indicates the average response gathered from all participants who were asked to rate their level of perceived stress on a scale from 1 to 4, with 1 being never and 4 being fairly often. The value of 3.065 tells us that participants reported to be stressed more often than not. Additionally, we would predict for a randomly selected observation to have a stress level of 3.065.
e) Quantifying Error (ANOVA): anova(Stress.model)
Interpretation: In the anova table, our data produced a large sums of squares value (1356.725) which serves as an indication to how far our model is from the mean. This means that there is still a lot of variability left unexplained, since we only observed the mean in our data. Thus, our mean square is a product of both degrees of freedom and sum of squares which represents variance around our model. With a mean square of ~0.435, we can say that the model is good but there is definitely room to make it better.
f) Visualization of Sampling Variability:
Interpretation: The histogram below represents a distribution of 1000 generated sample means that could have been produced if our sample was the DGP. It is centered around 3.065 which is the same value as our original sample mean. This is because we resampled the data from our original sample many times, this visualization provides us with an idea of what the DGP could look like. Overall, the distribution appears to be normal with very slight differences here and there.
g) Numeric Description of Sampling Variability:
Interpretation: This table numerically depicts the sampling distribution pictured above. It demonstrates a mean of 3.06 which is the same as the mean of our original sample. This suggests that our sample is very likely to have come from a DGP that has a mean of 3.06. Our minimum value (3.031001) and maximum value (3.107694) of possible means are much less likely to have been generated by this DGP. The standard error of this distribution is 0.01191397 which accounts for why the graph above is quite narrow if you observe the x-axis scale.
h) Confidence Intervals:
Interpretation: Here, we have contructed a 95% confidence interval meaning we have set our significance level to 0.5. The lower limit of our confidence level is 3.039325 and the upper limit is 3.08752. This means that if we were to repeat this method on many randomly generated samples, 95% of the intervals we create would contain the true population parameter that generated our data. The significance level, which is 0.5, is important when observing the p-value since it's necessary to make a decision when hypothesis testing. However, in this case we do not observe the p-value as much because we are not comparing the simple model to anything yet.
j) Conclusion: Based on our word equation, we proposed that stress levels can be calculated using the average reported stress levels from the survey and some error. After creating a linear model and observing the analysis of variance table generated using code, it is clear that there is still a lot of variance unaccounted for. Our sums of squares value was 1356.725 which demontrates that despite the fact that the mean reduces more error than any other statistic, it is still not very good at explaining variability within the sample, and much less within the population.
Quantitative Predictor Model
a) Word Equation: Stress = Age + Error
Interpretation: Age expains some variation in stress level plus some error.
b) GLM Equation: $Y_i$ = $b_0$ + $b_1X_i$ + $e_i$
Interpretation: This is the general linear model equation, where $Y_i$ is Stress Level, $b_0$ is the mean for Stress when age is 0, $X_i$ is the increment of stress to add on for age, and $e_i$ is how off the model's predicition is.
c) Visualizations:
Interpretation: The point plot helps us see the relationship between age and stress by plotting every point of data on the graph. It helps us a see an overall view of the correlation of the data. Like the point plot the jitter plot also helps us see the relationship between stress and age by plotting every point on the graph, however points plots do not allow us to see when a point is on top of another point, and jitter plots jitter the points both vertically and horizontally so we get a more accurate representation of the individual points. Thus we can see by the density of some points on the jitter plot which represents which ages are more likely to have higher levels of stress.
d) Fit a linear model: $Y_i$ = 3.42673 - .01646Xi + $e_i$
Interpretation: The intercept describes the predicted stress level if someone was 0 years old. The b1 coefficient is the decrement subtracted from the intercept with each additional year of age added on. This means that if someone was 20 years old there stress level would be 3.42673 - .01646(20) which is 3.09753. This is a fairly high score keeping in mind that the scale is from 1-4.
e) Quantifying Error (ANOVA):
Interpretation: Looking at the anova table, we can compare the sum of squared errors from the model (13.153) and the sum of squared errors from the total (903.58) it appears as though the model does not explain much variability. We can confirm this by seeing that the variable age explains .0146 of the variation within Stress as shown by the PRE. However when we can look and the mean squared for the model (13.153) compared to the mean squared for the total (.432) we see that the MS model is significantly higher than MS total. We can also see that the F value is fairly high at 30.9 meaning that the variance explained by the model is almost 31 times that left over from the model (with respect to degrees of freedom). The df model also shows us that this model is fairly elegant in that it only requires one degree of freedom (which is needed to get b1).
f) Visualization of Sampling Variability:
Interpretation: This sampling variability shows us a distribution of sample if our sample were to be the DGP. We obtain this by taking samples from our data with replacement and obtaining a b1 for each sample. The distribution below is normal and centers around the sample mean (which makes sense because we treated it as our population mean). We can also see that 0 is not included within our graph of b1s and because of that it is unlikely that we would obtain a sample than has a b1 of zero. Note that b1=0 means there is no relationship between age and stress meaning we would keep the simple model.
g) Numeric Description of Sampling Variability:
Interpretation: Again, we are able to see that our max of -.00723 and our min of -.00282 are below 0 meaning that it would be highly unlikely to obtain a sample whose b1 equals 0. This gives us an idea of what our confidence interval may look like in that our confidence interval will also be below 0. This is beginning to show us that we may be able to reject the null hypothesis or the simple model.
h) Confidence Intervals:
Interpretation: Here we have created our confidence interval created from a t distribution created from our sample. The confidence level is set at 95% meaning that alpha = .05, which be important later to discuss p value's relationship to alpha. Our confidence interval is -0.02263022, -0.01069335 which once again shows that 0 is not contained within this interval. This means that if the simple model were true it would be highly unlikley that we were to get our sample. There is a 95% likliehood that the population parameter will fall within this interval.
i) Model Comparison:
Interpretation: In our model comparison, we compared our model to the empty model and concluded that because our F Statistic was 30.918 it would be highly unlikely to have gotten such a high F value if there was no relationship between Stress and Age (given that our F value is greater than 4). This is solidified by looking at the sampling distribution that was created which looked at a situation where the simple model was true in the DGP. As we can see, our sample is not contained within the sampling distribution meaning that it would be highly unlikley for our sample to occur if the simple model was true. We also can see here that the p value of <.0001 is less that the alpha level of .05 that we have chosen. This means that we will reject the simple in favor of the complex one.
j) Conclusion: Our research question was asking whether people become more stressed as they get older. As we have seen through fitting the model, $b_1$ = -.016 which means there is a negative correlation between age and stress. In other words, people get less stressed as they age. To ensure that there is actually a correlation between age in stress we ran a model comparison in a addition to looking at the sampling variability if our sample were to be the DGP. From these, we were able to see that our sample would be highly unlikely to occur if the simple model is true (where there is no relationship between age and stress and $B_1$ = 0). This means that we were shown that the data does not demonstrate what we believed to be true and actually shows the opposite: as people age they become less stressed.
Qualitative Predictor Model
a) Word Equation: StressLevel = Age + Error
Interpretation: A person's age (young or old) plus a difference between the predicted stress value and actual stress value will predict a person's stress level.
b) GLM Equation: $Y_i$ = $b_0$ + $b_1X_i$ + $e_i$
Interpretation: This represents the general linear model. Here Yi represents the person's Stress Level, b0 represents the Stress Level at 0, and Xi represents the increment to add to age and ei represents the difference between predicted value and real values.
c) Visualizations:
Interpretation: The box plot displays the relationship between young and stress versus old and stress. This particular plot shows the data based on the perceived stress scale of 1-4 that was established at the beginning of the study. The faceted histogram shows the density of each age group in relation to stress. This form of displaaying the data makes it easier to see the actual differences in stress level between the young group and old group.
d) Fit Linear Model:
$Y_i$ = 3.1457 + -.1328Xi + $e_i$
Interpretation: The intercept (3.1457) represents the average stress level for someone who is 0 years old. The bi coefficient (-0.1328) represents the value to subtract from stress if age is old.
e) Quantifying Error (ANOVA):
Interpretation: Based on the anova table below, the PRE value is 0.0102, which indicates how much variation found in the perceived stress levels is explained by either the age groups observed. In this case the F value (21.616) is relatively high, meaning that the variance explained by the model is about 21 times of what was left over from the model with respect to degrees of freedom of 2093. This high value of DF indicates that this model is not as elegant since there is still a large value of data left to estimate and analyze.
f) Visualization of Sampling Variability:
Interpretation: The sampling variability displays the distribution of samples if our DGP were to be our sample. Based on the histogram below, the distribution is normal and it centers the mean of the sample. In addition, the value of 0 is also not found within this distribution, which indicates that it is unlikely for the age group (young or old) and stress level to not have any relationship.
g) Numeric Description of Sampling Variability:
Interpretation: Based on the table of data below the minimum is -.02307 and the maximum is -.0428. These values further strengthens the fact that no relationship between age group and stress levels is not likely because 0 is not within the minimum and maximum of b1.
h) Confidence Intervals:
Interpretation: Our confidence interval is -0.189017 to -0.078983 and is in the 95% percentile. This means that we are 95% likliehood that the interval contains the true population parameter. This confidence interval value gives us further insight that 0 would not be part of our data.
i) Model Comparison:
Interpretation: In our model comparison we observed our two group model to the empty model and observed that given a F Value of 21.616 we can see that age as a qualitative variable explains more variation in stress than the empty model, and our F value being higher than 4 indicates that it would be difficult to say that it would be unlikely to get such value if there was no relationship between age and stress. Since our p value of <0.0001 was less than our alpha level of 0.05, we can reject the simple model in favor of our two group model. Our graph also demonstrates that there are b1s in this sampling distribution that is higher or lower than our observed b1 value.
j) Conclusion: In order to analyze the relationship between age and stress, the overall age data was split up into two groups. These two groups were categorized as "old" and "young." When we fit the model it was determined that the "ageold" (b1) value was -0.1328. This indicates a negative correlation between the old age group and stress. It can be concluded that as people get older they tend to feel less stressed. We have also shown that there is in fact a realtionship between the two variables through our model comparison, sampling distribution, and confidence intervals.