Name: Jessica Gangle
Name: Derek Tingey
Name: Adam Cohen
Name: Puja Patel
Name: Ahhyun Lee
Name: Yiyang Zhou
Overview
Research question: How does age impact political views? How much variation in political views can be explained by age?
- In Section 1(Simple Model), we'll explore the simple model constructed upon the variable
Political_views
. - In Section 2(Qualitative Predictor Model), we'll explore a model that uses a catagorical variable
agefactor
to predictPolitical_views
. - In Section 3(Quantitative Predictor Model), we'll explore a linear model that uses
age
to predictPolitical_views
.
Simple Model
a) Word Equation
- Political Affiliation = Mean + error
b) GLM Equation
- $$Y_i = \beta_0 + e_i$$
- $Y_i$: value of the variable
Political_views
. - $\beta_0$: The mean of
Political_views
. - $e_i$: error(difference between each value and the mean)
c) Visualizations : simply / empty model, point plot
The outcome of the empty model shows that there is quite a bit of variation in political views. Based on the histogram, the distribution could be considered bimodal but is overall, fairly normally distributed.
d) Fit a linear model
The output of the code gives us an intercept of 3.459, which is the mean of the variable Political_views
. It would also be the predicted value for every point if we use the simple model.
e) Quantifying Error (ANOVA)
From the ANOVA table, we know:
- The sum of sqaured residual is 4494.158.
- The averaged squared residual across degrees of freedom is 2.6, meaning that on average the squared error(squared difference between the actual value and the mean) is 2.6.
Becuase we have no explanatory variable in the empty model, we cannot draw any conclusions as to whether or not any variance explained.
f) Visualization of sampling variability
In this part, we generate a sampling distribution of the mean based on our sample by assuming that the population is normal with parameters same as those of our original sample.
As expected our sampling distribution is normal, with an approximate range for $\beta_0$ between 3.35 and 3.55
g) Numeric description of sampling variability
This part requires us to numerically describe the variability of the sampling distribution(the 1000 means) we obtained in the previous part.
Thus, it's sufficient to obtain the standard deviation (also called standard error in this case) of the 1000 means.
Another way of numerically describe sampling variability is to create a simple model for the sampling distribution and then obtain information about squared errors(distance from each sample mean to the mean of the sample means).
From the favstats()
output, we know:
- The standard error of the sampling distribution is around 0.04.
From the anova()
output, we know:
- On average, the mean squared residual of our sampling distribution is around 0.0016.
If we create a sampling distribution of our null model, the mean of the sampling distribution should be close to the mean value in the original sample. We cannot see if there is any explained variance in the data, still, because there is no explanatory variable.
h) Confidence intervals
Since we do not know the true standard deviation of the population, we'll obtain a confidence interval based on t distribution with the confint()
function.
According to this output, we can conlude with 95% confidence that the true mean for the variable Political_views
, or the true $\beta_0$ for the empty model, is within the values 3.383 and 3.536.
To make this more intuitive, we can check this confidence interval based on the sampling distribution we obtained in part f): We would like to obtain the value for which 2.5% of the values are above this value and the value for which 2.5% of the values are below this value.
Now, we have obtained two confidence intervals based on two different methods. We can see that the bounds of the CI obtained from our sampling distribution is similar to that obtained by confint()
.
i) Model Comparison Not needed for simple model
j) Conclusion
We cannot draw any conclusions about the impact of age on political beliefs from the data we have interpreted thus far. We can only be confident at a 95% level that the mean rating for political belief falls between 3.38 and 3.54.
This simple model will be used as a baseline model for comparison in the next two sections.
Qualitative Predictor Model
In this section, we transform age
into agefactor
, a catagorical variable that has one of the three values: young, middle or old.
a) Word Equation
- Political Affilation = Age + error, where the age is now a catagorical variable.
b) GLM Equation
- $Y_i = \beta_0 + \beta_1X_i + \beta_2X_i + e_i$
- $Y_i$: value of the variable
Political_views
- $\beta_0$: The mean value of
Political_views
for people who are classified as "young". - $\beta_1$: The increment to the predicted value the model adds for people who are classified as "middle".
- $\beta_2$: The increment to the predicted value the modle adds for people who are classified as "old".
- $e_i$: error
c) Visualizations
Based on the jitter plot and the box plot, Political_views
seems to be not related to age. However, a more sophisticated model comparison as the following needs to be done to draw a convincing conclusion.
d) Fit a linear model
Our complex model produced a value of 3.5878 as the average political affiliation score for people considered "young". Furthermore, our model suggests that for people considered in the "middle" age group their average score will be 0.2661 less, and for people in the "old" age group their average score will be 0.1192 less.
e) Quantifying Error (ANOVA)
From the output of anova()
, we see that despite of the model, the sum of squared residual and the mean squared residual are still high.
With the model applied, the averaged squared difference between individual value and the predicted value is 2.599.
f) Visualization of sampling variability
The parameters we are interested in are $\beta_0$, $\beta_1$, and $\beta_2$. Thus, we'll create and explore the sampling distribution of all of these variables.
The following code produces 1000 linear models, each based on one newly sampled(with replacement) sample of the same size from the original dataset. Then the corresponding $\beta_0$, $\beta_1$, $\beta_2$ and $F$ are collected.
From the visualization of the sampling distribution, we see that most resampled $\beta_1$ and $\beta_2$ are below 0. All of these observations seem to support the hypothesis that agefactor can explain the variability of political views to some extent.
g) Numeric description of sampling variability
This part requires us to numerically describe the variability of the sampling distribution(the 1000 means) we obtained in the previous part.
Thus, it's sufficient to obtain the standard deviation (also called standard error in this case) of the 1000 means.
h) Confidence intervals
From the output, we know that based of the sample we have:
- We are 95% confident that the true mean for people classified as young lies in the range of 3.456 and 3.720
- We are 95% confident that the true $\beta_1$ lies in the range of -0.453 and -0.080
- We are 95% confident that the true $\beta_2$ lies in the range of -0.306 and 0.067
i) Model Comparison
j) Conclusion
From the data interpretations we have run, it appears that there is a greater association of political views with age as people change from a young age to middle aged versus when people go from middle aged to older aged. We can see that our confidence interval does not contain 0 for b1 but it does for b2. This gives little evidence to reject the null model when looking at an age shift from middle aged to old.
From the supernova table, we observe a considerably low p-value: 0.0198. It provides evidence that variability of Political views is explained by agefactor by some degree.
However, our PRE is incredibily low, indicating that only about .045% of variation in political views is explained by age.
In conclusion, with the low p-value, there is some evidence that Political views are associated with agefactor. However, given the low PRE value and that the confidence interval of $\beta_1$ contains 0, our evidence may be limited.
Quantitative Predictor Model
a) Word Equation
- Political Affilation = Age + error
b) GLM Equation
- $Y_i=\beta_0+\beta_1X_i+e_i$
c) Visualizations
From the visualization, it seems that there's a weak possitive association between Political_views
and age
. However, further statistical analysis needs to be done to draw a convincing conclusion.
d) Fit a linear model
Our complex model suggests that for someone with a theoretical age of 0, their predicted political affiliation score will be 3.07652. Furthermore, for every 1 year increase in age, a persons score will on average increase by 0.01806.
e) Quantifying Error (ANOVA)
From the output of anova(), we see that despite of the complex model, the sum of squared residual and the mean squared residual are still high.
With the model applied, the averaged squared difference between individual values and the predicted values is 2.6
f) Visualization of sampling variability
We used resampling to create a sampling distribution of the b1 value for the quantitative model of Age as a description of Political Views. We can see that there is a fairly normal distribution of b1 centered around about .15.
g) Numeric description of sampling variability
We have found the fav stats of our sampling distribution to see that the data is centered around .018 and the standard deviation is around .01
h) Confidence intervals
We constructed a confidence interval for both b1 and b0. We are 95% confident that the true value of b1 lies between the values .0028 and .0333.
The confidence interval for the slope does not include 0, indicating that at a 95% confidence level, our sample provides evidence that political views can be explained by age.
i) Model Comparison
From the model comparison we have created we can see that the PRE value is very small, this means that there has been very little variation explained by the age variable.Moreover, P value is lower than 5%. This means that there is evidence at a 95% confidence level that we should reject the simple model, because the data is not very likely if the simple model is true. However, F is a bit larger becuase we have only used 1 degree of freedom and there are 1722 unused, therefore there is a considerable amount of variation explained per degree of freedom.
j) Conclusion
From our model comparison we can conclude that there is not a significant amount of variability in the data explained by age. Our PRE value was .0031 meaning only about .31% of the variation was explained by the model. However, from the constructed confidence interval, we can reject the simple model and conclude with 95% confidence that there is some difference in political affiliation with age. This is an odd situation: while we are claiming that there is some statistically significant advantage of explaining variation in political views with age as opposed to the null model; when looking from a practical standpoint, the explained variability does not seem significant enough to conclude a relationship between the two.