This project will give you the chance to utilize all the new skills you have learned in this course.
Here are some resources for finding datasets that are preloaded in your R libraries:
Link to list of all datasets that are preloaded with library(supernova)
Link to list of all datasets that are preloaded with library(Lock5withR)
Link to list of all datasets that are preloaded with library(fivethirtyeight)
Link to list of all datasets that are preloaded with library(dslabs)
You can also try searching for datasets on the following websites:
Part I: Explore Variation
In this section, you want to set the stage for this Jupyter notebook.
Start with things like your research interests and your data. You may want to explain where the data came from or how it was collected. You may have a research question in mind or you may want to make some visualizations so that you can start wondering, "What might be the data generating process that generated this distribution?"
This section should include any visualizations and word equations you might create in order to start addressing your research questions and interests. If you need to clean your data or create new variables, that should also belong in this section along with explanations of what you did and why.
Part II: Model Variation
In this section, you want to create formal models of the ideas you have been exploring so far.
You might include things like specifying and fitting models, describing alternative models, and depicting the models visually (e.g., as predictions added to your exploratory graphs). You also want to figure out how much error has been reduced by your model.
Then you want to consider which of these models might represent the DGP that produced your data. You may want to use a random data generating processes (such as
shuffle()) to examine whether the relationship you see in your empirical data are similar to those created by a random process.
Part III: Evaluate Models
In this section, you want to evaluate your models to come up with the best estimates for a model of the DGP.
You want to include formally comparing your model(s) to the empty model. Is your model more reasonable than the empty model? Explain how you decided. What is/are the best model(s) of the DGP according to your data? (What is a reasonable range of estimates?)
What have you learned about the DGP from your analysis? What are some strengths and limitations of your analysis? From what we have done so far, are we able to determine whether your explanatory variable causes changes in the outcome variable? Why or why not?