Building your own dataset - Reddit Scrape
In this notebook I will walk you through building your own dataset to create unique one of a kind projects.
Although this example uses reddit you can follow a similar path to build many different types of datasets.
We will also load in and do some visualization of our data set after we build it.
We are going to use two libraries to scrape data from reddit. praw will allow us to work with reddit API and psaw will wrap around it to handle multiple requests.
We need psaw because praw will only return 1000 results and we need way more data in our projects!
Create Reddit Account
If you want to scrape your own data you will need to create a reddit developer account.
I have the data in the notebook so feel free to skip and use the data already provided.
- Signup for a reddit account.
- Select the "Are you a developer? Create an app button"
- Give you program a name and a redirect URL ( http://localhost ).
- On the final screen note your client id and secret.
|Create Account||Access Developer||Name||ID and secret|
Enter the detail below. Warning you must change my credentials they will not work.
Create Data Query
Setup a start and end data for our query so we can gather a year of data. We select the subreddit we want to query which is ShowerThoughts. This is a subreddit where people post short funny thoughts you might have as your brain wanders.
You aren't paid according to how hard you work, you are paid according to how hard you are to replace
Finally we create a generator that we will use to query our data.
A little setup here first by setting where our data will be stored and the file name. I chose to break up my data as I query it by month so if something fails in the middle we only have to replace that month.
Download our Data
Warning - This process will take a long time to complete if you want to follow along feel free to use data I have already provided in the notebook. Full Dataset GitHub
Submissions - We are getting back a submission from our query and pulling data attributes off of it.
- created_utc ( We will break this down into month day and hour)
Load our Data
Now that we have our data created lets load it and take a quick look at what we got!
- Going to take any csv file in our data_folder and load it into our data_holder list.
- We didn't add columns to our data so lets make every row a dictionary with only the data we are interested in using. (id, title, score, month, hour)
- Note the data is stored in the same directory as the notebook inside of a folder called data_files
I am interested in the average score of a post and the average length of a post.
Also how many posts end up with a score of 1 ( your starting score ) or 0 (lowest score possible ).
The average score is 80 but 72% have a score of 0 or 1!
Plot the Data
Lets convert to a pandas dataframe and plot some quick charts.
We can add columns to our dataframe here I add length of post.
Mean score by Hour
Mean Score by Month
As you can see we have created our own dataset with some interesting insights already. I plan on using the dataset to try and predit the score of the post or create some sort of post generator! Please feel free to use this as a starting point for projects of you own!