Intro
Hey everyone! This is an introduction to creating your own dataset using Selenium. This notebook's objective is to show you that you don't have to wait until a dataset that looks interesting pops up on Kaggle to start working with data. It is useful and not difficult to begin collecting your own data from the internet using various web scraping tools.
If you are using Selenium in Deepnote instead of locally, make sure to add these lines to the __init__.ipynb
.
Firstly, why Selenium?
I recommend Selenium for web scraping because it's a lot more robust than other solutions. Tools like Beautifulsoup are useful as HTML parsers. However, they cannot automate the whole browser like selenium can. This leads to benefits like scraping with just one library instead of multiple ones, like bs4 and requests. It also means you can handle dynamically loaded content.
For Selenium to work in Deepnote, we need to run it in headless mode. Selenium is actually a full browser automation software so running it in headless means we will not visually see the Chrome browser running. If you are running this code locally, you can run it in headful mode, and you will be able to see your browser load up as the code runs. This is typically my favorite part!
Now that we have this running, we can start scraping the web. Let's start with Wikipedia — it has a wealth of information and allows itself to be scraped, so it is a great candidate. If you needed to scrape Wikipedia for a real project and not for learning purposes, know that the Python Wikipedia library and the Wikipedia data dumps are better places to get information. With that disclaimer out of the way — let's get started.
The implicitly_wait function means that Selenium will wait up to 3 seconds before telling you that an element does not exist on a page. This is very useful for content that gets loaded in by an AJAX request or another dynamic method.
This sends the driver to the webpage we want to scrape. In this case, we are going to start with scraping the Wikipedia page for the Razzies. The Razzies are sort of the reverse Oscars Awards.
XPath
In my experience, one of the easiest ways to access elements on a web page is with XPath. Now for a quick introduction to XPath. Here is an example XPath expression:
"//div[@class = "header_div"]/h3"
This expression is saying is to look for a div on the page with a header_div class attribute. Then, find the first heading tag that is contained within that div. All XPath expressions will follow a similar pattern- double slash then tags with search terms attached. This tutorial aims to get started as quickly as possible so we can move on from HPath as this is about as complicated as our expressions will get.
Another useful resource is the XPath Cheat Sheet. Keep this handy!
Now, let's see if we can grab the title of the Wikipedia article.
Inspect element tells us the title is a header tag with attribute firstHeading
.
Brilliant!
2 things to note:
- We found the text contained within the element using Selenium's WebElement text attribute
- Notice the use of both single and double quotes as not to pass an invalid string to the method
Let's grab a couple more information from the Wiki page to have a more complete view of the information we need.
We want to get the data, but this is not an easy format to deal with. Luckily the datefinder library is fantastic for extracting dates from unstructured plaintext.
This line grabs the time stamp of when the article was last downloaded.
This line will return all of the content on the page. As long as the text is contained somewhere on the div, the text attribute will return it!
This line tells us how many images are on a given page.
Now that we have explored the page a little, let's try and package these functions to create a dataframe out of a few Wiki pages. The information that we will collect about each webpage is as follows:
- The title
- The text body
- The number of hyperlinks on the page
- The number of images on the page
- When it was last edited
Awesome! Let's test if our function works on a different article.
Awesome. We have a function that works on both pages! We never even looked at the Oscars Wiki page. Luckily for us, Wikipedia pages are set up in a pretty uniform way.
Bring the Webdriver back to the Razzies page.
This line above looks for every 'a' tag with an href attribute and then gets the link text. Note the use of the find elements method and not find element. Using a convenient list comprehension, we only grab the links to Wiki articles that aren't about the Golden Raspberries. There are a couple of URLs we might not want to scrape as they are meta articles about Wikipedia, but the majority of our work is good.
Now let's try and create a dataframe with all of the links we collected from the Razzies page. By the end of our work, we will have a pandas dataframe with info on the Razzie and all the movies associated with it.
We iterate through our links and append the scraped data to our dataframe. This can take some time because we need to access a large number of pages. Feel free to limit the number of links you will scrape if you are feeling impatient.
Let's explore some of the data briefly.
Before we wrap up, let's clean up the data a bit. We are going to use the dataframes apply method, which I find incredibly convenient. We will use it to make our description text all lower case. Additionally, we will remove the stopwords. If your goal was to perform NLP tasks on the data, this would be a reasonable step. Stopwords are all of the words in the text necessary for grammar but do not necessarily bring meaning. These words like at
, the
, a
, to
, etc...
We can see that the processed text is all lower case, the punctuation has been removed, and words like with
, is
and a
have been removed.
The apply method with apply our text_clean
function to all of the cells in the description column of the dataframe.
Finally, save our work to CSV, and we are done! Congrats! We are done.
Next Steps
So you've gone and collected all of this data from Wikipedia. Now, what do you do with it? Here are a couple ideas to get you started.
Document Similarity Our code starts grabbing documents from the links present on our first Wikipedia page. Using the descriptions on those Wiki pages, you could find some way to predict if one document is related to another or guess if it is present as a first-degree link. This could be done with doc2vec, which would be a great fit to train a model to detect document similarity. More information on document similarity can be found here.
Text Generation With Wikipedia, we can get a lot of text really quickly. Training a model to write a book in the Wikipedia style could be a fascinating problem to tackle. The transformers library is a great place to look at. If you are looking for another Deepnote tutorial about text generation, I have written one here.
Text Classification There is plenty of data to work with here, so you can think of any given text classification project to start working on. For example, you could create categories of text, say movies, history, and science. Using the tools above, you can collect Wiki articles from different fields and train a text classification model.
These ideas just scratch the surface. You could visualize the data and find out how recently a typical article has been edited. Applying these techniques to any site can make it easy to collect data and work with it.
Run this notebook in Deepnote
If you want to pick up this work, feel free to Duplicate this article next (at the beginning). Or just click below on "Launch in Deepnote"
If you'll be starting on your own, make sure to install chromium-driver
.
You can do it from Terminal, directly from a cell using bash
magic:
%%bash
sudo apt-get update
sudo apt-get install chromium-driver -y
Or add the following lise into your Dockerfile
in Environment tab:
RUN sudo apt-get update
RUN sudo apt-get install chromium-driver -y
Conclusion
It is straightforward to start collecting data for your own personal projects. There is no better place to find information than online.
If you have any questions about running or writing code like this, please feel free to email me at josh.zwiebel<at>uwaterloo.ca
.