Hello! This is an introductory guide to web scraping with Selenium. Selenium is a tool that allows us to automate the entire web browser, meaning that it can handle content on dynamically loaded websites--something that other HTML-based parsers can't do.
Selenium isn't the only kind of web scraping tool out there; HTML parsers are another popular choice. Using the
requests library and BeautifulSoup, you can grab the HTML from a web page and search through that HTML for data that you want.
HTML parsing is a more efficient and simple approach than Selenium when you have static webpages. A static website is one that has relatively simple pages where the content is embedded directly in the HTML and doens't change over time. An example is the Video Game Music Archive.
If you're running Selenium locally in a Jupyter notebook, you'll want to download the latest version of chromedriver and put it in the same directory as the notebook. If you're running it in deepnote, you'll need to add the following lines to the top of
__init__.ipynb, which can be found by clicking on the "Environment" icon on the left toolbar of deepnote and then looking under "Initialization". After you've added these lines, you'll need to restart the environment by clicking the "Restart" button.
The cell above contains all the imports that we'll need for basic web scraping. If you're working locally, you can install these libraries using the pip installer, e.g.
!pip install pandas. If you're in deepnote, create a file called
requirements.txt containing the following text:
selenium datefinder pandas numpy matplotlib seaborn
Initializing the driver
For Selenium to work in Deepnote, it has to be run in "headless" mode (meaning that you can't see the driver interacting with the broswer). The other options above also help ensure compatibility.
The implicitly_wait function means that Selenium will wait up to 3 seconds before telling you that an element does not exist on a page. This is very useful for content that is loaded dynamically.
Finally, we send the driver to the webpage that we want to scrape. In this case, it's the News section of The Daily Texan's website. In this tutorial, we'll be collecting a dataset of old daily texan articles including the title, authors, date last updated, teaser, and link for each article.
Scraping the first article
We'll start out by collecting the data we want for the first article on the news page, and then we'll show how to extend that to scrape all the articles on the page.
One of the easiest and most reliable ways to access and collect elements on a web page is through XPath, which is used to find HTML elements on a page. There are other ways to find elements, such as:
- Class name: HTML divs are often identified with a class so that they'll adopt certain characteristics. You may want to search for instances of a particular class name on a website.
- CSS selector: String representations of HTML tags, attributes, Id and Class that can be used to locate/distinguish elements.
XPath is what will be used in this tutorial.
We can easily use inspect element to find the XPath of certain elements on the page, just follow these steps:
- Right click on the page and then click "Inspect", or just use the shortcut
CTRL+SHIFT+Ion Windows (
- At the top left of the console that appears on the right of the browser, click the icon that shows a curser contained within a box. Then, click on the web page element you want to collect.
- In the inspect element console, right click on the highlighted element, hover over "Copy", then click "Copy XPath".
After following the instructions, you should get the XPath shown above for the article's title. Note that sometimes a small change in where you click can result in a slightly different HTML element; you may need to copy the XPath of the div right above (or below) what was highlighted. The XPath may need to be enclosed in single or double quotes, depending on whether it already contains either of those.
Title / link
To actually scrape the element on the webpage, you use the
find_element_by_xpath function. You'll notice that upon scraping, you initially get a WebElement object, which isn't very useful. To extract the information we want, we'll have to use a few functions.
.text gets you the text associated with an element, and
.get_attribute() finds a particular attribute type for a given element. In this case, we're looking for the
'href' attribute of the title, which is just the link.
Author / date
Next, we'll collect the author information. This is one of those situations where a small change in click position can refer you to a different div; play around with where you click and which div you copy the XPath of until you get the one shown above. We use the XPath shown above because it includes both the author names and the date the story was last updated.
We have all the author names and the date, but it's in a single string. We'll have to process the text to get a list of author names as well as the story date in datettime format.
First, to extract the author names, we see that the author names start after the word "BY" and end after the hypen ("-") before the date.
First, we used
re.split to split the long string into a list of two smaller strings; one contains the author names, and the other contains the date.
We then split the string with the author names using python's built-in
split function, and used
.strip() to get rid of excess spaces at the beginning/end of each name.
Finally, we used the
datefinder library to convert the date string into datetime
Follwing the same process as before, the XPath for the teaser should be what's shown above.
We collect the teaser using
Scraping multiple articles
Now that we've seen how to get all the elements we want for one article, let's expand it to multiple articles. The process is very similar, but we use
.find_elements_by_xpath instead of
.find_element_by_xpath. We also need to change the XPath slightly. If you copy the XPath of the second article title, you'll notice that there's a slight difference between it and the first article:
First Article XPath:
Second Article XPath:
The first article has
li, while the second article has
li. This is often how rows are represented in a web page; to scrape elements from all rows, just remove the brackets entirely:
Title / link
The code above uses list comprehension, which is just an easy way to execute a for loop in a single line and put the results in a list.
Author / date
The code above uses slightly more complex list comprehension to make it more concise; a more readable (but less concise) version of the code is in the cell below the original code.
Putting it all together
Now that we know how to collect the data we want on all the articles for a page, let's create a function that collects everything and puts it into a pandas dataframe, which is basically just a table.
This function assumes an instance of
driver has already been instantiated, but the driver hasn't navigated to a web page yet. Let's test it on the first page of the news section:
Scraping multiple pages
What we just did was get everything on the first page, but the function we've created makes it super easy to scrape any page of The Daily Texan's news archives. If we go to the second page of the archives, we can see that it has the following link:
https://thedailytexan.com/section/news?page=1, the third page is
https://thedailytexan.com/section/news?page=2 , and so on.
I could have used this pattern to programmatically scrape all 1,392 pages of Daily Texasn news articles available online, but for the sake of time I just scraped the first 10.
After the data has been collected, you should save it to a file so that you can load it later. Typically, a
.csv file works great for saving data; however, because one of our columns (Authors) contains a list, saving it as a .csv file and then loading that file will result in the list being treated as a stirg. Thus, I'll save it as a binary file instead using the
.to_pickle function instead.
Now that you've collected your data, what next? There's a lot you can do with it that's beyond the scope of this tutorial, but one common way of exploring data is through visualizations.
Let's say I'm curious about how many authors each article has, how long the article titles are, or whether the article titles include "UT" in them. I can answer all of these questions using visualizations.
Web scraping is an incredibly powerful tool for collecting datasets that aren't readily available, and Selenium is a great tool for getting the job done on dynamically loaded web pages.