Web Scraping Images with Python and Selenium
How to scrape and capture images on websites.
Photo by Toa Heftiba on Unsplash
If you wanted to train a model based on different types of images, you would need to obtain a lot of images. You could get the images manually but that would require a lot of time and effort. A more efficient way to get images is to scrape the images from web pages. Web scraping provides an easy way to get a large amount of data in a relatively short amount of time.
In this article, we will create a script that will scrape a Google webpage, extract some images, and create datasets. The following steps will be performed using Python and Selenium.
1. Install Selenium Package.
2. Import the Libraries.
3. Install the Web Driver.
4. Launch Browser and Open the URL.
5. Load the Images.
6. Review the Web Page’s HTML Structure.
7. Find and Extract Images.
8. Download Images.
Install Selenium Package
Selenium is a python library that can scrape data on websites dynamically. It can also be used for web automation & testing. Scraping data from the web is only a small portion of the selenium library. Some of the features of Selenium include:
Import the Libraries
Install the Web Driver
The web driver is a key component of selenium. The web driver is a browser automation framework that works with open source APIs. The framework operates by accepting commands, sending those commands to a browser, and interacting with applications.
Selenium supports multiple web browsers and offers web drivers for each browser. I have imported the chrome web driver from selenium. Alternatively, you can download the web driver for your specific browser and store it in a location where it can be easily accessed (C:\users\webdriver\chromedriver.exe). You can download a web driver for your browser from this link: https://selenium-python.readthedocs.io/installation.html#:~:text=Selenium%20requires%20a-,driver,-to%20interface%20with
Launch Browser and Open the URL
Create url variable containing the webpage for a Google image search.
Launch the browser and open the given url in your webdriver. We format the url in our search by adding a search word in place of “s”. In this case, we have formatted the value of s to be “Pets”.
Load the Images
The execute script function will scroll down the body of the web page and load the images. This function will insure that each time we load a page, it goes to the end of the webpage. We have given 5 seconds for the images to load in order to provide enough time for image rendering.
Review the Web Page’s HTML Structure
We need to understand the structure and contents of the HTML tags and find an attribute that is unique only to images. For this project, we will be using the search results for pets webpage in Google (shown below). You can find this webpage by selecting this link: https://www.google.com/search?q=pets&tbm=isch&ved=2ahUKEwiWjr_P3Ir5AhWKHzQIHSKyC1IQ2-cCegQIABAA&oq=pets&gs_lcp=CgNpbWcQAzIECCMQJzIICAAQgAQQsQMyBwgAELEDEEMyBwgAELEDEEMyBwgAELEDEEMyBQgAEIAEMggIABCABBCxAzIICAAQgAQQsQMyBAgAEEMyBQgAEIAEOgsIABCABBCxAxCDAToHCCMQ6gIQJ1CHDViyQGCCRmgDcAB4AIABdogBgwiSAQQxMC4ymAEAoAEBqgELZ3dzLXdpei1pbWewAQrAAQE&sclient=img&ei=-qnZYpapM4q_0PEPouSukAU&bih=568&biw=1251&hl=en
We will now go to the url and find the attributes that are related to images. Access the webpage and right-click anywhere on the web page, select inspect from the drop-down list, click the arrow icon on the upper left-hand side of the screen with the HTML and then click an image on the webpage. This will result in the following screen being displayed.
On the HTML screen, you will see highlighted the HTML line containing the attributes related to the selected image. We find that the class = “rg_i Q4LuWd”.
If you move the cursor over other images on the web page you will see a pattern where all of the classes contain the values “Q4LuWd”.
So, we want to find and extract the objects where class contains the values “Q4LuWd”.
Find and Extract Images
We will use the find_elements_by_xpath() function to identify the images.
All of the images that contained “Q4LuWD” in the class name are now stored in imgResults which is a selenium object. If you display the imgResults you will only see a description of the selenium object.
Now we need to download the images. To retrieve an image we need to access the “src” attribute. The value of the src attribute is a URL that will open the image on a new page where we will use python functions to download the image.
We will use the image_object.get_attribute(‘src’) function to access the scr attribute. The get_attribute function returns the attribute value of the parameter sent as an argument.
The src list contains the list of image URL’s. We will now go through the list and use a python function to download the images.
Download Images
The loop will run 10 times and download 10 images to your file folder. You can specify a higher number if you need more images.
The urllib.request.urlretreive() function has two arguments. The first is a URL, and the second is the file path where you want to store the downloaded image. Each image will be stored in a separate file.
The downloaded images are stored in your specified file folder.
We have combined the image files into a single PDF file shown below.
This article shows how you can easily scrape and capture images on a website.
Thanks so much for reading my article! If you have any comments or feedback, please let me know.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. Membership gives you unlimited access to all articles on Medium. You can sign up using this link https://medium.com/@dniggl/membership