Web Scraping TripAdvisor Hotels with Python and Beautiful Soup
How to scrape a website and create a dataset.
TripAdvisor, the world’s largest travel site, is a popular website for finding hotels, restaurants, transportation, and places to visit. Whenever someone plans a trip to a city or country, they are likely to go to TripAdvisor to find the best places to stay and visit. TripAdvisor has over 702 million reviews of the world’s leading hotels, lists over 8 million locations (hotels, restaurants, tourist attractions), and ranks number one in the Travel and Tourism category in the United States.
In this article, I will present a script that will scrape hotel information from a TripAdvisor webpage, extract some data elements and create a dataset. The following steps will be performed using Python and BeautifulSoup.
1. Import the Libraries.
2. Review the Web Page’s HTML Structure.
3. Retrieve and Convert the HTML Data.
4. Find and Extract the Data Elements.
5. Create the Data Frame.
6. Convert Data Frame to a CSV File.
Import the Libraries
* requests allows you to send HTTP requests to a server that returns a Response Object with all the response data (i.e. HTML).
* beautifulsoup (bs4) is used to pull data out of HTML files and convert the data to a BeautifulSoup object, which represents the HTML as a nested data structure.
* pandas is used for data analysis and manipulation.
* csv module implements classes to read and write tabular data in CSV format.
Review the Web Page’s HTML Structure
We need to understand the structure and contents of the HTML tags within the web pages. For this project, we will be using the TripAdvisor Hawaii Hotels and Places to Stay webpage (shown below). You can find this webpage by selecting this link - https://www.tripadvisor.in/Hotels-g28932-Hawaii-Hotels.html
We can scrape this webpage by parsing the HTML of the page and extracting the information needed for our dataset. To scrape some data from this web page right-click anywhere on the web page, select inspect from the drop-down list, click the arrow icon on the upper left-hand side of the screen with the HTML and then click the hotel name (Prince Waikiki) in the review section of the webpage. This will result in the following screen being displayed.
On the HTML screen, you will see highlighted the HTML line containing the Hotel Name — Prince Waikiki.
If you move up one line from this tag you will find the div tag with a class of “listing_title”. This is the parent of the <a tag.
So, if you wanted to find, extract and capture all the hotel names on the web page you would do the following steps. 1. Find all the HTML lines for a specific parent (div tag with class = listing_title) which would include their associated children. 2.Extract the data elements and build a list containing each of the hotel names.
The code for finding and extracting hotel names would look like the following:
We will find, extract and store the other data elements on the web page following a similar procedure as described above.
Retrieve and Convert the HTML
Create an object (URL) containing the website address and send a get request for the specific URL’s HTML to the web server. Then retrieve the HTML data that the web server sends back and convert the data into a BeautifulSoup object.
The HTML content of the webpage is parsed and scraped using Beautiful Soup. Beautiful Soup is a great tool for parsing and scraping websites because of the numerous functions it provides to extract data from HTML. To learn more about BeautifulSoup go to the following link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-usehttps://)
For this project, we will be using the TripAdvisor Hawaii Hotels and Places to Stay webpage. You can find this webpage by going to the following link: https://www.tripadvisor.in/Hotels-g28932-Hawaii-Hotels.html
After reviewing the TripAdvisor Hawaii Hotels and Places to Stay webpage, I have decided to extract the following data elements from the hotels:
* Hotel Names * Ratings * Number of Reviews * Prices
Find and Extract the Data Elements
For each of the data elements we want to extract, we will find all the HTML lines that are within a specific tag and class. We will then extract the data elements and store the data in a list.
Create the Data Frame
We will create a dictionary that will contain the data names and values for all the data elements that were extracted.
Create and display the data frame.
Convert the Data Frame to a CSV file
Thanks so much for reading my article! If you have any comments or feedback, please send them to me at dniggl@cox.net.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. Membership gives you unlimited access to all articles on Medium. You can sign up using this link https://medium.com/@dniggl/membership