Web Scraping TripAdvisor Hotels with Python and Beautiful Soup
How to scrape a website and create a dataset.
TripAdvisor, the world’s largest travel site, is a popular website for finding hotels, restaurants, transportation, and places to visit. Whenever someone plans a trip to a city or country, they are likely to go to TripAdvisor to find the best places to stay and visit. TripAdvisor has over 702 million reviews of the world’s leading hotels, lists over 8 million locations (hotels, restaurants, tourist attractions), and ranks number one in the Travel and Tourism category in the United States.
In this article, I will present a script that will scrape hotel information from a TripAdvisor webpage, extract some data elements and create a dataset. The following steps will be performed using Python and BeautifulSoup.
1. Import the Libraries.
2. Review the Web Page’s HTML Structure.
3. Retrieve and Convert the HTML Data.
4. Find and Extract the Data Elements.
5. Create the Data Frame.
6. Convert Data Frame to a CSV File.
Import the Libraries
Downloading bs4-0.0.1.tar.gz (1.1 kB)
Preparing metadata (setup.py) ... done
Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 KB 8.3 MB/s eta 0:00:00
Downloading soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Building wheels for collected packages: bs4
Building wheel for bs4 (setup.py) ... done
Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=95d237f73e39a1782664817e8345bbdeba02be83d5b479f8822b47d9fdeb8833
Stored in directory: /root/.cache/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.11.1 bs4-0.0.1 soupsieve-2.3.2.post1
WARNING: You are using pip version 22.0.4; however, version 22.2.2 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
* requests allows you to send HTTP requests to a server that returns a Response Object with all the response data (i.e. HTML).
* beautifulsoup (bs4) is used to pull data out of HTML files and convert the data to a BeautifulSoup object, which represents the HTML as a nested data structure.
* pandas is used for data analysis and manipulation.
* csv module implements classes to read and write tabular data in CSV format.
Review the Web Page’s HTML Structure
We need to understand the structure and contents of the HTML tags within the web pages. For this project, we will be using the TripAdvisor Hawaii Hotels and Places to Stay webpage (shown below). You can find this webpage by selecting this link - https://www.tripadvisor.in/Hotels-g28932-Hawaii-Hotels.html
We can scrape this webpage by parsing the HTML of the page and extracting the information needed for our dataset. To scrape some data from this web page right-click anywhere on the web page, select inspect from the drop-down list, click the arrow icon on the upper left-hand side of the screen with the HTML and then click the hotel name (Prince Waikiki) in the review section of the webpage. This will result in the following screen being displayed.
On the HTML screen, you will see highlighted the HTML line containing the Hotel Name — Prince Waikiki.
If you move up one line from this tag you will find the div tag with a class of “listing_title”. This is the parent of the <a tag.
So, if you wanted to find, extract and capture all the hotel names on the web page you would do the following steps. 1. Find all the HTML lines for a specific parent (div tag with class = listing_title) which would include their associated children. 2.Extract the data elements and build a list containing each of the hotel names.
The code for finding and extracting hotel names would look like the following:
We will find, extract and store the other data elements on the web page following a similar procedure as described above.
Retrieve and Convert the HTML
Create an object (URL) containing the website address and send a get request for the specific URL’s HTML to the web server. Then retrieve the HTML data that the web server sends back and convert the data into a BeautifulSoup object.
The HTML content of the webpage is parsed and scraped using Beautiful Soup. Beautiful Soup is a great tool for parsing and scraping websites because of the numerous functions it provides to extract data from HTML. To learn more about BeautifulSoup go to the following link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-usehttps://)
For this project, we will be using the TripAdvisor Hawaii Hotels and Places to Stay webpage. You can find this webpage by going to the following link: https://www.tripadvisor.in/Hotels-g28932-Hawaii-Hotels.html
After reviewing the TripAdvisor Hawaii Hotels and Places to Stay webpage, I have decided to extract the following data elements from the hotels:
* Hotel Names * Ratings * Number of Reviews * Prices
Find and Extract the Data Elements
For each of the data elements we want to extract, we will find all the HTML lines that are within a specific tag and class. We will then extract the data elements and store the data in a list.
Create the Data Frame
We will create a dictionary that will contain the data names and values for all the data elements that were extracted.
Create and display the data frame.
Convert the Data Frame to a CSV file
Thanks so much for reading my article! If you have any comments or feedback, please send them to me at email@example.com.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. Membership gives you unlimited access to all articles on Medium. You can sign up using this link https://medium.com/@dniggl/membership