Scrape HTML Tables Without LeavingΒ Pandas

Webscraping is often a pain. Researching, finding, and installing the libraries you need can be time consuming. Finding the content you need in the HTML can take time. Getting everything to work can be finicky.πŸ™ In this article, I'll show you how to use the Python pandas library to scrape HTML tables with single line of code! It doesn't work in all cases, but when you have HTML tables on a website it can make your life much easier. πŸ˜€ You'll see how to use it to get data from websites about soccer and weightlifting. ⚽️ πŸ‹

Scrape HTML Tables Without LeavingΒ Pandas - preview image
source: pixabay.com
## How to quickly get the data you need from websites Webscraping is often a pain. Researching, finding, and installing the libraries you need can be time consuming. Finding the content you need in the HTML can take time. Getting everything to work can be finicky.πŸ™ In this article, I'll show you how to use the Python pandas library to scrape HTML tables with single line of code! It doesn't work in all cases, but when you have HTML tables on a website it can make your life much easier. πŸ˜€ You'll see how to use it to get data from websites about soccer and weightlifting. ⚽️ πŸ‹
We'll use [`pd.read_html()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) to scrape tabular data. In my experience, a lot of folks don't know about `pd.read_html()` even though it's [been around](https://pandas.pydata.org/docs/whatsnew/v0.12.0.html?highlight=read_html) for over 7 years.
# Setup To get the latest versions of necessary packages and their dependencies, uncomment and run the following code one time. Then restart your notebook kernel.
# !pip install pandas lxml beautifulsoup4 html5lib matplotlib -U
Generally pandas will try to use lxml to parse HTML because it is fast. If that fails, then it will use BeautifulSoup4 with html5lib. You can read more about the parsers in the pandas [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#html-table-parsing-gotchas).
Let's import the packages and check versions. I always like to check the versions of Python and key libraries to help diagnose problems that might arise. πŸ˜‰
import sys import pandas as pd print(f"Python version {sys.version}") print(f"pandas version: {pd.__version__}")
If your Python version is less than 3.6, I suggest you update it. Same goes for pandas if your version is less than 1.0.5. To learn more about the pandas 1.0 update, see my guide [here](https://towardsdatascience.com/whats-new-in-pandas-1-0-ffa99bd43a58).
## Example 1: Soccer ⚽️ Let's scrape some soccer stats about the US Women's National Team - that's football in much of the world. ⚽️
Let's use the U.S. Soccer website: [https://www.ussoccer.com/uswnt-stats](https://www.ussoccer.com/uswnt-stats). ![Soccer Stats Table](https://storage.googleapis.com/published-content/read_html/soccer_stats.png)
In Chrome, go to the website, right click on the data, and select *Inspect*. ![Chrome inspect from menu](https://storage.googleapis.com/published-content/read_html/inspect.png) You should see the sidebar appear. This shows you the HTML behind the page, among other things. Look for the HTML tags <table>, <tbody>, <tr>, or <td>. These all signify you have found a table. See [w3schools.com](https://www.w3schools.com/html/html_tables.asp) if you want to learn about HTML table basics. The pandas function we are going to use requires us to find HTML tables. So you've just struck gold! πŸŽ‰
Let's grab the data from the webpage. To get each table into a DataFrame we just need to run the following code.
list_of_dfs = pd.read_html('https://www.ussoccer.com/uswnt-stats')
Now the DataFrames are in a list.
type(list_of_dfs)
Let's see how many DataFrames are in the list.
len(list_of_dfs)
Alright, let's have a look at the first DataFrame. πŸ‘€
list_of_dfs[0]
Looks like a bunch of players' stats. Good, that matches what we would expect from looking at the website. Let's see what's in the second table.
list_of_dfs[1]
Looks like it contains the goalkeepers' stats.
In both cases, it would be better if we read in the top row as the column headers and the first column as the index. When you read in the tables with `pd.read_html()` you have many of the same arguments available as you do with `pd.read_csv()`. That's the pandas method that most folks are more familiar with. πŸ‘ Let's put the table columns where we want them by grabbing the data again. This time we'll pass `header=0, index_col=0`.
list_of_dfs_nicer = pd.read_html('https://www.ussoccer.com/uswnt-stats', header=0, index_col=0) list_of_dfs_nicer[0].head(2)
Much nicer. πŸ˜€ To see more possible arguments in your Jupyter notebook or IDE, hold down `Shift` + `Tab` with your cursor inside the function. Or just head over to the [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html).
Let's keep rolling. 🧻 We can save each DataFrame as its own variable. Let's put everyone who isn't a goalie in `runners_df` and the goalies in `goalies_df`. According to the internet and my daughter - who plays a lot of soccer - there isn't a name for all the non-goalies on a soccer team. I am going to call them runners, because that's what they do a lot. πŸ™‚ Please correct me if I'm missing something.
runners_df = list_of_dfs_nicer[0] goalies_df = list_of_dfs_nicer[1]
We want to exclude the final three rows in *runners_df* and the final two rows in *goalies_df*. Let's slice with `.iloc[]`. If you want to learn more about slicing DataFrames and how to use pandas, I humbly suggest you check out my [Memorable Pandas book](https://memorablepandas.com).
runners_df = runners_df.iloc[:-3] runners_df.tail(2)
goalies_df = goalies_df.iloc[:-2] goalies_df
Let's quickly make a visualization showing the goals for each player who has scored this year.
goal_scorers_df = runners_df[runners_df['G']>0] goal_scorers_df['G'].sort_values().plot(kind='barh', title='2020 USWNT Goals');
Looks like Christian Press and Lindsey Horan are leading the goal scoring charge in the brief 2020 season. πŸ… Alright, we could do lots of fun data cleaning and visualization, but the focus today is on scraping and reading HTML tables. Let's look at one more example.
## Example 2: Weightlifting πŸ‹πŸΌβ€β™€οΈ [Wikipedia](https://en.wikipedia.org/wiki/List_of_weight_training_exercises) has an interesting chart displaying which muscle groups are worked by different weightlifting exercises. ![Screenshot of chart from wikipedia](https://storage.googleapis.com/published-content/read_html/weight_wiki.png)
After inspecting it, we see that it's an HTML table. Let's grab it.πŸš€
weightlifting_df_list = pd.read_html('https://en.wikipedia.org/wiki/List_of_weight_training_exercises', index_col=0) len(weightlifting_df_list)
weightlifting_df_list[0]
That's what we want! πŸ‘
![dumbbells](https://storage.googleapis.com/published-content/read_html/dumbbells.jpg) source: pixabay.com
Let's assign the first DataFrame to a variable.
exercises_df = weightlifting_df_list[0]
Now you could do whatever you like with the DataFrame. Maybe you want to filter it to show only the exercises that work your hamstrings.
hammies = exercises_df[(exercises_df['Ham-strings']=='Yes') | (exercises_df['Ham-strings']=='Some')] hammies
Let's sort the table by the exercises that work the hamstrings a lot.
hammies.sort_values(by='Ham-strings', ascending=False)
Looks like we should do some lunges, deadlifts and leg curls if we want stronger hamstrings. I think it's time for a workout! πŸ‹οΈβ€
## Wrap You've seen how to use `pd.read_html()` to move data from HTML tables into a list of pandas DataFrames. `pd.read_html()` isn't the right tool for every webscraping endeavor, but when it's an option, it's a great one. πŸ‘ I hope you found this introduction to webscraping with `pd.read_html()` to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. πŸ˜€ If you have questions or comments, please share them with me on [Twitter](https://twitter.com/discdiver) or [LinkedIn](https://www.linkedin.com/in/-jeffhale/).
![weight machines in gym](https://storage.googleapis.com/published-content/read_html/blue_gym.jpg) source: pixabay.com Happy scraping! πŸš€

Run this article as a notebook

Deepnote is a new kind of data science notebook. Jupyter-compatible and with real-time collaboration.

Sign-up for the waitlist below, or find out more here.

To be continued...