When dealing with PDF documents that contain structured data in tables, extracting that information programmatically can be a challenging task. However, with tools like pdfplumber, the process becomes streamlined and efficient. In this tutorial, we will explore how to extract and clean tabular data from a sample PDF using pdfplumber in Python.

Loading the PDF

Let’s start by loading a sample PDF file (ca-warn-report.pdf in this case) and extracting data from its tables.

import pdfplumber # !pip install pdfplumber

# Load the PDF
pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf")

# Get the first page
first_page = pdf.pages[0]

Extracting the table

Now, let’s use pdfplumber’s extract_table method to retrieve the data from the largest table on the first page of the PDF.

# Extract the table
table = first_page.extract_table()

# Display the first few rows of the table
print(table[:3])

The extract_table method returns a list of lists, where each inner list represents a row in the table. For better clarity and manipulation, we can convert this list into a pandas DataFrame and perform additional data cleanup if necessary.

Cleaning up with Pandas

We’ll use pandas to convert the extracted table into a data frame and perform basic data-cleaning operations, such as removing extra spaces.

import pandas as pd

# Convert the table to a DataFrame
df = pd.DataFrame(table[1:], columns=table[0])

# Clean up columns with extra spaces
for column in ["Effective", "Received"]:
    df[column] = df[column].str.replace(" ", "")

# Display the cleaned

Visualizing the table extraction process

pdfplumber provides a visual debugging feature that helps us understand how it identifies and extracts tables from the PDF. Let’s visualize this process for better comprehension.

# Display visual debugging for table extraction
im = first_page.to_image()
im.debug_tablefinder()

Conclusion

By leveraging pdfplumber alongside pandas, you can efficiently transform raw PDF data into structured datasets ready for analysis or further processing in your Python workflows.

By following these steps, you can effectively utilize pdfplumber to extract and work with tabular data from PDFs, enhancing your data processing capabilities in Python. If you encounter any issues, please get in touch with our support. Happy coding in Deepnote!

Extracting tabular data from PDFs with Deepnote

Loading the PDF

Extracting the table

Cleaning up with Pandas

Visualizing the table extraction process

Conclusion

That’s it, time to try Deepnote