HW01: Intro to Text Data
In this assignment, we will explore how to load a text classification dataset (AG's news, originally posted here), how we can preprocess the data and extract useful information from a real-world dataset. First, we have to download the data; we only download a subset of the data with four classes.
Inspect Data
Let's make the data more human readable by adding a header and replacing labels
Document Length
Word Frequency
Let's implement a keyword search (similar to the baker-bloom economic uncertainty) and compute how often some given keywords ("play", "tax", "blackberry", "israel") appear in the different classes in our data
As a last exercise, let's plot the number of occurrences of "tax" in the different classes in the dataset. Hint: have a look at the pandas bar plot with group by