import necessary libraries
pandas as pd,
numpy as np and
seaborn as sns
conf_matrix (this script should be in your local repo) and
Load your previously saved csv dataframe using pandas'
plot label frequencies
Set the seaborn color palette to "deep" using
Then, plot the label frequencies using
sns.countplot() on the column "sentiment" (or what ever you have called it).
Use stop words to remove less-meaningful words. The logic of removing stop words has to do with the fact that these words don't carry a lot of meaning, and they appear a lot in most text. We have provided you with a list of common German stopwords ('data/stopwords_german.txt'). Import the packages
unidecode first, then use
readlines() to save the words contained in the .txt file to a list.
Call the python string function
strip() to remove newline characters (
\n) and unidecode's
unidecode() on every element in the resulting list.
split data for training
To train and evaluate the model, we split the data into a training set and a test set using
train_test_split(), the arguments being the text column, the label/sentiment column, a test set size (
test_size=0.1 for 10%,
test_size=0.3 for 30%, etc.) and a integer of your choice as random_state.
You can then call
.shape on the resulting sets to see their dimensions.
set up ML pipeline
Instantiate a pipeline by adding 3 steps: a
'tfidf' and a
The Countvectorizer helps us to create numerical values from text by counting the inherent tokens. Pass
lowercase=True. Pass your list of stopwords as
The arguments for the
Fit your pipeline to the training data by calling
fit() on the pipeline object and passing the training texts and training labels.
We have provided you with a function to score your model using the test texts and labels. In case of encoding issues calling
.values.astype('U') on the texts before passing them to your pipeline might help.
plot confusion matrix
To quickly plot a confusion matrix, use the provided function pplot_cm and pass the same arguments as with
Pass the example texts from the repo description to
pipeline.predict() and play around with new texts to get a feeling for how your model determines a sentiment.