Natural Language Processing with Disaster Tweets
Predict which Tweets are about real disasters and which ones are not
Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
This notebook is strictly for beginners and it is your entryway to the world of natural language processing. I have used a dataset from Kaggle competition and used simple tools for cleaning and training text data.
I will show you how to :
- Analyze dataset
- Visualization of Keywords
- Cleaning data
- Training with a simple model
- Model Metrics (F1)
- predictions from the test dataset.
Importing Required Library
Checking Missing Values
Going deep into disaster tweets
Most common keywords
Using pie chart
Location of Tweets
Word cloud of tweets
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
Stemming and Lemmatization in Python NLTK are text normalization techniques for Natural Language Processing. These techniques are widely used for text preprocessing. The difference between stemming and lemmatization is that stemming is faster as it cuts words without knowing the context, while lemmatization is slower as it knows the context of words before processing.
In this case PoerterStemmer performed well then lemmatization
Machine learning algorithms most often take numeric feature vectors as input. Thus, when working with text documents, we need a way to convert each document into a numeric vector.
In this case Countvectorizer is best performing.
Using Logistic Regression for Training Model
Using Simple Naive Bayes
Our simple Logistics Regression worked poor in F1 score, so I decided to chose another model for training, you can chose any gradient boosting or simple linear model to train our data.
This is the best score I can come up with experimenting on various text vectors, text cleaning, and simple model implementations.
Fitting model and predicting the test data.
Final Submission into Competition
You can submit your score in this competition and see where you stand in leaderboard.
If you like my work do ❤ it and share it with others.
I got 0.791 score which is not bad.