Libraries Used
Installing third party libraries
!pip install openpyxl
!pip install tweet-preprocessor
Specific corpuses downloaded from NLTK
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
import pandas as pd
import preprocessor as p
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
import re
import string
from scipy.special import softmax
import numpy as np
Importing Tweet File
Importing Tweets
Tweets were extracted using RapidMiner. Extracted tweets were exported as Excel file.
airasia_twts = pd.read_excel("airasia3.xlsx")
airasia_twts.columns
Removing unwanted columns
Keeping only the text and retweets columns, the rest were removed
airasia_twts.drop(columns=["Created-At", "From-User", "From-User-Id", "To-User","To-User-Id", "Language", "Source", "Geo-Location-Latitude", "Geo-Location-Longitude", "Id" ], inplace=True)
Text Pre-processing
Initialize tokenizers
NLTK has a dedicated tokenizer for Tweets which was used in this project. Inflection of words in Tweets were lemmatized using Lemmatizer
tknzr = TweetTokenizer(preserve_case=False)
lemmatizer = WordNetLemmatizer()
Custom function for cleaning text
Cleaning common tokens in a Tweet such as RT, @ using a library called tweet-preprocessor.
Removing punctuations and digits
Lemmatizing words
def preprocess_tweets(tweet: str):
cleaned_text = p.clean(tweet)
translation_table = str.maketrans('', '', string.punctuation+string.digits)
cleaned_text = cleaned_text.translate(translation_table)
cleaned_text = [(lemmatizer.lemmatize(w)) for w in tknzr.tokenize((cleaned_text))]
return ' '.join(cleaned_text)
airasia_twts.head()
Applying the cleaning function to Tweet column
airasia_twts['Text'] = airasia_twts['Text'].apply(lambda x: preprocess_tweets(x))
Removing duplicated Tweets
Mostly commonly the tweet that has been retweeted were removed
airasia_twts.drop_duplicates(subset=["Text"], inplace=True)
Exporting cleaned file
airasia_twts.to_csv("cleaned_airasia_tweets_1.csv")