Twitter Analysis - Cleaning

Libraries Used

Installing third party libraries

!pip install openpyxl !pip install tweet-preprocessor

Specific corpuses downloaded from NLTK

import nltk nltk.download('wordnet') nltk.download('omw-1.4')

import pandas as pd import preprocessor as p from nltk.tokenize import TweetTokenizer from nltk.stem import WordNetLemmatizer import re import string from scipy.special import softmax import numpy as np

Importing Tweet File

Importing Tweets

Tweets were extracted using RapidMiner. Extracted tweets were exported as Excel file.

airasia_twts = pd.read_excel("airasia3.xlsx")

airasia_twts.columns

Removing unwanted columns

Keeping only the text and retweets columns, the rest were removed

airasia_twts.drop(columns=["Created-At", "From-User", "From-User-Id", "To-User","To-User-Id", "Language", "Source", "Geo-Location-Latitude", "Geo-Location-Longitude", "Id" ], inplace=True)

Text Pre-processing

Initialize tokenizers

NLTK has a dedicated tokenizer for Tweets which was used in this project. Inflection of words in Tweets were lemmatized using Lemmatizer

tknzr = TweetTokenizer(preserve_case=False) lemmatizer = WordNetLemmatizer()

Custom function for cleaning text

Cleaning common tokens in a Tweet such as RT, @ using a library called tweet-preprocessor.

Removing punctuations and digits

Lemmatizing words

def preprocess_tweets(tweet: str): cleaned_text = p.clean(tweet) translation_table = str.maketrans('', '', string.punctuation+string.digits) cleaned_text = cleaned_text.translate(translation_table) cleaned_text = [(lemmatizer.lemmatize(w)) for w in tknzr.tokenize((cleaned_text))] return ' '.join(cleaned_text)

airasia_twts.head()

Applying the cleaning function to Tweet column

airasia_twts['Text'] = airasia_twts['Text'].apply(lambda x: preprocess_tweets(x))

Removing duplicated Tweets

Mostly commonly the tweet that has been retweeted were removed

airasia_twts.drop_duplicates(subset=["Text"], inplace=True)

Exporting cleaned file

airasia_twts.to_csv("cleaned_airasia_tweets_1.csv")

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}