Sentiment Analysis for Under-Resourced Language
Train a sentiment classifier (Positive, Negative, Neutral) on a corpus of the provided documents.
Your goal is to maximize accuracy. There is special interest in being able to accurately detect negative sentiment. The training data includes documents from a wide variety of sources, not merely social media, and some of it may be inconsistently labeled. Please describe the outcomes in your work sample including how data limitations impact your results and how these limitations could be addressed in a larger
We filled missing value using pad method
Remove noise from out Dataset
Wordcloud is a perfect way to view word frequency
Tokenized words using count vectorizer
CountVectorizer convert a collection of text documents to a matrix of token counts
Cross Validation
Vectorize text
Build models
Bias and Variance Tradeoff on MultinomialNB
Use Random Forest Classifier
Bias and Variance Tradeoff for RF
Using Sgd Linear model classifier
Bias and Variance Tradeoff for SGD
Good fit SGD
XGBoost Classifier
Best Model XGBOOST
Using Adaboost classifier
Bias and Variance Tradeoff for Adaboost
Perfect model is Adaboost classifier
Logistics Regression with Gridsearch parameter tunning
Bias and Varaince Tradeoff for logistics regression
Confusion Matrix for the Logistics Regression
Test on a random Tweet
Conclusion
Although our goal is to maximize accuracy, bias and variance are other important factors to consider since our focus is on an under resources language and also adequately detecting negative sentiment without been bias.
Outcome
I will break down my outcome into 6 sections
- Under resource language
- Understanding the dataset
- Insight derived from the corpus
- Model Interpretation
- How data limitation affect the model result
- How this limitation can be addressed in larger
There are over 6900 languages in the world today and only a small fraction offers the resources required for the implementation of Natural Language processing or Human Language Technologies.
However, most technologies are concerned with the language for which large resources are available or which have suddenly become of interest because of economic and political science. Unfortunately, most languages from the developing countries received only a little attention so far. One way we intend to improve the language divide is by building Natural Language applications.
About 99% of the dataset is written in Hindi and the other 1% is in English. After effective cleaning and preprocessing, an intensive approach was taken toward the sentiment and I found that most texts sentiment are neutral. Furthermore, a word cloud technique was done.
After an effective data preprocessing and feature engineering, our high performance model with Bias and variance tradeoff was built with Xgboost Algorithm an the accuracy is 66%. However, our goal is to improve on the model overtime as more data are been feed into it.
Apparently, with the result from the high-performance model with bias and variance trade-off, more data are required to increase the accuracy and optimize the model.
This limitation can be address in large using data collection techquies, such as survey/questionaire, scraping of text written in hindi from social media, traditional data collection and deep learning apporach to improve the model accuracy.