Omar Torres, Bella Mendoza, Zachary Kuo

University of California, Berkeley

Ordinal regression is a classification method for categories on an ordinal scale -- e.g. [1, 2, 3, 4, 5] or [G, PG, PG-13, R]. This notebook implements ordinal regression using the method of Frank and Hal 2001, which transforms a k-multi-class classifier into k-1 binary classifiers (each of which predicts whether a data point is above a threshold in the ordinal scale -- e.g., whether a movie is "higher" than PG). This method can be used with any binary classification method that outputs probabilities; here L2-regularizaed binary logistic regression is used.

This notebook trains a model (on train.txt), optimizes L2 regularization strength on dev.txt, and evaluates performance on test.txt. Reports test accuracy with 95% confidence intervals.

from scipy import sparse from sklearn import linear_model from collections import Counter import numpy as np import matplotlib.pyplot as plt import operator import nltk import math from scipy.stats import norm from sklearn.metrics import plot_confusion_matrix,confusion_matrix, ConfusionMatrixDisplay import pandas as pd from sklearn.model_selection import train_test_split

Splitting Annotated Data

# Reading data adjudicated_data = pd.read_csv("adjudicated.txt", sep="\t", names=['index','format','label','text']) adjudicated_data.head()

# Spliting data train_df, test_df = train_test_split(adjudicated_data, test_size=0.4, random_state=42) developement_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42) print(train_df.shape, developement_df.shape, test_df.shape)

# Adding txt files to splits folder train_df.to_csv('splits/train.txt', header=None, index=None, sep='\t') developement_df.to_csv('splits/dev.txt', header=None, index=None, sep='\t') test_df.to_csv('splits/test.txt', header=None, index=None, sep='\t')

Modeling

!python -m nltk.downloader punkt

def load_ordinal_data(filename, ordering): X = [] Y = [] orig_Y=[] for ordinal in ordering: Y.append([]) with open(filename, encoding="utf-8") as file: for line in file: cols = line.split("\t") idd = cols[0] label = cols[2].lstrip().rstrip() text = cols[3] X.append(text) # For each label in our data look for its index in ['1','2','3','4'] # Then for each of the 4 sections in Y put a 1 in the sections with indexes less than our index # Put a 0 otherwise index=ordering.index(label) for i in range(len(ordering)): if index > i: Y[i].append(1) else: Y[i].append(0) orig_Y.append(label) return X, Y, orig_Y

class OrdinalClassifier: def __init__(self, ordinal_values, feature_method, trainX, trainY, devX, devY, testX, testY, orig_trainY, orig_devY, orig_testY): self.ordinal_values=ordinal_values self.feature_vocab = {} self.feature_method = feature_method self.min_feature_count=2 self.log_regs = [None]* (len(self.ordinal_values)-1) self.trainY=trainY self.devY=devY self.testY=testY self.orig_trainY=orig_trainY self.orig_devY=orig_devY self.orig_testY=orig_testY self.trainX = self.process(trainX, training=True) self.devX = self.process(devX, training=False) self.testX = self.process(testX, training=False) # Featurize entire dataset def featurize(self, data): featurized_data = [] for text in data: feats = self.feature_method(text) featurized_data.append(feats) return featurized_data # Read dataset and returned featurized representation as sparse matrix + label array def process(self, X_data, training = False): data = self.featurize(X_data) if training: fid = 0 feature_doc_count = Counter() for feats in data: for feat in feats: feature_doc_count[feat]+= 1 for feat in feature_doc_count: if feature_doc_count[feat] >= self.min_feature_count: self.feature_vocab[feat] = fid fid += 1 F = len(self.feature_vocab) D = len(data) X = sparse.dok_matrix((D, F)) for idx, feats in enumerate(data): for feat in feats: if feat in self.feature_vocab: X[idx, self.feature_vocab[feat]] = feats[feat] return X def train(self): (D,F) = self.trainX.shape for idx, ordinal_value in enumerate(self.ordinal_values[:-1]): best_dev_accuracy=0 best_model=None for C in [0.1, 1, 10, 50, 100, 200]: log_reg = linear_model.LogisticRegression(C = C, max_iter=1000) log_reg.fit(self.trainX, self.trainY[idx]) development_accuracy = log_reg.score(self.devX, self.devY[idx]) if development_accuracy > best_dev_accuracy: best_dev_accuracy=development_accuracy best_model=log_reg self.log_regs[idx]=best_model def test(self): cor=tot=0 counts=Counter() preds=[None]*(len(self.ordinal_values)-1) for idx, ordinal_value in enumerate(self.ordinal_values[:-1]): preds[idx]=self.log_regs[idx].predict_proba(self.testX)[:,1] preds=np.array(preds) predicted_labels = [] for data_point in range(len(preds[0])): ordinal_preds=np.zeros(len(self.ordinal_values)) for ordinal in range(len(self.ordinal_values)-1): if ordinal == 0: ordinal_preds[ordinal]=1-preds[ordinal][data_point] else: ordinal_preds[ordinal]=preds[ordinal-1][data_point]-preds[ordinal][data_point] ordinal_preds[len(self.ordinal_values)-1]=preds[len(preds)-1][data_point] prediction=np.argmax(ordinal_preds) predicted_labels.append(prediction) counts[prediction+1]+=1 if prediction == self.ordinal_values.index(self.orig_testY[data_point]): cor+=1 tot+=1 accuracy = cor / tot return (accuracy,predicted_labels,counts)

def binary_bow_featurize(text): feats = {} words = nltk.word_tokenize(text) for word in words: # word=word.lower() feats[word]=1 return feats # Extra features def feature2(text): # Here the `feats` dict should contain the features -- the key should be the feature name, # and the value is the feature value. See `simple_featurize` for an example. # This will be how much did the person express themselves feats = {} words = nltk.word_tokenize(text) feats["length"] = np.log(len(words)) feats["!"] = 0 feats["?"] = 0 feats["."] = 0 for word in words: word=word.lower() if word == "!": feats["!"] += 1 if word == "?": feats["?"] += 1 if word == ".": feats["."] += 1 return feats def feature3(text): # Here the `feats` dict should contain the features -- the key should be the feature name, # and the value is the feature value. See `simple_featurize` for an example. # This will be how many capitalization people used feats = {} words = nltk.word_tokenize(text) feats["cap"] = 0 feats["all_cap"] = 0 for word in words: if word.istitle(): feats["cap"] += 1 if word.isupper(): feats["all_cap"] += 1 return feats def feature4(text): # Here the `feats` dict should contain the features -- the key should be the feature name, # and the value is the feature value. See `simple_featurize` for an example. # This will be the bigrams feature feats = {} words = nltk.word_tokenize(text) negative_bigrams = {"lie","lied","never signed","never authorized","didn't authorize","didn't sign","fraud","immediately","expect","complaint","complaints","attorney","lawyer","scam","unethical","unacceptable","frustrated","refuse","harassment","bogus","lawsuit","abuse","i am angry","dishonest","never happened","i demand","jail"} feats["negative_bigrams"] = 0 for bigram in negative_bigrams: if bigram in words: feats["negative_bigrams"] += 1 return feats def feature5(text): feats = {} words = nltk.word_tokenize(text) for word in words: if word in feats.keys(): feats[word]+=1 else: feats[word]=1 return feats def exclamation_feature(text): words = nltk.word_tokenize(text) feats = {} feats['!'] = 0 for word in words: if word =='!': feats['!'] += 1 return feats def len_feature(text): feats = {} words = nltk.word_tokenize(text) punc_count = 0 for word in words: if word == '.' or word=='?' or word=='!': punc_count +=1 if punc_count > 0: feats['len'] = len(words)/punc_count else: feats['len'] = len(words) return feats def comma_feature(text): feats = {} words = nltk.word_tokenize(text) comma_count = 0 for word in words: if word ==',': comma_count += 1 if comma_count > 0: feats['has_commas'] = 1 else: feats['has_commas'] = 0 return feats def combiner_function(text): # Here the `all_feats` dict should contain the features -- the key should be the feature name, # and the value is the feature value. See `simple_featurize` for an example. # at the moment, all 4 of: bag of words and your 3 original features are handed off to the combined model # update the values within [bag_of_words, feature1, feature2, feature3] to change this. all_feats={} for feature in [binary_bow_featurize, feature4, exclamation_feature]: all_feats.update(feature(text)) return all_feats

def confidence_intervals(accuracy, n, significance_level): critical_value=(1-significance_level)/2 z_alpha=-1*norm.ppf(critical_value) se=math.sqrt((accuracy*(1-accuracy))/n) return accuracy-(se*z_alpha), accuracy+(se*z_alpha)

def print_confusion(classifier): fig, ax = plt.subplots(figsize=(10,10)) plot_confusion_matrix(classifier.log_reg, classifier.devX, classifier.devY, ax=ax, xticks_rotation="vertical", values_format="d") plt.show()

def run(trainingFile, devFile, testFile, ordinal_values, feature_func): trainX, trainY, orig_trainY=load_ordinal_data(trainingFile, ordinal_values) devX, devY, orig_devY=load_ordinal_data(devFile, ordinal_values) testX, testY, orig_testY=load_ordinal_data(testFile, ordinal_values) simple_classifier = OrdinalClassifier(ordinal_values, feature_func, trainX, trainY, devX, devY, testX, testY, orig_trainY, orig_devY, orig_testY) simple_classifier.train() accuracy, test_predicted_labels, count = simple_classifier.test() test_predicted_labels = [str(x+1) for x in test_predicted_labels] df = pd.DataFrame(columns=['actual','predicted','text']) df['actual'] = simple_classifier.orig_testY df['predicted'] = test_predicted_labels df['text'] = testX print("\n") print("------------------------------------------------------------------------------") lower, upper=confidence_intervals(accuracy, len(testY[0]), .95) print("Test accuracy for best dev model: %.3f, 95%% CIs: [%.3f %.3f]\n" % (accuracy, lower, upper)) cm = confusion_matrix(simple_classifier.orig_testY, test_predicted_labels) # Define the class labels for the confusion matrix classes = ['1','2','3','4'] disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes) disp.plot() return df

trainingFile = "splits/train.txt" devFile = "splits/dev.txt" testFile = "splits/test.txt" # ordinal values must be in order *as strings* from smallest to largest, e.g.: # ordinal_values=["G", "PG", "PG-13", "R"] ordinal_values=["1", "2", "3", "4"] our_df = run(trainingFile, devFile, testFile, ordinal_values, combiner_function)# combiner_function feature4

our_df[our_df['actual']!=our_df['predicted']].head(38)

print(our_df.iloc[96,2])

print(our_df.iloc[17,2])

print(our_df.iloc[37,2])

print(our_df.iloc[80,2])

Analysis

Our model defines classes primarily with the use of exclamation points, usage of capitalization, and negative bigrams. While these are categories are sufficient for a relatively accurate classifier, there are still some systematic mistakes made by the model. For example, rhetorical questions not getting caught as signs of aggression, or short sentences being commonly classified as aggressive while they are actually very neutral, are both common errors made by our model. This could be happening because shorter sentences mean the presence of one of these features can be weighted more heavily and there are fewer chances for the presence of non-aggressive features. Two labels that are often mistaken for each other are 1 and 2. This could be due to both of them being extremely common, making it difficult to differentiate the minor differences between the two.

In terms of biases, there is a lack of data for certain registers of English such as "intimate" or "frozen" due to the nature of complaints. This makes it difficult to identify potential bias in these registers of the language. While not this does not necessarily equate to bias, there is a tendency for casual English to have higher aggression scores and formal English to have lower scores, likely due to the attitude of the speaker that is associated with each respective register. The spread of data labels was also very skewed. Over half (around 60%) of our dataset is labeled as 1, and the next most prevalent labels are 2, 3, then 4 respectively. This could cause our model to place less importance on the 3 and 4 levels, as the prevalence of 1's will be focused on during the training process. This means that our dataset would be a good candidate for changing class weights. By increasing prediction error for misclassifying 3's and 4's, we can work around this imbalance of labels.

________________________

The source of our data is the Consumer Financial Protection Bureau, a federal dataset that contains complaints specifically regarding financial products and services from companies. The data is published after the company responds to the complaint, or after 15 days. This data is listed as, “intended for public access and use”. This means that it is neither private nor under copyright. It contains CSV and JSON versions of data and was processed and prepared by the US government. The dataset was initially created in November 2020 and is stated to be updated daily.

In our dataset of consumer complaints, we are primarily considering annotating the “Consumer complaint narrative” category, as it contains the actual text used by consumers when filing a complaint. Rather than sentiment analysis, the goal of our project is to analyze the level of aggression displayed in each of the complaints by detecting elements such as harsh wording, euphemisms, and the overall tone of the message.

Aggression Level Analysis Annotation Guidelines

1. Introduction

The purpose of this annotation is to utilize identified aspects of aggression and frustration within consumer complaints to gauge the overall level of anger that is expressed. The source of the data used for this task stems from the Consumer Financial Protection Bureau, a federal dataset that contains complaints specifically regarding financial products and services from companies. The data is published after the company responds to the complaint, or after 15 days. For each given complaint, the task of the annotator is to identify elements of frustration and/or aggression within the sentences and use this information to decide upon a designated level of anger.

1a. Notable Definitions

Euphemism Mild or indirect words or expressions substituted for one considered too harsh or blunt when referring to something unpleasant or embarrassing. For example, the sentence “He was let go from his job” uses euphemism by using “let go” rather than “fired” to describe the end of the person’s employment. Frustration The feeling of being upset or annoyed, especially because of inability to change or achieve something. From a psychological perspective, frustration is often correlated with anger and disappointment, making it an important feeling to detect when identifying levels of anger.

2. Levels of Anger

- 1 (no anger expressed): the user is filing for a complaint or request in a respectful manner that has minimal signs of expressed frustration. Although the user is complaining, they are generally only listing facts that could help to solve their case and are opting to uses euphemisms rather than being too harsh. Common factors to notice for level 1 are: o Mentioning failure to take action, or being unsure about a process o Primarily providing facts o Making an inquiry o Expressing concern without being aggressive or threatening

- 2 (slightly angry): the user is communicating their problem, but there are indications of frustration through diction, tone, and/or punctuation that imply a level of anger that is slightly above neutral, but not too extreme. Rather than sticking to providing raw facts, the user expresses some of their feelings that come up when dealing with the problem they are trying to solve. Common factors to notice for level 2 are: o Expressing how the situation being complained about is having a negative impact on their life o Mild frustration, whether directly stated or detected through the tone of the complaint o Mentions of the company refusing to take action or do anything to help the situation o Not too much aggression, but rather more focus on frustration

- 3 (angry): Rather than using a more reserved level of expressing frustration, the user is more blatant with their aggression and/or takes actions that indicate anger. The user may point out the fault of the company or lean more heavily on opinionated statements such as how they feel about the situation or company in question. Making accusations towards the company or claiming that something is illegal is also a good indicator. Common factors to notice for level 3 are: o Threatening to take action against the company o Stating that people were rude, or similar negative experiences o Making accusations against the company

- 4 (very angry): The highest level of anger, with the user very blatantly showing aggression in their complaint. There may be offensive or harsh wording, threats, or demands. The user expresses their thoughts and opinions about the company with the use of negative extremities (“worst”, “slowest”, etc.). Common factors to notice for level 4 are: o Clear expression of anger, frustration, and aggression o Harsh wording o Negative extremities and use of exclamation points to highlight negative points o Threats or demands made against the company

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Omar Torres, Bella Mendoza, Zachary Kuo

Splitting Annotated Data

Modeling

Analysis

Aggression Level Analysis Annotation Guidelines

Omar Torres, Bella Mendoza, Zachary Kuo