Prediction Module

1_Data Processing

The complete steps of data collection and preprocessing have included data collection, data visualization, data cleaning, featurization, building feature sets and target sets, and splitting training sets and test sets. The result of data cleaning is to process all kinds of dirty data in a corresponding way to get standard, clean and continuous data, which is provided for use in data statistics, data mining, etc.

1.1_Data Exploration

Data collection is the gathering of data while measuring and analyzing different types of information with the help of specific proven techniques. Defining the problems to be solved is the first step in the roadmap of the data collection process. Depending on the objectives of the project, the next step would be to define which data is more beneficial. Next, depending on what data is needed, a decision should be made as to where the data can be collected. It is also important to define the time frame for the data. Meanwhile, choosing the right form of data storage facilitates subsequent processing. Finally, care should be taken to consider privacy issues as well as security issues for incoming data collection.

import pandas as pd df_Accidents0515 = pd.read_csv('../Dataset/archive/Accidents0515.csv', error_bad_lines=False) #df_Casualties0515 = pd.read_csv('../Dataset/archive/Casualties0515.csv', error_bad_lines=False) #df_Vehicles0515 = pd.read_csv('../Dataset/archive/Vehicles0515.csv', error_bad_lines=False)

The visualization of the data can be used to discover possible relationships between features and labels, the presence of dirty data and outliers in the data, etc., to facilitate the selection of specific ML models.

df_Accidents0515

df_Accidents0515.dtypes

df_Accidents0515[df_Accidents0515["Accident_Severity"]==1].count()

df_Accidents0515[df_Accidents0515["Accident_Severity"]==2].count()

df_Accidents0515[df_Accidents0515["Accident_Severity"]==3].count()

1.2_Data Cleaning

The main idea behind data cleaning is to 'clean' the data by filling in missing values, smoothing out noisy data, smoothing or removing outliers and resolving data inconsistencies. The main categories are variable deletion, constant value filling, statistical filling, interpolation filling and model filling. Data used for data analysis may contain hundreds of attributes, most of which are irrelevant and redundant to the mining task. Dimensional imputation reduces the amount of data and ensures minimal loss of information by removing irrelevant attributes.

# Format df_Accidents0515['Date']=pd.to_datetime(df_Accidents0515['Date'], format='%d/%m/%Y') # Delete early data df_Accidents0515 = df_Accidents0515.loc[df_Accidents0515.Date.between('2015-07-01','2015-12-31')] df_Accidents0515

# Delete unnecessary data df_Accidents0515 = df_Accidents0515.drop(['Accident_Index', 'Location_Easting_OSGR', 'Location_Northing_OSGR', 'Longitude', 'Latitude', 'Date', 'Time', 'Local_Authority_(District)', 'Local_Authority_(Highway)', 'LSOA_of_Accident_Location'], axis = 1) df_Accidents0515

1.3_Data Exploration

# PCA from sklearn.decomposition import PCA pca=PCA(n_components='mle') new_df_Accidents0515=pca.fit_transform(df_Accidents0515) new_df_Accidents0515 = pd.DataFrame(new_df_Accidents0515) new_df_Accidents0515

df_Accidents0515.columns

During the data processing, an attempt was made to downscale the data using Principle Component Analysis (PCA), but the results were not satisfactory due to a mismatch between the downscaling operation and the prediction operation, so it was cancelled. Since PCA was not applicable, the next step was to find the features that correlated with Accident_Severity by calculating the correlation coefficient matrix.

# Correlation coefficient df_Accidents0515_corr = df_Accidents0515.corr() # Visualization import matplotlib.pyplot as plt, seaborn seaborn.heatmap(df_Accidents0515_corr, center=0, annot=True, cmap='YlGnBu') plt.show()

The correlation coefficient matrix is output as a table and the correlation coefficients are sorted in descending order of Accident_Severity. When the correlation coefficient is greater than zero, it indicates that the effect of the feature on Accident_Severity is positively correlated. Therefore, all features with a correlation coefficient greater than zero with Accident_Severity are retained.

df_Accidents0515_corr

# Delete unrelated data df_Accidents0515 = df_Accidents0515.drop(['Carriageway_Hazards', '1st_Road_Number', 'Police_Force', 'Road_Type', 'Light_Conditions', 'Speed_limit', 'Urban_or_Rural_Area', 'Number_of_Casualties'], axis = 1) df_Accidents0515

df_Accidents0515.columns

1.4_Featurization

Feature engineering is the process of transforming raw data into features, enabling the application of these features to predictive models to improve the accuracy of model predictions on invisible data. The aim is to maximize the extraction and processing of features from the raw data. Common methods include timestamp processing, decomposition of category attributes, partitioning, cross-features, feature selection, feature scaling and feature extraction. Supervised learning requires the construction of feature sets and label sets. Features are the individual data points that are the variables to be fed into the ML model, while labels are what is to be predicted, judged or classified. The final step is splitting the training and testing sets. After splitting the original dataset vertically from the column dimension into the feature set and label set, it needs to be further split horizontally from the row dimension. ML does not end with finding a model from the training dataset, but with testing the dataset to see if the model works on the new data.

# DL/ML Data Preparation from sklearn import model_selection import numpy as np X = df_Accidents0515[['Number_of_Vehicles', 'Day_of_Week', '1st_Road_Class', 'Junction_Detail', 'Junction_Control', '2nd_Road_Class', '2nd_Road_Number', 'Pedestrian_Crossing-Human_Control', 'Pedestrian_Crossing-Physical_Facilities', 'Weather_Conditions', 'Road_Surface_Conditions', 'Special_Conditions_at_Site', 'Did_Police_Officer_Attend_Scene_of_Accident']] y = df_Accidents0515[['Accident_Severity']] X = np.array(X) y = np.array(y) X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.4, random_state = 1234) X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

2_Prediction

2.1_LSTM

# LSTM from tensorflow import keras from keras.models import Sequential from keras.layers.core import Dense, Activation, Dropout from keras.layers import LSTM model = Sequential() model.add(LSTM( 100, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True)) model.add(LSTM( 20, return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(1)) model.add(Activation('linear')) model.compile(loss="mse", optimizer="rmsprop") model.fit(X_train, y_train, epochs=10, batch_size=1)

X_test=np.array(X_test) X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1)) res_lstm = model.predict(X_test)

res_lstm = res_lstm.round(0) trueRecord = 0 for i in range(len(res_lstm)): if res_lstm[i] == y_test[i]: trueRecord += 1 else: continue accuracy = trueRecord / len(res_lstm) accuracy

# Confusion Matrix from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay classes = [1, 2, 3] confusion_mat = confusion_matrix(y_test, res_lstm) disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes) disp.plot( include_values=True, cmap="viridis", ax=None, xticks_rotation="horizontal", values_format="d" ) plt.show()

X_train.shape

2.2_Decision Tree

# Decision Tree from sklearn.tree import DecisionTreeClassifier X_train = np.reshape(X_train,(-1,13)) clf = DecisionTreeClassifier(max_depth=14) clf.fit(X_train, y_train)

X_test = np.reshape(X_test,(-1,13)) res_clf = clf.predict(X_test) res_clf

trueRecord = 0 for i in range(len(res_clf)): if res_clf[i] == y_test[i]: trueRecord += 1 else: continue accuracy = trueRecord / len(res_clf) accuracy

# Confusion Matrix classes = [1, 2, 3] confusion_mat = confusion_matrix(y_test, res_clf) disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes) disp.plot( include_values=True, cmap="viridis", ax=None, xticks_rotation="horizontal", values_format="d" ) plt.show()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Prediction Module