Prediction Module
1_Data Processing
The complete steps of data collection and preprocessing have included data collection, data visualization, data cleaning, featurization, building feature sets and target sets, and splitting training sets and test sets. The result of data cleaning is to process all kinds of dirty data in a corresponding way to get standard, clean and continuous data, which is provided for use in data statistics, data mining, etc.
1.1_Data Exploration
Data collection is the gathering of data while measuring and analyzing different types of information with the help of specific proven techniques. Defining the problems to be solved is the first step in the roadmap of the data collection process. Depending on the objectives of the project, the next step would be to define which data is more beneficial. Next, depending on what data is needed, a decision should be made as to where the data can be collected. It is also important to define the time frame for the data. Meanwhile, choosing the right form of data storage facilitates subsequent processing. Finally, care should be taken to consider privacy issues as well as security issues for incoming data collection.
The visualization of the data can be used to discover possible relationships between features and labels, the presence of dirty data and outliers in the data, etc., to facilitate the selection of specific ML models.
1.2_Data Cleaning
The main idea behind data cleaning is to 'clean' the data by filling in missing values, smoothing out noisy data, smoothing or removing outliers and resolving data inconsistencies. The main categories are variable deletion, constant value filling, statistical filling, interpolation filling and model filling. Data used for data analysis may contain hundreds of attributes, most of which are irrelevant and redundant to the mining task. Dimensional imputation reduces the amount of data and ensures minimal loss of information by removing irrelevant attributes.
1.3_Data Exploration
During the data processing, an attempt was made to downscale the data using Principle Component Analysis (PCA), but the results were not satisfactory due to a mismatch between the downscaling operation and the prediction operation, so it was cancelled. Since PCA was not applicable, the next step was to find the features that correlated with Accident_Severity by calculating the correlation coefficient matrix.
The correlation coefficient matrix is output as a table and the correlation coefficients are sorted in descending order of Accident_Severity. When the correlation coefficient is greater than zero, it indicates that the effect of the feature on Accident_Severity is positively correlated. Therefore, all features with a correlation coefficient greater than zero with Accident_Severity are retained.
1.4_Featurization
Feature engineering is the process of transforming raw data into features, enabling the application of these features to predictive models to improve the accuracy of model predictions on invisible data. The aim is to maximize the extraction and processing of features from the raw data. Common methods include timestamp processing, decomposition of category attributes, partitioning, cross-features, feature selection, feature scaling and feature extraction. Supervised learning requires the construction of feature sets and label sets. Features are the individual data points that are the variables to be fed into the ML model, while labels are what is to be predicted, judged or classified. The final step is splitting the training and testing sets. After splitting the original dataset vertically from the column dimension into the feature set and label set, it needs to be further split horizontally from the row dimension. ML does not end with finding a model from the training dataset, but with testing the dataset to see if the model works on the new data.