Example of unsupervised learning: Vibrations in slender marine structures
This is a jupyter notebook. You can run the code in the cells (grey boxes) with "shift + enter" (or the run button above). Think of it as a script with blocks that can be run independently.
The notebook is running on a remote server. This means that you won't have to setup anything. When you are finished, you can save it to your own laptop by clicking on "My projects" in the black top bar and choose download. However, in order to run locally you will also need the datafiles, python 3.6 or higher and all the necessary packages (easily installed with anaconda).
The intention with this notebook is for you to get familiar with some concepts involved in a "realistic" supervised learning scenario. You do (hopefully) not need to understand all the code in order to get results, modify parameters, get new results etc.
Instructions for the exercises are given in orange text. In some places you will be encouraged to post your findings on Teams. This is so the instructor can follow how far you are, and possibly generate discussion both on Teams and for the discussion session in the afternoon.
Load the relevant python packages
If the output from the above cell is not "3.6.something", you need to chose "Kernel" in the menu bar and "Change kernel" to "Python 3.6".
The modelling challenge
Slender marine structures include drilling risers for oil platforms and mooring lines for fish farms and marine wind turbines.
Due to ocean currents, they "vibrate" in patterns like these The vibrations lead to material fatigue and eventually, cracks will appear.
In this example, the challenge is to investigate if we can use data from sensors mounted on a slender marine structure to categorize its movement as leading to high or low fatigue accumulation.
We will use data from a series of experiments done at the MARINTEK (now part of SINTEF Ocean) towing tank using a rotating rig. A piece of pipe (red) was mounted on a rotating rig (blue) in one of three different configurations, and the entire structure was rotated in the water with different velocities, leading to different current conditions around the riser. (The results of this study were published in https://link.springer.com/chapter/10.1007%2F978-3-030-35664-4_8)
In the first configuration, the current is purely perpendicular to the pipe. We will call this the x/y-plane. In the second configuration, there will also be a mild current along the pipe, and in the third configuration, there will be a strong current along the pipe.
Ten sensors measuring the acceleration were attached along the pipe. The acceleration allows us to compute the displacement. Here is an example of what the data looks like in one direction as a function of time
The full-length timeseries from the sensors can be difficult to compare. Instead, we have split the timeseries into small intervals (red vertical lines) and we will compare the statistics of each interval (assuming independence).
We will investigate a subset of data from sensors moving with approximately the same velocity, and look for differences in the vibrations due to difference in currents. In other words: Can we, based on the acceleration of the riser, classify the expected pattern of movement?
Load the data
In a real-life scenario, you would start by preprocessing your data. Often, it is a timely and hideous process. We have already done that for you to save you some time. The code below loads the preprocessed data frame (a table with callable names) with one timeseries-interval per row, and the following columns:
sigma_Acc_x, sigma_Acc_y: Standard deviation of acceleration in the x, y directions in ms-2
skew_Acc_x, skew_Acc_y: Skewness of acceleration in the x, y direction, no units. Skewness is the third standardised central moment for the probability distribution. It describes the symmetry of the distribution.
kur_Acc_x, kur_Acc_y: Kurtosis of acceleration in the x, y direction, no units. Kurtosis is the fourth standardised central moment of the probability distribution. It is a measure of the combined weights of the tails relative to the rest of the distribution. It assumes a value of 1.5 for a systematic sinusoidal oscillation, and a value of 3.0 for a Gaussian process typical for multi-frequency random vibration.
freq_Pos_x, freq_Pos_y: Frequency of oscillations in x, y direction in units of s−1
sigma_Pos_x, sigma_Pos_y: Standard deviation of total displacement divided by riser diameter
skew_Pos_x, skew_Pos_y: Skewness of displacement in the x, y direction, no units
kur_Pos_x, kur_Pos_y: Kurtosis of displacement in the x, y direction, no units
sensor_velocity_x, sensor_velocity_y: Magnitude of sensor velocity in the x, y direction in ms−1
sensor_velocity_tot: Total sensor velocity (equal to current) in ms−1
In the animation below, each of the scatter plot distributions represents (almost) exactly the same statistical properties.
Consequently, it is important to visualise the data to get a feeling for what they contain. It also allows to check for outliers and anomalies in the data. The python package
seaborn allows for easy visualisation. Generating scatter plots with all points may take a while so be patient.
sns.pairplot() command shows the relationships between variables, and the histograms along the diagonal. It can be overwhelming to look at all variables at the same time.
Investigate the variables you think look most interesting by removing less relevant variables from
plot_variables and plot again.
Feature selection and engineering
Given that the sensors measure acceleration and that all the other parameters are derived from the acceleration (apart from the sensor velocity, which is externally driven), we expect to see a high level of correlation among the parameters. However, some parameters are more closely related than others. Skewness and kurtosis are correlated for acceleration and displacement respectively, but not cross-correlated. So we can probably drop either skewness or kurtosis. Inspecting the diagonal in the pairplot, we see that kurtosis has some outliers, so we keep the skewness rather than kurtosis.
Not all features have the same scale. Look at the y-axes on the plots: Some have values of the order of 1000s, and some are 0.1. In order to let them equally influence the model, we need to "put everything on the same scale". We can either scale each feature to a fixed range of values (
MinMaxScaler) or substract the mean from each feature and change its distribution to normalised Gaussian (
StandardScaler, mean = 0, variance = 1).
Different methods yield different clustering results. The choice of method will depend on your data and application of the clustering.
K-means as explained in the video is conceptually easy to understand, but it has some disadvantages. It assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters. If you perform k-means (k=2) clustering on two horizontal lines (that are close enough) you will end up with 2 clusters, each containing half of a line.
In k-means, the number of expected clusters needs to be specified before running the algorithm. As such, k-means will always return a specified number of clusters even on data with uniform distribution, where you might want a single cluster.
If there is an overlap between clusters, k-means does not have any intrinsic measure for the probability of each point belonging to one or the other cluster in the overlap.
HDBSCAN is a density-based clustering method. It is designed to handle elongated clusters and as opposed to other clustering algorithms, it does not necessarily assign all data points to clusters, but can declare some of them as noise. It is very fast and efficient. https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
Clustering with HDBSCAN
In order to setup HDBSCAN, we will focus on two hyper parameters
Cluster size is the minimum number of cluster members. This can be a bit arbitrary, and it is recommended to try a couple of different values,
Minimum samples determines the conservativeness of your clustering. The higher the value, the more points will be considered noise.
The initial set of hyper parameters has been manually optimised to minimise the number of non-clustered events and obtain a final number of less than 10 clusters.
The plot shows the weighted dendrogram that is used to select the clusters, based on their area or on the leaves.
Cluster number -1 refers to the points marked as outliers. As we see, the fraction of outliers is high with around 20% of the data.
Try changing the hyperparameters and see how the clustering changes.
To visualise the clusters and outliers we can again use a scatter plot from seaborn. Each cluster is represented by a color, and the diagonal shows the distributions (kernel density) within each cluster. The plotting may take a while so be patient.
Plotting samples from HDBSCAN clusters
We wish to see each cluster reflect a different pattern of movement. To verify whether we succeed, we use images of the displacement in the x/y-plane corresponding to the time intervals on which we performed clustering. Note that in the clustering process, those intervals were represented solely by statistical measures. The images that we use for visualisation contain more information than was given to the algorithm during training.
The plots in the first row correspond to time intervals which were marked as outliers. The remaining rows represent one cluster each. It is clear that members of the same cluster display visually similar movement patterns. The color indicates the configuration of the pipe in the rotation rig:
- Green: the first configuration with only x/y-current,
- Yellow: the configuration with a mild current along the pipe,
- Red: the configuration with most current along the pipe.
Do you see any relationships between cluster members. Post your thoughts on Teams.
If you have any general questions about unsupervised learning or anything you would like to be discussed, then please comment on Teams.
Go back to the SINTEFSchool portal and continue with the theoretical part.