Let's Explore Data Denoising!
So imagine that Dr. Lavigne just casts a spell that completely disrupts the Great Lakes... ugh....oh wait, wrong story.
No, but I mean, imagine you're working on a project where you have to analyze soil properties and optimize crop yields. You've collected data on three key factors:
1. Moisture Content (Feature 1)
2. Nutrient Levels (Feature 2)
3. pH Value (Feature 3)
Let's generate a dataset with 100 samples from different regions! These variables may correlate strongly (e.g. nutrient levels could affect pH, soil moisture might influence nutrient absorption). To simplify the analysis and identify the primary drivers of crop yield, let's perform Principal Component Analysis (PCA).
Run to view results
What does this mean?
The explained variance for each principal component is:
+ Principal Component 1: 89.51% (dominantly explains the dataset variance),
+ Principal Component 2: 10.39%
+ Principal Component 3: 0.11%.
So, this means that most of the variability in soil properties is captured by PC1, with PC2 adding significant but smaller insights. PC3 contributes minimal unique information.
Run to view results
I've added the variance ellipses to the scatterplots to show numerical spread to make these differences more clear.
1. PC1 vs PC2 Plot: The red ellipse shows the spread of the data, reflecting the large variance captured by these two components. Since PC1 explains the most variance (89.5%), the ellipse is stretched more along the PC1 axis.
2. The red ellipse is much smaller, indicating less variability in this pair. PC3 contributes only 0.1% of the variance, so its spread is minimal compared to PC2.
Ok chat, so what happens if we put it all together?
We want to talk about Data Denoising, which involves reconstructing the dataset by retaining only the most important components (Principal Components, orthogonal directions in the feature space that corresponds to a singular value). This process filters out noise by eliminating smaller, less significant components that contribute minimally to the data's structure.
Here, we keep only the largest singular values (correlating to the explained variance ratios listed above) and their corresponding components to represent the data's main structure. Let's use the reduced SVD to reconstruct the dataset, and see how close our reconstruction is to the original dataset!
Run to view results
Now we've reconstructed the dataset using a reduced SVD, where we are using the two largest components. This effectively preserves the primary structure of the data while suppressing less significant variations, which are often associated with the noise.
The reconstruction error (measured using the Frobenius norm) is approximately 4.36, which reflects the difference between the original and reconstructed datasets (think of this as the amount of information lost by disregarding smaller singular values).
Run to view results
Notice how each feature plot compares the original dataset with the reconstructed version. The reconstructed data closely follows the original for features dominated by the first two principal components, showing effective preservation of the main structure. This is because the third component (reflective of the third column of A) is very similar to the linear combination of the first two columns of A.
The error plots show the magnitude of reconstruction differences across samples for each feature. The errors are relatively small, highlighting the effectiveness of the reduced SVD in denoising while retaining meaningful variability.
Here, the "noise" refers to the smaller singular values and their associated components that contribute minimally to the overall variability in the dataset. For instance, the pH value's variability is weakly aligned with the main structure (dominated by soil moisture and nutrient levels) and is therefore treated as secondary or noise. Reduced SVD filters out this noise to focus on the most meaningful relationships in the data.
What if we did not use the most significant Singular Values?
Run to view results
Do you see the importance of using the most significant singular values (PC 1 and PC2) in reduced SVD?
THINK: What would've happened if we had used singular values close to zero? Remember that the third column of A was a combination of the first two columns of A, did that third column really add any "new information" that would correlate with a significant principal component?