Let's Explore Data Denoising!

So imagine that Dr. Lavigne just casts a spell that completely disrupts the Great Lakes... ugh....oh wait, wrong story.

No, but I mean, imagine you're working on a project where you have to analyze soil properties and optimize crop yields. You've collected data on three key factors:

1. Moisture Content (Feature 1)

2. Nutrient Levels (Feature 2)

3. pH Value (Feature 3)

Let's generate a dataset with 100 samples from different regions! These variables may correlate strongly (e.g. nutrient levels could affect pH, soil moisture might influence nutrient absorption). To simplify the analysis and identify the primary drivers of crop yield, let's perform Principal Component Analysis (PCA).

# Simulating soil properties: moisture content, nutrient levels, and slightly dependent pH value np.random.seed(42) soil_moisture = np.random.normal(50, 10, 100) # Mean 50%, Std. Dev. 10% nutrient_levels = soil_moisture * 0.5 + np.random.normal(0, 5, 100) # Correlated with moisture ph_values = 0.6 * soil_moisture + 0.4 * nutrient_levels + np.random.normal(0, 0.5, 100) # Similar to a combination, but with added noise # Combine into a dataset soil_data = np.vstack((soil_moisture, nutrient_levels, ph_values)).T soil_df = pd.DataFrame(soil_data, columns=["Soil Moisture (%)", "Nutrient Levels", "pH Value"]) # Perform PCA pca_soil = PCA(n_components=3) soil_pca_result = pca_soil.fit_transform(soil_df) # Explained variance soil_explained_variance = pca_soil.explained_variance_ratio_ # Prepare PCA results in a DataFrame soil_pca_df = pd.DataFrame( soil_pca_result, columns=["Principal Component 1", "Principal Component 2", "Principal Component 3"] ) # Visualize first two principal components with a scatterplot fig, ax = plt.subplots(figsize=(8, 6)) ax.scatter(soil_pca_df["Principal Component 1"], soil_pca_df["Principal Component 2"], alpha=0.6, color="green") ax.set_title("PCA: Soil Properties - First Two Principal Components") ax.set_xlabel("Principal Component 1") ax.set_ylabel("Principal Component 2") ax.grid(True) # Show the graph plt.show() # Combine original dataset and PCA results for a tabular display combined_df = pd.concat([soil_df, soil_pca_df], axis=1) # Display the first 10 rows of the dataset print("Soil Dataset with Combined pH Value and Principal Components (First 10 Rows):") display(combined_df.head(10)) # Show explained variance for each principal component print("Explained Variance Ratio:") print(soil_explained_variance)

Run to view results

What does this mean?

The explained variance for each principal component is:

+ Principal Component 1: 89.51% (dominantly explains the dataset variance),

+ Principal Component 2: 10.39%

+ Principal Component 3: 0.11%.

So, this means that most of the variability in soil properties is captured by PC1, with PC2 adding significant but smaller insights. PC3 contributes minimal unique information.

from matplotlib.patches import Ellipse # Function to draw variance ellipses def draw_ellipse(x, y, ax, n_std=1.0, **kwargs): """ Add an ellipse to the plot that represents the covariance of the dataset. """ cov = np.cov(x, y) vals, vecs = np.linalg.eigh(cov) order = vals.argsort()[::-1] vals, vecs = vals[order], vecs[:, order] theta = np.degrees(np.arctan2(*vecs[:, 0][::-1])) width, height = 2 * n_std * np.sqrt(vals) ellip = Ellipse(xy=(np.mean(x), np.mean(y)), width=width, height=height, angle=theta, **kwargs) ax.add_patch(ellip) return ellip # Plot PC1 vs PC2 with variance ellipse fig, ax1 = plt.subplots(figsize=(8, 6)) ax1.scatter(soil_pca_df["Principal Component 1"], soil_pca_df["Principal Component 2"], alpha=0.6, color="blue") draw_ellipse(soil_pca_df["Principal Component 1"], soil_pca_df["Principal Component 2"], ax1, edgecolor="red", facecolor="none", lw=2) ax1.set_title("PCA: PC1 vs PC2 with Variance Ellipse") ax1.set_xlabel("Principal Component 1") ax1.set_ylabel("Principal Component 2") ax1.grid(True) plt.show() # Plot PC2 vs PC3 with variance ellipse fig, ax2 = plt.subplots(figsize=(8, 6)) ax2.scatter(soil_pca_df["Principal Component 2"], soil_pca_df["Principal Component 3"], alpha=0.6, color="orange") draw_ellipse(soil_pca_df["Principal Component 2"], soil_pca_df["Principal Component 3"], ax2, edgecolor="red", facecolor="none", lw=2) ax2.set_title("PCA: PC2 vs PC3 with Variance Ellipse") ax2.set_xlabel("Principal Component 2") ax2.set_ylabel("Principal Component 3") ax2.grid(True) plt.show()

Run to view results

I've added the variance ellipses to the scatterplots to show numerical spread to make these differences more clear.

1. PC1 vs PC2 Plot: The red ellipse shows the spread of the data, reflecting the large variance captured by these two components. Since PC1 explains the most variance (89.5%), the ellipse is stretched more along the PC1 axis.

2. The red ellipse is much smaller, indicating less variability in this pair. PC3 contributes only 0.1% of the variance, so its spread is minimal compared to PC2.

Ok chat, so what happens if we put it all together?

We want to talk about Data Denoising, which involves reconstructing the dataset by retaining only the most important components (Principal Components, orthogonal directions in the feature space that corresponds to a singular value). This process filters out noise by eliminating smaller, less significant components that contribute minimally to the data's structure.

Here, we keep only the largest singular values (correlating to the explained variance ratios listed above) and their corresponding components to represent the data's main structure. Let's use the reduced SVD to reconstruct the dataset, and see how close our reconstruction is to the original dataset!

# Perform SVD on the dataset U, S, VT = np.linalg.svd(soil_df, full_matrices=False) # Retain only the first 2 singular values for reconstruction (k = 2) k = 2 S_reduced = np.zeros((k, k)) np.fill_diagonal(S_reduced, S[:k]) # Keep only the top k singular values U_reduced = U[:, :k] # Top k left singular vectors VT_reduced = VT[:k, :] # Top k right singular vectors # Reconstruct the dataset using the reduced SVD reconstructed_data = U_reduced @ S_reduced @ VT_reduced # Convert reconstructed data to DataFrame for comparison reconstructed_df = pd.DataFrame(reconstructed_data, columns=soil_df.columns) # Calculate the reconstruction error (Frobenius norm) reconstruction_error = np.linalg.norm(soil_df - reconstructed_df, 'fro') # Display original vs reconstructed data from IPython.display import display display(pd.concat([soil_df, reconstructed_df.add_suffix(" (Reconstructed)")], axis=1)) # Show reconstruction error reconstruction_error

Run to view results

Now we've reconstructed the dataset using a reduced SVD, where we are using the two largest components. This effectively preserves the primary structure of the data while suppressing less significant variations, which are often associated with the noise.

The reconstruction error (measured using the Frobenius norm) is approximately 4.36, which reflects the difference between the original and reconstructed datasets (think of this as the amount of information lost by disregarding smaller singular values).

# Calculate reconstruction error for each feature error_df = soil_df - reconstructed_df # Plot the original vs reconstructed dataset for each feature for feature in soil_df.columns: plt.figure(figsize=(8, 6)) plt.scatter(range(len(soil_df[feature])), soil_df[feature], alpha=0.6, label="Original", color="blue") plt.scatter(range(len(reconstructed_df[feature])), reconstructed_df[feature], alpha=0.6, label="Reconstructed", color="orange") plt.title(f"Original vs Reconstructed Data for {feature}") plt.xlabel("Sample Index") plt.ylabel(feature) plt.legend() plt.grid(True) plt.show() # Combine all reconstruction error line charts into one plot plt.figure(figsize=(10, 8)) # Check if error_df contains the relevant columns and ensure correct labels colors = ['red', 'purple', 'teal'] # Define colors for each feature for feature, color in zip(error_df.columns, colors): plt.plot( range(len(error_df[feature])), error_df[feature], label=f"Reconstruction Error: {feature}", color=color ) plt.title("Reconstruction Error for All Features") plt.xlabel("Sample Index") plt.ylabel("Error Magnitude") plt.legend() plt.grid(True) plt.show()

Run to view results

Notice how each feature plot compares the original dataset with the reconstructed version. The reconstructed data closely follows the original for features dominated by the first two principal components, showing effective preservation of the main structure. This is because the third component (reflective of the third column of A) is very similar to the linear combination of the first two columns of A.

The error plots show the magnitude of reconstruction differences across samples for each feature. The errors are relatively small, highlighting the effectiveness of the reduced SVD in denoising while retaining meaningful variability.

Here, the "noise" refers to the smaller singular values and their associated components that contribute minimally to the overall variability in the dataset. For instance, the pH value's variability is weakly aligned with the main structure (dominated by soil moisture and nutrient levels) and is therefore treated as secondary or noise. Reduced SVD filters out this noise to focus on the most meaningful relationships in the data.

What if we did not use the most significant Singular Values?

# Reconstruct the dataset using only Principal Components 2 and 3 k = 2 # Using the last 2 components: PC2 and PC3 U_reduced = pca_soil.components_[1:].T # Take PC2 and PC3 components (transpose for reconstruction) S_reduced = np.diag(soil_pca_result[:, 1:]) # Use PC2 and PC3 projections V_reduced = pca_soil.components_[1:, :] # Corresponding right singular vectors # Reconstruct data with PC2 and PC3 reconstructed_data_pc2_pc3 = (soil_pca_result[:, 1:] @ V_reduced).T reconstructed_data_pc2_pc3_df = pd.DataFrame( reconstructed_data_pc2_pc3.T, columns=["Soil Moisture (%)", "Nutrient Levels", "pH Value"] ) # Plot the original vs reconstructed data for each feature for feature in soil_df.columns: plt.figure(figsize=(8, 6)) plt.scatter(range(len(soil_df[feature])), soil_df[feature], alpha=0.6, label="Original", color="blue") plt.scatter( range(len(reconstructed_data_pc2_pc3_df[feature])), reconstructed_data_pc2_pc3_df[feature], alpha=0.6, label="Reconstructed (PC2 & PC3)", color="red", ) plt.title(f"Reconstruction of {feature} using PC2 and PC3") plt.xlabel("Sample Index") plt.ylabel(feature) plt.legend() plt.grid(True) plt.show() # Calculate reconstruction error reconstruction_error_pc2_pc3 = np.linalg.norm(soil_df - reconstructed_data_pc2_pc3_df) print(f"Reconstruction Error using PC2 and PC3: {reconstruction_error_pc2_pc3:.4f}")

Run to view results

Do you see the importance of using the most significant singular values (PC 1 and PC2) in reduced SVD?

THINK: What would've happened if we had used singular values close to zero? Remember that the third column of A was a combination of the first two columns of A, did that third column really add any "new information" that would correlate with a significant principal component?

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Let's Explore Data Denoising!

What does this mean?

Ok chat, so what happens if we put it all together?

What if we did not use the most significant Singular Values?

Do you see the importance of using the most significant singular values (PC 1 and PC2) in reduced SVD?

Let's Explore Data Denoising!