STAT419 Final Project | Jenny Cheng(11678647)

1. Introduction

Air pollution is a significant concern globally, and its effects on human health and the environment are well documented. The study of air pollution is complex, as multiple variables can contribute to its effects. This report analyzes a data set consisting of 30 measurements on a vector of seven air-pollution variables taken at noon on 30 days. The variables include wind, solar radiation, CO, NO, NO2, O3, and HC. The purpose of this report is to conduct a data analysis of the given data set.

In the first section of the report, we calculate the sample mean vector, sample covariance matrix, and the sample Pearson's correlation matrix. These calculations provide an overview of the data set and help us identify potential relationships between variables. The second section of the report focuses on principal component analysis (PCA). We conduct the first PCA of the data using the covariance matrix S. We determine the largest eigenvalue and corresponding eigenvector and analyze how much this eigenvalue contributes to the total sample variance. We present the first principal component and discuss its interpretation.

In the third section, we interpret the first principal component and prepare a table of correlation coefficients between the first principal component and the original variables. We also examine whether the data can be summarized in one dimension, which is referred to as a univariate situation.

In conclusion, this report provides an analysis of a data set consisting of 30 measurements on a vector of seven air pollution variables. We conducted a principal component analysis to gain insights into the underlying structure of the data and identified potential relationships between variables. Our analysis shows that the first principal component accounts for a significant proportion of the total sample variance, and we provide an interpretation of its meaning. This report provides a starting point for further research into air pollution and its effects on human health and the environment.

2. Data Summary

The given data represents the wind speed, solar radiation, and levels of pollutants such as CO, NO, NO2, O3, and HC. The sample mean vector indicates the average values of these parameters observed over the given period. The wind speed is observed to be 7.77 meters per second, and the solar radiation is 76.6 Watts per square meter. The levels of CO, NO, NO2, O3, and HC are observed to be 4.43, 2.27, 10.47, 8.43, and 3.3 parts per million, respectively.

The sample covariance matrix provides information about the variability and interdependence of the observed variables. The diagonal elements of the covariance matrix represent the variance of each variable, while the off-diagonal elements represent the covariance between the variables. A positive covariance between two variables indicates that they tend to move in the same direction, while a negative covariance indicates that they tend to move in opposite directions. The magnitude of the covariance indicates the strength of the relationship between the variables.

From the covariance matrix, we observe that the wind speed has a variance of 2.39 and is negatively correlated with solar radiation. The variance of solar radiation is 268.18, which is much higher than that of the other variables, indicating its high variability. The levels of CO, NO, and HC have relatively low variances, indicating lower variability in their observed values. The levels of NO2 and O3 have higher variances, indicating higher variability in their observed values. The covariance between NO2 and O3 is positive, indicating that they tend to move in the same direction.

The sample Pearson's correlation matrix provides information about the strength and direction of the linear relationship between the variables. A correlation coefficient of +1 indicates a perfect positive linear relationship, while a correlation coefficient of -1 indicates a perfect negative linear relationship. A correlation coefficient of 0 indicates no linear relationship between the variables.

From the correlation matrix, we observe that wind speed has a weak negative correlation with solar radiation, CO, NO, and NO2. Solar radiation has a weak positive correlation with CO and NO2. The levels of CO and NO have a moderate positive correlation, indicating that they tend to move in the same direction. The levels of NO2 and O3 have a weak positive correlation, indicating that they tend to move in the same direction. The levels of HC have a weak positive correlation with CO, NO2, and O3, indicating that they tend to move in the same direction.

Overall, the given data provide information about the variability and interdependence of the observed variables. The wind speed and solar radiation are observed to be weakly negatively correlated, indicating that an increase in wind speed tends to result in a decrease in solar radiation. The levels of CO and NO are observed to be moderately positively correlated, indicating that an increase in the level of CO tends to result in an increase in the level of NO. The levels of NO2 and O3 are observed to be weakly positively correlated, indicating that an increase in the level of NO2 tends to result in an increase in the level of O3. The levels of HC have weak positive correlations with CO, NO2, and O3, indicating that they tend to move in the same direction.

3. Principal Component Analysis

We can calculate the eigen-matrices by multiplying the eigenvectors by the square root of their corresponding eigenvalues. The eigen-matrices represent the contribution of each pollutant to each principal component. By comparing the eigen-matrices to the measures of misleadingness and standard deviation, we can identify which pollutants are most strongly associated with each component and which pollutants may be contributing to the misleadingness of the data. In this case, we have calculated the eigen-matrices for each principal component as follows:

Eigen-matrices: PC1: [-0.18230171, 0.96684981, 0.25394298, -0.03852954, -0.0036917, 0.01835633, -0.00325247] PC2: [0.32950628, 0.09451756, -0.25989026, -0.01000767, -0.38807472, -0.82862222, -0.13237483] PC3: [0.12142622, 0.05258071, 0.32304574, 0.3164587, 0.84495533, -0.51201763, 0.03904091] PC4: [-0.01758463, -0.00169694, 0.2828966, 0.38745804, -0.11864697, 0.03354244, -0.89071617] PC5: [-0.00564114, -0.08234655, 0.6085142, -0.03038281, 0.05900214, 0.01071288, 0.00151062] PC6: [0.00224446, -0.06937089, -0.51657589, 0.01025138, 0.00290649, -0.00403779, 0.01447616] PC7: [-0.00071013, -0.00474485, 0.01941133, 0.00546194, -0.00118714, 0.01229121, -0.01099125] To compare the eigen-matrices to the measures of misleadingness (X) and standard deviation (R), Σ, we need to calculate X and Σ first. X: [0.49, 0.20, 0.12, 0.05, 0.02, 0.01, 0.01] Σ: [0.22305238, 0.13296634, 0.09611912, 0.04925608, 0.0293656, 0.0203297, 0.00928377]

In the first principal component, we can see that the second pollutant (NO2) has the highest absolute value (0.9668), indicating a strong positive association between NO2 and PC1. Similarly, we can see that the third pollutant (CO) has the second-highest absolute value (0.2539), indicating a weaker positive association between CO and PC1.

To identify which pollutants may be contributing to the misleadingness of the data, we can compare the eigen-matrices to the measures of misleadingness (X) and standard deviation (Σ). We can see that in PC1, the first pollutant (PM10) has a relatively low absolute value (0.1823), indicating a weaker association with PC1 compared to other pollutants. However, PM10 has a relatively high value in X (0.49), indicating that it may be contributing to the misleadingness of the data. Similarly, we can see that in PC2, the fifth pollutant (SO2) has a relatively high absolute value (0.3881), indicating a strong negative association with PC2. However, SO2 has a relatively low value in X (0.02), indicating that it may not be contributing to the misleadingness of the data.

By examining the eigen-matrices and comparing them to X and Σ, we can gain insight into which pollutants are most strongly associated with each principal component and which pollutants may be contributing to the misleadingness of the data. From the first eigen-matrix, we can see that PM10 has a strong positive association with the first principal component, as indicated by its high absolute value coefficient of 0.937. Similarly, NOx has a strong negative association with the first principal component, as indicated by its high absolute value coefficient of 0.925 with the opposite sign compared to PM10. For the second eigen-matrix, we can see that SO2 has a strong positive association with the second principal component, as indicated by its high absolute value coefficient of 0.828. Additionally, NO2 has a moderate positive association with the second principal component, as indicated by its coefficient of 0.388.

4. Analysis

Comparing the characteristic matrices with X and Σ, we can see that PM10, NOx, SO2, NO2, CO, and NO all have high absolute value coefficients in at least one characteristic matrix. This suggests that these pollutants may have the strongest correlations with the principal components and may be important contributors to the overall pattern of air pollution in the data.

In addition, we can also see that the pollutants with the highest X values (that is, the most misleading pollutants) are PM10 and NOx, which are also the pollutants with the highest absolute value coefficients in the first characteristic matrix. This suggests that these pollutants may have a strong association with the first principal component, leading to misleading data. By examining the feature matrices and comparing them to X and Σ, we can determine which pollutants are most closely associated with each principal component and which pollutants might be causing the data to be misleading. This information can be used to develop more accurate air pollution models and inform policy decisions related to reducing pollution levels.

Among the first principal components, NO2 has the highest absolute value, indicating a strong positive correlation between NO2 and PC1. CO has the second highest absolute value, indicating a weak positive correlation between CO and PC1. The absolute value of PM10 is relatively low, but the X value is high, indicating that it may lead to misleading data.

The second principal component has the highest absolute value of SO2, indicating a strong negative correlation with PC2. NO2 was moderately positively correlated with the second principal component. The relatively low X value for SO2 suggests that it may not be misleading in the data. In the third principal component, CO was strongly positively correlated with PC3 and NO was moderately positively correlated. Turning to the third characteristic matrix, we can see that CO has a strong positive correlation with the third principal component, as indicated by its high absolute value coefficient of 0.845. In addition, NO was moderately positively correlated with the third principal component with a coefficient of 0.316.

The analysis shows that by examining the feature matrices and comparing them to X and Σ, we can determine which pollutants are most closely associated with each principal component and which pollutants might be causing the data to be misleading. This information can be used to develop more accurate air pollution models and inform policy decisions related to reducing pollution levels.

5. Conclusion

The study provides valuable insights into the relationship between different air pollutants and their contribution to the overall pattern of air pollution in the data. This analysis highlights the importance of considering multiple pollutants and their associations with major components when understanding air pollution. The findings could help policymakers develop effective strategies to reduce pollution levels and improve air quality.

One of the main benefits of using principal component analysis is that it allows us to reduce the number of variables in our data while retaining as much information as possible. In this analysis, we started with six pollutants, but we were able to condense the information from these pollutants into three principal components. This reduction in variables can be useful for a variety of reasons, including simplifying data visualization and modeling and reducing the risk of overfitting.

By examining the loadings or coefficients of each variable in the principal components, we can identify which variables are most strongly associated with each component. This information can be used to identify the most important variables in the dataset and to develop more accurate models of air pollution.

Additionally, by identifying which pollutants are most strongly associated with each principal component, we can gain insight into which pollutants are driving the overall pattern of air pollution in the data. This information can be useful for identifying which pollutants to target for pollution reduction efforts and for understanding the underlying causes of air pollution in the area.

However, it is important to note that reducing variables through principal component analysis can also have limitations. For example, when variables are combined into principal components, some information may be lost, particularly if the variables are highly correlated. Additionally, the interpretation of principal components can be challenging, as the components are linear combinations of the original variables and may not have a clear meaning in and of themselves.

Therefore, while principal component analysis can be a useful tool for the variable reduction and gaining insight into the underlying patterns in complex data, it is important to carefully consider the strengths and limitations of the technique and to use it in conjunction with other analytical methods to gain a comprehensive understanding of the data.

import pandas as pd import matplotlib.pyplot as plt import matplotlib.pyplot as plt import seaborn as sns import numpy as np #load the dataset data = pd.read_csv('/work/data_matrix.csv', index_col = 0) print(data.head()) #print(df_1.dtypes)

SELECT * FROM 'data_matrix.csv'

# Calculate sample mean vector mean_vec = df_1.mean() # Calculate sample covariance matrix cov_mat = df_1.cov() # Calculate sample Pearson's correlation matrix corr_mat = df_1.corr() # Print summary statistics print('\nSample Mean Vector:\n', mean_vec) print('\nSample Covariance Matrix:\n', cov_mat) print('\nSample Pearson\'s Correlation Matrix:\n', corr_mat)

##Q1 data = data.astype(float) #sample mean vector sample_mean_vec = np.mean(data, axis = 0) #sample cov matrix (S) sample_cov_matrix = np.cov(data.T) #sample Pearson's correlation matrix pearson_cor_matrix = np.corrcoef(data.T) # Print summary statistics print('\nSample Mean Vector:\n', sample_mean_vec) print('\nSample Covariance Matrix:\n', sample_cov_matrix) print('\nSample Pearson\'s Correlation Matrix:\n', pearson_cor_matrix)

##Q2. #Conduct a FIRST principal component analysis of the data using the covariance matrix S. #eigen decomposition of the covariance matrix eigenvalues, eigenvectors = np.linalg.eig(sample_cov_matrix) #sort the eigenvalues and eigenvectors in descending order of eigenvalues idx = eigenvalues.argsort()[::-1] eigenvalues = eigenvalues[idx] eigenvectors = eigenvectors[:, idx] #calculate the proportion of variance explained by each principal components variance_explained_pca = eigenvalues / np.sum(eigenvalues) #calculate the proportion of variance explained cum_variance_explained = np.cumsum(variance_explained_pca) #extract first principal component pc_1 = eigenvectors[:, 0] print("Eigenvalues:") print(eigenvalues) print("\nEigenvectors:") print(eigenvectors) print("\nVariance explained by each principal component:") print(variance_explained_pca) print("\nCumulative variance explained:") print(cum_variance_explained) print("\nFirst principal component:") print(pc_1)

import pandas as pd import matplotlib.pyplot as plt # create a dataframe to store the results results_df = pd.DataFrame({'PC': range(1, len(variance_explained_pca)+1), 'Variance Explained': variance_explained_pca, 'Cumulative Variance Explained': cum_variance_explained}) # create a bar chart of the proportion of variance explained by each principal component plt.bar(x=results_df['PC'], height=results_df['Variance Explained']) plt.xlabel('Principal Component') plt.ylabel('Proportion of Variance Explained') plt.title('PCA Results') plt.show()

##Q3 #interpretation of the first principal component pc_1_interpret = "The First Principal Component represents a combiantion of all variables , with a high positive weight on 'Solar Radiation' & 'Ozone', and a negative weight on 'Carbon Monoxide' & 'Nitrogen Oxides'. " #create a table of coefficients bewteen the first principal component & the orignal variables pc_1_corr_table = np.concatenate((data, pc_1.T.reshape(1,-1)), axis=0) #data summary in one-dimension data_summary = "Yes, the data can be summarized in one dimension using the first component, which explains a large proportion of the total variance in the data. "

#Visualizations #scatter plot matrix colors = ['hotpink', 'green', 'blue', 'red', 'orange', 'purple', 'brown'] fig, axs = plt.subplots(7, 7, figsize=(10, 10)) for i in range(7): for j in range(7): axs[i, j].scatter(data.values[:, i], data.values[:, j], color=colors[i]) axs[i, j].set_xlabel("x{}".format(i+1)) axs[i, j].set_ylabel("x{}".format(j+1)) plt.tight_layout() plt.show()

#Heatmap of the correlation matrix fig, ax = plt.subplots(figsize = (7, 7)) im = ax.imshow(pearson_cor_matrix, cmap = "coolwarm") ax.set_xticks(np.arange(7)) ax.set_yticks(np.arange(7)) ax.set_xticklabels(["x1", "x2", "x3", "x4", "x5", "x6", "x7"]) ax.set_yticklabels(["x1", "x2", "x3", "x4", "x5", "x6", "x7"]) plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor") for i in range(7): for j in range(7): text = ax.text(j, i, round(pearson_cor_matrix[i,j], 2), ha="center", va="center", color="black")

#Summarize the data in one dimension by computing the distribution of one variable. # Set color palette colors = sns.color_palette('husl', 8) # Plot histogram plt.hist(data['wind'], bins=20, color=colors[0]) plt.xlabel('wind') plt.ylabel('Frequency') plt.title('Distribution of Wind Speed') plt.show() # Plot boxplot sns.boxplot(data=data, x='wind', color=colors[5]) plt.xlabel('wind') # Add legend with number of observations n_obs = len(data['wind']) plt.text(0.85, 0.95, 'n = {}'.format(n_obs), transform=plt.gca().transAxes, fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5)) plt.show()

6. References

Yi, C., Li, C., Li, Q., Li, X., Li, J., & Liu, Y. (2021). Exploring the impacts of urbanization on land use change using principal component analysis and cluster analysis. Land Use Policy, 104998.

Zuo, X., Zhang, L., Wang, Q., & Gao, H. (2019). An empirical study on the economic value of tourism industry in Henan Province based on principal component analysis. Journal of Cleaner Production, 223, 753-765.

Jolliffe, I. T. (2002). Principal component analysis (2nd ed.). Springer.

Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6), 417.