Clothing Analysis
*** The Report is at the bottom of the Notebook. Please scroll down to find this.
In the First Notebook that was given to us, a few tests were already executed: none yielded a residual plot that would fit the conditions that were required. Firstly, it could be noted that the data set used was not preprocessed and cleaned before being used. This is bad for one major reason: data points in which the customer did not spend any money but, however, returned their products. Including the statistics for these people will yield negative results as the the amount spent cannot be inferred from these values.
First though, we need to remove one clear outlier: the datapoint with an id of 60 and an amount of $1506000. This value is much greater than the third quartile plus 1.5 times the interquartile range. Also, we need to remove the datapoints where the amount is equal to 0.
Lets first start by querying the data for the clothing and creating a chart visualization widget for the different fields vs amount.
Although, this data looks better, lets clean the data. We can remove the outliers by considering the IQR.
Lets also check for normality within the dataset.
This data looks pretty normal to me, so we can get started with our analysis.
In the given data, the statistician checked to see a correlation between the dollar12 and dollar24 variables and between all the variables. Because we took out an outlier at the beginning, lets recompute the values for these models and check the residual plot.
Amount vs Dollar12 and Dollar24
Furthermore, plotting the residuals of the regression yields:
The residuals plot shows a parabolic pattern (middle down middle); hence, the model for these values still does not fit properly. Lets try refitting our other model.
Amount vs All Other Variables
Lets check the residuals.
Although this model is slightly better than the previous model, there is still much work to be done. The r-squared value is still quite low and the residual plot still shows a parabolic pattern.
Average Spending Model
Since these variables do not show as much correlation with one another as expected, we can move to formulate a model that considers the average amount that a user spends monthly. Intuitively using this parameter makes much more sense over using the raw values of the Amount Spent and the Frequency Shopped over the past 1 year and 2 years because a person would be more likely to spend a consistent amount on a month to month basis rather than spending an unexpectedly small or large amount.
Lets Check the Residuals.
A Smoother Residual Plot
There looks to be one outlier in the residuals when the fitted value is 7.63 and 7.64 . Lets check out this plot when removing this outlier.
Putting Everything Together
Lets mostly keep what we are doing and only change the predictors that are not statistically significant.
Lets check the residuals
There looks to be one outlier in the residuals when the fitted value is 7.75 and 7.76 . Lets check out this plot when removing this outlier.
Conclusion
The final model produced was:
In context, if we created prediction intervals for these 3 customers who may have had a Dollar12 and a Freq12 of $1000 and 10, $840 and 15, and $960 and 13 respectively, the interval would predict that the first customer spends between $61.28 and $156.79, the second customer spends between $31.27 and $80.50, and the third customer spends between $43.16 and $110.50.
The main question to be asked with our resulting model is how effective is it and what are its shortcomings?
One can point out that by transforming a regression line properly, an effective model for predicting data can be revealed. The short and simple response to this is that it is not feasible for most models especially this. It is impossible to predict the Amount someone will spend on their next purchase. This fact is independent from the model, the data scientist, etc. We cannot predict this parameter if we are not given the correct information. Suppose that an individual decides to shop at that store; the purchases made by this individual will be dependent on the context of their shopping-whether they're shopping for a party with 50 people or for a dinner with 4 people)- the time of day they are shopping, their state of mind, their budget, etc. Because these variables are not given to us in the dataset and are not confounding with any of the variables in the dataset, we simply do not know many of the factors that will go into a purchase. if another survey can be held where more parameters are asked in the survey, then a much better model can be computed.
With this being said, for the amount of parameter that we currently have, our model does a decent job at predicting the average case scenario. When a prediction interval for this regression is created, it is likely that an exact datapoint's value will fall into this range.
Report
In the analysis of this data, we statistically tested models of this data in order to find the model that works the best in comparison to others both practically and statistically.