# Start writing code here...

Shopify Data Science Intern Challenge

Questions from Part 1

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. What metric would you report for this dataset? What is its value?

#import all of the packages needed for this project import pandas as pd ##import matplotlib.pyplot as plt ## Found a better graphing package and it worked perfectly for this project import plotly.express as px

#Let's load the file into the dataframe first df = pd.read_csv('/work/datasheet.csv')

df.head()

df.shape

df.dtypes

Converting created_at columns to datetime data type in order to query the data....

df['created_at'] = pd.to_datetime(df['created_at'])

df.dtypes

df.query('20170301 < created_at < 20170331')

df.duplicated().sum()

df.nunique()

df.describe()

#calculating AOV for each shop and then take then average d=[] group = df.groupby('shop_id') for key, value in group: d.append([key,value['order_amount'].sum()/value['total_items'].sum()]) df1 = pd.DataFrame(d, columns=["shop_id","AOV Per Shop"]) print(df1) n = len(pd.unique(df['shop_id'])) df1["AOV Per Shop"].sum()/n ##double checking our calculations ##a= df.loc[df['shop_id'] == 1, "order_amount"].sum() ##b = df.loc[df['shop_id'] == 1, "total_items"].sum() ##print(a/b)

#Plotting this result and visualizing fig = px.box(df1, y="AOV Per Shop") fig.show()

def find_outliers_IQR(df1): q1=df1.quantile(0.25) q3=df1.quantile(0.75) IQR=q3-q1 outliers = df1[((df1<(q1-1.5*IQR)) | (df1>(q3+1.5*IQR)))] return outliers outliers = find_outliers_IQR(df1["AOV Per Shop"]) print("number of outliers: "+ str(len(outliers))) print("max outlier value: "+ str(outliers.max())) print("min outlier value: "+ str(outliers.min()))

df2 = df1[df1['AOV Per Shop'] != 25725] n = len(pd.unique(df2['shop_id'])) AOV_no = df2["AOV Per Shop"].sum()/n print(round(AOV_no))

mode_aov = df2["AOV Per Shop"].mode() print(mode_aov) mode_aov = df1["AOV Per Shop"].mode() print(mode_aov) mean_aov = df1["AOV Per Shop"].mean() print(mean_aov)

Conclusion for Part 1

I would've use mode AOV as the metric as it reduces the effect of outliers. However, if we choose to perform the above analysis and eliminate the outliers from the original dataset, we could also just use the mean calculated from the without outlier dataset. In this case, I would say the value would be 153 and it would be a good starting point for businesses to consider on how to improve their business to maximize revenue. I hope you enjoyed my analysis.

Questions from Part 2 SQL

For this question you’ll need to use SQL.

Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

A. How many orders were shipped by Speedy Express in total?

""" Select Count(OS.OrderID) AS Num_Of_Orders_Shipped_By_SpeedyExpress From Orders AS OS Where (Select SH.ShipperID From Shippers AS SH Where SH.ShipperName = "Speedy Express") = OS.ShipperID; """

B. What is the last name of the employee with the most orders?

""" SELECT Employees.LastName, COUNT(*) AS NumberOfOrders FROM Orders INNER JOIN Employees ON Orders.EmployeeID = Employees.EmployeeID GROUP BY Employees.LastName ORDER BY NumberOfOrders DESC LIMIT 1; """

C. What product was ordered the most by customers in Germany?

""" SELECT Products.ProductName, SUM(OrderDetails.Quantity) AS "TotalOrdered" FROM Orders JOIN Customers ON Customers.CustomerID = Orders.CustomerID JOIN OrderDetails ON OrderDetails.OrderID = Orders.OrderID JOIN Products ON Products.ProductID = OrderDetails.ProductID WHERE Customers.Country = 'Germany' GROUP BY OrderDetails.ProductID ORDER BY TotalOrdered DESC LIMIT 1; """

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Shopify Data Science Intern Challenge

Questions from Part 1

Conclusion for Part 1

Questions from Part 2 SQL

Shopify Data Science Intern Challenge