Shopify Data Science Intern Challenge
Questions from Part 1
On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.
Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. What metric would you report for this dataset? What is its value?
0
1
53
1
2
92
2
3
44
3
4
18
4
5
18
Converting created_at columns to datetime data type in order to query the data....
4515
4516
1
4657
4658
1
1190
1191
1
347
348
1
2587
2588
1
41
42
1
2759
2760
1
948
949
1
388
389
1
3316
3317
1
2225
2226
1
4347
4348
1
2030
2031
1
1824
1825
1
707
708
1
3276
3277
1
374
375
1
4358
4359
1
4028
4029
1
4908
4909
1
2941
2942
1
4153
4154
1
728
729
1
3110
3111
1
2948
2949
1
4678
4679
1
1597
1598
1
2524
2525
1
588
589
1
2078
2079
1
1167
1168
1
3968
3969
1
3470
3471
1
1290
1291
1
4633
4634
1
1546
1547
1
4982
4983
1
605
606
1
2193
2194
1
4784
4785
1
1020
1021
1
932
933
1
1377
1378
1
4460
4461
1
1589
1590
2
4394
4395
2
2820
2821
2
856
857
2
3569
3570
2
4454
4455
2
count
5000
5000
mean
2500.5
50.0788
std
1443.520003
29.00611766
min
1
1
25%
1250.75
24
50%
2500.5
50
75%
3750.25
75
max
5000
100
shop_id AOV Per Shop
0 1 158.0
1 2 94.0
2 3 148.0
3 4 128.0
4 5 142.0
.. ... ...
95 96 153.0
96 97 162.0
97 98 133.0
98 99 195.0
99 100 111.0
[100 rows x 2 columns]
number of outliers: 2
max outlier value: 25725.0
min outlier value: 352.0
152
0 153.0
dtype: float64
0 153.0
dtype: float64
407.99
Conclusion for Part 1
I would've use mode AOV as the metric as it reduces the effect of outliers. However, if we choose to perform the above analysis and eliminate the outliers from the original dataset, we could also just use the mean calculated from the without outlier dataset. In this case, I would say the value would be 153 and it would be a good starting point for businesses to consider on how to improve their business to maximize revenue. I hope you enjoyed my analysis.
Questions from Part 2 SQL
For this question you’ll need to use SQL.
Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.
A. How many orders were shipped by Speedy Express in total?
B. What is the last name of the employee with the most orders?
C. What product was ordered the most by customers in Germany?