# Shopify Data Science Intern Challenge

## Questions from Part 1

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. What metric would you report for this dataset? What is its value?

0

1

53

1

2

92

2

3

44

3

4

18

4

5

18

Converting created_at columns to datetime data type in order to query the data....

4515

4516

1

4657

4658

1

1190

1191

1

347

348

1

2587

2588

1

41

42

1

2759

2760

1

948

949

1

388

389

1

3316

3317

1

2225

2226

1

4347

4348

1

2030

2031

1

1824

1825

1

707

708

1

3276

3277

1

374

375

1

4358

4359

1

4028

4029

1

4908

4909

1

2941

2942

1

4153

4154

1

728

729

1

3110

3111

1

2948

2949

1

4678

4679

1

1597

1598

1

2524

2525

1

588

589

1

2078

2079

1

1167

1168

1

3968

3969

1

3470

3471

1

1290

1291

1

4633

4634

1

1546

1547

1

4982

4983

1

605

606

1

2193

2194

1

4784

4785

1

1020

1021

1

932

933

1

1377

1378

1

4460

4461

1

1589

1590

2

4394

4395

2

2820

2821

2

856

857

2

3569

3570

2

4454

4455

2

count

5000

5000

mean

2500.5

50.0788

std

1443.5200033252052

29.006117657009987

min

1

1

25%

1250.75

24

50%

2500.5

50

75%

3750.25

75

max

5000

100

```
shop_id AOV Per Shop
0 1 158.0
1 2 94.0
2 3 148.0
3 4 128.0
4 5 142.0
.. ... ...
95 96 153.0
96 97 162.0
97 98 133.0
98 99 195.0
99 100 111.0
[100 rows x 2 columns]
```

```
number of outliers: 2
max outlier value: 25725.0
min outlier value: 352.0
```

```
152
```

```
0 153.0
dtype: float64
0 153.0
dtype: float64
407.99
```

# Conclusion for Part 1

I would've use mode AOV as the metric as it reduces the effect of outliers. However, if we choose to perform the above analysis and eliminate the outliers from the original dataset, we could also just use the mean calculated from the without outlier dataset. In this case, I would say the value would be 153 and it would be a good starting point for businesses to consider on how to improve their business to maximize revenue. I hope you enjoyed my analysis.

# Questions from Part 2 SQL

For this question you’ll need to use SQL.

Follow this link to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.

A. How many orders were shipped by Speedy Express in total?

B. What is the last name of the employee with the most orders?

C. What product was ordered the most by customers in Germany?