Sentient Steps 

by Patrick NoonanDec 2, 2020
1 like5 duplicates
Share
Twitter iconTwitter
Facebook iconFacebook
Linkedin
Email
Copy link
Save as PDF
      1. Background
      2. Plan of Attack
      3. Access and Clean Data
      4. Data Exploration and Visualization
      5. Predicting the Future
      6. That's all folks!

Background

  • Stewie, a filthy rich Casino mogul, hires us to analyze his step count and predict the final amount of steps for the year 2020.​
  • If we can accurately predict his final step count, we will be paid $100,000. ​
  • Stewie is a health nut and wears both a Garmin watch and Fitbit to count his daily steps.

Plan of Attack

  1. Define what we need.​
  2. Access the data.​
  3. Clean the data.​
  4. Analyze and Visualize the data.​
  5. Make future predictions using algorithmic models (Data Science).​

Access and Clean Data

Import Modules.

import pandas as pd import os import numpy as np import matplotlib.pyplot as plt from datetime import datetime as dt from datetime import timedelta from dateutil.parser import parse as parse_date import pdb import seaborn as sns from sklearn.linear_model import LinearRegression %matplotlib inline

Import each Json file and combine everything into one DataFrame.

folder= 'data/' json_files = os.listdir(folder) data = pd.DataFrame() for file in json_files: full_path_file = folder+file print(full_path_file) file_df = pd.read_json(full_path_file) data = pd.concat([file_df, data])

What does the data look like?

data.info()
data.head()

There are a lot of features - but we only care about totalSteps and calendarDate. Let's pull out those fields, convert to a proper date format and set date to index to allow for time series indexing operations.

data['date'] = data['calendarDate'].apply(lambda x: pd.to_datetime(x['date'])) steps_data = data[['date', 'totalSteps']].set_index('date') steps_data.head()
steps_data.describe()

Data Exploration and Visualization

Let's explore the data visually.

import matplotlib ax = steps_data.plot(kind='line', fontsize=14, title ="Steps/Day", figsize=(14,7)) ax.set_xlabel('Day') ax.set_ylabel('# Steps', fontsize=14) ax.get_legend().remove() plt.grid(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax = steps_data.rolling(window = 7).mean().plot(kind='line', fontsize=14, title ="Rolling 7 Days - Mean Steps", figsize=(14,7)) ax.set_xlabel('Day') ax.set_ylabel('# Steps', fontsize=14) ax.get_legend().remove() plt.grid(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

Let's further smooth the data by cutting it on week and month.

by_week = steps_data.groupby(pd.Grouper(freq ='W')).sum() ax = by_week.plot(kind='line', fontsize=14, title ="Steps per Week", figsize=(14,7)) ax.set_xlabel('Week') ax.set_ylabel('# Steps', fontsize=14) ax.get_legend().remove() plt.grid(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
steps_data['month'] = steps_data.index.strftime('%Y-%m') steps_data[['month', 'totalSteps']] monthly_grouped=steps_data.groupby('month').mean() by_month = steps_data.groupby(pd.Grouper(freq ='M')).mean() by_month['month'] = by_month.reset_index()['date'].dt.strftime('%Y-%m') ax = monthly_grouped.plot(kind='bar', fontsize=14, title ="Mean Daily Steps by Month", figsize=(14,7)) ax.set_xlabel('Month') ax.set_ylabel('# Steps', fontsize=14) ax.get_legend().remove() plt.grid(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

Variation of Data: Histograms and Box/Whisker Plots

chart = sns.histplot(steps_data, kde=False, color='green', bins=20) plt.title('Distribution of Steps', fontsize=18) plt.xlabel('Daily Steps', fontsize=16) plt.ylabel('Frequency', fontsize=16) # figure size in inches sns.set(rc={'figure.figsize':(11,9)}) sns.set(style="whitegrid") for key, spine in chart.spines.items(): spine.set_visible(False) plt.grid(False) ax.get_xaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

Differences between days of week?

steps_data['day_of_week'] = steps_data.index.day_name() steps_data['month_name'] = steps_data.index.month_name() plt.figure(figsize=(10,8)) ax = sns.boxplot(x="day_of_week", y="totalSteps", data=steps_data) plt.title('Steps / Day of Week', fontsize=18) plt.xlabel('Day of Week', fontsize=16) plt.ylabel('Steps', fontsize=16) sns.set(style="whitegrid") for key, spine in chart.spines.items(): spine.set_visible(False) plt.grid(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
plt.figure(figsize=(13,7)) ax = sns.boxplot(x="month_name", y="totalSteps", data=steps_data) plt.title('Steps / Month', fontsize=18) plt.xlabel('Month', fontsize=12) plt.ylabel('Steps', fontsize=16) sns.set(style="whitegrid") for key, spine in chart.spines.items(): spine.set_visible(False) plt.grid(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

Predicting the Future

How many Steps by the end of 2020?

Transform data to show a running sum by day, for the year 2020. Use this cumulative sum column to make a prediction using a simple linear regression. The image below shows the general concept - create a 'best fit' line based on data points in order to predict future values.

Let's manipulate the data to get a clean cumulative step total for the year 2020.

steps_data.sort_index(inplace=True) cum_sum_2020 = steps_data.loc['2020'] cum_sum_2020.reset_index(inplace=True) cum_sum_2020['DayOfYear'] = cum_sum_2020['date'].dt.dayofyear cum_sum_2020['cumulative_steps'] = cum_sum_2020['totalSteps'].cumsum().dropna() cum_sum_2020[['DayOfYear', 'totalSteps', 'cumulative_steps']].head()
cum_series = cum_sum_2020.set_index('DayOfYear')['cumulative_steps'].dropna() ax = cum_series.plot(kind='line', fontsize=14, title ="Cumulative Steps", figsize=(14,7)) ax.set_xlabel('Day of Year - 2020') ax.set_ylabel('# Steps', fontsize=14) plt.grid(False) for key, spine in chart.spines.items(): spine.set_visible(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
X = cum_series[-60:].index.values.reshape(-1, 1) Y = cum_series[-60:].values.reshape(-1, 1) linear_regressor = LinearRegression() # create object for the class linear_regressor.fit(X, Y) # perform linear regression
start_day = cum_series.index.values.max() predict_days = np.arange(start_day,366) prediction = linear_regressor.predict(predict_days.reshape(-1, 1))
cum_series = cum_sum_2020.set_index('DayOfYear')['cumulative_steps'].dropna() ax = cum_series.plot(kind='line', fontsize=14, title ="Cumulative Steps", figsize=(14,7)) ax.set_xlabel('Day of Year - 2020') ax.set_ylabel('# Steps', fontsize=14) plt.grid(False) for key, spine in chart.spines.items(): spine.set_visible(False) ax.get_yaxis().set_major_formatter( matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ','))) plt.plot(predict_days, prediction, color = 'red', linewidth=4, )
end_of_year_total = int(round(prediction[-1][0],0)) mean_steps_day_prediction = int(end_of_year_total/365) print('Total predicted steps:') print(f'{end_of_year_total:,}') print('Mean steps/day prediction:') print(f'{mean_steps_day_prediction:,}')

That's all folks!

Recommended on Deepnote

Stock Market Analysis

Stock Market Analysis

Last update 3 months ago
The 10 Best Ways to Create NumPy Arrays

The 10 Best Ways to Create NumPy Arrays

Last update 4 months ago
Wide Residual Networks

Wide Residual Networks

Last update 4 months ago