Does temperature affect crimes being committed?
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import seaborn as sns
weather = pd.read_csv('san_fran_weather.csv')
weather["DATE"]= pd.to_datetime(weather["DATE"],format="%d/%m/%Y")
#dt.datetime.strptime('2018-05-12','%Y-%m-%d').strftime('%Y-%m-%d')
weather.info()
weather.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2093 entries, 0 to 2092
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STATION 2093 non-null object
1 NAME 2093 non-null object
2 LATITUDE 2093 non-null float64
3 LONGITUDE 2093 non-null float64
4 ELEVATION 2093 non-null float64
5 DATE 2093 non-null datetime64[ns]
6 PRCP 2093 non-null float64
7 PRCP_ATTRIBUTES 2093 non-null object
8 SNOW 151 non-null float64
9 SNOW_ATTRIBUTES 151 non-null object
10 SNWD 359 non-null float64
11 SNWD_ATTRIBUTES 359 non-null object
12 TAVG 2093 non-null float64
13 TAVG_ATTRIBUTES 0 non-null float64
14 TMAX 2093 non-null float64
15 TMAX_ATTRIBUTES 2093 non-null object
16 TMIN 2093 non-null float64
17 TMIN_ATTRIBUTES 2093 non-null object
dtypes: datetime64[ns](1), float64(10), object(7)
memory usage: 294.5+ KB
weather_new=weather[weather['DATE']>='2018']
weather_new.info()
weather_new.tail()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1362 entries, 731 to 2092
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STATION 1362 non-null object
1 NAME 1362 non-null object
2 LATITUDE 1362 non-null float64
3 LONGITUDE 1362 non-null float64
4 ELEVATION 1362 non-null float64
5 DATE 1362 non-null datetime64[ns]
6 PRCP 1362 non-null float64
7 PRCP_ATTRIBUTES 1362 non-null object
8 SNOW 60 non-null float64
9 SNOW_ATTRIBUTES 60 non-null object
10 SNWD 268 non-null float64
11 SNWD_ATTRIBUTES 268 non-null object
12 TAVG 1362 non-null float64
13 TAVG_ATTRIBUTES 0 non-null float64
14 TMAX 1362 non-null float64
15 TMAX_ATTRIBUTES 1362 non-null object
16 TMIN 1362 non-null float64
17 TMIN_ATTRIBUTES 1362 non-null object
dtypes: datetime64[ns](1), float64(10), object(7)
memory usage: 202.2+ KB
# comment: just curious, why after 2018?
crimes = pd.read_csv('san_fran_crime_18_21.csv')
#crimes["Incident Date"]= pd.to_datetime(crimes["Incident Date"])
crimes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 507188 entries, 0 to 507187
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Incident Datetime 507188 non-null object
1 Incident Date 507188 non-null object
2 Incident Time 507188 non-null object
3 Incident Year 507188 non-null int64
4 Incident Day of Week 507188 non-null object
5 Report Datetime 507188 non-null object
6 Row ID 507188 non-null float64
7 Incident ID 507188 non-null int64
8 Incident Number 507188 non-null int64
9 CAD Number 396370 non-null float64
10 Report Type Code 507188 non-null object
11 Report Type Description 507188 non-null object
12 Filed Online 100983 non-null object
13 Incident Code 507188 non-null int64
14 Incident Category 506740 non-null object
15 Incident Subcategory 506740 non-null object
16 Incident Description 507188 non-null object
17 Resolution 507188 non-null object
18 Intersection 481401 non-null object
19 CNN 481401 non-null float64
20 Police District 507188 non-null object
21 Analysis Neighborhood 481299 non-null object
22 Supervisor District 481401 non-null float64
23 Latitude 481401 non-null float64
24 Longitude 481401 non-null float64
25 Point 481401 non-null object
26 Neighborhoods 470970 non-null float64
27 ESNCAG - Boundary File 5465 non-null float64
28 Central Market/Tenderloin Boundary Polygon - Updated 65013 non-null float64
29 Civic Center Harm Reduction Project Boundary 64798 non-null float64
30 HSOC Zones as of 2018-06-05 108172 non-null float64
31 Invest In Neighborhoods (IIN) Areas 0 non-null float64
32 Current Supervisor Districts 481316 non-null float64
33 Current Police Districts 480778 non-null float64
dtypes: float64(14), int64(4), object(16)
memory usage: 131.6+ MB
#crimes.sort_values('Incident Date').tail()
#print(crimes[crimes['Incident Date']=='27/9/2021'])
crimes["Incident Date"]= pd.to_datetime(crimes["Incident Date"],format="%d/%m/%Y")
crimes.sort_values('Incident Date').tail()
crimes_grouped=crimes.groupby('Incident Date').size()
crimes_df=pd.DataFrame(crimes_grouped)
crimes_df = crimes_df.rename(columns={0: 'crimes'})
crimes_df.head()
weath_crim = weather_new.merge(crimes_df, left_on='DATE', right_on='Incident Date', how='left')
weath_crim.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1362 entries, 0 to 1361
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STATION 1362 non-null object
1 NAME 1362 non-null object
2 LATITUDE 1362 non-null float64
3 LONGITUDE 1362 non-null float64
4 ELEVATION 1362 non-null float64
5 DATE 1362 non-null datetime64[ns]
6 PRCP 1362 non-null float64
7 PRCP_ATTRIBUTES 1362 non-null object
8 SNOW 60 non-null float64
9 SNOW_ATTRIBUTES 60 non-null object
10 SNWD 268 non-null float64
11 SNWD_ATTRIBUTES 268 non-null object
12 TAVG 1362 non-null float64
13 TAVG_ATTRIBUTES 0 non-null float64
14 TMAX 1362 non-null float64
15 TMAX_ATTRIBUTES 1362 non-null object
16 TMIN 1362 non-null float64
17 TMIN_ATTRIBUTES 1362 non-null object
18 crimes 1362 non-null int64
dtypes: datetime64[ns](1), float64(10), int64(1), object(7)
memory usage: 212.8+ KB
# comment: well done!
weath_crim.head()
number_of_missing_fin = weath_crim['crimes'].isna().sum()
print(number_of_missing_fin)
0
# comment : nice check on nan values!
average_temp = weath_crim['TAVG'].mean()
max_temp=weath_crim['TAVG'].max()
average_crime = weath_crim['crimes'].mean()
print('The average # of crimes in SF is ', average_crime)
print('The average temperature in SF is ', average_temp)
print('The max average temp in SF over the 3 years is ', max_temp)
The average # of crimes in SF is 371.6108663729809
The average temperature in SF is 14.489170337738619
The max average temp in SF over the 3 years is 29.75
# comment: why not use TAVG as mean temperature instead of TMAX?
# this part may be unnecessary for
sns.regplot(data = weath_crim, x = weath_crim['TAVG'],y = weath_crim['crimes'])
plt.show()
# comment: why use TMAX instead of TAVG hahaha?
# the graph does not show a very strong positive correlation
!pip install statsmodels
Collecting statsmodels
Downloading statsmodels-0.13.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
|████████████████████████████████| 9.8 MB 18.2 MB/s
Collecting patsy>=0.5.2
Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
|████████████████████████████████| 233 kB 58.0 MB/s
Requirement already satisfied: pandas>=0.25 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels) (1.2.5)
Requirement already satisfied: scipy>=1.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels) (1.7.1)
Requirement already satisfied: numpy>=1.17 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas>=0.25->statsmodels) (2021.1)
Requirement already satisfied: six in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.2 statsmodels-0.13.0
import statsmodels.api as sm
# Fit and summarize OLS model
mod = sm.OLS(weath_crim.TMAX, weath_crim.crimes)
res = mod.fit()
print(res.summary(alpha = 0.01))
OLS Regression Results
=======================================================================================
Dep. Variable: TMAX R-squared (uncentered): 0.928
Model: OLS Adj. R-squared (uncentered): 0.928
Method: Least Squares F-statistic: 1.762e+04
Date: Sun, 03 Oct 2021 Prob (F-statistic): 0.00
Time: 10:05:29 Log-Likelihood: -4100.1
No. Observations: 1362 AIC: 8202.
Df Residuals: 1361 BIC: 8207.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.005 0.995]
------------------------------------------------------------------------------
crimes 0.0469 0.000 132.745 0.000 0.046 0.048
==============================================================================
Omnibus: 162.614 Durbin-Watson: 0.565
Prob(Omnibus): 0.000 Jarque-Bera (JB): 261.270
Skew: 0.817 Prob(JB): 1.84e-57
Kurtosis: 4.390 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Comment: Genereally very well done! :)