# EDA for NYC Taxi Trip Duration

This notebook contains the code used to conduct exploratory data analysis on the `nyc_taxi-trip_duration.csv`

dataset.

## 1. Import the necessary libraries

## 2. Import the Dataset

## 3. Data Analysis

In this section we will start analyzing the dataset. We will use several functions and methods to determine the following:

- The size of the dataset
- The datatypes of each variable (feature)
- The number of missing values
- The number of unique values per datatype
- The number of unique values per feature(variable)
- Generate a summary statistics for numerical and non-numerical variables

### 3.1 Variable Identification and Datatype

In this section we will use the `.info()`

method to get a concise summary of the dataframe. We will also use `.nunique()`

and the `.isnull()`

methods to check for unique and missing values respectively.

#### 3.1.1 Using the `.info()`

method

After running the `.info()`

method we can determine following:

- The data set has
**729,322 rows**(entries) and**11 columns**(features) - The datatypes of these variables are
**(4) four floats**,**(3) three integers**, and**(4) four objects** - The dataset has
**no null values** - The dataset has two features
**'pickup_datetime'**and**'dropoff_datetime'**as a datatype of "object". We should change the datatype of these variables to "datetime" datatype. - Once we have changed the datatype of the
**pickup_datetime**and**dropoff_datetime**variables we can create additional features for this dataset. - The features
**pickup_longitude**,**pickup_latitude**,**dropoff_longitude**,**dropoff_latitude**can be used to create a new feature. This feature will contain the distance of each trip. - Finally, we can state that we have 10 independent variables and 1 dependent variable that we will call the
**Target Variable**

#### 3.1.2 Checking for Null values

After using `.isnull.sum()`

we can safely conclude that there are no missing values.

#### 3.1.3 Checking for Unique Values

After using the `.nunique()`

method we can state the following:

- There are no duplicate values in the
`id`

column. - There are only two vendors. 1 or 2
- There are only two values for
`store_and_fwd_flag`

column. Which we now it is "Y" or "N"

### 3.2 Getting Summary Statistics for Numbercial and Non-Numerical Variables

In this subsection we will use the `.decribe`

method to calculate basic statics. These summary statistics inlcude the min,mean,max,quartiles, and standard deviation for the numerical variables and total number of unique values and frequency for the non numerical values. We will also use the `.describe`

method to analyze the target variable `trip_duration`

.

#### 3.2.1 Summary Statistics for Numerical Variables

After using the `.describe()`

method we can determine the following:

- Reiterate that there are no null values in the numerical variables.
- The
`vendor_id`

column appears to only have two values, which means that there might be only two vendors. - The
`passenger_count`

allow us to determine that most of the taxi trips include 1 -2 passangers. The max number of passenger is 9. We will need to check for the presence of outliers when we conduct our univariate analysis. - The
`trip_duration`

column provides us the duration of each trip in seconds. We will analyze this column seperately to see if we see any outliers.

#### 3.2.2 Summary Statistics for Non-Numerical Variable

After using the `.describe()`

method on non-numerical variables we can determine the following:

- There are no null values on the non-numerical columns.
- The
`id`

column shows a frequency of 1, meaning that they are unique values.

### 3.3 Converting Features to the right Datatype

In this section we will convert the `pickup_datetime`

and `dropoff_datetime`

to the DataTime datatype. The conversion of these two features to the right datatype will allow us to get a new features using the **datetime** library.

### 3.4 Creating New Features

In this section we will be creating new features form existing ones. The features that we will be creating are:

- Create new features using the
`pickup_datetime`

and`dropoff_datetime`

columns. - Create a new feature called
`distance`

using the`pickup_longitude`

,`pickup_latitude`

`dropoff_longitude`

, and the`dropoff_latitude`

. - Create a new feature called
`average_speed`

using the`distance`

and`trip_duration`

.

#### 3.4.1 Creating Features based on Date and Time

In this section we will be creating eight (8) new feature using the `pickup_datetime`

and `dropoff_datetime`

.

Now that we have the pickup time and the dropoff time we can determine the part of the day they fall into. As we know the day is divied in four parts. These parts are:

- Morning, which starts at 6:01 am and ends at 12:oopm
- Afternoon, which starts at 12:01 pm and ends at 18:00pm
- Evening, which starts at 18:01 and ends at 21:00pm
- Night, which start at 21:01 and ends at 6:00am

#### 3.4.2 Creating a feature based on Location

In this section we will use the `pickup_longitude`

, `pickup_latitude`

, `dropoff_longitude`

, and the `dropoff_latitude`

to create a new feature named **distance**. We will use the Haversine python library to calculate the distance (in miles) between the pickup coordinance and the dropoff coordinance.

**Observation**
From this exercise we can see the high traffic that new york taxi drivers have to face every day!

#### 3.4.3 Checking the new features

Now that we have finished creating the new features it is recommended to see the update dataframe `df`

### 3.5 Univariate Analysis

Now that we are very familiar with the dataset and have created some useful features it is time to start analyzing each variable.

#### 3.5.1 Univariate Analysis for Target Variable `trip_duration`

The summary statistics for `trip_duration`

shows us:

- There are no null values
- A trip duration on average will take 9523.23 seconds
- We see a considerable difference bettwen the max trip value and the 75% quantile value. We will need to check for the presence of outliers when we conduct our univariate analysis for the
`trip_duration`

variable.

**Observation** As we can see the variable `trip_duration`

has a high kurtosis and it is skew. These resutls indicate that the variable is not normally distributed. In order to support our finding we test for normality using the Shapiro-Wilk test, which confirmed that the variable `trip_duraiton`

is not normally distributed.

#### 3.5.2 Univariate Analysis for for `vendor_id`

**Observation:** We can see that `vendor_id 2`

makes more trips than `vendor_id 1`

#### 3.5.3 Univariate Analysis for `passenger_count`

**Observation:** From the data we can safely conclude that most of the trips will include just one passenger.

#### 3.5.4 Univariate Analysis for `pickup_by_day`

and `dropoff_by_day`

**Observation:** We can see that Friday is the busiest day of the week. Followed by Saturdar and Thursday.

#### 3.5.5 Univariate Analysis for `distance`

**Observation:** We can see that there are 2,901 trips that have a distance of zero. We should investigate why they have zero distance. Lets see is we can get an idea why they have a distance of 0 or close to 0.

**Observation:** We can see that all of the trips with a distance of zero have the `store_and_fwd_flag`

as **No**. We need to inspect this observation further.

#### 3.5.6 Univariate Analysis for `average_speed`

**Observations** Again we see that 2,901 trips have an average speed of zero. We also see that the max speed is 3504.84 miles per hour, which is not possible. We need to explore both the miniums and maximum to see if they are outliers or errors.

**Observations** Again we see some inconsistencies in the max trip. We can see that the trip had a distance of 771 miles but a trip duration of only one hour. We believe this input is an error and most be removed from the dataset.

### 3.6 Bivariate Analysis

In this section we will see the relationship between the target variable and selected features

**Observation:** We can see that the afternoon is the busiest time of the day. Most of the demand starts between 2:00 pm and 3:30 pm. Each trip last on average around 16.67 minutes.

### 3.7 Multivariate Analysis

For this section we will calcualte the correlation of feature and we will visualize it using a heat map.