In [ ]:
Introduction

Automatidata has contracted with the New York Taxi and Limousine Commission (TLC)
to conduct a study of the data collected by the TLC. The goal of this study is to 
provide the analytics that will drive the creation of an TLC application. This 
application will let TLC clients estimate potential fares. Secondary goals are to 
provide a clean dataset to the TLC and provide any insights gained during the analysis.

Milestones 

Planning - We anticipate that the planning stage will take one week. This includes
determining the methods and tools that will be used. It also provides time to 
communicate with stakeholders to understand their needs.

NOTE: It has already been determined that Python will be the primary analysis tool.

Analysis - It is anticipated that the analysis phase will take two weeks. In this 
phase, Automatidata will perform a cross-team exploratory data analysis, then clean and 
structure the data.

Construct - This phase will take two weeks. In this stage, the data will be analyzed to 
uncover the hidden relationships used to make recommendations.

Execute - This final phase will take one week. Automatidata will present the findings 
to stakeholders, then incorporate feedback into the final version.

EDA

The data was received from the TLC in CSV format. Automatidata used the following methods to explore the data:
In [ ]:
The following method was used to import the data into Python:
In [4]:
import numpy as np
import pandas as pd

tlc_data = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')
In [ ]:
The following method was used to view a sample of the data:
In [75]:
tlc_data.head(10)
Out[75]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 99495069 2 11/24/17 4:32 11/24/17 4:32 1 0.01 1 N 263 263 2 2.5 0.5 0.5 0.00 0.00 0.3 3.80
1 67271410 2 8/8/17 23:28 8/8/17 23:29 2 0.01 1 N 132 132 2 2.5 0.5 0.5 0.00 0.00 0.3 3.80
2 26279873 2 3/31/17 5:29 3/31/17 5:29 1 0.01 1 N 249 249 2 2.5 0.5 0.5 0.00 0.00 0.3 3.80
3 98165974 2 11/19/17 7:17 11/19/17 7:17 1 0.01 2 N 264 239 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
4 39498898 2 5/16/17 13:33 5/16/17 13:33 1 0.01 2 N 100 100 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
5 105577859 2 12/13/17 12:19 12/13/17 12:19 1 0.01 2 N 132 132 1 52.0 0.0 0.5 17.57 5.76 0.3 76.13
6 59839344 2 7/14/17 6:09 7/14/17 6:11 1 0.01 2 N 132 132 2 52.0 0.0 0.5 0.00 5.76 0.3 58.56
7 111191930 2 1/14/17 19:04 1/14/17 19:05 2 0.02 1 N 106 106 2 2.5 0.0 0.5 0.00 0.00 0.3 3.30
8 111339767 2 1/15/17 5:04 1/15/17 5:04 1 0.02 1 N 264 42 2 2.5 0.5 0.5 0.00 0.00 0.3 3.80
9 75746755 2 9/9/17 13:29 9/9/17 13:29 3 0.02 1 N 162 162 2 2.5 0.0 0.5 0.00 0.00 0.3 3.30
In [ ]:
The following method was used to discover column names, data types, etc.
In [76]:
tlc_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22534 entries, 0 to 22533
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22534 non-null  int64  
 1   VendorID               22534 non-null  int64  
 2   tpep_pickup_datetime   22534 non-null  object 
 3   tpep_dropoff_datetime  22534 non-null  object 
 4   passenger_count        22534 non-null  int64  
 5   trip_distance          22534 non-null  float64
 6   RatecodeID             22534 non-null  int64  
 7   store_and_fwd_flag     22534 non-null  object 
 8   PULocationID           22534 non-null  int64  
 9   DOLocationID           22534 non-null  int64  
 10  payment_type           22534 non-null  int64  
 11  fare_amount            22534 non-null  float64
 12  extra                  22534 non-null  float64
 13  mta_tax                22534 non-null  float64
 14  tip_amount             22534 non-null  float64
 15  tolls_amount           22534 non-null  float64
 16  improvement_surcharge  22534 non-null  float64
 17  total_amount           22534 non-null  float64
dtypes: float64(8), int64(7), object(3)
memory usage: 3.1+ MB
In [ ]:
The following method was used to calculate descriptive statistics:
In [77]:
tlc_data.describe()
Out[77]:
Unnamed: 0 VendorID passenger_count trip_distance RatecodeID PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
count 2.253400e+04 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 22534.000000 2.253400e+04 22534.000000
mean 5.676140e+07 1.556803 1.643694 2.933062 1.031907 162.353288 161.453359 1.332653 12.977239 0.333696 0.498669 1.831764 0.309110 3.000000e-01 16.255795
std 3.273716e+07 0.496774 1.286401 3.656285 0.229813 66.584149 70.089471 0.489831 12.666115 0.461663 0.025767 2.767018 1.384127 5.551238e-17 15.558575
min 1.212700e+04 1.000000 0.000000 0.010000 1.000000 1.000000 1.000000 1.000000 2.500000 0.000000 0.000000 0.000000 0.000000 3.000000e-01 3.300000
25% 2.855514e+07 1.000000 1.000000 1.000000 1.000000 114.000000 112.000000 1.000000 6.500000 0.000000 0.500000 0.000000 0.000000 3.000000e-01 8.750000
50% 5.673992e+07 2.000000 1.000000 1.630000 1.000000 162.000000 162.000000 1.000000 9.500000 0.000000 0.500000 1.360000 0.000000 3.000000e-01 11.800000
75% 8.537328e+07 2.000000 2.000000 3.090000 1.000000 233.000000 233.000000 2.000000 14.500000 0.500000 0.500000 2.450000 0.000000 3.000000e-01 17.800000
max 1.134863e+08 2.000000 6.000000 33.960000 5.000000 265.000000 265.000000 4.000000 999.990000 4.500000 0.500000 200.000000 19.100000 3.000000e-01 1200.290000
In [ ]:
The following method sorts the fare amounts in descending order:
In [78]:
tlc_data_sort = tlc_data.sort_values(by='fare_amount', ascending=False)
tlc_data_sort.head(10)
Out[78]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
15740 11157412 1 2/6/17 5:50 2/6/17 5:51 1 2.60 5 N 226 226 1 999.99 0.0 0.0 200.00 0.00 0.3 1200.29
22532 40523668 2 5/19/17 8:20 5/19/17 9:20 1 33.92 5 N 229 265 1 200.01 0.0 0.5 51.64 5.76 0.3 258.21
20427 111653084 1 1/1/17 23:53 1/1/17 23:53 1 7.30 5 N 1 1 1 152.00 0.0 0.0 0.00 0.00 0.3 152.30
22533 51810714 2 6/18/17 23:33 6/19/17 0:12 2 33.96 5 N 132 265 2 150.00 0.0 0.0 0.00 0.00 0.3 150.30
22510 101198443 2 11/30/17 10:41 11/30/17 11:31 1 25.50 5 N 132 265 2 140.00 0.0 0.5 0.00 16.26 0.3 157.06
22530 76319330 2 9/11/17 11:41 9/11/17 12:18 1 31.95 4 N 138 265 2 131.00 0.0 0.5 0.00 0.00 0.3 131.80
21745 51087145 1 6/16/17 18:30 6/16/17 19:18 2 12.50 5 N 211 265 1 120.00 0.0 0.0 5.00 12.50 0.3 137.80
22531 49894023 2 6/13/17 12:30 6/13/17 13:37 1 32.72 3 N 138 1 1 107.00 0.0 0.0 55.50 16.26 0.3 179.06
22389 93330154 1 11/4/17 13:32 11/4/17 14:18 2 19.80 5 N 265 230 1 105.00 0.0 0.0 8.00 2.64 0.3 115.94
353 91660295 2 10/30/17 11:23 10/30/17 11:23 1 0.32 5 N 264 83 1 100.00 0.0 0.5 25.20 0.00 0.3 126.00
In [ ]:
The following method sorts trip distance in descending order:
In [79]:
tlc_data_sort = tlc_data.sort_values(by='trip_distance', ascending=False)
tlc_data_sort.head(10)
Out[79]:
Unnamed: 0 VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
22533 51810714 2 6/18/17 23:33 6/19/17 0:12 2 33.96 5 N 132 265 2 150.00 0.0 0.0 0.00 0.00 0.3 150.30
22532 40523668 2 5/19/17 8:20 5/19/17 9:20 1 33.92 5 N 229 265 1 200.01 0.0 0.5 51.64 5.76 0.3 258.21
22531 49894023 2 6/13/17 12:30 6/13/17 13:37 1 32.72 3 N 138 1 1 107.00 0.0 0.0 55.50 16.26 0.3 179.06
22530 76319330 2 9/11/17 11:41 9/11/17 12:18 1 31.95 4 N 138 265 2 131.00 0.0 0.5 0.00 0.00 0.3 131.80
22529 94052446 2 11/6/17 20:30 11/7/17 0:00 1 30.83 1 N 132 23 1 80.00 0.5 0.5 18.56 11.52 0.3 111.38
22528 90375786 1 10/26/17 14:45 10/26/17 16:12 1 30.50 1 N 132 220 1 90.50 0.0 0.5 19.85 8.16 0.3 119.31
22527 68023798 2 8/11/17 14:14 8/11/17 15:17 1 30.33 2 N 132 158 1 52.00 0.0 0.5 14.64 5.76 0.3 73.20
22526 77309977 2 9/14/17 13:44 9/14/17 14:34 1 28.23 2 N 13 132 1 52.00 0.0 0.5 4.40 5.76 0.3 62.96
22525 43431843 1 5/15/17 8:11 5/15/17 9:03 1 28.20 2 N 90 132 1 52.00 0.0 0.5 11.71 5.76 0.3 70.27
22524 51094874 2 6/16/17 18:51 6/16/17 19:41 1 27.97 2 N 261 132 2 52.00 4.5 0.5 0.00 5.76 0.3 63.06
In [ ]:
From these calculations: 1) The most expensive trips are not necessarily the longest ones; 2) 
The longest trips are about 33 miles; 3) The data may require cleaning for outliers and missing 
data.
In [ ]:
Data Cleaning

The following steps were taken to clean the data set:

Rows with 'trip_distance' values less than zero were removed.
Rows with 'fare_amount' values less than zero were removed.
The data was checked for null values. None were found.
The data was checked for duplicates. None were found.
In [80]:
a = tlc_data['trip_distance'].min()
b = tlc_data['trip_distance'].max()
c = tlc_data['trip_distance'].mean()
d = tlc_data['trip_distance'].std()

print('Minimum trip distance: ' + str(a))
print('Maximum trip distance: ' + str(b))
print('Mean trip distance: ' + str(c))
print('Standard deviation: ' + str(d))
Minimum trip distance: 0.01
Maximum trip distance: 33.96
Mean trip distance: 2.9330620395846276
Standard deviation: 3.6562845207479793
In [81]:
a = tlc_data['fare_amount'].min()
b = tlc_data['fare_amount'].max()
c = tlc_data['fare_amount'].mean()
d = tlc_data['fare_amount'].std()

print('Minimum fare amount: ' + str(a))
print('Maximum fare amount: ' + str(b))
print('Mean fare amount: ' + str(c))
print('Standard deviation: ' + str(d))
Minimum fare amount: 2.5
Maximum fare amount: 999.99
Mean fare amount: 12.9772388390876
Standard deviation: 12.666115277972198
In [30]:
tlc_data.max()
Out[30]:
Unnamed: 0                           113486300
VendorID                                     2
tpep_pickup_datetime     12/31/2017 9:57:40 PM
tpep_dropoff_datetime    12/31/2017 9:55:19 AM
passenger_count                              6
trip_distance                            33.96
RatecodeID                                  99
store_and_fwd_flag                           Y
PULocationID                               265
DOLocationID                               265
payment_type                                 4
fare_amount                             999.99
extra                                      4.5
mta_tax                                    0.5
tip_amount                               200.0
tolls_amount                              19.1
improvement_surcharge                      0.3
total_amount                           1200.29
dtype: object
In [41]:
tlc_data['trip_distance'].std()
Out[41]:
np.float64(3.653171182833924)
In [55]:
correlation = np.corrcoef(tlc_data['trip_distance'],tlc_data['fare_amount'])
print(correlation)
[[1.         0.75659898]
 [0.75659898 1.        ]]
In [66]:
correlation = np.corrcoef(tlc_data['passenger_count'],tlc_data['fare_amount'])
print(correlation)
[[1.         0.01061401]
 [0.01061401 1.        ]]
In [5]:
tlc_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22534 entries, 0 to 22533
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22534 non-null  int64  
 1   VendorID               22534 non-null  int64  
 2   tpep_pickup_datetime   22534 non-null  object 
 3   tpep_dropoff_datetime  22534 non-null  object 
 4   passenger_count        22534 non-null  int64  
 5   trip_distance          22534 non-null  float64
 6   RatecodeID             22534 non-null  int64  
 7   store_and_fwd_flag     22534 non-null  object 
 8   PULocationID           22534 non-null  int64  
 9   DOLocationID           22534 non-null  int64  
 10  payment_type           22534 non-null  int64  
 11  fare_amount            22534 non-null  float64
 12  extra                  22534 non-null  float64
 13  mta_tax                22534 non-null  float64
 14  tip_amount             22534 non-null  float64
 15  tolls_amount           22534 non-null  float64
 16  improvement_surcharge  22534 non-null  float64
 17  total_amount           22534 non-null  float64
dtypes: float64(8), int64(7), object(3)
memory usage: 3.1+ MB
In [7]:
tlc_data.isnull().sum()
Out[7]:
Unnamed: 0               0
VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
store_and_fwd_flag       0
PULocationID             0
DOLocationID             0
payment_type             0
fare_amount              0
extra                    0
mta_tax                  0
tip_amount               0
tolls_amount             0
improvement_surcharge    0
total_amount             0
dtype: int64
In [10]:
tlc_data_dup = tlc_data.duplicated().sum()
print(tlc_data_dup)
0
In [ ]:
Approximately 74.5% of the recorded trips were less than or equal to 3 miles:
In [26]:
sum(tlc_data.trip_distance <= 3) / tlc_data.count()
Out[26]:
Unnamed: 0               0.745318
VendorID                 0.745318
tpep_pickup_datetime     0.745318
tpep_dropoff_datetime    0.745318
passenger_count          0.745318
trip_distance            0.745318
RatecodeID               0.745318
store_and_fwd_flag       0.745318
PULocationID             0.745318
DOLocationID             0.745318
payment_type             0.745318
fare_amount              0.745318
extra                    0.745318
mta_tax                  0.745318
tip_amount               0.745318
tolls_amount             0.745318
improvement_surcharge    0.745318
total_amount             0.745318
dtype: float64
In [29]:
sum(tlc_data.fare_amount <= 15) / tlc_data.count()
Out[29]:
Unnamed: 0               0.765998
VendorID                 0.765998
tpep_pickup_datetime     0.765998
tpep_dropoff_datetime    0.765998
passenger_count          0.765998
trip_distance            0.765998
RatecodeID               0.765998
store_and_fwd_flag       0.765998
PULocationID             0.765998
DOLocationID             0.765998
payment_type             0.765998
fare_amount              0.765998
extra                    0.765998
mta_tax                  0.765998
tip_amount               0.765998
tolls_amount             0.765998
improvement_surcharge    0.765998
total_amount             0.765998
dtype: float64
In [ ]:
Conclusions

Approximately 74.5% of the documented trip were 3 miles or less.
Approximately 76.6% of the fares for documented trips were $15.00 or less.
There is a high positive correlation between 'fare_amount' and 'trip_distance'.
There is a very low positive correlation between 'passenger_count' and 'fare_amount'.

Therefore, we recommend that developers concentrate on accurately predicting fares for short trips.