In [ ]:
Introduction
Automatidata has contracted with the New York Taxi and Limousine Commission (TLC)
to conduct a study of the data collected by the TLC. The goal of this study is to
provide the analytics that will drive the creation of an TLC application. This
application will let TLC clients estimate potential fares. Secondary goals are to
provide a clean dataset to the TLC and provide any insights gained during the analysis.
Milestones
Planning - We anticipate that the planning stage will take one week. This includes
determining the methods and tools that will be used. It also provides time to
communicate with stakeholders to understand their needs.
NOTE: It has already been determined that Python will be the primary analysis tool.
Analysis - It is anticipated that the analysis phase will take two weeks. In this
phase, Automatidata will perform a cross-team exploratory data analysis, then clean and
structure the data.
Construct - This phase will take two weeks. In this stage, the data will be analyzed to
uncover the hidden relationships used to make recommendations.
Execute - This final phase will take one week. Automatidata will present the findings
to stakeholders, then incorporate feedback into the final version.
EDA
The data was received from the TLC in CSV format. Automatidata used the following methods to explore the data:
In [ ]:
The following method was used to import the data into Python:
In [4]:
import numpy as np
import pandas as pd
tlc_data = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')
In [ ]:
The following method was used to view a sample of the data:
In [75]:
tlc_data.head(10)
Out[75]:
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 99495069 | 2 | 11/24/17 4:32 | 11/24/17 4:32 | 1 | 0.01 | 1 | N | 263 | 263 | 2 | 2.5 | 0.5 | 0.5 | 0.00 | 0.00 | 0.3 | 3.80 |
| 1 | 67271410 | 2 | 8/8/17 23:28 | 8/8/17 23:29 | 2 | 0.01 | 1 | N | 132 | 132 | 2 | 2.5 | 0.5 | 0.5 | 0.00 | 0.00 | 0.3 | 3.80 |
| 2 | 26279873 | 2 | 3/31/17 5:29 | 3/31/17 5:29 | 1 | 0.01 | 1 | N | 249 | 249 | 2 | 2.5 | 0.5 | 0.5 | 0.00 | 0.00 | 0.3 | 3.80 |
| 3 | 98165974 | 2 | 11/19/17 7:17 | 11/19/17 7:17 | 1 | 0.01 | 2 | N | 264 | 239 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 52.80 |
| 4 | 39498898 | 2 | 5/16/17 13:33 | 5/16/17 13:33 | 1 | 0.01 | 2 | N | 100 | 100 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 52.80 |
| 5 | 105577859 | 2 | 12/13/17 12:19 | 12/13/17 12:19 | 1 | 0.01 | 2 | N | 132 | 132 | 1 | 52.0 | 0.0 | 0.5 | 17.57 | 5.76 | 0.3 | 76.13 |
| 6 | 59839344 | 2 | 7/14/17 6:09 | 7/14/17 6:11 | 1 | 0.01 | 2 | N | 132 | 132 | 2 | 52.0 | 0.0 | 0.5 | 0.00 | 5.76 | 0.3 | 58.56 |
| 7 | 111191930 | 2 | 1/14/17 19:04 | 1/14/17 19:05 | 2 | 0.02 | 1 | N | 106 | 106 | 2 | 2.5 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 3.30 |
| 8 | 111339767 | 2 | 1/15/17 5:04 | 1/15/17 5:04 | 1 | 0.02 | 1 | N | 264 | 42 | 2 | 2.5 | 0.5 | 0.5 | 0.00 | 0.00 | 0.3 | 3.80 |
| 9 | 75746755 | 2 | 9/9/17 13:29 | 9/9/17 13:29 | 3 | 0.02 | 1 | N | 162 | 162 | 2 | 2.5 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 3.30 |
In [ ]:
The following method was used to discover column names, data types, etc.
In [76]:
tlc_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22534 entries, 0 to 22533 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 22534 non-null int64 1 VendorID 22534 non-null int64 2 tpep_pickup_datetime 22534 non-null object 3 tpep_dropoff_datetime 22534 non-null object 4 passenger_count 22534 non-null int64 5 trip_distance 22534 non-null float64 6 RatecodeID 22534 non-null int64 7 store_and_fwd_flag 22534 non-null object 8 PULocationID 22534 non-null int64 9 DOLocationID 22534 non-null int64 10 payment_type 22534 non-null int64 11 fare_amount 22534 non-null float64 12 extra 22534 non-null float64 13 mta_tax 22534 non-null float64 14 tip_amount 22534 non-null float64 15 tolls_amount 22534 non-null float64 16 improvement_surcharge 22534 non-null float64 17 total_amount 22534 non-null float64 dtypes: float64(8), int64(7), object(3) memory usage: 3.1+ MB
In [ ]:
The following method was used to calculate descriptive statistics:
In [77]:
tlc_data.describe()
Out[77]:
| Unnamed: 0 | VendorID | passenger_count | trip_distance | RatecodeID | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2.253400e+04 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 22534.000000 | 2.253400e+04 | 22534.000000 |
| mean | 5.676140e+07 | 1.556803 | 1.643694 | 2.933062 | 1.031907 | 162.353288 | 161.453359 | 1.332653 | 12.977239 | 0.333696 | 0.498669 | 1.831764 | 0.309110 | 3.000000e-01 | 16.255795 |
| std | 3.273716e+07 | 0.496774 | 1.286401 | 3.656285 | 0.229813 | 66.584149 | 70.089471 | 0.489831 | 12.666115 | 0.461663 | 0.025767 | 2.767018 | 1.384127 | 5.551238e-17 | 15.558575 |
| min | 1.212700e+04 | 1.000000 | 0.000000 | 0.010000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000e-01 | 3.300000 |
| 25% | 2.855514e+07 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 114.000000 | 112.000000 | 1.000000 | 6.500000 | 0.000000 | 0.500000 | 0.000000 | 0.000000 | 3.000000e-01 | 8.750000 |
| 50% | 5.673992e+07 | 2.000000 | 1.000000 | 1.630000 | 1.000000 | 162.000000 | 162.000000 | 1.000000 | 9.500000 | 0.000000 | 0.500000 | 1.360000 | 0.000000 | 3.000000e-01 | 11.800000 |
| 75% | 8.537328e+07 | 2.000000 | 2.000000 | 3.090000 | 1.000000 | 233.000000 | 233.000000 | 2.000000 | 14.500000 | 0.500000 | 0.500000 | 2.450000 | 0.000000 | 3.000000e-01 | 17.800000 |
| max | 1.134863e+08 | 2.000000 | 6.000000 | 33.960000 | 5.000000 | 265.000000 | 265.000000 | 4.000000 | 999.990000 | 4.500000 | 0.500000 | 200.000000 | 19.100000 | 3.000000e-01 | 1200.290000 |
In [ ]:
The following method sorts the fare amounts in descending order:
In [78]:
tlc_data_sort = tlc_data.sort_values(by='fare_amount', ascending=False)
tlc_data_sort.head(10)
Out[78]:
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15740 | 11157412 | 1 | 2/6/17 5:50 | 2/6/17 5:51 | 1 | 2.60 | 5 | N | 226 | 226 | 1 | 999.99 | 0.0 | 0.0 | 200.00 | 0.00 | 0.3 | 1200.29 |
| 22532 | 40523668 | 2 | 5/19/17 8:20 | 5/19/17 9:20 | 1 | 33.92 | 5 | N | 229 | 265 | 1 | 200.01 | 0.0 | 0.5 | 51.64 | 5.76 | 0.3 | 258.21 |
| 20427 | 111653084 | 1 | 1/1/17 23:53 | 1/1/17 23:53 | 1 | 7.30 | 5 | N | 1 | 1 | 1 | 152.00 | 0.0 | 0.0 | 0.00 | 0.00 | 0.3 | 152.30 |
| 22533 | 51810714 | 2 | 6/18/17 23:33 | 6/19/17 0:12 | 2 | 33.96 | 5 | N | 132 | 265 | 2 | 150.00 | 0.0 | 0.0 | 0.00 | 0.00 | 0.3 | 150.30 |
| 22510 | 101198443 | 2 | 11/30/17 10:41 | 11/30/17 11:31 | 1 | 25.50 | 5 | N | 132 | 265 | 2 | 140.00 | 0.0 | 0.5 | 0.00 | 16.26 | 0.3 | 157.06 |
| 22530 | 76319330 | 2 | 9/11/17 11:41 | 9/11/17 12:18 | 1 | 31.95 | 4 | N | 138 | 265 | 2 | 131.00 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 131.80 |
| 21745 | 51087145 | 1 | 6/16/17 18:30 | 6/16/17 19:18 | 2 | 12.50 | 5 | N | 211 | 265 | 1 | 120.00 | 0.0 | 0.0 | 5.00 | 12.50 | 0.3 | 137.80 |
| 22531 | 49894023 | 2 | 6/13/17 12:30 | 6/13/17 13:37 | 1 | 32.72 | 3 | N | 138 | 1 | 1 | 107.00 | 0.0 | 0.0 | 55.50 | 16.26 | 0.3 | 179.06 |
| 22389 | 93330154 | 1 | 11/4/17 13:32 | 11/4/17 14:18 | 2 | 19.80 | 5 | N | 265 | 230 | 1 | 105.00 | 0.0 | 0.0 | 8.00 | 2.64 | 0.3 | 115.94 |
| 353 | 91660295 | 2 | 10/30/17 11:23 | 10/30/17 11:23 | 1 | 0.32 | 5 | N | 264 | 83 | 1 | 100.00 | 0.0 | 0.5 | 25.20 | 0.00 | 0.3 | 126.00 |
In [ ]:
The following method sorts trip distance in descending order:
In [79]:
tlc_data_sort = tlc_data.sort_values(by='trip_distance', ascending=False)
tlc_data_sort.head(10)
Out[79]:
| Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22533 | 51810714 | 2 | 6/18/17 23:33 | 6/19/17 0:12 | 2 | 33.96 | 5 | N | 132 | 265 | 2 | 150.00 | 0.0 | 0.0 | 0.00 | 0.00 | 0.3 | 150.30 |
| 22532 | 40523668 | 2 | 5/19/17 8:20 | 5/19/17 9:20 | 1 | 33.92 | 5 | N | 229 | 265 | 1 | 200.01 | 0.0 | 0.5 | 51.64 | 5.76 | 0.3 | 258.21 |
| 22531 | 49894023 | 2 | 6/13/17 12:30 | 6/13/17 13:37 | 1 | 32.72 | 3 | N | 138 | 1 | 1 | 107.00 | 0.0 | 0.0 | 55.50 | 16.26 | 0.3 | 179.06 |
| 22530 | 76319330 | 2 | 9/11/17 11:41 | 9/11/17 12:18 | 1 | 31.95 | 4 | N | 138 | 265 | 2 | 131.00 | 0.0 | 0.5 | 0.00 | 0.00 | 0.3 | 131.80 |
| 22529 | 94052446 | 2 | 11/6/17 20:30 | 11/7/17 0:00 | 1 | 30.83 | 1 | N | 132 | 23 | 1 | 80.00 | 0.5 | 0.5 | 18.56 | 11.52 | 0.3 | 111.38 |
| 22528 | 90375786 | 1 | 10/26/17 14:45 | 10/26/17 16:12 | 1 | 30.50 | 1 | N | 132 | 220 | 1 | 90.50 | 0.0 | 0.5 | 19.85 | 8.16 | 0.3 | 119.31 |
| 22527 | 68023798 | 2 | 8/11/17 14:14 | 8/11/17 15:17 | 1 | 30.33 | 2 | N | 132 | 158 | 1 | 52.00 | 0.0 | 0.5 | 14.64 | 5.76 | 0.3 | 73.20 |
| 22526 | 77309977 | 2 | 9/14/17 13:44 | 9/14/17 14:34 | 1 | 28.23 | 2 | N | 13 | 132 | 1 | 52.00 | 0.0 | 0.5 | 4.40 | 5.76 | 0.3 | 62.96 |
| 22525 | 43431843 | 1 | 5/15/17 8:11 | 5/15/17 9:03 | 1 | 28.20 | 2 | N | 90 | 132 | 1 | 52.00 | 0.0 | 0.5 | 11.71 | 5.76 | 0.3 | 70.27 |
| 22524 | 51094874 | 2 | 6/16/17 18:51 | 6/16/17 19:41 | 1 | 27.97 | 2 | N | 261 | 132 | 2 | 52.00 | 4.5 | 0.5 | 0.00 | 5.76 | 0.3 | 63.06 |
In [ ]:
From these calculations: 1) The most expensive trips are not necessarily the longest ones; 2)
The longest trips are about 33 miles; 3) The data may require cleaning for outliers and missing
data.
In [ ]:
Data Cleaning
The following steps were taken to clean the data set:
Rows with 'trip_distance' values less than zero were removed.
Rows with 'fare_amount' values less than zero were removed.
The data was checked for null values. None were found.
The data was checked for duplicates. None were found.
In [80]:
a = tlc_data['trip_distance'].min()
b = tlc_data['trip_distance'].max()
c = tlc_data['trip_distance'].mean()
d = tlc_data['trip_distance'].std()
print('Minimum trip distance: ' + str(a))
print('Maximum trip distance: ' + str(b))
print('Mean trip distance: ' + str(c))
print('Standard deviation: ' + str(d))
Minimum trip distance: 0.01 Maximum trip distance: 33.96 Mean trip distance: 2.9330620395846276 Standard deviation: 3.6562845207479793
In [81]:
a = tlc_data['fare_amount'].min()
b = tlc_data['fare_amount'].max()
c = tlc_data['fare_amount'].mean()
d = tlc_data['fare_amount'].std()
print('Minimum fare amount: ' + str(a))
print('Maximum fare amount: ' + str(b))
print('Mean fare amount: ' + str(c))
print('Standard deviation: ' + str(d))
Minimum fare amount: 2.5 Maximum fare amount: 999.99 Mean fare amount: 12.9772388390876 Standard deviation: 12.666115277972198
In [30]:
tlc_data.max()
Out[30]:
Unnamed: 0 113486300 VendorID 2 tpep_pickup_datetime 12/31/2017 9:57:40 PM tpep_dropoff_datetime 12/31/2017 9:55:19 AM passenger_count 6 trip_distance 33.96 RatecodeID 99 store_and_fwd_flag Y PULocationID 265 DOLocationID 265 payment_type 4 fare_amount 999.99 extra 4.5 mta_tax 0.5 tip_amount 200.0 tolls_amount 19.1 improvement_surcharge 0.3 total_amount 1200.29 dtype: object
In [41]:
tlc_data['trip_distance'].std()
Out[41]:
np.float64(3.653171182833924)
In [55]:
correlation = np.corrcoef(tlc_data['trip_distance'],tlc_data['fare_amount'])
print(correlation)
[[1. 0.75659898] [0.75659898 1. ]]
In [66]:
correlation = np.corrcoef(tlc_data['passenger_count'],tlc_data['fare_amount'])
print(correlation)
[[1. 0.01061401] [0.01061401 1. ]]
In [5]:
tlc_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22534 entries, 0 to 22533 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 22534 non-null int64 1 VendorID 22534 non-null int64 2 tpep_pickup_datetime 22534 non-null object 3 tpep_dropoff_datetime 22534 non-null object 4 passenger_count 22534 non-null int64 5 trip_distance 22534 non-null float64 6 RatecodeID 22534 non-null int64 7 store_and_fwd_flag 22534 non-null object 8 PULocationID 22534 non-null int64 9 DOLocationID 22534 non-null int64 10 payment_type 22534 non-null int64 11 fare_amount 22534 non-null float64 12 extra 22534 non-null float64 13 mta_tax 22534 non-null float64 14 tip_amount 22534 non-null float64 15 tolls_amount 22534 non-null float64 16 improvement_surcharge 22534 non-null float64 17 total_amount 22534 non-null float64 dtypes: float64(8), int64(7), object(3) memory usage: 3.1+ MB
In [7]:
tlc_data.isnull().sum()
Out[7]:
Unnamed: 0 0 VendorID 0 tpep_pickup_datetime 0 tpep_dropoff_datetime 0 passenger_count 0 trip_distance 0 RatecodeID 0 store_and_fwd_flag 0 PULocationID 0 DOLocationID 0 payment_type 0 fare_amount 0 extra 0 mta_tax 0 tip_amount 0 tolls_amount 0 improvement_surcharge 0 total_amount 0 dtype: int64
In [10]:
tlc_data_dup = tlc_data.duplicated().sum()
print(tlc_data_dup)
0
In [ ]:
Approximately 74.5% of the recorded trips were less than or equal to 3 miles:
In [26]:
sum(tlc_data.trip_distance <= 3) / tlc_data.count()
Out[26]:
Unnamed: 0 0.745318 VendorID 0.745318 tpep_pickup_datetime 0.745318 tpep_dropoff_datetime 0.745318 passenger_count 0.745318 trip_distance 0.745318 RatecodeID 0.745318 store_and_fwd_flag 0.745318 PULocationID 0.745318 DOLocationID 0.745318 payment_type 0.745318 fare_amount 0.745318 extra 0.745318 mta_tax 0.745318 tip_amount 0.745318 tolls_amount 0.745318 improvement_surcharge 0.745318 total_amount 0.745318 dtype: float64
In [29]:
sum(tlc_data.fare_amount <= 15) / tlc_data.count()
Out[29]:
Unnamed: 0 0.765998 VendorID 0.765998 tpep_pickup_datetime 0.765998 tpep_dropoff_datetime 0.765998 passenger_count 0.765998 trip_distance 0.765998 RatecodeID 0.765998 store_and_fwd_flag 0.765998 PULocationID 0.765998 DOLocationID 0.765998 payment_type 0.765998 fare_amount 0.765998 extra 0.765998 mta_tax 0.765998 tip_amount 0.765998 tolls_amount 0.765998 improvement_surcharge 0.765998 total_amount 0.765998 dtype: float64
In [ ]:
Conclusions
Approximately 74.5% of the documented trip were 3 miles or less.
Approximately 76.6% of the fares for documented trips were $15.00 or less.
There is a high positive correlation between 'fare_amount' and 'trip_distance'.
There is a very low positive correlation between 'passenger_count' and 'fare_amount'.
Therefore, we recommend that developers concentrate on accurately predicting fares for short trips.