### Hello World,

TLDR; Principal Component Analysis to find relationship between multiple variables for Chicago Taxi Fare dataset.

This assignment was given by Prof Patrick Healy at the University of Limerick for a Big Data and Visualisation module (CS6502)

### Task

In Lab 11, you have learned how to create and evaluate a machine learning model, then predict using the model. In this project you are asked to practice the skills on a different dataset – Chicago Taxi Trips, this dataset includes taxi trips from 2013 to the present, reported to the City of Chicago.

Your task here is to

1. use the python scikit-learn to investigate the similarities and relationships that PCA can provide. It will not be possible for you to analyze the entire data set in this way so you should perform your analysis on *the first 1000 rows* of the table. Ideally the output of this phase should guide the direction you take in the second, ML, part of the assignment. You will need to decide which columns should be part of your analysis and which should be ignored. See some of the tutorial links we have posted in the lecture slides for help in deciding what are appropriate columns to consider.

2. based on the information you gained from step 1, create a model to predict the taxi fare1(“fare” column in the dataset).

Note that you may need to clean the data, pick a list of features (feature engineering), and then design your model. Please email your zipped solution pack to the lecturer by the end of week 14 with subject “cs6502:ml proj”.

Note that your solution pack should contain

-the query you use to selected the first 1000 rows

-key steps for your PCA (setup, command, etc)

-the sql for model creation

-the sql to evaluate the model

-the SQL to predict using the model above

-link of the BigQuery commands you composed

-screenshot of the model evaluation report

You will be assessed on the quality of the information you glean and feed into the ML part and how you present this to us. Again, see how this is done from the tutorials. (Note, as the tutorials make clear the R statistical package has very good support for this type of analysis. If you wish to take this task on using R then this will be ok, too.)

### Solutions

Link to Google Colab .ipynb file

The following document contains a solution pack:

-the query you use to selected the first 1000 rows

-key steps for your PCA (setup, command, etc)

-the sql for model creation

-the sql to evaluate the model

-the sql to predict using the model above

-link of the BigQuery commands you composed

-screenshot of the model evaluation report

**The query you use to selected the first 1000 rows**

SELECT

fare ,

EXTRACT(DAYOFWEEK from trip_start_timestamp )

AS dayofweek,

EXTRACT(HOUR from trip_start_timestamp ) AS

hourofday,

trip_miles ,

trip_seconds ,

pickup_longitude ,

pickup_latitude ,

dropoff_longitude ,

dropoff_latitude ,

pickup_community_area ,

dropoff_community_area ,

trip_start_timestamp

FROM

`bigquery-public-data.chicago_taxi_trips.taxi_trips`

WHERE

trip_seconds>=60

AND trip_miles>0.5

AND fare>0

AND pickup_longitude IS NOT NULL

AND pickup_latitude IS NOT NULL

AND dropoff_longitude IS NOT NULL

AND dropoff_latitude IS NOT NULL

AND pickup_longitude <>

dropoff_longitude

AND pickup_latitude <> dropoff_latitude

AND pickup_community_area IS NOT NULL

AND dropoff_community_area IS NOT NULL

AND

MOD(ABS(FARM_FINGERPRINT(CAST(trip_start_timestamp AS STRING))), 5000)= 1

LIMIT

1000

**Key steps for your PCA (setup, command, etc)**

PCA and Data exploration, cleaning, feature

engineering– Google Collab link

https://colab.research.google.com/drive/1LdTYfayhNcmg_rcU-DUZWwORD2zQ3Zu1?usp=sharing

Importing

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

%matplotlib inline

# Pandas display options

pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set random seed

RSEED = 10

# Visualizations

import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use('fivethirtyeight')

import seaborn as sns

palette = sns.color_palette('Paired', 10)

Importing data set (random 1000 rows from Chicago Taxi Fare dataset)

url = "/content/results-20200510-190212.csv"

df = pd.read_csv(url)

df.head()

fare | dayofweek | hourofday | trip_miles | trip_seconds | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | pickup_community_area | dropoff_community_area | trip_start_timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 8.650 | 4 | 23 | 2.600 | 420 | -87.649 | 41.923 | -87.633 | 41.900 | 7 | 8 | 2015-07-22 23:45:00 UTC |

1 | 8.450 | 4 | 23 | 2.300 | 540 | -87.613 | 41.892 | -87.643 | 41.879 | 8 | 28 | 2015-07-22 23:45:00 UTC |

2 | 7.250 | 4 | 23 | 1.800 | 480 | -87.619 | 41.891 | -87.643 | 41.879 | 8 | 28 | 2015-07-22 23:45:00 UTC |

3 | 6.050 | 4 | 23 | 1.300 | 300 | -87.629 | 41.900 | -87.613 | 41.892 | 8 | 8 | 2015-07-22 23:45:00 UTC |

4 | 7.050 | 4 | 23 | 1.600 | 420 | -87.638 | 41.893 | -87.619 | 41.891 | 8 | 8 | 2015-07-22 23:45:00 UTC |

df.describe()

fare | dayofweek | hourofday | trip_miles | trip_seconds | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | pickup_community_area | dropoff_community_area | |
---|---|---|---|---|---|---|---|---|---|---|---|

count | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 | 1000.000 |

mean | 13.257 | 3.217 | 16.230 | 3.926 | 824.164 | -87.656 | 41.894 | -87.643 | 41.896 | 24.643 | 20.044 |

std | 11.697 | 1.318 | 4.679 | 5.090 | 719.886 | 0.072 | 0.040 | 0.042 | 0.037 | 19.623 | 15.238 |

min | 4.450 | 1.000 | 13.000 | 0.510 | 60.000 | -87.914 | 41.713 | -87.914 | 41.706 | 1.000 | 1.000 |

25% | 6.750 | 1.000 | 13.000 | 1.177 | 420.000 | -87.645 | 41.879 | -87.648 | 41.881 | 8.000 | 8.000 |

50% | 8.750 | 4.000 | 13.000 | 1.900 | 633.500 | -87.632 | 41.892 | -87.632 | 41.892 | 28.000 | 8.000 |

75% | 13.300 | 4.000 | 23.000 | 3.900 | 960.000 | -87.621 | 41.900 | -87.621 | 41.901 | 32.000 | 32.000 |

max | 124.250 | 4.000 | 23.000 | 64.400 | 13620.000 | -87.583 | 42.010 | -87.573 | 42.010 | 77.000 | 77.000 |

plt.figure(figsize = (10, 6))

sns.distplot(df['fare']);

plt.title('Distribution of Fare');

Binning

# Bin the fare and convert to string

df['fare-bin'] = pd.cut(df['fare'], bins = list(range(0, 50, 5))).astype(str)

# Uppermost bin

df.loc[df['fare-bin'] == 'nan', 'fare-bin'] = '[45+]'

# Adjust bin so the sorting is correct

df.loc[df['fare-bin'] == '(5.0, 10.0]', 'fare-bin'] = '(05.0, 10.0]'

# Bar plot of value counts

df['fare-bin'].value_counts().sort_index().plot.bar(color = 'b', edgecolor = 'k');

plt.title('Fare Binned');

Lat Long Pickup drop distribution

fig, axes = plt.subplots(1, 2, figsize = (20, 8), sharex=True, sharey=True)

axes = axes.flatten()

# Plot Longitude (x) and Latitude (y)

sns.regplot('pickup_longitude', 'pickup_latitude', fit_reg = False,

data = df.sample(100, random_state = RSEED), ax = axes[0]);

sns.regplot('dropoff_longitude', 'dropoff_latitude', fit_reg = False,

data = df.sample(100, random_state = RSEED), ax = axes[1]);

axes[0].set_title('Pickup Locations')

axes[1].set_title('Dropoff Locations');

# Absolute difference in latitude and longitude

df['abs_lat_diff'] = (df['dropoff_latitude'] - df['pickup_latitude']).abs()

df['abs_lon_diff'] = (df['dropoff_longitude'] - df['pickup_longitude']).abs()

sns.lmplot('abs_lat_diff', 'abs_lon_diff', fit_reg = False,

data = df.sample(900, random_state=RSEED));

plt.title('Absolute latitude difference vs Absolute longitude difference');

sns.lmplot('abs_lat_diff', 'abs_lon_diff', hue = 'fare-bin', size = 8, palette=palette,

fit_reg = False, data = df.sample(900, random_state=RSEED));

plt.title('Absolute latitude difference vs Absolute longitude difference');

It does seem that the rides with a larger absolute difference in both longitude and latitude tend to cost more. To capture both differences in a single variable, we can add up the two differences in latitude and longitude and also find the square root of the sum of differences squared. The former feature would be called the Manhattan distance - or l1 norm - and the latter is called the Euclidean distance - or l2 norm. Both of these distances are specific examples of the general Minkowski distance.

Manhattan and Euclidean Distance

The Minkowski Distance between two points is expressed as:

if p = 1, then this is the Manhattan distance and if p = 2 this is the Euclidean distance. You may also see these referred to as the l1 or l2 norm where the number indicates p in the equation.

I should point out that these equations are only valid for actual distances in a cartesian coordinate system and here we only use them to find relative distances. To find the actual distances in terms of kilometers, we have to work with the latitude and longitude geographical coordinate system. This will be done later using the Haversine formula.

def minkowski_distance(x1, x2, y1, y2, p):

return ((abs(x2 - x1) ** p) + (abs(y2 - y1)) ** p) ** (1 / p)

# Create a color mapping based on fare bins

color_mapping = {fare_bin: palette[i] for i, fare_bin in enumerate(df['fare-bin'].unique())}

color_mapping

df['color'] = df['fare-bin'].map(color_mapping)

plot_data = df.sample(100, random_state = RSEED)

df['manhattan'] = minkowski_distance(df['pickup_longitude'], df['dropoff_longitude'],

df['pickup_latitude'], df['dropoff_latitude'], 1)

# Calculate distribution by each fare bin

plt.figure(figsize = (12, 6))

for f, grouped in df.groupby('fare-bin'):

sns.kdeplot(grouped['manhattan'], label = f'{f}', color = list(grouped['color'])[0]);

plt.xlabel('degrees'); plt.ylabel('density')

plt.title('Manhattan Distance by Fare Amount');

df['euclidean'] = minkowski_distance(df['pickup_longitude'], df['dropoff_longitude'],

df['pickup_latitude'], df['dropoff_latitude'], 2)

# Calculate distribution by each fare bin

plt.figure(figsize = (12, 6))

for f, grouped in df.groupby('fare-bin'):

sns.kdeplot(grouped['euclidean'], label = f'{f}', color = list(grouped['color'])[0]);

plt.xlabel('degrees'); plt.ylabel('density')

plt.title('Euclidean Distance by Fare Amount');

These features do seem to have some differences between the different fare amounts, so they might be helpful in predicting the fare.

grouped = df.groupby('fare-bin')['euclidean'].agg(['mean', 'count'])

grouped.sort_index(ascending=True)

mean | count | |
---|---|---|

fare-bin | ||

(0.0, 5.0] | 0.013 | 21 |

(05.0, 10.0] | 0.020 | 578 |

(10.0, 15.0] | 0.043 | 196 |

(15.0, 20.0] | 0.069 | 54 |

(20.0, 25.0] | 0.087 | 18 |

(25.0, 30.0] | 0.161 | 31 |

(30.0, 35.0] | 0.173 | 21 |

(35.0, 40.0] | 0.236 | 19 |

(40.0, 45.0] | 0.279 | 36 |

[45+] | 0.273 | 26 |

df.groupby('fare-bin')['euclidean'].mean().plot.bar(color = 'b');

plt.title('Average Euclidean Distance by Fare Bin');

There is a very clearly relationship between the fare bin and the average distance of the trip! This should give us confidence that this feature will be useful to a model.

corrs = df.corr()

corrs['fare'].plot.bar(color = 'g', figsize = (10, 10));

plt.title('Correlation with Fare Amount');

corrs = df.corr()

plt.figure(figsize = (14, 14))

sns.heatmap(corrs, annot = True, vmin = -1, vmax = 1, fmt = '.3f', cmap=plt.cm.PiYG_r);

the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings)

from sklearn.preprocessing import StandardScaler

features = ['manhattan', 'dayofweek', 'hourofday', 'trip_seconds', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude', 'pickup_community_area', 'dropoff_community_area']# Separating out the features

x = df.loc[:, features].values# Separating out the target

y = df.loc[:,['fare']].values# Standardizing the features

x = StandardScaler().fit_transform(x)

x

features = x.T

covariance_matrix = np.cov(features)

print(covariance_matrix)

eig_vals, eig_vecs = np.linalg.eig(covariance_matrix)

print('Eigenvectors \n%s' %eig_vecs)

print('\nEigenvalues \n%s' %eig_vals)

eig_vals[0] / sum(eig_vals)

Standardize the Data Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.

Ref: https://stackoverflow.com/questions/50796024/feature-variable-importance-after-a-pca-analysis

pca = PCA()

features = pca.fit_transform(x)

explained_variance = pca.explained_variance_ratio_

explained_variance

def myplot(score,coeff,labels=None):

xs = score[:,0]

ys = score[:,1]

n = coeff.shape[0]

scalex = 1.0/(xs.max() - xs.min())

scaley = 1.0/(ys.max() - ys.min())

plt.scatter(xs * scalex,ys * scaley, c = y)

for i inrange(n):

plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)

if labels isNone:

plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')

else:

plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlim(-1,1)

plt.ylim(-1,1)

plt.xlabel("PC{}".format(1))

plt.ylabel("PC{}".format(2))

plt.grid()

#Call the function. Use only the 2 PCs.

myplot(features[:,0:2],np.transpose(pca.components_[0:2, :]))

plt.show()

pca.explained_variance_ratio_

cumsum = np.cumsum(pca.explained_variance_ratio_)

%matplotlib inline

import matplotlib.pyplot as plt

plt.xlabel("number of components")

plt.ylabel("Cumulative explained variance")

plt.plot(cumsum,'--o')

print(abs( pca.components_ ))

Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: the 1st, 4th, 5th and 9th variables are most important. which in this case are 'manhattan distance', 'trip_seconds', 'pickup_longitude', 'pickup_community_area'.

However, the pickup longitude is something that is a raw unprocessed feature and the manhattan distance is brought to use after feature procesing so we will be considering that.

NOTE: We are not going to be taking the trip miles because a) it is usually not available at the beginning of the journey AND b) it is highly corelated with the manhattan distance, and we should not taken multiple variables that are highly corelated to each other in order to decrease generalisation error.

Now lets move onto the ML part on Big Query in GCP

**The sql for model creation** https://console.cloud.google.com/bigquery?sq=1085073370079:ac74aed7fd1b4bdc8f6b85a60dfa5bc6

**Model Test/Predict**

https://console.cloud.google.com/bigquery?sq=1085073370079:39a02464cfc54a80bd8cfba1d1722440

**EvaluateModel - **

**the sql to evaluate the model**

https://console.cloud.google.com/bigquery?sq=1085073370079:fa04e26003b94cef86773dd3423da54e

References:

https://stackoverflow.com/questions/40120696/trying-to-get-the-euclidean-distance-through-a-query

https://towardsdatascience.com/another-machine-learning-walk-through-and-a-challenge-8fae1e187a64