Xgb

11 minute read

XGboost

To conclude with the estimation part of the project we will use the Xgboost methodology to compare results with the previous estimations. Once again, due to time constraints and computing restrictions, we continue with the same size of the data sample from the two previous models only with the 14 variables with had more relative influence.

library(tidyverse)
library(xgboost)
library(caret)
library(dplyr)
trainub %>% select('price','distance', 'windSpeed','temperature','visibility','humidity',
'precipIntensity','destination_Back.Bay','source_Beacon.Hill','source_North.End',
'destination_Fenway', 'source_Back.Bay','source_Northeastern.University',
'destination_South.Station','source_Financial.District') -> trainub2
testub %>% select('price','distance', 'windSpeed','temperature','visibility','humidity',
    'precipIntensity','destination_Back.Bay','source_Beacon.Hill','source_North.End',
      'destination_Fenway', 'source_Back.Bay','source_Northeastern.University',
      'destination_South.Station','source_Financial.District') -> testub2

trainly %>% select('price','surge_multiplier','distance','windSpeed','visibility','temperature',
    'humidity','source_North.End','precipIntensity','destination_Back.Bay',
    'source_Boston.University','source_Beacon.Hill','source_North.Station',
    'source_Financial.District','source_Back.Bay') -> trainly2
testly %>% select('price','surge_multiplier','distance','windSpeed','visibility','temperature',
    'humidity','source_North.End','precipIntensity','destination_Back.Bay',
     'source_Boston.University','source_Beacon.Hill','source_North.Station',
     'source_Financial.District','source_Back.Bay') -> testly2

We use the xgboost & caret libraries, using xgbtree and gbm. As control parameters, we used CV only with 5 folds. But in the grid options we were more generous trying to tuning the best parameters for the booster parameters, with a maximum of 200 iterations, with a depth of the tree equal to (5, 10 and 15), with a learning rate (eta = 0.01), and with a length of 3 to the number of variables supplied to a tree.

1.) We estimate the model with XGBOOST with caret-XgbTree

Matrix sample for Uber
xub_train = xgb.DMatrix(as.matrix(trainub2 %>% select(-price)))
yub_train = trainub$price
xub_test = xgb.DMatrix(as.matrix(testub2 %>% select(-price)))
yub_test = testub$price

Matrix sample for Lyft

xly_train = xgb.DMatrix(as.matrix(trainly2 %>% select(-price)))
yly_train = trainly$price
xly_test = xgb.DMatrix(as.matrix(testly2 %>% select(-price)))
yly_test = testly$price

Control

xgb_trcontrol = trainControl( method = "cv",
    number = 5, allowParallel = T, verboseIter = F, returnData = F)

tune grid

xgbGrid <- expand.grid(nrounds = c(100,200), max_depth = c(5,10,15),
     colsample_bytree = seq(0.5, 0.9, length.out = 3), eta = 0.1,
    gamma=0,  min_child_weight = 1, subsample = 1)

Here’s the Table for the Best tune - xgTree: alt alt

Running the model

set.seed(45) 
xgbub_model = train(xub_train, yub_train,  trControl = xgb_trcontrol,
    tuneGrid = xgbGrid,  method = "xgbTree")

xgbly_model = train(xly_train, yly_train,  trControl = xgb_trcontrol,
    tuneGrid = xgbGrid, method = "xgbTree")

xgbub_model$bestTune
xgbly_model$bestTune

Evaluation

predi_ub = predict(xgbub_model, xub_test)
resid_ub = yub_test - predi_ub
RMSE = sqrt(mean(resid_ub^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

predi_ly = predict(xgbly_model, xly_test)
resid_ly = yly_test - predi_ly
RMSE = sqrt(mean(resid_ly^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

R2

yub_test_mean = mean(yub_test)
# Calculate total sum of squares
tss_ub =  sum((yub_test - yub_test_mean)^2 )
# Calculate residual sum of squares
rss_ub =  sum(resid_ub^2)
# Calculate R-squared
rsq_ub  =  1 - (rss_ub/tss_ub)
cat('The R-square of the test data is ', round(rsq_ub,3), '\n')

yly_test_mean = mean(yly_test)
# Calculate total sum of squares
tss_ly =  sum((yly_test - yly_test_mean)^2 )
# Calculate residual sum of squares
rss_ly =  sum(resid_ly^2)
# Calculate R-squared
rsq_ly  =  1 - (rss_ly/tss_ly)
cat('The R-square of the test data is ', round(rsq_ly,3), '\n')

The best tune for both models (Uber and Lyft) is the same with 100 iterations, depth of tree equal to 5 and length of variables as 0.5. Below, Table 6B presents results on RMSE and R2, we found that the error is small for Uber as well as its R2, on the other hand, the model for Lyft shows a higher R2.

RMSE & R2

  Uber Lyft
RMSE 7.396 8.55
R2 0.042 0.093

Finally, we present the prediction figures for Uber and Lyft.

Plot predictions vs test data : Uber

options(repr.plot.width=8, repr.plot.height=4)
ub_data = as.data.frame(cbind(predicted_ub = predi_ub,
                              observed_ub = yub_test))

ggplot(ub_data,aes(predicted_ub, observed_ub)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Extreme Gradient Boosting: Uber") +
    xlab("Predicted") + ylab("Observed") + 
    theme(plot.title = element_text(color="darkgreen",size=16,hjust = 0.5),
          axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
          axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

Plot predictions vs test data : Lyft

options(repr.plot.width=8, repr.plot.height=4)
ly_data = as.data.frame(cbind(predicted_ly = predi_ly,
                              observed_ly = yly_test))

ggplot(ly_data,aes(predicted_ly, observed_ly)) + geom_point(color = "darkred", alpha = 0.5) + 
    geom_smooth(method=lm)+ ggtitle('Linear Regression ') + ggtitle("Extreme Gradient Boosting: Lyft") +
    xlab("Predicted") + ylab("Observed") + 
    theme(plot.title = element_text(color="darkgreen",size=16,hjust = 0.5),
          axis.text.y = element_text(size=12), axis.text.x = element_text(size=12,hjust=.5),
          axis.title.x = element_text(size=14), axis.title.y = element_text(size=14))

Here’s the plots for Xgboost - model predictions:

alt

Extra! GBM with caret

Concluding, although we already presented the gbm estimations in the above section, we found it interesting to reply to the exercise with another routine, at least to compare methodologies and of course, the most important results.

gbmGrid <-  expand.grid(interaction.depth = c(1,3,5,7,9), 
   n.trees = c(50,100,150), shrinkage = 0.01, n.minobsinnode = 20)
              
gbm_ub <- caret::train(price ~ ., data = trainub2,
  method = "gbm", trControl = trainControl(method = "cv", 
      number = 5,  verboseIter = FALSE), verbose = 0, tuneGrid = gbmGrid)

gbm_ly <- caret::train(price ~ ., data = trainly2,
     method = "gbm", trControl = trainControl(method = "cv", 
      number = 5,  verboseIter = FALSE), verbose = 0, tuneGrid = gbmGrid)

gbm_ub$bestTune
gnm_ly$bestTune

Table fot the Best tune –caret::gbm-

n.trees interaction.depth shrinkage n.minobsinnode
150 9 0.01 20

The best tune for both models (Uber and Lyft) is the same with 150 trees, interaction depth equal to 5 and ‘n.minobsinnode’ to 20. Regarding this last parameter, higher values mean smaller trees so make the algorithm run faster and use less memory, which may be a consideration. We believe results are not very sensitive to this parameter and given the stochastic nature of GBM performance it might actually be difficult to determine exactly what value is ‘the best’. We see that the small RMSE-CV is achieved with a 9 interaction depth and 150 iterations, Lyft’s results are similar to Uber.

Here’s the plots for Caret::gbm – best tune:

alt

Evaluation for the model

predi_ub1 = predict(gbm_ub, testub2)
resid_ub1 = testub2$price - predi_ub1
RMSE = sqrt(mean(resid_ub1^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

predi_ly1 = predict(gbm_ly, testly2)
resid_ly1 = testly2$price - predi_ly1
RMSE = sqrt(mean(resid_ly1^2))
cat('The root mean square error of the test data is ', round(RMSE,3),'\n')

Below, table presents results on RMSE and R2, we found that the gbm errors are s smaller than the xgbtree. So we could select this model prediction since it is the one that has the lower RMSE figures.

RMSE & R2

  Uber Lyft
RMSE 7.292 8.391
R2 0.069 0.126
trellis.par.set(caretTheme())
plot(gbm_ly)  
plot(xgbub_model)

plot(gbm_ub, metric = "RMSE", plotType = "level", scales = list(x = list(rot = 90)))
plot(gbm_ly, metric = "RMSE", plotType = "level", scales = list(x = list(rot = 90)))

Finally, we gather all the metrics available during the exercise, which could help us to select the model with less error, of course, this will depend on the observation and the selected number of predictors. Results show that the model that has the smallest RMSE is GBM! The selection is consistent since it is chosen for almost all the exercises and for both firms. However, it appears that for Uber considering the subsample of 14 predictors, Random Forest is the best selection.

Selected metrics all models

alt

This is the end pf part 6.

Tags:

Categories:

Updated: