Graph

26 minute read

Extra and last post!

  1. Graph netowrk: a brief analysis.
  2. Privacy policy for Uber and Lyft
  3. Conclusions

1. Graph netowrk: a brief analysis.

library(knitr)
library(igraph)
library(dplyr)

For this exercise, we decide to take a small sample of 200 observations to reduce the plot analysis, especially for the visual part of the graph figures.

sample_n(rider_net, 200) -> rider

After that we decided to take split the data for Uber and Lyft, in that way we could perform the graph exercise and some metrics for both samples, basic statistics and degree indexes.

rider %>% filter(cab_type == 'Lyft') -> lyft
rider %>% filter(cab_type == 'Uber') -> uber

rider <-within(rider, rm('cab_type', 'id'))
uber <-within(uber, rm('cab_type', 'id'))
lyft <-within(lyft, rm('cab_type', 'id'))
FN=graph_from_data_frame(d=rider, directed=F)
ub=graph_from_data_frame(d=uber, directed=F)
ly=graph_from_data_frame(d=lyft, directed=F)

Here is a sample of the code for Uber, that it’s easily replicable for Lyft and for the whole sample.

par(mar=c(1.1,1.1,1.1,1.1))
LO = layout.fruchterman.reingold(ub,niter=100)

plot(FN, layout=LO, vertex.size=degree(ly, mode='out')*0.8, vertex.frame.color="black", asp=0, vertex.color='cyan',
     edge.color="lightgray", edge.arrow.size=0.069, rescale=TRUE, edge.width=.06, 
     edge.color='darkgreen', vertex.label.cex=0.8

Here’s the plot for the whole network -Uber & Lyft included-:

alt

plot(ub, layout=LO, vertex.size=degree(ly, mode='out')*0.8, vertex.frame.color="black", asp=0, vertex.color='cyan',
     edge.color="lightgray", edge.arrow.size=0.069, rescale=TRUE, edge.width=.06, 
     edge.color='darkgreen', vertex.label.cex=0.8

Basic statistics for the graph

# Creating the function for the mode
moda <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

r = range(degree(FN))
ra = r[2]-r[1]
mo = moda(degree(FN))
me = mean(degree(FN))
md = median(degree(FN))

Basic statistics

  Full set Uber Lyft
Range 20 13 10
Mode 38 25 15
Mean 33.3 19.8 13.5
Median 36 22 14

From the above Table we can see range is a little higher for Uber than Lyft, we also see that the mode is higher than Lyft, this clearly reflects that Uber is more connected to the network. Finally, the mean and median are also higher in Uber.

Here’s the network plot for Uber:

alt

plot(ly, layout=LO, vertex.size=degree(ly, mode='out')*0.8, vertex.frame.color="black", asp=0, vertex.color='cyan',
     edge.color="lightgray", edge.arrow.size=0.069, rescale=TRUE, edge.width=.06, 
     edge.color='darkgreen', vertex.label.cex=0.8

Here’s the network plot for Lyft:

alt

Last, we make some plots for both firms, this time giving more weight to the degree of each source place.

Here’s the network plot for Uber:

alt

Here’s the network plot for Lyft:

alt

Degree indexes for the newtork

DEG_0 = centr_degree(f,loops=FALSE)$centralization
CLO_0 = centr_clo(f,normalized=TRUE)$centralization
POW_0 = mean(power_centrality(f,exponent=0.45))
BET_0 = centr_betw(f,normalized=TRUE)$centralization
EIG_0 = centr_eigen(f)$centralization
Centralities_0=cbind(DEG_0,CLO_0,BET_0,EIG_0,POW_0)
round(Centralities_0,4)
  Centr. deg. Closeness Power Eigen Betwen.
Full set 0.72 0.029 -0.99 0.19 0.0235
Uber 0.56 0.084 -0.97 0.24 0.036
Lyft 0.38 0.084 -0.98 0.24 0.033

From the centrality degree we can affirm our previous statement, we found that Uber is more connected than Lyft by the number of links of each source place, however, the Closeness degree, Power degree and Eigenvector centrality are similar. Finally, the Betweenness Centrality measures the importance of a node based in the shortest path between all pairs of nodes in the graph, again Uber has a small advantage communicating the distinct places in the network.

2. Privacy policy for Uber and Lyft

Privacy Policy is an integral part of the General Terms of Use. The protection of users personal data is one of Uber and Lyft’s priorities, and any processing of personal data is carried out conscientiously, legally, and in accordance with all relevant legislation in their respective location.

For this part we will only discuss the visitor data, that is collected when using the Website and browsing it’s content.

According to the firms, this data is not collected and do not process personal data whose collection is prohibited by law, such as data relating to racial or ethnic origin, political opinions, religious or other beliefs, trade union memberships and other data the collection of which the law expressly prohibits.

Uber collects the following data:

  • Full name (first name, last name, middle name, title)
  • Headline (job title)
  • Skype name
  • Profile image
  • Phone number
  • Email address
  • Company name
  • Company address

Uber collects this information for the sole purpose of carrying out activities which are subject of their business. The collected data can be used for further development and improvement of the Website, to adapt and improve the content and services provided, as well as to inform and offer the User other contents and services that can be used through the Website.

Data security

Customers data is stored in multi-tenant data-stores, both firms don’t have individual data-stores for each customer

Uber

Uber uses a background check and identity verification: they collect background check and identity verification information for drivers and delivery partners. This may include information such as driver history or criminal record (where permitted by law), and right to work. Ubel also collects identity verification from ‘Uber Eats’ users who request alcohol delivery.

According to demographic data: they may collect demographic data about users, including user surveys. In some countries, they also receive demographic data about users from third parties.

  • Location data: Uber collects precise location data from a user’s mobile device if enabled by the user. For drivers and delivery partners, Uber collects this data when the Uber app is running in the foreground (app open and on-screen) or background (app open but not on-screen) of their mobile device. For riders, delivery recipients, and renters, Uber collects this data when the Uber app is running in the foreground.

Uber collects and uses data to enable reliable and convenient transportation, delivery, and other products and services. Uber uses the data we collect:

  • To enhance the safety and security of our users and services
  • For customer support
  • For research and development
  • To enable communications between users.
  • Track and share the progress of rides or deliveries.
  • Enable features that allow users to share information with other people, such as when riders submit a compliment about a driver.

On the other hand, Uber uses personal data to make automated decisions relating to use of our services. This includes:

  • Enabling dynamic pricing, in which the price of a ride, is determined based on constantly varying factors such as the estimated time and distance, the predicted route, estimated traffic, and the number of riders and drivers using Uber at a given moment.

A review for Lyft Privacy Policy

At Lyft, they need to collect, use, and share some of our personal information. Lyft’s privacy homepage provides additional information about our commitment to respecting your personal information, including ways for you to access and delete that information.

When you use the Lyft Platform, they collect the information you provide to them, usage information, and information about your device. They also collect information about you from other sources like third party services, and optional programs in which you participate, which they may combine with other information they have about you. Here are some types of information they collect about us.

  • Information You Provide:
  • Your name,
  • email address,
  • phone number,
  • birth date, and payment information.

You may choose to share additional info for your Rider profile, like your photo or saved addresses (e.g., home or work), and set up other preferences (such as your preferred pronouns).

Ratings and Feedback. When you rate and provide feedback about Riders or Drivers, they collect all of the information you provide in your feedback.

  • Information collected when you use the Lyft Platform Location Information. The Lyft Platform collects location information (including GPS and WiFi data) differently depending on your Lyft app settings and device permissions as well as whether you are using the platform as a Rider or Driver.

Riders: Lyft collects your device’s precise location when you open and use the Lyft app, including while the app is running in the background from the time you request a ride until it ends. Lyft also tracks the precise location of scooters and e-bikes at all times.

Usage Information. They collect information about your use of the Lyft Platform, including ride information like the date, time, destination, distance, route, payment, and whether you used a promotional or referral code. Lyft also collect information about your interactions with the Lyft Platform like our apps and websites, including the pages and content you view and the dates and times of your use.

Device Information. They collect information about the devices you use to access the Lyft Platform, including device model, IP address, type of browser, version of operating system, identity of carrier and manufacturer, radio type (such as 4G), preferences and settings (such as preferred language), application installations, device identifiers, advertising identifiers, and push notification tokens. If you are a Driver, we also collect mobile sensor data from your device (such as speed, direction, height, acceleration, deceleration, and other technical data).

Communications between Riders and Drivers. Lyft work with a third party to facilitate phone calls and text messages between Riders and Drivers without sharing either party’s actual phone number with the other. But while we use a third party to provide the communication service, we collect information about these communications, including the participants’ phone numbers, the date and time, and the contents of SMS messages. For security purposes, may also monitor or record the contents of phone calls made through the Lyft Platform, and will always let you know we are about to do so before the call begins.

Information collected from Third Parties. Third party services provide us with information needed for core aspects of the Lyft Platform, as well as for additional services, programs, loyalty benefits, and promotions that can enhance your Lyft experience. These third party services include background check providers, insurance partners, financial service providers, marketing providers, and other businesses. Lyft obtains the following information about you from these third party services:

Information to make the Lyft Platform safer, like background check information for drivers; Information about your participation in third party programs that provide things like insurance coverage and financial instruments, such as insurance, payment, transaction, and fraud detection information; Information to operationalize loyalty and promotional programs, such as information about your use of such programs; and Information about you provided by specific services, such as demographic and market segment information. Enterprise Programs. If you use Lyft through your employer or other organization that participates in one of our Lyft Business enterprise programs, we will collect information about you from those parties, such as your name and contact information.

Referral Programs. Friends help friends use the Lyft Platform. If someone refers you to Lyft, we will collect information about you from that referral including your name and contact information.

Lyft use your personal information to:

Provide the Lyft Platform; Maintain the security and safety of the Lyft Platform and its users; Build and maintain the Lyft community; Provide customer support; Improve the Lyft Platform; and respond to legal proceedings and obligations.

Maintaining the Security and Safety of the Lyft Platform and its Users. Providing you a secure and safe experience drives our platform, both on the road and on our apps. To do this, Lyft uses your personal information to:

Authenticate users; Verify that Drivers and their vehicles meet safety requirements; Investigate and resolve incidents, accidents, and insurance claims; Encourage safe driving behavior and avoid unsafe activities; Find and prevent fraud; and Block and remove unsafe or fraudulent users from the Lyft Platform.

Improving the Lyft Platform. To improve your experience and provide you with new and helpful features. To do this, they use your personal information to:

Perform research, testing, and analysis; Develop new products, features, partnerships, and services; Prevent, find, and resolve software or hardware bugs and issues; and monitor and improve our operations and processes, including security practices, algorithms, and other modeling. Responding to Legal Proceedings and Requirements. Sometimes the law, government entities, or other regulatory bodies impose demands and obligations on us with respect to the services we seek to provide. In such a circumstance, we may use your personal information to respond to those demands or obligations.

Concluding, as you may see below the data provided has an identificator for the user who is totally anonimized and it is not possible to see, or determine any particular charactristics or information of the user.

head(rider_net)

                    id                   cab_type      source         destination
1 424553bb-7174-41ea-aeb4-fe06d4f4b9d7     Lyft  "Haymarket Square"  "North Station"
2 4bd23055-6827-41c6-b23b-3c491f24e74d     Lyft  "Haymarket Square"  "North Station"
3 981a3613-77af-4620-a42a-0c0866077d1e     Uber  "Haymarket Square"  "North Station"
4 c2d88af2-d278-4bfd-a8d0-29ca77cc5512     Lyft  "Haymarket Square"  "North Station"
5 e0126e1f-8ca9-4f2e-82b3-50505a09db9a     Uber  "Haymarket Square"  "North Station"
6 f6f6d7e4-3e18-4922-a5f5-181cdd3fa6f2     Lyft  "Haymarket Square"  "North Station"

3. Conclusions and recommendations

Due to the high volume of the database, my basic (free) accounts in RStudio Cloud and Google Cloud were unable to make the calculations for the whole dataset, both crashed every time. Especially in Random Forest, and Gradient Boosting Machine. We believe that even hiring more cores at the cloud, it would certainly reduce the computing time but maybe the results (selection variables) would be the same. But still, this limitation is a clear sign to work with cloud machines in the near future. The exercise was computed with an Intel i3-4170 processor with 16GB RAM. Personally, we really liked this final project, even the selected database were not the correct for a beginner, but we learned from these issues, and most important we learnt all the characteristics for each model, in particular, tunning the algorithm. The size of the data set imposed a hard issue on this exercise, and in the low number of good predictors. We also believe that the possible problem of dimensionality, affects the database, making it sparser as we included all the number of features considered, which usually leads to overfitting, that possibly happened in the tree classification.

Things that were attempted but not included in the project.

The following features of time hours were thought to have some validity but when run through the models were found to have no predictive power, maybe it would require an interaction with other variables. Finally, there are a few figures on best tune parameters for xgTree, but we do not include in the project, because we already put tables. Also, there is a routine to select and made statistical differences between RMSE models, but due to programming mistakes in the Rdata environment we prefer not to include them, because there would include some results for the last part, and not for all modes, this is an immediate correction for the next weeks, that will replace final Table in the Xgboost part. Although we attempted to perform Xgboost using xgboost library, it was not possible to adjust the data to the format of the package, in particular, adjusting the data to ‘labels’. So, momentarily we prefer not to include this code routine.

Next steps

While we were writing the report and analysis, we realize that there were other alternatives that certainly would enrich the exercise, the first one, applying classification tree considering as target variables destinations or sources, definitively it could have enriched the document, and of course extract all the benefits from the database. Regarding, this point, we include a graph network model, on how these destinations and sources –locations- are connected and some relevant statistics. About the gbm library, although there is a better option than manually tweaking hyperparameters one at a time is to perform a grid search which iterates over every combination of hyperparameter values and allows us to assess which combination tends to perform well. We know is possible to do it with a local machine with our small sub-sample (260,000 observations), and we didn’t perform it due to time restrictions, maybe it could take hours! On the other hand, while surfing on the web I read that Python is less time consuming than R, definitly this is the next step, trying to replicate these results using Python libraries. Finally, mentioned in the last classes, apply an ensemble learning methodology where outputs are combined by to derive the final output, and last but not least improve experience on Tensorflow and Keras. Google Colab were a very important instrument in this part.

This is the end!

Tags:

Categories:

Updated: