Advance Research in Sciences
[ ISSN : 2837-5777 ]
Predicting Travel Time and Distance for Statistics Netherlands Interviewers with Machine Learning
Statistics Netherlands, Division of Data Services, Research and Innovation, Heerlen, the Netherlands
Corresponding Authors
Keywords
Abstract
The usual observation strategy of Statistics Netherlands surveys is via the Internet with follow-ups for nonrespondents by telephone or face-to-face interviews. The face-to-face interviewers work from their homes and receive sample addresses every month. The approach strategy includes a maximum of six visits, evenly spread over the days and parts of the days of the month in question. The interviewers schedule their visits themselves. It is not obvious in advance how much travel time and distance are needed to complete an interviewer’s work package. This paper provides a model to estimate travel time and distance of these work packages, applying machine learning techniques to interviewers’ travel declarations. According to Mean Absolute Percentage Error, the best way to predict the travel distance is to use a Support Vector Regression model with a log-log plus one transformation. The log-log transformation ensures homoscedasticity. The plus one, is for the cyclists who have zero declarations, because they do not get a km-reimbursement. The explanatory variables in the model are road distance, distance to the ideal interviewer’s residence, radius of the circle containing the addresses, number of addresses, urbanity of the interviewer’s residence, means of the interviewer’s transport, region, province and month. In order to predict travel time, according to Mean Absolute Percentage Error, it is also best to use a Support Vector Regression, this time with a log-log transformation. Again, the log-log transformation is used to remove heteroscedasticity from the model. Plus, one is not necessary here, as travel time is always accounted for. The same explanatory variables are used as in the model for travel distance, with road distance replaced by road time. Both road distance and time are determined by an offline server with routing.