The topic of our research project is centered around flight options details. We’re interested in analyzing how different variables affect the price of a flight option, which eventually affects the customer’s decision when looking for a ticket. We’re going to implement different models that describe and predict the prices of the flight tickets.
From our own experience, we can say that the prices of airline tickets often seem a bit arbitrary. Should you wait to book or are you already too late? When is the moment for a bargain? What downsides do you have to accept for a cheap ticket? Well, the factors influencing the price are not really explained transparently. Therefore, there was a certain excitement to see through this system a bit, or at least to get to know it better.
Before we worked more closely with the data set, we asked ourselves whether we could answer the following questions:
The following dataset was downloaded from Kaggle and originally scraped from the ‘Easemytrip’ platform for booking flight tickets. It contains information on prices of flights that are operated by six airlines, departing from India’s greatest cities - at different times. The names of the variables are quite self-explanatory: the original dataset contained, besides the price (originally in Indian Rupees, we converted it into EUR), the airline and the index, the flight number (which we didn’t use), origin and destination city (6 main Indian cities), departure and arrival time, flight duration, number of stopovers, class and days left (from offer to departure day). There are 300,153 observations and 12 variables in this dataset.
flightdata <- read.csv("flightprice.csv")
set.seed(1)
library("corrplot")
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
library("caret")
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: ggplot2
## Loading required package: lattice
library("randomForest")
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library("psych")
## Warning: package 'psych' was built under R version 4.1.3
##
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
any(is.na(flightdata))
## [1] FALSE
The is.na
command helps us determine whether our dataset contains any missing values, which is not the case here.
head(flightdata)
## X airline flight source_city departure_time stops arrival_time
## 1 0 SpiceJet SG-8709 Delhi Evening zero Night
## 2 1 SpiceJet SG-8157 Delhi Early_Morning zero Morning
## 3 2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning
## 4 3 Vistara UK-995 Delhi Morning zero Afternoon
## 5 4 Vistara UK-963 Delhi Morning zero Morning
## 6 5 Vistara UK-945 Delhi Morning zero Afternoon
## destination_city class duration days_left price
## 1 Mumbai Economy 2.17 1 5953
## 2 Mumbai Economy 2.33 1 5953
## 3 Mumbai Economy 2.17 1 5956
## 4 Mumbai Economy 2.25 1 5955
## 5 Mumbai Economy 2.33 1 5955
## 6 Mumbai Economy 2.33 1 5955
With the help of the head()
command, we can analyze the top rows of the dataset.
str(flightdata)
## 'data.frame': 300153 obs. of 12 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ airline : chr "SpiceJet" "SpiceJet" "AirAsia" "Vistara" ...
## $ flight : chr "SG-8709" "SG-8157" "I5-764" "UK-995" ...
## $ source_city : chr "Delhi" "Delhi" "Delhi" "Delhi" ...
## $ departure_time : chr "Evening" "Early_Morning" "Early_Morning" "Morning" ...
## $ stops : chr "zero" "zero" "zero" "zero" ...
## $ arrival_time : chr "Night" "Morning" "Early_Morning" "Afternoon" ...
## $ destination_city: chr "Mumbai" "Mumbai" "Mumbai" "Mumbai" ...
## $ class : chr "Economy" "Economy" "Economy" "Economy" ...
## $ duration : num 2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
## $ days_left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 5953 5953 5956 5955 5955 5955 6060 6060 5954 5954 ...
summary(flightdata)
## X airline flight source_city
## Min. : 0 Length:300153 Length:300153 Length:300153
## 1st Qu.: 75038 Class :character Class :character Class :character
## Median :150076 Mode :character Mode :character Mode :character
## Mean :150076
## 3rd Qu.:225114
## Max. :300152
## departure_time stops arrival_time destination_city
## Length:300153 Length:300153 Length:300153 Length:300153
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## class duration days_left price
## Length:300153 Min. : 0.83 Min. : 1 Min. : 1105
## Class :character 1st Qu.: 6.83 1st Qu.:15 1st Qu.: 4783
## Mode :character Median :11.25 Median :26 Median : 7425
## Mean :12.22 Mean :26 Mean : 20890
## 3rd Qu.:16.17 3rd Qu.:38 3rd Qu.: 42521
## Max. :49.83 Max. :49 Max. :123071
Judging by the structure and summary, we can observe that all of our categorical variables were loaded as character variables, which means that the data needs further pre-processing.
flightdata <- flightdata[sample(nrow(flightdata), 10000), ]
flightdata <- flightdata[, -c(1,3)]
flightdata$airline <- as.factor(flightdata$airline)
flightdata$source_city <- as.factor(flightdata$source_city)
flightdata$departure_time <- as.factor(flightdata$departure_time)
flightdata$stops <- as.factor(flightdata$stops)
flightdata$arrival_time <- as.factor(flightdata$arrival_time)
flightdata$destination_city <- as.factor(flightdata$destination_city)
flightdata$class <- as.factor(flightdata$class)
flightdata$price <- flightdata$price / 83.89
head(flightdata)
## airline source_city departure_time stops arrival_time destination_city
## 24388 Vistara Delhi Night one Evening Kolkata
## 124413 GO_FIRST Kolkata Early_Morning one Afternoon Delhi
## 142643 Air_India Kolkata Morning one Night Hyderabad
## 25173 Indigo Delhi Afternoon one Evening Kolkata
## 294762 Air_India Chennai Early_Morning one Night Bangalore
## 116487 Air_India Bangalore Early_Morning one Evening Chennai
## class duration days_left price
## 24388 Economy 20.17 25 78.41221
## 124413 Economy 10.83 31 75.38443
## 142643 Economy 11.92 3 141.18488
## 25173 Economy 4.17 30 56.74097
## 294762 Business 13.92 27 718.32161
## 116487 Economy 12.83 13 188.15115
plot(flightdata$duration, flightdata$price, col=flightdata$class, ylab = "Price, EUR", xlab = "Duration of Trip, h")
legend('topright', legend = levels(flightdata$class), col = 1:2, cex = 1, pch = 1)
We have plotted the prices of the tickets over duration of flights in hours. Based on this scatter plot, we can observe no correlation between the two variables. Yet, we can clearly observe two clusters, where the black-dotted observations are “Business” class flights, and red-dotted are “Economy”. While there may be no difference in duration of trips between the two clusters, the “Business” class tickets are on average more expensive than “Economy” ones.
plot(price ~ departure_time, col = 2:6, data = flightdata, xlab = "Departure Time", ylab = "Price")
Judging by this boxplot, we cannot derive any conclusions whether the tickets are more expensive during different times of flight. However, we may see that the median for ‘Late Night’ tickets is highly lower in comparison to the rest, and that the Night category has the highest upper quartile.
plot(price ~ airline, col = 2:7, ylab = "Price", xlab = "Airline", data = flightdata)
Here we plotted the prices by each airline. We can see that “Air India” and “Vistara”’s prices are are higher on average in comparison to the other airlines.
plot(price ~ source_city, col = 2:7, ylab = "Price", xlab = "Source City", data = flightdata)
In the following boxplot, we cannot spot any major difference in prices between the different cities where planes are commencing their flights.
plot(price ~ class, col = 2:3, ylab = "Price", xlab = "Class", data = flightdata)
In this boxplot, we can observe a clear difference in prices between the flight tickets of the “Economy” and “Business” classes. “Business” class tickets have a higher cost, with a median of nearly 620 EUR, while the “Economy” class has a median of 150 EUR. In both instances we can observe a large number of outliers.
corr.data <- cor(flightdata[,8:10])
corrplot(corr.data, method="circle")
Here we have graphed the correlation plot between the three numeric variables, mainly “price”, “days_left” and “duration”. We can observe that there is a positive correlation (~ 0.3) between the price and duration of the flight, which may mean that longer distance flights are more expensive. The “days_left” variable, which relates to the number of days between the booking and flight date, has a negative correlation with “price” (~ -0.2), which may mean that if the customer purchases a ticket which is closer to the flight date, it will be more expensive.
Now, we are going to tackle our first question of predicting flight ticket prices. We will start by performing a Multiple Linear Regression, in order to observe the effects of various variables upon ticket prices. We are going to execute different models and test which one performed best, based on criteria such as the Adjusted R-Squared and AIC. In all of our regression models, we will use a 95% confidence level to identify significant variables, and conclude whether there is enough evidence to reject the null hypothesis \(H_0\):
regAll <- lm(price ~ (relevel(as.factor(flightdata$class), ref="Economy")) + airline + source_city +
departure_time + stops + arrival_time + destination_city + duration + days_left, data = flightdata)
summary(regAll)
##
## Call:
## lm(formula = price ~ (relevel(as.factor(flightdata$class), ref = "Economy")) +
## airline + source_city + departure_time + stops + arrival_time +
## destination_city + duration + days_left, data = flightdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -364.28 -37.71 -4.63 36.21 582.71
##
## Coefficients:
## Estimate
## (Intercept) 99.14165
## relevel(as.factor(flightdata$class), ref = "Economy")Business 535.58397
## airlineAirAsia -2.09458
## airlineGO_FIRST 18.35166
## airlineIndigo 22.23502
## airlineSpiceJet 23.22496
## airlineVistara 46.70121
## source_cityChennai -0.22193
## source_cityDelhi -18.15947
## source_cityHyderabad -19.10516
## source_cityKolkata 19.98255
## source_cityMumbai -2.70147
## departure_timeEarly_Morning 6.83121
## departure_timeEvening 5.31027
## departure_timeLate_Night 25.03907
## departure_timeMorning 9.25449
## departure_timeNight 10.48042
## stopstwo_or_more 18.06302
## stopszero -89.15850
## arrival_timeEarly_Morning -17.65285
## arrival_timeEvening 10.69180
## arrival_timeLate_Night 14.51292
## arrival_timeMorning 4.47077
## arrival_timeNight 14.80507
## destination_cityChennai -7.57747
## destination_cityDelhi -22.78900
## destination_cityHyderabad -27.01614
## destination_cityKolkata 14.70637
## destination_cityMumbai -5.91493
## duration 0.41830
## days_left -1.53358
## Std. Error
## (Intercept) 5.44487
## relevel(as.factor(flightdata$class), ref = "Economy")Business 1.95146
## airlineAirAsia 4.09470
## airlineGO_FIRST 3.55754
## airlineIndigo 3.07697
## airlineSpiceJet 5.19317
## airlineVistara 2.01159
## source_cityChennai 3.02855
## source_cityDelhi 2.72529
## source_cityHyderabad 2.97799
## source_cityKolkata 2.90121
## source_cityMumbai 2.72055
## departure_timeEarly_Morning 2.70139
## departure_timeEvening 2.76340
## departure_timeLate_Night 11.29609
## departure_timeMorning 2.64686
## departure_timeNight 2.97815
## stopstwo_or_more 4.05699
## stopszero 3.01670
## arrival_timeEarly_Morning 4.29275
## arrival_timeEvening 2.80675
## arrival_timeLate_Night 4.48158
## arrival_timeMorning 2.98180
## arrival_timeNight 2.77167
## destination_cityChennai 2.97584
## destination_cityDelhi 2.81270
## destination_cityHyderabad 3.00811
## destination_cityKolkata 2.84974
## destination_cityMumbai 2.75893
## duration 0.15306
## days_left 0.05908
## t value Pr(>|t|)
## (Intercept) 18.208 < 2e-16
## relevel(as.factor(flightdata$class), ref = "Economy")Business 274.453 < 2e-16
## airlineAirAsia -0.512 0.608988
## airlineGO_FIRST 5.159 2.54e-07
## airlineIndigo 7.226 5.33e-13
## airlineSpiceJet 4.472 7.83e-06
## airlineVistara 23.216 < 2e-16
## source_cityChennai -0.073 0.941585
## source_cityDelhi -6.663 2.82e-11
## source_cityHyderabad -6.415 1.47e-10
## source_cityKolkata 6.888 6.01e-12
## source_cityMumbai -0.993 0.320740
## departure_timeEarly_Morning 2.529 0.011462
## departure_timeEvening 1.922 0.054679
## departure_timeLate_Night 2.217 0.026672
## departure_timeMorning 3.496 0.000474
## departure_timeNight 3.519 0.000435
## stopstwo_or_more 4.452 8.59e-06
## stopszero -29.555 < 2e-16
## arrival_timeEarly_Morning -4.112 3.95e-05
## arrival_timeEvening 3.809 0.000140
## arrival_timeLate_Night 3.238 0.001206
## arrival_timeMorning 1.499 0.133814
## arrival_timeNight 5.342 9.42e-08
## destination_cityChennai -2.546 0.010901
## destination_cityDelhi -8.102 6.03e-16
## destination_cityHyderabad -8.981 < 2e-16
## destination_cityKolkata 5.161 2.51e-07
## destination_cityMumbai -2.144 0.032063
## duration 2.733 0.006288
## days_left -25.960 < 2e-16
##
## (Intercept) ***
## relevel(as.factor(flightdata$class), ref = "Economy")Business ***
## airlineAirAsia
## airlineGO_FIRST ***
## airlineIndigo ***
## airlineSpiceJet ***
## airlineVistara ***
## source_cityChennai
## source_cityDelhi ***
## source_cityHyderabad ***
## source_cityKolkata ***
## source_cityMumbai
## departure_timeEarly_Morning *
## departure_timeEvening .
## departure_timeLate_Night *
## departure_timeMorning ***
## departure_timeNight ***
## stopstwo_or_more ***
## stopszero ***
## arrival_timeEarly_Morning ***
## arrival_timeEvening ***
## arrival_timeLate_Night **
## arrival_timeMorning
## arrival_timeNight ***
## destination_cityChennai *
## destination_cityDelhi ***
## destination_cityHyderabad ***
## destination_cityKolkata ***
## destination_cityMumbai *
## duration **
## days_left ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.27 on 9969 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.9134
## F-statistic: 3515 on 30 and 9969 DF, p-value: < 2.2e-16
In the regression above, we have included all of the variables that could explain the changes in price. We have releveled the “class” variable, so that the “Economy” value would be taken as baseline. The fitted regression line for the model is: \[ \widehat{Price} = 99.14165 + 535.58397 \cdot classBusiness - 2.09458 \cdot airlineAirAsia + 18.35166 \cdot airlineGO\_First + ... - 1.53358 \cdot days\_left \] Interpretation:
Now, we will proceed with an adjusted model, where we are going to select the variables that we think might explain prices the best.
regAdj <- lm(price ~ (relevel(as.factor(flightdata$class), ref="Economy")) + stops + duration + destination_city +
days_left, data = flightdata)
summary(regAdj)
##
## Call:
## lm(formula = price ~ (relevel(as.factor(flightdata$class), ref = "Economy")) +
## stops + duration + destination_city + days_left, data = flightdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -378.29 -36.27 -6.04 36.57 622.51
##
## Coefficients:
## Estimate
## (Intercept) 134.5560
## relevel(as.factor(flightdata$class), ref = "Economy")Business 543.7162
## stopstwo_or_more 21.9959
## stopszero -95.3687
## duration 0.2858
## destination_cityChennai -8.1186
## destination_cityDelhi -18.6953
## destination_cityHyderabad -23.4356
## destination_cityKolkata 9.6827
## destination_cityMumbai -7.5207
## days_left -1.5367
## Std. Error
## (Intercept) 3.2776
## relevel(as.factor(flightdata$class), ref = "Economy")Business 1.8346
## stopstwo_or_more 4.1823
## stopszero 3.0676
## duration 0.1398
## destination_cityChennai 3.0584
## destination_cityDelhi 2.8217
## destination_cityHyderabad 3.0739
## destination_cityKolkata 2.8967
## destination_cityMumbai 2.7921
## days_left 0.0618
## t value Pr(>|t|)
## (Intercept) 41.053 < 2e-16
## relevel(as.factor(flightdata$class), ref = "Economy")Business 296.370 < 2e-16
## stopstwo_or_more 5.259 1.48e-07
## stopszero -31.089 < 2e-16
## duration 2.044 0.040962
## destination_cityChennai -2.654 0.007956
## destination_cityDelhi -6.626 3.64e-11
## destination_cityHyderabad -7.624 2.68e-14
## destination_cityKolkata 3.343 0.000833
## destination_cityMumbai -2.694 0.007080
## days_left -24.866 < 2e-16
##
## (Intercept) ***
## relevel(as.factor(flightdata$class), ref = "Economy")Business ***
## stopstwo_or_more ***
## stopszero ***
## duration *
## destination_cityChennai **
## destination_cityDelhi ***
## destination_cityHyderabad ***
## destination_cityKolkata ***
## destination_cityMumbai **
## days_left ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 84.14 on 9989 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.9048
## F-statistic: 9506 on 10 and 9989 DF, p-value: < 2.2e-16
We have included the following variables:
We can observe that this model performs worse than the model above, since the Adjusted R-Squared value is 0.9048, as opposed to the previous value of 0.9134.
Now, we will perform a forward stepwise regression, where we start with the model that contains an intercept but no predictors. Next, we add to the null model the variables that result in the lowest AIC, and conclude which model performs better. Stepwise output is hidden.
startmodel = lm(price ~ 1, data = flightdata)
regStep = step(startmodel, direction="forward", scope = formula(regAll))
summary(regStep)
##
## Call:
## lm(formula = price ~ relevel(as.factor(flightdata$class), ref = "Economy") +
## stops + days_left + airline + destination_city + source_city +
## arrival_time + departure_time + duration, data = flightdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -364.28 -37.71 -4.63 36.21 582.71
##
## Coefficients:
## Estimate
## (Intercept) 99.14165
## relevel(as.factor(flightdata$class), ref = "Economy")Business 535.58397
## stopstwo_or_more 18.06302
## stopszero -89.15850
## days_left -1.53358
## airlineAirAsia -2.09458
## airlineGO_FIRST 18.35166
## airlineIndigo 22.23502
## airlineSpiceJet 23.22496
## airlineVistara 46.70121
## destination_cityChennai -7.57747
## destination_cityDelhi -22.78900
## destination_cityHyderabad -27.01614
## destination_cityKolkata 14.70637
## destination_cityMumbai -5.91493
## source_cityChennai -0.22193
## source_cityDelhi -18.15947
## source_cityHyderabad -19.10516
## source_cityKolkata 19.98255
## source_cityMumbai -2.70147
## arrival_timeEarly_Morning -17.65285
## arrival_timeEvening 10.69180
## arrival_timeLate_Night 14.51292
## arrival_timeMorning 4.47077
## arrival_timeNight 14.80507
## departure_timeEarly_Morning 6.83121
## departure_timeEvening 5.31027
## departure_timeLate_Night 25.03907
## departure_timeMorning 9.25449
## departure_timeNight 10.48042
## duration 0.41830
## Std. Error
## (Intercept) 5.44487
## relevel(as.factor(flightdata$class), ref = "Economy")Business 1.95146
## stopstwo_or_more 4.05699
## stopszero 3.01670
## days_left 0.05908
## airlineAirAsia 4.09470
## airlineGO_FIRST 3.55754
## airlineIndigo 3.07697
## airlineSpiceJet 5.19317
## airlineVistara 2.01159
## destination_cityChennai 2.97584
## destination_cityDelhi 2.81270
## destination_cityHyderabad 3.00811
## destination_cityKolkata 2.84974
## destination_cityMumbai 2.75893
## source_cityChennai 3.02855
## source_cityDelhi 2.72529
## source_cityHyderabad 2.97799
## source_cityKolkata 2.90121
## source_cityMumbai 2.72055
## arrival_timeEarly_Morning 4.29275
## arrival_timeEvening 2.80675
## arrival_timeLate_Night 4.48158
## arrival_timeMorning 2.98180
## arrival_timeNight 2.77167
## departure_timeEarly_Morning 2.70139
## departure_timeEvening 2.76340
## departure_timeLate_Night 11.29609
## departure_timeMorning 2.64686
## departure_timeNight 2.97815
## duration 0.15306
## t value Pr(>|t|)
## (Intercept) 18.208 < 2e-16
## relevel(as.factor(flightdata$class), ref = "Economy")Business 274.453 < 2e-16
## stopstwo_or_more 4.452 8.59e-06
## stopszero -29.555 < 2e-16
## days_left -25.960 < 2e-16
## airlineAirAsia -0.512 0.608988
## airlineGO_FIRST 5.159 2.54e-07
## airlineIndigo 7.226 5.33e-13
## airlineSpiceJet 4.472 7.83e-06
## airlineVistara 23.216 < 2e-16
## destination_cityChennai -2.546 0.010901
## destination_cityDelhi -8.102 6.03e-16
## destination_cityHyderabad -8.981 < 2e-16
## destination_cityKolkata 5.161 2.51e-07
## destination_cityMumbai -2.144 0.032063
## source_cityChennai -0.073 0.941585
## source_cityDelhi -6.663 2.82e-11
## source_cityHyderabad -6.415 1.47e-10
## source_cityKolkata 6.888 6.01e-12
## source_cityMumbai -0.993 0.320740
## arrival_timeEarly_Morning -4.112 3.95e-05
## arrival_timeEvening 3.809 0.000140
## arrival_timeLate_Night 3.238 0.001206
## arrival_timeMorning 1.499 0.133814
## arrival_timeNight 5.342 9.42e-08
## departure_timeEarly_Morning 2.529 0.011462
## departure_timeEvening 1.922 0.054679
## departure_timeLate_Night 2.217 0.026672
## departure_timeMorning 3.496 0.000474
## departure_timeNight 3.519 0.000435
## duration 2.733 0.006288
##
## (Intercept) ***
## relevel(as.factor(flightdata$class), ref = "Economy")Business ***
## stopstwo_or_more ***
## stopszero ***
## days_left ***
## airlineAirAsia
## airlineGO_FIRST ***
## airlineIndigo ***
## airlineSpiceJet ***
## airlineVistara ***
## destination_cityChennai *
## destination_cityDelhi ***
## destination_cityHyderabad ***
## destination_cityKolkata ***
## destination_cityMumbai *
## source_cityChennai
## source_cityDelhi ***
## source_cityHyderabad ***
## source_cityKolkata ***
## source_cityMumbai
## arrival_timeEarly_Morning ***
## arrival_timeEvening ***
## arrival_timeLate_Night **
## arrival_timeMorning
## arrival_timeNight ***
## departure_timeEarly_Morning *
## departure_timeEvening .
## departure_timeLate_Night *
## departure_timeMorning ***
## departure_timeNight ***
## duration **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80.27 on 9969 degrees of freedom
## Multiple R-squared: 0.9136, Adjusted R-squared: 0.9134
## F-statistic: 3515 on 30 and 9969 DF, p-value: < 2.2e-16
AIC(regAll, regStep)
## df AIC
## regAll 32 116120.1
## regStep 32 116120.1
The stepwise()
model includes all of the variables that we’ve had in the complete model. Albeit a rare occasion, this may mean that each of our variables in the dataset has a great effect on price, hence the function decided that it’s best to include them all. Therefore, the AIC and Adjusted R-Squared values haven’t changed.
AIC(regStep, regAdj)
## df AIC
## regStep 32 116120.1
## regAdj 12 117042.2
However, in comparison to our adjusted model, the stepwise model performs far better, since the AIC value is lower.
Now, we will test the predictive power of the models, by segmenting the data into a training and test data. Firstly, we will estimate the models on the training sample, after which we will evaluate their performances on the test sample.
random = sample(1:nrow(flightdata), 0.8*nrow(flightdata))
train_data = flightdata[random,]
test_data = flightdata[-random,]
We’re going to split the data into 80% for the training data and 20% for the test data.
fullmodel = lm(price ~ class + airline + source_city + departure_time + stops + arrival_time +
destination_city + duration + days_left, data = train_data)
adjmodel = lm(price ~ + stops + duration + destination_city + class + days_left, data=train_data)
startmodel1 = lm(price ~ 1, data=train_data)
regStep1 = step(startmodel1, direction="forward", scope=formula(fullmodel))
Re-executing the regressions, but now on the training data. Stepwise output hidden.
y_hat_full = predict(fullmodel, newdata = test_data)
y_hat_adjmodel = predict(adjmodel, newdata = test_data)
y_hat_regstep = predict(regStep1, newdata = test_data)
Computing the prediction errors on the test data. They will be used for calculating the Root Mean Squared Error (RMSE).
sqrt(mean((y_hat_full - test_data$price)^2)) # Full model.
## [1] 79.35446
sqrt(mean((y_hat_adjmodel - test_data$price)^2)) # Adjusted model.
## [1] 82.58473
sqrt(mean((y_hat_regstep - test_data$price)^2)) # Forward selected model.
## [1] 79.35446
We can observe that the lowest RMSE is achieved by both the full model & stepwise models. The in-sample performance showed the exact same results as the out-of-sample performance. We may now proceed with the Random Forests exercise.
We also decided to use random forests to predict the flight prices. Since there are 9 predictors, we could have set the hyper parameter mtry=3. However, we noticed that increasing it to 4 yielded significantly better results (lower RMSE) without increasing too much computation time.
rf <- randomForest(price ~ ., data=flightdata, importance=TRUE, mtry=4)
rf
##
## Call:
## randomForest(formula = price ~ ., data = flightdata, importance = TRUE, mtry = 4)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 2287.793
## % Var explained: 96.92
varImpPlot(rf)
Above we can see that proportion of variance explained is higher than the ones we had in the linear regression models. We may also visualize the importance of the variables.
In order to be able to evaluate which model performed better between random forests and linear regression, we use 10-fold cross validation to compute the metrics of performance.
fitControl <- trainControl(method = "cv",number = 10)
rfFit1 <- train(price ~ .,
data = flightdata,
method = "rf",
tuneGrid = data.frame(mtry=4),
trControl = fitControl)
rfFit1
## Random Forest
##
## 10000 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 9000, 9001, 9001, 9001, 8999, 8999, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 65.11511 0.9558858 44.03098
##
## Tuning parameter 'mtry' was held constant at a value of 4
We may also check the variables that are most important in determining prices on the model built with the train function.
varImp(rfFit1)
## rf variable importance
##
## only 20 most important variables shown (out of 30)
##
## Overall
## classEconomy 100.0000
## airlineVistara 6.8982
## duration 5.3286
## airlineIndigo 3.6531
## days_left 2.5399
## stopszero 2.1408
## airlineGO_FIRST 2.0158
## airlineAirAsia 1.5688
## stopstwo_or_more 0.5562
## airlineSpiceJet 0.5405
## destination_cityDelhi 0.3550
## source_cityDelhi 0.3198
## destination_cityKolkata 0.3097
## source_cityKolkata 0.2889
## destination_cityMumbai 0.2589
## destination_cityHyderabad 0.2450
## arrival_timeNight 0.2429
## source_cityMumbai 0.2314
## arrival_timeEvening 0.2128
## arrival_timeEarly_Morning 0.2046
On the table shown above, we confirm that the three most important variables are: class, whether the airline is Vistara or not, and the duration of the flight.
We can see that the model has a Pseudo R-squared of 95.58% and a RMSE of 65.11. Now, we can proceed to test the performance of the linear model also by using 10-fold cross validation.
lmFit1 <- train(price ~ .,
data = flightdata,
method = "lm",
trControl = fitControl)
lmFit1
## Linear Regression
##
## 10000 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 9000, 9000, 8999, 8999, 9000, 9002, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 80.33955 0.9134614 54.64637
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Comparing the random forest model to the linear regression, both using 10-fold cross validation, we conclude that random forests is a better model. The three performance measures are significantly better on random forests than on the linear regression model.
In order to find out whether a flight can be seen as affordable by the every-day Indian, we create a new column in the dataset which we derive from the price column. Our assumption is that everything below or equal to the median weekly wage is affordable.
By conducting some research we found that the current median weekly wage in India is 90€, hence we used this to classify the flights as affordable (TRUE or FALSE), thus obtaining a boolean variable.
We can then use this variable to train our models with, and finally predict whether a certain flight ticket will be affordable or not given its route, flight time, airline etc. And affordable will mean for half of India’s wage earners (from definition of median half will be above the median wage), which is quite a large amount of people and hence a big market.
To this end, let’s test the logistic regression and k-NN models.
First, we set up the categorical “affordable” variable to be predicted and remove the “prices” variable, because we don’t want to use them to predict the affordable variable.
med_weekly_salary = 90 # In Euro for median Indian person.
bool = flightdata$price<=med_weekly_salary
# New column with TRUE/FALSE.
flightdata$affordable = bool
flightdata$affordable = as.factor(flightdata$affordable)
flightdata = flightdata[,-10] # Remove prices.
head(flightdata)
## airline source_city departure_time stops arrival_time destination_city
## 24388 Vistara Delhi Night one Evening Kolkata
## 124413 GO_FIRST Kolkata Early_Morning one Afternoon Delhi
## 142643 Air_India Kolkata Morning one Night Hyderabad
## 25173 Indigo Delhi Afternoon one Evening Kolkata
## 294762 Air_India Chennai Early_Morning one Night Bangalore
## 116487 Air_India Bangalore Early_Morning one Evening Chennai
## class duration days_left affordable
## 24388 Economy 20.17 25 TRUE
## 124413 Economy 10.83 31 TRUE
## 142643 Economy 11.92 3 FALSE
## 25173 Economy 4.17 30 TRUE
## 294762 Business 13.92 27 FALSE
## 116487 Economy 12.83 13 FALSE
Now we’ll build the logistic model. Then we will try to remove some of the predictor variables with step()
. But as we have seen in the linear regression, all variables are retained here. Moreover, we try the “backward” and “both” (backward and forward) options, yet as we can see, it yields the same model.
fit_logit_all <- glm(affordable ~ . , data = flightdata,family = binomial())
fit_logit_step <- step(fit_logit_all, direction = "both")
fit_logit_step_b <- step(fit_logit_all, direction = "backward")
Below, we see that the three models are the same:
formula(fit_logit_step)==formula(fit_logit_step_b)
## [1] TRUE
formula(fit_logit_step)==formula(fit_logit_all)
## [1] TRUE
summary(fit_logit_step)
##
## Call:
## glm(formula = affordable ~ airline + source_city + departure_time +
## stops + arrival_time + destination_city + class + duration +
## days_left, family = binomial(), data = flightdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8494 -0.0001 0.0000 0.3348 2.3405
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.366463 255.206645 -0.099 0.92082
## airlineAirAsia 2.896392 0.251284 11.526 < 2e-16 ***
## airlineGO_FIRST 1.291127 0.159124 8.114 4.90e-16 ***
## airlineIndigo 0.728799 0.132923 5.483 4.19e-08 ***
## airlineSpiceJet 0.642757 0.200101 3.212 0.00132 **
## airlineVistara -0.543992 0.095862 -5.675 1.39e-08 ***
## source_cityChennai -0.098120 0.138095 -0.711 0.47738
## source_cityDelhi 0.015890 0.126563 0.126 0.90009
## source_cityHyderabad 0.348407 0.140966 2.472 0.01345 *
## source_cityKolkata -0.659732 0.132935 -4.963 6.95e-07 ***
## source_cityMumbai 0.203546 0.128167 1.588 0.11226
## departure_timeEarly_Morning 0.138778 0.124309 1.116 0.26425
## departure_timeEvening 0.337278 0.128155 2.632 0.00849 **
## departure_timeLate_Night -0.065860 0.635199 -0.104 0.91742
## departure_timeMorning -0.159617 0.117832 -1.355 0.17554
## departure_timeNight 0.234316 0.143948 1.628 0.10357
## stopstwo_or_more -1.945816 0.155729 -12.495 < 2e-16 ***
## stopszero 2.142629 0.189396 11.313 < 2e-16 ***
## arrival_timeEarly_Morning 0.567084 0.213141 2.661 0.00780 **
## arrival_timeEvening -0.233631 0.128573 -1.817 0.06920 .
## arrival_timeLate_Night -0.287932 0.212202 -1.357 0.17482
## arrival_timeMorning 0.260944 0.142134 1.836 0.06637 .
## arrival_timeNight -0.008834 0.128745 -0.069 0.94530
## destination_cityChennai -0.006227 0.139634 -0.045 0.96443
## destination_cityDelhi -0.120066 0.134065 -0.896 0.37047
## destination_cityHyderabad 0.149540 0.142695 1.048 0.29465
## destination_cityKolkata -0.642112 0.131379 -4.887 1.02e-06 ***
## destination_cityMumbai -0.003401 0.130615 -0.026 0.97923
## classEconomy 23.405643 255.206510 0.092 0.92693
## duration -0.019510 0.006827 -2.858 0.00427 **
## days_left 0.140398 0.003893 36.064 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13862.7 on 9999 degrees of freedom
## Residual deviance: 4555.4 on 9969 degrees of freedom
## AIC: 4617.4
##
## Number of Fisher Scoring iterations: 19
We see that the stepwise logistic regression model achieves an AIC of 4617.4.
Now, let’s see how this model performs in predicting classes. For that we again make an 80%-20% random split of the data into test and train.
random = sample(1:nrow(flightdata), 0.8*nrow(flightdata))
train_data = flightdata[random,]
test_data = flightdata[-random,]
Now, we’ll use the caret package to train the model that we obtained from the stepwise algorithm. Then we predict whether the test data flights will be affordable or not and check our predictions with the confusionMatrix
function.
fitControl <- trainControl(method = "cv", number = 5) # Cross Validation.
logitFit <- train(formula(fit_logit_step), # We use the variables that we obtained from the stepwise algorithm.
data = train_data,
method = "glm",
trControl = fitControl)
# Result:
confusionMatrix(predict(logitFit, test_data), test_data$affordable, positive = "TRUE")
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 892 71
## TRUE 104 933
##
## Accuracy : 0.9125
## 95% CI : (0.8993, 0.9245)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.825
##
## Mcnemar's Test P-Value : 0.01556
##
## Sensitivity : 0.9293
## Specificity : 0.8956
## Pos Pred Value : 0.8997
## Neg Pred Value : 0.9263
## Prevalence : 0.5020
## Detection Rate : 0.4665
## Detection Prevalence : 0.5185
## Balanced Accuracy : 0.9124
##
## 'Positive' Class : TRUE
##
Here we can see that the model performs reasonably well with an Accuracy of 91% meaning that around 91% of the model’s predictions are correct. There are around 30% more False Positives than False Negatives, so this might suggest a small bias of the model but it can also just be by chance.
In implementing the k-NN classification, we are going to predict the categorical variable affordable using all the other variables within the data set.
We make a copy of our data set to prepare it for our k-NN classification. Next we derive the outcome variable, from the price data. Because k-NN involves calculating distances between data points, we must use numeric variables only. First, we scale the data in case our features are on different metrics.
Determining which values are numerical:
str(flightdata)
## 'data.frame': 10000 obs. of 10 variables:
## $ airline : Factor w/ 6 levels "Air_India","AirAsia",..: 6 3 1 4 1 1 4 6 5 6 ...
## $ source_city : Factor w/ 6 levels "Bangalore","Chennai",..: 3 5 5 3 2 1 3 5 3 4 ...
## $ departure_time : Factor w/ 6 levels "Afternoon","Early_Morning",..: 6 2 5 1 2 2 6 6 2 5 ...
## $ stops : Factor w/ 3 levels "one","two_or_more",..: 1 1 1 1 1 1 1 1 3 1 ...
## $ arrival_time : Factor w/ 6 levels "Afternoon","Early_Morning",..: 3 1 6 3 6 3 5 6 5 5 ...
## $ destination_city: Factor w/ 6 levels "Bangalore","Chennai",..: 5 3 4 5 1 2 1 4 5 5 ...
## $ class : Factor w/ 2 levels "Business","Economy": 2 2 2 2 1 2 2 1 2 1 ...
## $ duration : num 20.17 10.83 11.92 4.17 13.92 ...
## $ days_left : int 25 31 3 30 27 13 21 5 31 32 ...
## $ affordable : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 1 1 2 1 2 1 ...
We scale the already numeric variables first:
flightdata[,c("duration", "days_left")]<- scale(flightdata[,c("duration", "days_left")])
Then we proceed to dummy code variables that have three or more levels:
airline <-as.data.frame(dummy.code(flightdata$airline))
source_city <-as.data.frame(dummy.code(flightdata$source_city))
departure_time <-as.data.frame(dummy.code(flightdata$departure_time))
stops <-as.data.frame(dummy.code(flightdata$stops))
arrival_time <-as.data.frame(dummy.code(flightdata$arrival_time))
destination_city<-as.data.frame(dummy.code(flightdata$destination_city))
We combine the new dummy variables with the original data set:
flightdata <- cbind(flightdata, airline, source_city,departure_time ,stops ,arrival_time ,destination_city)
We remove the original variables that had to be dummy coded as well as the outcome variable.
flightdata <- flightdata[,-(1:7)]
outcome<-as.data.frame(flightdata[,3])
flightdata <- flightdata[,-3]
We split the data into training and test sets. We partition 80% of the data into the training set and the remaining 20% into the test set.
# 80% of the sample size:
smp_size <- floor(0.8 * nrow(flightdata))
train_ind <- sample(seq_len(nrow(flightdata)), size = smp_size)
# Creating test and training sets that contain all of the predictors.
class_pred_train <- flightdata[train_ind, ]
class_pred_test <- flightdata[-train_ind, ]
Split outcome variable into training and test sets using the same partition as above.
outcome_train <- outcome[train_ind, ]
outcome_train<-factor(outcome_train)
outcome_test <- outcome[-train_ind, ]
outcome_test <- data.frame(outcome_test)
We use the caret package to run the k-NN classification. We can make use of the function which picks the optimal number of neighbors for us:
pred_caret <- caret::train(class_pred_train, outcome_train, method = "knn", preProcess = c("center","scale"))
Looking at the output of the k-NN model, we can see that it chose k = 9:
pred_caret
## k-Nearest Neighbors
##
## 8000 samples
## 35 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## Pre-processing: centered (35), scaled (35)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 8000, 8000, 8000, 8000, 8000, 8000, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6842963 0.3685325
## 7 0.6912754 0.3824386
## 9 0.6956270 0.3910731
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
We can visualize the accuracy for different number of neighbors:
plot(pred_caret)
Next, we compare our predicted values of the affordability of the price to our actual values. The confusion matrix gives an indication of how well our model predicted the actual values.
The confusion matrix output also shows overall model statistics:
knnPredict <- predict(pred_caret, newdata = class_pred_test)
outcome_test$outcome_test<- factor(outcome_test$outcome_test)
confusionMatrix(knnPredict, outcome_test$outcome_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 771 352
## TRUE 234 643
##
## Accuracy : 0.707
## 95% CI : (0.6865, 0.7269)
## No Information Rate : 0.5025
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4136
##
## Mcnemar's Test P-Value : 1.343e-06
##
## Sensitivity : 0.7672
## Specificity : 0.6462
## Pos Pred Value : 0.6866
## Neg Pred Value : 0.7332
## Prevalence : 0.5025
## Detection Rate : 0.3855
## Detection Prevalence : 0.5615
## Balanced Accuracy : 0.7067
##
## 'Positive' Class : FALSE
##
The model did not perform very well, it only successfully classified 71% of the cases correctly. The success of the model can also be evaluated with a variety of other metrics (sensitivity, specificity, etc.) included here.
To summarize, we utilized the caret package to perform k-NN classification, predicting the affordability of a flight. This model may not have yielded the accuracy of the logistic model for a number of reasons. The majority of our predictor variables were dummy-coded categorical variables, which are not necessarily the most suited for binary k-NN classification problems.
Based on various measures and results, we have seen that when predicting a quantitative response, the Random Forests model performed better than the Linear Regression model. According to all of the measures (accuracy, recall, precision) in the classification task, the Logistic Regression has performed better than the k-NN model.
What we may conclude from this research, is that customers from India (and not only), should consider purchasing a ticket as early as possible; choose a flight that operates late at night, and look for airline offers other than those provided by ‘Air India’ or ‘Vistara’.