Project Description:

The topic of our research project is centered around flight options details. We’re interested in analyzing how different variables affect the price of a flight option, which eventually affects the customer’s decision when looking for a ticket. We’re going to implement different models that describe and predict the prices of the flight tickets.

Motivation:

From our own experience, we can say that the prices of airline tickets often seem a bit arbitrary. Should you wait to book or are you already too late? When is the moment for a bargain? What downsides do you have to accept for a cheap ticket? Well, the factors influencing the price are not really explained transparently. Therefore, there was a certain excitement to see through this system a bit, or at least to get to know it better.

Research Questions:

Before we worked more closely with the data set, we asked ourselves whether we could answer the following questions:

Dataset Description:

The following dataset was downloaded from Kaggle and originally scraped from the ‘Easemytrip’ platform for booking flight tickets. It contains information on prices of flights that are operated by six airlines, departing from India’s greatest cities - at different times. The names of the variables are quite self-explanatory: the original dataset contained, besides the price (originally in Indian Rupees, we converted it into EUR), the airline and the index, the flight number (which we didn’t use), origin and destination city (6 main Indian cities), departure and arrival time, flight duration, number of stopovers, class and days left (from offer to departure day). There are 300,153 observations and 12 variables in this dataset.

Project Research:

Loading the data:

flightdata <- read.csv("flightprice.csv")

Seed & Libraries:

set.seed(1)
library("corrplot")
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
library("caret")
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: ggplot2
## Loading required package: lattice
library("randomForest")
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library("psych")
## Warning: package 'psych' was built under R version 4.1.3
## 
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
## 
##     outlier
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Data Inspection:

any(is.na(flightdata))
## [1] FALSE

The is.na command helps us determine whether our dataset contains any missing values, which is not the case here.

head(flightdata)
##   X  airline  flight source_city departure_time stops  arrival_time
## 1 0 SpiceJet SG-8709       Delhi        Evening  zero         Night
## 2 1 SpiceJet SG-8157       Delhi  Early_Morning  zero       Morning
## 3 2  AirAsia  I5-764       Delhi  Early_Morning  zero Early_Morning
## 4 3  Vistara  UK-995       Delhi        Morning  zero     Afternoon
## 5 4  Vistara  UK-963       Delhi        Morning  zero       Morning
## 6 5  Vistara  UK-945       Delhi        Morning  zero     Afternoon
##   destination_city   class duration days_left price
## 1           Mumbai Economy     2.17         1  5953
## 2           Mumbai Economy     2.33         1  5953
## 3           Mumbai Economy     2.17         1  5956
## 4           Mumbai Economy     2.25         1  5955
## 5           Mumbai Economy     2.33         1  5955
## 6           Mumbai Economy     2.33         1  5955

With the help of the head() command, we can analyze the top rows of the dataset.

str(flightdata)
## 'data.frame':    300153 obs. of  12 variables:
##  $ X               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ airline         : chr  "SpiceJet" "SpiceJet" "AirAsia" "Vistara" ...
##  $ flight          : chr  "SG-8709" "SG-8157" "I5-764" "UK-995" ...
##  $ source_city     : chr  "Delhi" "Delhi" "Delhi" "Delhi" ...
##  $ departure_time  : chr  "Evening" "Early_Morning" "Early_Morning" "Morning" ...
##  $ stops           : chr  "zero" "zero" "zero" "zero" ...
##  $ arrival_time    : chr  "Night" "Morning" "Early_Morning" "Afternoon" ...
##  $ destination_city: chr  "Mumbai" "Mumbai" "Mumbai" "Mumbai" ...
##  $ class           : chr  "Economy" "Economy" "Economy" "Economy" ...
##  $ duration        : num  2.17 2.33 2.17 2.25 2.33 2.33 2.08 2.17 2.17 2.25 ...
##  $ days_left       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ price           : int  5953 5953 5956 5955 5955 5955 6060 6060 5954 5954 ...
summary(flightdata)
##        X            airline             flight          source_city       
##  Min.   :     0   Length:300153      Length:300153      Length:300153     
##  1st Qu.: 75038   Class :character   Class :character   Class :character  
##  Median :150076   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :150076                                                           
##  3rd Qu.:225114                                                           
##  Max.   :300152                                                           
##  departure_time        stops           arrival_time       destination_city  
##  Length:300153      Length:300153      Length:300153      Length:300153     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     class              duration       days_left      price       
##  Length:300153      Min.   : 0.83   Min.   : 1   Min.   :  1105  
##  Class :character   1st Qu.: 6.83   1st Qu.:15   1st Qu.:  4783  
##  Mode  :character   Median :11.25   Median :26   Median :  7425  
##                     Mean   :12.22   Mean   :26   Mean   : 20890  
##                     3rd Qu.:16.17   3rd Qu.:38   3rd Qu.: 42521  
##                     Max.   :49.83   Max.   :49   Max.   :123071

Judging by the structure and summary, we can observe that all of our categorical variables were loaded as character variables, which means that the data needs further pre-processing.

Data Pre-processing:

flightdata <- flightdata[sample(nrow(flightdata), 10000), ] 
flightdata <- flightdata[, -c(1,3)]
flightdata$airline <- as.factor(flightdata$airline)
flightdata$source_city <- as.factor(flightdata$source_city)
flightdata$departure_time <- as.factor(flightdata$departure_time)
flightdata$stops <- as.factor(flightdata$stops)
flightdata$arrival_time <- as.factor(flightdata$arrival_time)
flightdata$destination_city <- as.factor(flightdata$destination_city)
flightdata$class <- as.factor(flightdata$class)
flightdata$price <- flightdata$price / 83.89
head(flightdata)
##          airline source_city departure_time stops arrival_time destination_city
## 24388    Vistara       Delhi          Night   one      Evening          Kolkata
## 124413  GO_FIRST     Kolkata  Early_Morning   one    Afternoon            Delhi
## 142643 Air_India     Kolkata        Morning   one        Night        Hyderabad
## 25173     Indigo       Delhi      Afternoon   one      Evening          Kolkata
## 294762 Air_India     Chennai  Early_Morning   one        Night        Bangalore
## 116487 Air_India   Bangalore  Early_Morning   one      Evening          Chennai
##           class duration days_left     price
## 24388   Economy    20.17        25  78.41221
## 124413  Economy    10.83        31  75.38443
## 142643  Economy    11.92         3 141.18488
## 25173   Economy     4.17        30  56.74097
## 294762 Business    13.92        27 718.32161
## 116487  Economy    12.83        13 188.15115
  • Firstly, we have randomly subsetted the data to 10,000 rows, because of limited computational power.
  • Next, we have factored all of the categorical variables that had ‘chr’ as datatype.
  • Lastly, we divided the price category by the currency exchange to get prices in EUR (as of 01/04/22), for easier interpretation.

Descriptive Analysis & Visual Interpretation:

plot(flightdata$duration, flightdata$price,  col=flightdata$class, ylab = "Price, EUR", xlab = "Duration of Trip, h")
legend('topright', legend = levels(flightdata$class), col = 1:2, cex = 1, pch = 1)

We have plotted the prices of the tickets over duration of flights in hours. Based on this scatter plot, we can observe no correlation between the two variables. Yet, we can clearly observe two clusters, where the black-dotted observations are “Business” class flights, and red-dotted are “Economy”. While there may be no difference in duration of trips between the two clusters, the “Business” class tickets are on average more expensive than “Economy” ones.

plot(price ~ departure_time, col = 2:6, data = flightdata, xlab = "Departure Time", ylab = "Price")

Judging by this boxplot, we cannot derive any conclusions whether the tickets are more expensive during different times of flight. However, we may see that the median for ‘Late Night’ tickets is highly lower in comparison to the rest, and that the Night category has the highest upper quartile.

plot(price ~ airline, col = 2:7, ylab = "Price", xlab = "Airline", data = flightdata)

Here we plotted the prices by each airline. We can see that “Air India” and “Vistara”’s prices are are higher on average in comparison to the other airlines.

plot(price ~ source_city, col = 2:7, ylab = "Price", xlab = "Source City", data = flightdata)

In the following boxplot, we cannot spot any major difference in prices between the different cities where planes are commencing their flights.

plot(price ~ class, col = 2:3, ylab = "Price", xlab = "Class", data = flightdata)

In this boxplot, we can observe a clear difference in prices between the flight tickets of the “Economy” and “Business” classes. “Business” class tickets have a higher cost, with a median of nearly 620 EUR, while the “Economy” class has a median of 150 EUR. In both instances we can observe a large number of outliers.

corr.data <- cor(flightdata[,8:10])
corrplot(corr.data, method="circle")

Here we have graphed the correlation plot between the three numeric variables, mainly “price”, “days_left” and “duration”. We can observe that there is a positive correlation (~ 0.3) between the price and duration of the flight, which may mean that longer distance flights are more expensive. The “days_left” variable, which relates to the number of days between the booking and flight date, has a negative correlation with “price” (~ -0.2), which may mean that if the customer purchases a ticket which is closer to the flight date, it will be more expensive.

Prediction:

Linear Regression:

Now, we are going to tackle our first question of predicting flight ticket prices. We will start by performing a Multiple Linear Regression, in order to observe the effects of various variables upon ticket prices. We are going to execute different models and test which one performed best, based on criteria such as the Adjusted R-Squared and AIC. In all of our regression models, we will use a 95% confidence level to identify significant variables, and conclude whether there is enough evidence to reject the null hypothesis \(H_0\):

  • \(H_0\): The variables do not have an effect over price.
  • \(H_1\): The variables do have an effect over price.

Full Model:

regAll <- lm(price ~ (relevel(as.factor(flightdata$class), ref="Economy")) + airline + source_city + 
               departure_time + stops + arrival_time + destination_city + duration + days_left, data = flightdata)
summary(regAll)
## 
## Call:
## lm(formula = price ~ (relevel(as.factor(flightdata$class), ref = "Economy")) + 
##     airline + source_city + departure_time + stops + arrival_time + 
##     destination_city + duration + days_left, data = flightdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -364.28  -37.71   -4.63   36.21  582.71 
## 
## Coefficients:
##                                                                Estimate
## (Intercept)                                                    99.14165
## relevel(as.factor(flightdata$class), ref = "Economy")Business 535.58397
## airlineAirAsia                                                 -2.09458
## airlineGO_FIRST                                                18.35166
## airlineIndigo                                                  22.23502
## airlineSpiceJet                                                23.22496
## airlineVistara                                                 46.70121
## source_cityChennai                                             -0.22193
## source_cityDelhi                                              -18.15947
## source_cityHyderabad                                          -19.10516
## source_cityKolkata                                             19.98255
## source_cityMumbai                                              -2.70147
## departure_timeEarly_Morning                                     6.83121
## departure_timeEvening                                           5.31027
## departure_timeLate_Night                                       25.03907
## departure_timeMorning                                           9.25449
## departure_timeNight                                            10.48042
## stopstwo_or_more                                               18.06302
## stopszero                                                     -89.15850
## arrival_timeEarly_Morning                                     -17.65285
## arrival_timeEvening                                            10.69180
## arrival_timeLate_Night                                         14.51292
## arrival_timeMorning                                             4.47077
## arrival_timeNight                                              14.80507
## destination_cityChennai                                        -7.57747
## destination_cityDelhi                                         -22.78900
## destination_cityHyderabad                                     -27.01614
## destination_cityKolkata                                        14.70637
## destination_cityMumbai                                         -5.91493
## duration                                                        0.41830
## days_left                                                      -1.53358
##                                                               Std. Error
## (Intercept)                                                      5.44487
## relevel(as.factor(flightdata$class), ref = "Economy")Business    1.95146
## airlineAirAsia                                                   4.09470
## airlineGO_FIRST                                                  3.55754
## airlineIndigo                                                    3.07697
## airlineSpiceJet                                                  5.19317
## airlineVistara                                                   2.01159
## source_cityChennai                                               3.02855
## source_cityDelhi                                                 2.72529
## source_cityHyderabad                                             2.97799
## source_cityKolkata                                               2.90121
## source_cityMumbai                                                2.72055
## departure_timeEarly_Morning                                      2.70139
## departure_timeEvening                                            2.76340
## departure_timeLate_Night                                        11.29609
## departure_timeMorning                                            2.64686
## departure_timeNight                                              2.97815
## stopstwo_or_more                                                 4.05699
## stopszero                                                        3.01670
## arrival_timeEarly_Morning                                        4.29275
## arrival_timeEvening                                              2.80675
## arrival_timeLate_Night                                           4.48158
## arrival_timeMorning                                              2.98180
## arrival_timeNight                                                2.77167
## destination_cityChennai                                          2.97584
## destination_cityDelhi                                            2.81270
## destination_cityHyderabad                                        3.00811
## destination_cityKolkata                                          2.84974
## destination_cityMumbai                                           2.75893
## duration                                                         0.15306
## days_left                                                        0.05908
##                                                               t value Pr(>|t|)
## (Intercept)                                                    18.208  < 2e-16
## relevel(as.factor(flightdata$class), ref = "Economy")Business 274.453  < 2e-16
## airlineAirAsia                                                 -0.512 0.608988
## airlineGO_FIRST                                                 5.159 2.54e-07
## airlineIndigo                                                   7.226 5.33e-13
## airlineSpiceJet                                                 4.472 7.83e-06
## airlineVistara                                                 23.216  < 2e-16
## source_cityChennai                                             -0.073 0.941585
## source_cityDelhi                                               -6.663 2.82e-11
## source_cityHyderabad                                           -6.415 1.47e-10
## source_cityKolkata                                              6.888 6.01e-12
## source_cityMumbai                                              -0.993 0.320740
## departure_timeEarly_Morning                                     2.529 0.011462
## departure_timeEvening                                           1.922 0.054679
## departure_timeLate_Night                                        2.217 0.026672
## departure_timeMorning                                           3.496 0.000474
## departure_timeNight                                             3.519 0.000435
## stopstwo_or_more                                                4.452 8.59e-06
## stopszero                                                     -29.555  < 2e-16
## arrival_timeEarly_Morning                                      -4.112 3.95e-05
## arrival_timeEvening                                             3.809 0.000140
## arrival_timeLate_Night                                          3.238 0.001206
## arrival_timeMorning                                             1.499 0.133814
## arrival_timeNight                                               5.342 9.42e-08
## destination_cityChennai                                        -2.546 0.010901
## destination_cityDelhi                                          -8.102 6.03e-16
## destination_cityHyderabad                                      -8.981  < 2e-16
## destination_cityKolkata                                         5.161 2.51e-07
## destination_cityMumbai                                         -2.144 0.032063
## duration                                                        2.733 0.006288
## days_left                                                     -25.960  < 2e-16
##                                                                  
## (Intercept)                                                   ***
## relevel(as.factor(flightdata$class), ref = "Economy")Business ***
## airlineAirAsia                                                   
## airlineGO_FIRST                                               ***
## airlineIndigo                                                 ***
## airlineSpiceJet                                               ***
## airlineVistara                                                ***
## source_cityChennai                                               
## source_cityDelhi                                              ***
## source_cityHyderabad                                          ***
## source_cityKolkata                                            ***
## source_cityMumbai                                                
## departure_timeEarly_Morning                                   *  
## departure_timeEvening                                         .  
## departure_timeLate_Night                                      *  
## departure_timeMorning                                         ***
## departure_timeNight                                           ***
## stopstwo_or_more                                              ***
## stopszero                                                     ***
## arrival_timeEarly_Morning                                     ***
## arrival_timeEvening                                           ***
## arrival_timeLate_Night                                        ** 
## arrival_timeMorning                                              
## arrival_timeNight                                             ***
## destination_cityChennai                                       *  
## destination_cityDelhi                                         ***
## destination_cityHyderabad                                     ***
## destination_cityKolkata                                       ***
## destination_cityMumbai                                        *  
## duration                                                      ** 
## days_left                                                     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.27 on 9969 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.9134 
## F-statistic:  3515 on 30 and 9969 DF,  p-value: < 2.2e-16

In the regression above, we have included all of the variables that could explain the changes in price. We have releveled the “class” variable, so that the “Economy” value would be taken as baseline. The fitted regression line for the model is: \[ \widehat{Price} = 99.14165 + 535.58397 \cdot classBusiness - 2.09458 \cdot airlineAirAsia + 18.35166 \cdot airlineGO\_First + ... - 1.53358 \cdot days\_left \] Interpretation:

  • We can observe how most of our variables are significant, since the p-values are < 0.05, and the t-values are > |1.96|.
  • \(\beta_0 = 99.14165\), the intercept, is the average value of ticket price at the baseline, when all the other variables are equal to 0. Since most of our variables are categorical, the regression has taken some values as the baseline. If we go from the baseline to our values, the price of the ticket will either increase or decrease by the \(\beta_i\) coefficients.
  • As an example for one of the betas: \(\beta_1 = 535.58397\), the slope for “Business” class, is the marginal effect of the upper class ticket on price, keeping all things constant. If one wants to purchase a business class ticket instead of the economy one, the price will increase on average by 535.58 EUR, ceteris paribus.
  • The \(R^2\) value is 0.9136, which means that the model explains 91.36% of the variability in price.

Adjusted Model:

Now, we will proceed with an adjusted model, where we are going to select the variables that we think might explain prices the best.

regAdj <- lm(price ~ (relevel(as.factor(flightdata$class), ref="Economy")) + stops + duration + destination_city + 
               days_left, data = flightdata)
summary(regAdj)
## 
## Call:
## lm(formula = price ~ (relevel(as.factor(flightdata$class), ref = "Economy")) + 
##     stops + duration + destination_city + days_left, data = flightdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -378.29  -36.27   -6.04   36.57  622.51 
## 
## Coefficients:
##                                                               Estimate
## (Intercept)                                                   134.5560
## relevel(as.factor(flightdata$class), ref = "Economy")Business 543.7162
## stopstwo_or_more                                               21.9959
## stopszero                                                     -95.3687
## duration                                                        0.2858
## destination_cityChennai                                        -8.1186
## destination_cityDelhi                                         -18.6953
## destination_cityHyderabad                                     -23.4356
## destination_cityKolkata                                         9.6827
## destination_cityMumbai                                         -7.5207
## days_left                                                      -1.5367
##                                                               Std. Error
## (Intercept)                                                       3.2776
## relevel(as.factor(flightdata$class), ref = "Economy")Business     1.8346
## stopstwo_or_more                                                  4.1823
## stopszero                                                         3.0676
## duration                                                          0.1398
## destination_cityChennai                                           3.0584
## destination_cityDelhi                                             2.8217
## destination_cityHyderabad                                         3.0739
## destination_cityKolkata                                           2.8967
## destination_cityMumbai                                            2.7921
## days_left                                                         0.0618
##                                                               t value Pr(>|t|)
## (Intercept)                                                    41.053  < 2e-16
## relevel(as.factor(flightdata$class), ref = "Economy")Business 296.370  < 2e-16
## stopstwo_or_more                                                5.259 1.48e-07
## stopszero                                                     -31.089  < 2e-16
## duration                                                        2.044 0.040962
## destination_cityChennai                                        -2.654 0.007956
## destination_cityDelhi                                          -6.626 3.64e-11
## destination_cityHyderabad                                      -7.624 2.68e-14
## destination_cityKolkata                                         3.343 0.000833
## destination_cityMumbai                                         -2.694 0.007080
## days_left                                                     -24.866  < 2e-16
##                                                                  
## (Intercept)                                                   ***
## relevel(as.factor(flightdata$class), ref = "Economy")Business ***
## stopstwo_or_more                                              ***
## stopszero                                                     ***
## duration                                                      *  
## destination_cityChennai                                       ** 
## destination_cityDelhi                                         ***
## destination_cityHyderabad                                     ***
## destination_cityKolkata                                       ***
## destination_cityMumbai                                        ** 
## days_left                                                     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84.14 on 9989 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9048 
## F-statistic:  9506 on 10 and 9989 DF,  p-value: < 2.2e-16

We have included the following variables:

  • Class - releveled by “Economy”, since it is known that there is a discrepancy in ticket prices between “Economy” and “Business” class tickets.
  • Stops - If a flight has more stops, it may mean that the trip duration is longer.
  • Duration - If a flight has longer hours, it can highly influence price.
  • Destination City - Flying to specific cities during holidays or when the demand is high for the destination may mean that tickets are more expensive.
  • Days Left - If a person buys a ticket in advance, they’re more likely to buy a cheaper ticket, compared to a couple days before the flight.

We can observe that this model performs worse than the model above, since the Adjusted R-Squared value is 0.9048, as opposed to the previous value of 0.9134.

Forward Stepwise:

Now, we will perform a forward stepwise regression, where we start with the model that contains an intercept but no predictors. Next, we add to the null model the variables that result in the lowest AIC, and conclude which model performs better. Stepwise output is hidden.

startmodel = lm(price ~ 1, data = flightdata)
regStep = step(startmodel, direction="forward", scope = formula(regAll))
summary(regStep)
## 
## Call:
## lm(formula = price ~ relevel(as.factor(flightdata$class), ref = "Economy") + 
##     stops + days_left + airline + destination_city + source_city + 
##     arrival_time + departure_time + duration, data = flightdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -364.28  -37.71   -4.63   36.21  582.71 
## 
## Coefficients:
##                                                                Estimate
## (Intercept)                                                    99.14165
## relevel(as.factor(flightdata$class), ref = "Economy")Business 535.58397
## stopstwo_or_more                                               18.06302
## stopszero                                                     -89.15850
## days_left                                                      -1.53358
## airlineAirAsia                                                 -2.09458
## airlineGO_FIRST                                                18.35166
## airlineIndigo                                                  22.23502
## airlineSpiceJet                                                23.22496
## airlineVistara                                                 46.70121
## destination_cityChennai                                        -7.57747
## destination_cityDelhi                                         -22.78900
## destination_cityHyderabad                                     -27.01614
## destination_cityKolkata                                        14.70637
## destination_cityMumbai                                         -5.91493
## source_cityChennai                                             -0.22193
## source_cityDelhi                                              -18.15947
## source_cityHyderabad                                          -19.10516
## source_cityKolkata                                             19.98255
## source_cityMumbai                                              -2.70147
## arrival_timeEarly_Morning                                     -17.65285
## arrival_timeEvening                                            10.69180
## arrival_timeLate_Night                                         14.51292
## arrival_timeMorning                                             4.47077
## arrival_timeNight                                              14.80507
## departure_timeEarly_Morning                                     6.83121
## departure_timeEvening                                           5.31027
## departure_timeLate_Night                                       25.03907
## departure_timeMorning                                           9.25449
## departure_timeNight                                            10.48042
## duration                                                        0.41830
##                                                               Std. Error
## (Intercept)                                                      5.44487
## relevel(as.factor(flightdata$class), ref = "Economy")Business    1.95146
## stopstwo_or_more                                                 4.05699
## stopszero                                                        3.01670
## days_left                                                        0.05908
## airlineAirAsia                                                   4.09470
## airlineGO_FIRST                                                  3.55754
## airlineIndigo                                                    3.07697
## airlineSpiceJet                                                  5.19317
## airlineVistara                                                   2.01159
## destination_cityChennai                                          2.97584
## destination_cityDelhi                                            2.81270
## destination_cityHyderabad                                        3.00811
## destination_cityKolkata                                          2.84974
## destination_cityMumbai                                           2.75893
## source_cityChennai                                               3.02855
## source_cityDelhi                                                 2.72529
## source_cityHyderabad                                             2.97799
## source_cityKolkata                                               2.90121
## source_cityMumbai                                                2.72055
## arrival_timeEarly_Morning                                        4.29275
## arrival_timeEvening                                              2.80675
## arrival_timeLate_Night                                           4.48158
## arrival_timeMorning                                              2.98180
## arrival_timeNight                                                2.77167
## departure_timeEarly_Morning                                      2.70139
## departure_timeEvening                                            2.76340
## departure_timeLate_Night                                        11.29609
## departure_timeMorning                                            2.64686
## departure_timeNight                                              2.97815
## duration                                                         0.15306
##                                                               t value Pr(>|t|)
## (Intercept)                                                    18.208  < 2e-16
## relevel(as.factor(flightdata$class), ref = "Economy")Business 274.453  < 2e-16
## stopstwo_or_more                                                4.452 8.59e-06
## stopszero                                                     -29.555  < 2e-16
## days_left                                                     -25.960  < 2e-16
## airlineAirAsia                                                 -0.512 0.608988
## airlineGO_FIRST                                                 5.159 2.54e-07
## airlineIndigo                                                   7.226 5.33e-13
## airlineSpiceJet                                                 4.472 7.83e-06
## airlineVistara                                                 23.216  < 2e-16
## destination_cityChennai                                        -2.546 0.010901
## destination_cityDelhi                                          -8.102 6.03e-16
## destination_cityHyderabad                                      -8.981  < 2e-16
## destination_cityKolkata                                         5.161 2.51e-07
## destination_cityMumbai                                         -2.144 0.032063
## source_cityChennai                                             -0.073 0.941585
## source_cityDelhi                                               -6.663 2.82e-11
## source_cityHyderabad                                           -6.415 1.47e-10
## source_cityKolkata                                              6.888 6.01e-12
## source_cityMumbai                                              -0.993 0.320740
## arrival_timeEarly_Morning                                      -4.112 3.95e-05
## arrival_timeEvening                                             3.809 0.000140
## arrival_timeLate_Night                                          3.238 0.001206
## arrival_timeMorning                                             1.499 0.133814
## arrival_timeNight                                               5.342 9.42e-08
## departure_timeEarly_Morning                                     2.529 0.011462
## departure_timeEvening                                           1.922 0.054679
## departure_timeLate_Night                                        2.217 0.026672
## departure_timeMorning                                           3.496 0.000474
## departure_timeNight                                             3.519 0.000435
## duration                                                        2.733 0.006288
##                                                                  
## (Intercept)                                                   ***
## relevel(as.factor(flightdata$class), ref = "Economy")Business ***
## stopstwo_or_more                                              ***
## stopszero                                                     ***
## days_left                                                     ***
## airlineAirAsia                                                   
## airlineGO_FIRST                                               ***
## airlineIndigo                                                 ***
## airlineSpiceJet                                               ***
## airlineVistara                                                ***
## destination_cityChennai                                       *  
## destination_cityDelhi                                         ***
## destination_cityHyderabad                                     ***
## destination_cityKolkata                                       ***
## destination_cityMumbai                                        *  
## source_cityChennai                                               
## source_cityDelhi                                              ***
## source_cityHyderabad                                          ***
## source_cityKolkata                                            ***
## source_cityMumbai                                                
## arrival_timeEarly_Morning                                     ***
## arrival_timeEvening                                           ***
## arrival_timeLate_Night                                        ** 
## arrival_timeMorning                                              
## arrival_timeNight                                             ***
## departure_timeEarly_Morning                                   *  
## departure_timeEvening                                         .  
## departure_timeLate_Night                                      *  
## departure_timeMorning                                         ***
## departure_timeNight                                           ***
## duration                                                      ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.27 on 9969 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.9134 
## F-statistic:  3515 on 30 and 9969 DF,  p-value: < 2.2e-16
AIC(regAll, regStep)
##         df      AIC
## regAll  32 116120.1
## regStep 32 116120.1

The stepwise() model includes all of the variables that we’ve had in the complete model. Albeit a rare occasion, this may mean that each of our variables in the dataset has a great effect on price, hence the function decided that it’s best to include them all. Therefore, the AIC and Adjusted R-Squared values haven’t changed.

AIC(regStep, regAdj)
##         df      AIC
## regStep 32 116120.1
## regAdj  12 117042.2

However, in comparison to our adjusted model, the stepwise model performs far better, since the AIC value is lower.

Out-of-sample Performance:

Now, we will test the predictive power of the models, by segmenting the data into a training and test data. Firstly, we will estimate the models on the training sample, after which we will evaluate their performances on the test sample.

Segmentation:

random = sample(1:nrow(flightdata), 0.8*nrow(flightdata))
train_data = flightdata[random,]
test_data = flightdata[-random,]

We’re going to split the data into 80% for the training data and 20% for the test data.

Regressions:

fullmodel = lm(price ~ class + airline + source_city + departure_time + stops + arrival_time + 
                 destination_city + duration + days_left, data = train_data)

adjmodel = lm(price ~ + stops + duration + destination_city + class + days_left, data=train_data)

startmodel1 = lm(price ~ 1, data=train_data)
regStep1 = step(startmodel1, direction="forward", scope=formula(fullmodel))

Re-executing the regressions, but now on the training data. Stepwise output hidden.

Prediction errors:

y_hat_full = predict(fullmodel, newdata = test_data)
y_hat_adjmodel = predict(adjmodel, newdata = test_data)
y_hat_regstep = predict(regStep1, newdata = test_data)

Computing the prediction errors on the test data. They will be used for calculating the Root Mean Squared Error (RMSE).

RMSE:

sqrt(mean((y_hat_full - test_data$price)^2)) # Full model.
## [1] 79.35446
sqrt(mean((y_hat_adjmodel - test_data$price)^2)) # Adjusted model.
## [1] 82.58473
sqrt(mean((y_hat_regstep - test_data$price)^2)) # Forward selected model.
## [1] 79.35446

We can observe that the lowest RMSE is achieved by both the full model & stepwise models. The in-sample performance showed the exact same results as the out-of-sample performance. We may now proceed with the Random Forests exercise.

Random Forests:

We also decided to use random forests to predict the flight prices. Since there are 9 predictors, we could have set the hyper parameter mtry=3. However, we noticed that increasing it to 4 yielded significantly better results (lower RMSE) without increasing too much computation time.

rf <- randomForest(price ~ ., data=flightdata, importance=TRUE, mtry=4)
rf
## 
## Call:
##  randomForest(formula = price ~ ., data = flightdata, importance = TRUE,      mtry = 4) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 2287.793
##                     % Var explained: 96.92
varImpPlot(rf)

Above we can see that proportion of variance explained is higher than the ones we had in the linear regression models. We may also visualize the importance of the variables.

Model Comparisons:

In order to be able to evaluate which model performed better between random forests and linear regression, we use 10-fold cross validation to compute the metrics of performance.

fitControl <- trainControl(method = "cv",number = 10)
rfFit1 <- train(price ~ .,
data = flightdata,
method = "rf",
tuneGrid = data.frame(mtry=4),
trControl = fitControl)
rfFit1
## Random Forest 
## 
## 10000 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9000, 9001, 9001, 9001, 8999, 8999, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   65.11511  0.9558858  44.03098
## 
## Tuning parameter 'mtry' was held constant at a value of 4

We may also check the variables that are most important in determining prices on the model built with the train function.

varImp(rfFit1)
## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##                            Overall
## classEconomy              100.0000
## airlineVistara              6.8982
## duration                    5.3286
## airlineIndigo               3.6531
## days_left                   2.5399
## stopszero                   2.1408
## airlineGO_FIRST             2.0158
## airlineAirAsia              1.5688
## stopstwo_or_more            0.5562
## airlineSpiceJet             0.5405
## destination_cityDelhi       0.3550
## source_cityDelhi            0.3198
## destination_cityKolkata     0.3097
## source_cityKolkata          0.2889
## destination_cityMumbai      0.2589
## destination_cityHyderabad   0.2450
## arrival_timeNight           0.2429
## source_cityMumbai           0.2314
## arrival_timeEvening         0.2128
## arrival_timeEarly_Morning   0.2046

On the table shown above, we confirm that the three most important variables are: class, whether the airline is Vistara or not, and the duration of the flight.

We can see that the model has a Pseudo R-squared of 95.58% and a RMSE of 65.11. Now, we can proceed to test the performance of the linear model also by using 10-fold cross validation.

lmFit1 <- train(price ~ .,
data = flightdata,
method = "lm",
trControl = fitControl)
lmFit1
## Linear Regression 
## 
## 10000 samples
##     9 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 9000, 9000, 8999, 8999, 9000, 9002, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   80.33955  0.9134614  54.64637
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Prediction Summary:

Comparing the random forest model to the linear regression, both using 10-fold cross validation, we conclude that random forests is a better model. The three performance measures are significantly better on random forests than on the linear regression model.

Classification:

In order to find out whether a flight can be seen as affordable by the every-day Indian, we create a new column in the dataset which we derive from the price column. Our assumption is that everything below or equal to the median weekly wage is affordable.

By conducting some research we found that the current median weekly wage in India is 90€, hence we used this to classify the flights as affordable (TRUE or FALSE), thus obtaining a boolean variable.

We can then use this variable to train our models with, and finally predict whether a certain flight ticket will be affordable or not given its route, flight time, airline etc. And affordable will mean for half of India’s wage earners (from definition of median half will be above the median wage), which is quite a large amount of people and hence a big market.

To this end, let’s test the logistic regression and k-NN models.

Logistic Regression:

First, we set up the categorical “affordable” variable to be predicted and remove the “prices” variable, because we don’t want to use them to predict the affordable variable.

med_weekly_salary = 90 # In Euro for median Indian person.

bool = flightdata$price<=med_weekly_salary 
# New column with TRUE/FALSE.
flightdata$affordable = bool
flightdata$affordable = as.factor(flightdata$affordable)
flightdata = flightdata[,-10] # Remove prices.
head(flightdata)
##          airline source_city departure_time stops arrival_time destination_city
## 24388    Vistara       Delhi          Night   one      Evening          Kolkata
## 124413  GO_FIRST     Kolkata  Early_Morning   one    Afternoon            Delhi
## 142643 Air_India     Kolkata        Morning   one        Night        Hyderabad
## 25173     Indigo       Delhi      Afternoon   one      Evening          Kolkata
## 294762 Air_India     Chennai  Early_Morning   one        Night        Bangalore
## 116487 Air_India   Bangalore  Early_Morning   one      Evening          Chennai
##           class duration days_left affordable
## 24388   Economy    20.17        25       TRUE
## 124413  Economy    10.83        31       TRUE
## 142643  Economy    11.92         3      FALSE
## 25173   Economy     4.17        30       TRUE
## 294762 Business    13.92        27      FALSE
## 116487  Economy    12.83        13      FALSE

Now we’ll build the logistic model. Then we will try to remove some of the predictor variables with step(). But as we have seen in the linear regression, all variables are retained here. Moreover, we try the “backward” and “both” (backward and forward) options, yet as we can see, it yields the same model.

fit_logit_all <- glm(affordable ~ . , data = flightdata,family = binomial())
fit_logit_step <- step(fit_logit_all, direction = "both")
fit_logit_step_b <- step(fit_logit_all, direction = "backward")

Below, we see that the three models are the same:

formula(fit_logit_step)==formula(fit_logit_step_b)
## [1] TRUE
formula(fit_logit_step)==formula(fit_logit_all)
## [1] TRUE
summary(fit_logit_step)
## 
## Call:
## glm(formula = affordable ~ airline + source_city + departure_time + 
##     stops + arrival_time + destination_city + class + duration + 
##     days_left, family = binomial(), data = flightdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8494  -0.0001   0.0000   0.3348   2.3405  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -25.366463 255.206645  -0.099  0.92082    
## airlineAirAsia                2.896392   0.251284  11.526  < 2e-16 ***
## airlineGO_FIRST               1.291127   0.159124   8.114 4.90e-16 ***
## airlineIndigo                 0.728799   0.132923   5.483 4.19e-08 ***
## airlineSpiceJet               0.642757   0.200101   3.212  0.00132 ** 
## airlineVistara               -0.543992   0.095862  -5.675 1.39e-08 ***
## source_cityChennai           -0.098120   0.138095  -0.711  0.47738    
## source_cityDelhi              0.015890   0.126563   0.126  0.90009    
## source_cityHyderabad          0.348407   0.140966   2.472  0.01345 *  
## source_cityKolkata           -0.659732   0.132935  -4.963 6.95e-07 ***
## source_cityMumbai             0.203546   0.128167   1.588  0.11226    
## departure_timeEarly_Morning   0.138778   0.124309   1.116  0.26425    
## departure_timeEvening         0.337278   0.128155   2.632  0.00849 ** 
## departure_timeLate_Night     -0.065860   0.635199  -0.104  0.91742    
## departure_timeMorning        -0.159617   0.117832  -1.355  0.17554    
## departure_timeNight           0.234316   0.143948   1.628  0.10357    
## stopstwo_or_more             -1.945816   0.155729 -12.495  < 2e-16 ***
## stopszero                     2.142629   0.189396  11.313  < 2e-16 ***
## arrival_timeEarly_Morning     0.567084   0.213141   2.661  0.00780 ** 
## arrival_timeEvening          -0.233631   0.128573  -1.817  0.06920 .  
## arrival_timeLate_Night       -0.287932   0.212202  -1.357  0.17482    
## arrival_timeMorning           0.260944   0.142134   1.836  0.06637 .  
## arrival_timeNight            -0.008834   0.128745  -0.069  0.94530    
## destination_cityChennai      -0.006227   0.139634  -0.045  0.96443    
## destination_cityDelhi        -0.120066   0.134065  -0.896  0.37047    
## destination_cityHyderabad     0.149540   0.142695   1.048  0.29465    
## destination_cityKolkata      -0.642112   0.131379  -4.887 1.02e-06 ***
## destination_cityMumbai       -0.003401   0.130615  -0.026  0.97923    
## classEconomy                 23.405643 255.206510   0.092  0.92693    
## duration                     -0.019510   0.006827  -2.858  0.00427 ** 
## days_left                     0.140398   0.003893  36.064  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13862.7  on 9999  degrees of freedom
## Residual deviance:  4555.4  on 9969  degrees of freedom
## AIC: 4617.4
## 
## Number of Fisher Scoring iterations: 19

We see that the stepwise logistic regression model achieves an AIC of 4617.4.

Out-of-sample Performance:

Now, let’s see how this model performs in predicting classes. For that we again make an 80%-20% random split of the data into test and train.

random = sample(1:nrow(flightdata), 0.8*nrow(flightdata))
train_data = flightdata[random,]
test_data = flightdata[-random,]

Now, we’ll use the caret package to train the model that we obtained from the stepwise algorithm. Then we predict whether the test data flights will be affordable or not and check our predictions with the confusionMatrix function.

fitControl <- trainControl(method = "cv", number = 5) # Cross Validation.
logitFit <- train(formula(fit_logit_step), # We use the variables that we obtained from the stepwise algorithm.
                  data = train_data, 
                  method = "glm", 
                  trControl = fitControl)

# Result:
confusionMatrix(predict(logitFit, test_data), test_data$affordable, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   892   71
##      TRUE    104  933
##                                           
##                Accuracy : 0.9125          
##                  95% CI : (0.8993, 0.9245)
##     No Information Rate : 0.502           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.825           
##                                           
##  Mcnemar's Test P-Value : 0.01556         
##                                           
##             Sensitivity : 0.9293          
##             Specificity : 0.8956          
##          Pos Pred Value : 0.8997          
##          Neg Pred Value : 0.9263          
##              Prevalence : 0.5020          
##          Detection Rate : 0.4665          
##    Detection Prevalence : 0.5185          
##       Balanced Accuracy : 0.9124          
##                                           
##        'Positive' Class : TRUE            
## 

Here we can see that the model performs reasonably well with an Accuracy of 91% meaning that around 91% of the model’s predictions are correct. There are around 30% more False Positives than False Negatives, so this might suggest a small bias of the model but it can also just be by chance.

k-NN:

In implementing the k-NN classification, we are going to predict the categorical variable affordable using all the other variables within the data set.

Data Pre-processing:

We make a copy of our data set to prepare it for our k-NN classification. Next we derive the outcome variable, from the price data. Because k-NN involves calculating distances between data points, we must use numeric variables only. First, we scale the data in case our features are on different metrics.

Determining which values are numerical:

str(flightdata)
## 'data.frame':    10000 obs. of  10 variables:
##  $ airline         : Factor w/ 6 levels "Air_India","AirAsia",..: 6 3 1 4 1 1 4 6 5 6 ...
##  $ source_city     : Factor w/ 6 levels "Bangalore","Chennai",..: 3 5 5 3 2 1 3 5 3 4 ...
##  $ departure_time  : Factor w/ 6 levels "Afternoon","Early_Morning",..: 6 2 5 1 2 2 6 6 2 5 ...
##  $ stops           : Factor w/ 3 levels "one","two_or_more",..: 1 1 1 1 1 1 1 1 3 1 ...
##  $ arrival_time    : Factor w/ 6 levels "Afternoon","Early_Morning",..: 3 1 6 3 6 3 5 6 5 5 ...
##  $ destination_city: Factor w/ 6 levels "Bangalore","Chennai",..: 5 3 4 5 1 2 1 4 5 5 ...
##  $ class           : Factor w/ 2 levels "Business","Economy": 2 2 2 2 1 2 2 1 2 1 ...
##  $ duration        : num  20.17 10.83 11.92 4.17 13.92 ...
##  $ days_left       : int  25 31 3 30 27 13 21 5 31 32 ...
##  $ affordable      : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 1 1 2 1 2 1 ...

We scale the already numeric variables first:

flightdata[,c("duration", "days_left")]<- scale(flightdata[,c("duration", "days_left")])

Then we proceed to dummy code variables that have three or more levels:

airline         <-as.data.frame(dummy.code(flightdata$airline))
source_city     <-as.data.frame(dummy.code(flightdata$source_city))
departure_time  <-as.data.frame(dummy.code(flightdata$departure_time))
stops           <-as.data.frame(dummy.code(flightdata$stops))
arrival_time    <-as.data.frame(dummy.code(flightdata$arrival_time))
destination_city<-as.data.frame(dummy.code(flightdata$destination_city))

We combine the new dummy variables with the original data set:

flightdata <- cbind(flightdata, airline, source_city,departure_time ,stops ,arrival_time ,destination_city)

We remove the original variables that had to be dummy coded as well as the outcome variable.

flightdata <- flightdata[,-(1:7)]
outcome<-as.data.frame(flightdata[,3])
flightdata <- flightdata[,-3]

Building the Model:

We split the data into training and test sets. We partition 80% of the data into the training set and the remaining 20% into the test set.

# 80% of the sample size:
smp_size <- floor(0.8 * nrow(flightdata))

train_ind <- sample(seq_len(nrow(flightdata)), size = smp_size)

# Creating test and training sets that contain all of the predictors.
class_pred_train <- flightdata[train_ind, ]
class_pred_test <- flightdata[-train_ind, ]

Split outcome variable into training and test sets using the same partition as above.

outcome_train <- outcome[train_ind, ]
outcome_train<-factor(outcome_train)
outcome_test <- outcome[-train_ind, ]
outcome_test <- data.frame(outcome_test)

We use the caret package to run the k-NN classification. We can make use of the function which picks the optimal number of neighbors for us:

pred_caret <- caret::train(class_pred_train, outcome_train, method = "knn", preProcess = c("center","scale"))

Looking at the output of the k-NN model, we can see that it chose k = 9:

pred_caret
## k-Nearest Neighbors 
## 
## 8000 samples
##   35 predictor
##    2 classes: 'FALSE', 'TRUE' 
## 
## Pre-processing: centered (35), scaled (35) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 8000, 8000, 8000, 8000, 8000, 8000, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6842963  0.3685325
##   7  0.6912754  0.3824386
##   9  0.6956270  0.3910731
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

We can visualize the accuracy for different number of neighbors:

plot(pred_caret)

Next, we compare our predicted values of the affordability of the price to our actual values. The confusion matrix gives an indication of how well our model predicted the actual values.

The confusion matrix output also shows overall model statistics:

knnPredict <- predict(pred_caret, newdata = class_pred_test) 
outcome_test$outcome_test<- factor(outcome_test$outcome_test)

confusionMatrix(knnPredict, outcome_test$outcome_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE   771  352
##      TRUE    234  643
##                                           
##                Accuracy : 0.707           
##                  95% CI : (0.6865, 0.7269)
##     No Information Rate : 0.5025          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4136          
##                                           
##  Mcnemar's Test P-Value : 1.343e-06       
##                                           
##             Sensitivity : 0.7672          
##             Specificity : 0.6462          
##          Pos Pred Value : 0.6866          
##          Neg Pred Value : 0.7332          
##              Prevalence : 0.5025          
##          Detection Rate : 0.3855          
##    Detection Prevalence : 0.5615          
##       Balanced Accuracy : 0.7067          
##                                           
##        'Positive' Class : FALSE           
## 

The model did not perform very well, it only successfully classified 71% of the cases correctly. The success of the model can also be evaluated with a variety of other metrics (sensitivity, specificity, etc.) included here.

Classification Summary:

To summarize, we utilized the caret package to perform k-NN classification, predicting the affordability of a flight. This model may not have yielded the accuracy of the logistic model for a number of reasons. The majority of our predictor variables were dummy-coded categorical variables, which are not necessarily the most suited for binary k-NN classification problems.

Summary:

Based on various measures and results, we have seen that when predicting a quantitative response, the Random Forests model performed better than the Linear Regression model. According to all of the measures (accuracy, recall, precision) in the classification task, the Logistic Regression has performed better than the k-NN model.

What we may conclude from this research, is that customers from India (and not only), should consider purchasing a ticket as early as possible; choose a flight that operates late at night, and look for airline offers other than those provided by ‘Air India’ or ‘Vistara’.