# Introduction

This post is related to the following two questions: How should you add dummy variables? How should you interpret the results?

It may be easier for us to understand theses questions if an example is used. So let’s create one. We will generate our own data. Wait, you may say why not use a real world example. The reason is that when data is collected from the real world, it’s hard (some time impossible) to know the true data-generating process. However, if the data is generated by ourselves, we know for sure how the data is generated. Of course, after the data is generated, we pretend we don’t know the data-generating process and use regression techniques to recover the true relationships. (well, you can also ask your colleague to generate the data and send it to you without telling you how the data is generated. )

# Generate data

Let’s go to our playground and generate the data. Say we want to study how salary is determined by education, experience, and whether someone is holding a management positon. Let’s play God and set up some rules:

1. The world is big enough for all of us. Everyone starts at 40k per year. That is, as long as you are breathing, you get 40k per year.

2. Practice makes perfect. Each additional year of experience, the salary increases by 5k.

3. The more you learn, the more you earn. The annual salary increase for high school, university, and PhD is 0, 10k, and 20k, respectively.

4. Anyone can hold the helm when the sea is calm. For those holding a management position, pay 20k more.

5. Born to be great leaders. For those only go to high school but hold a management position, give them an additional 30k.

6. S**t happens. Random factors will affect the salary with a mean of 0 and standard deviation of 5k.

library(tidyverse)
require(lmtest)
require(car)

set.seed(1)
# generate data
set.edu<-c("high school", "university", "phd")
set.edu<-as.factor(set.edu)
levels(set.edu)<-c("high school","university", "phd")
#levels(set.edu)<-c("university", "phd","high school")

set.mngt<-c("yes","no")

n=5000
d<-data.frame(edu=replicate(n,sample(set.edu,1)),
exp=runif(n,min=0,max = 10),
mngt=replicate(n,sample(set.mngt,1)))

# error term is normal
e<-rnorm(n,0,5000)
# error term is not normal
# e<-rbeta(n,shape1 = 1,shape2 = 5)*5000
# hist(e)

d$salary<-40000+5000*d$exp+e

# education
d<-d %>% mutate(salary = case_when( edu  == "phd" ~ salary+20000,
edu  == "university" ~ salary+10000,
edu  == "high school" ~ salary+0
))

# management
d<-d %>% mutate(salary = case_when(
mngt == "yes" ~ salary + 20000,
mngt == "no"  ~ salary + 0
))

# born to be
d<-d %>% mutate(salary = case_when(
mngt == "yes" & edu == "high school" ~ salary + 30000,
TRUE ~ salary
))

attach(d)

Here is the head and summary of the data.

head(d); summary(d)
##           edu       exp mngt    salary
## 1 high school 0.3549953  yes  87800.96
## 2         phd 3.5570365   no  80815.61
## 3         phd 2.4887294   no  67131.31
## 4  university 8.7918038  yes 119055.02
## 5 high school 3.1797888  yes 106786.99
## 6  university 3.2112050  yes  80901.15
##           edu            exp            mngt          salary
##  high school:1675   Min.   :0.001064   no :2499   Min.   : 31550
##  university :1685   1st Qu.:2.577482   yes:2501   1st Qu.: 73543
##  phd        :1640   Median :5.035378              Median : 90160
##                     Mean   :5.032461              Mean   : 90044
##                     3rd Qu.:7.520791              3rd Qu.:106648
##                     Max.   :9.989099              Max.   :146679

# Plot the data

Relationship betwen salary and education for people with and with no management position.

p<-ggplot(d,aes(edu,salary))+
geom_boxplot()+
geom_jitter(alpha=0.25,color="brown")+
facet_wrap(~mngt)
p

Relationship betwen salary and experience colored by education for people with and with no management position.

p<-ggplot(d,aes(exp,salary,color=edu))+
geom_point(alpha=0.3)+
stat_smooth(method = "lm")+
facet_wrap(~mngt)
p

# Use regression to uncover the true coefficient

## Ignore interaction between education and management

Let’s pretend that we don’t know the true data-generating process. We may just regress salary on education, experience, and management position. The results are:

md<-lm(salary~ edu +
exp +
mngt,data = d)
summary(md)
##
## Call:
## lm(formula = salary ~ edu + exp + mngt, data = d)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -25836.8  -6701.8    -14.2   6501.9  25925.7
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   50093.69     320.36  156.37   <2e-16 ***
## eduuniversity -4770.52     296.96  -16.07   <2e-16 ***
## eduphd         5130.94     298.95   17.16   <2e-16 ***
## exp            4954.33      42.22  117.33   <2e-16 ***
## mngtyes       29873.56     243.42  122.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8605 on 4995 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8577
## F-statistic:  7537 on 4 and 4995 DF,  p-value: < 2.2e-16

Although the parameters are statistically significant, it does not make any sense. How could a university degree compared to high school reduce your salary by 4770.52? Since we know how the data are generated, the correct coefficient for university education should be about 10k.

The correct model should include an interaction term of education and management position.

## Add interaction between education and management

Now, let’s add the interaction term between education and management and see what happens:

md<-lm(data = d,salary~
edu +
exp +
mngt+
edu*mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ edu + exp + mngt + edu * mngt, data = d)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -21223.1  -3314.8    -56.7   3334.6  18612.4
##
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)            39923.21     209.40  190.65   <2e-16 ***
## eduuniversity          10070.22     239.26   42.09   <2e-16 ***
## eduphd                 20116.12     243.72   82.54   <2e-16 ***
## exp                     5008.52      24.21  206.86   <2e-16 ***
## mngtyes                49735.74     241.04  206.34   <2e-16 ***
## eduuniversity:mngtyes -29965.51     340.51  -88.00   <2e-16 ***
## eduphd:mngtyes        -29782.87     342.73  -86.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4932 on 4993 degrees of freedom
## Multiple R-squared:  0.9533, Adjusted R-squared:  0.9533
## F-statistic: 1.7e+04 on 6 and 4993 DF,  p-value: < 2.2e-16

# Interpretation of the results

The results are now making sense.

• The intercept of 39923.21 (close to 40k) is the basic guaranteed income.

• The base for education is high school. Compared to high school, a university education can increase salary by $$\10070.22$$ ( close to 10k) on average. Compared to high school, Ph.D. education can increase the salary by $$\20116.12$$ (close to 20k).

• One more year of experience will increase the salary by $$\5008.52$$ (close to 5k).

• High school graduates holding a management position have a premium of $$\49735.74$$ (close to 50k). These are the born-to-be leaders.

• Compared to high school graduates holding a management position, the premium for university graduates holding a management position reduces by 29965.51 to 19770.23 (49735.74-29965.51, close to 20k).

• Compared to high school graduates holding a management position, the premium for Ph.D. graduates holding a management position reduces by 29782.87 to 19952.87 (49735.74-29782.87, close to 20k). Alternatively, you can say management position generates a basic premium of 20k regardless of the education level. Additional to this 20K, high school graduates get another 30k, increasing the total premium to 50k.

# Test to see violation of model assumptions

For our model to work, we need to meet some assumptions.

• Error should follow normal distribution

Normal Q-Q plot looks linear. So this assumption is satisfied.

par(mfrow = c(2,2))
plot(md)

• No autocorrelation

D-W test value is 1.8878 which is close to 2. Hence, this assumption is also satisfied.

dwtest(md)
##
##  Durbin-Watson test
##
## data:  md
## DW = 2.0237, p-value = 0.7984
## alternative hypothesis: true autocorrelation is greater than 0
• No multicollinearity

The VIF values of predictor variables edu, exp, and mngt are less than 5 so this assumption satisfied.

vif(md)
##              GVIF Df GVIF^(1/(2*Df))
## edu      4.008199  2        1.414938
## exp      1.001219  1        1.000609
## mngt     2.985803  1        1.727948
## edu:mngt 8.008466  2        1.682238

# Regression with subset of data

You can get the same results by running the model with a subset of the data. Instead of using a dummy variable for education, you can split the data into subsets by education level and run the regression model at each subset.

With data only from high school graduates, you get this:

d.sub<-d %>%
filter(edu=="high school")

md<-lm(data = d.sub,salary~
exp +
mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ exp + mngt, data = d.sub)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -21223.7  -3336.5    -87.6   3293.5  17774.6
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39919.95     273.48   146.0   <2e-16 ***
## exp          5009.17      42.35   118.3   <2e-16 ***
## mngtyes     49735.84     242.22   205.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4955 on 1672 degrees of freedom
## Multiple R-squared:  0.9704, Adjusted R-squared:  0.9704
## F-statistic: 2.745e+04 on 2 and 1672 DF,  p-value: < 2.2e-16

With data only from people with University degree, you get this:

d.sub<-d %>%
filter(edu=="university")

md<-lm(data = d.sub,salary~
exp +
mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ exp + mngt, data = d.sub)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -15617.5  -3200.8    -65.7   3315.3  18649.9
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  49805.9      266.2  187.11   <2e-16 ***
## exp           5046.1       41.3  122.20   <2e-16 ***
## mngtyes      19761.1      241.5   81.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4952 on 1682 degrees of freedom
## Multiple R-squared:  0.9304, Adjusted R-squared:  0.9303
## F-statistic: 1.125e+04 on 2 and 1682 DF,  p-value: < 2.2e-16

With data only from people with Ph.D. degree, you get this:

d.sub<-d %>%
filter(edu=="phd")

md<-lm(data = d.sub,salary~
exp +
mngt
)
summary(md)
##
## Call:
## lm(formula = salary ~ exp + mngt, data = d.sub)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -17231.0  -3396.4    -18.2   3378.1  15739.5
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60245.70     273.98  219.89   <2e-16 ***
## exp          4967.56      42.18  117.76   <2e-16 ***
## mngtyes     19952.33     241.48   82.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4888 on 1637 degrees of freedom
## Multiple R-squared:  0.9266, Adjusted R-squared:  0.9265
## F-statistic: 1.033e+04 on 2 and 1637 DF,  p-value: < 2.2e-16
##### Max Shang
###### Research Associate

Research Associate, University of Guelph (Ridgetown Campus)