Regression with Categorical Explanatory Variables
As we know, we would be evaluating the relationship between a numerical response variable and numerical explanatory variable using our linear regression.
Now let us see what is modeling approach to find the regression model where the response variable is numerical and the explanatory variable is categorical rather than numerical.
Let us consider an example, we have a poverty percentage which can be taken as response variable that should be predicted based on the region that they are staying. Here, the region is the explanatory variable. Let’s say the variable is 0 if they are in east region and 1 if they stay west region.
Let us say 11.17 is the intercept and slope is 0.38. For east, we give 0 to the x value i.e., region. For west, we give 1 for x. Here we consider one region as the reference level so it is taken as 0.
What does the slope and intercept predicts?
Intercept says that the model predicts an 11.17% average poverty percentage in east region. Remember, we’ve calculated this by plugging in the value of zero for the explanatory variable, because the variable is called region west, and an eastern state is not on the west, therefore we’re plugging in a zero for that.
The reason why we have to do this trick of plugging in a numerical variable is that we couldn’t simply plug in a level, a categorical variable, and solve a mathematical equation. So we’re making due by labeling some of the levels successes and some of the levels failures and denoting these with zeros and ones.
Slope says the relationship between the explanatory and the response variables. The model predicts that the average poverty percentage in western states is 0.38% higher than in the eastern states.