A 5-Step Checklist for Multiple Linear Regression
Multiple regression analysis is an extension of simple linear regression. It’s useful for describing and making predictions based on linear relationships between predictor variables (ie; independent variables) and a response variable (ie; a dependent variable). Although multiple regression analysis is simpler than many other types of statistical modeling methods, there are still some crucial steps that must be taken to ensure the validity of the results you obtain.
When using the checklist for multiple linear regression analysis, it’s critical to check that model assumptions are not violated. This is to fix or minimize any such violations, and to validate the predictive accuracy of your model. Since the internet provides so few plain-language explanations of this process, I decided to simplify things – to help walk you through the basic process. Please keep in mind that this is a brief summary checklist of steps and considerations. An entire statistics book could probably be written for each of these steps alone. Use this as a basic roadmap, but please investigate the nuances of each step, to avoid making errors. Google is your friend. Lastly, in all instances, use your common sense. If the results you see don’t make sense against what you know to be true, there is a problem that should not be ignored.
Before getting into any of the model investigations, inspect and prepare your data. Check it for errors, treat any missing values, and inspect outliers to determine their validity. After you’re comfortable that your data is correct, go ahead and proceed through the following fix step process.
STEP 1. SELECTING YOUR VARIABLES
To pick the right variables, you’ve got to have a basic understanding of your dataset, enough to know that your data is relevant, high quality, and of adequate volume. As part of your model building efforts, you’ll be working to select the best predictor variables for your model (ie; the variables that have the most direct relationships with your chosen response variable). When selecting predictor variables, a good rule of thumb is that you want to gather a maximum amount of information from a minimum number of variables, remembering that you’re working within the confines of a linear prediction equation.
The two following methods will be helpful to you in the variable selection process.
- Try out an automatic search procedure and let R decide what variables are best. Stepwise regression analysis is a quick way to do this. (Make sure to check your output and see that it makes sense)
- Use all-possible-regressions to test all possible subsets of potential predictor variables. With the all-possible-regressions method, you get to pick the numerical criteria by which you’d like to have the models ranked. Popular numerical criteria are as follows:
- R2 – The set of variables with the highest R2 value are the best fit variables for the model.
- note: R2 values are always between 0 and 1.0
- Adjusted R2 – The sets of variables with larger adjusted R2 values are the better fit variables for the model.
- Cp – The smaller the Cp value, the less total mean square error, and the less regression bias there is.
- PRESSp – The smaller the predicted sum of squares (PRESSp) value, the better the predictive capabilities of the model.
- R2 – The set of variables with the highest R2 value are the best fit variables for the model.
STEP 2. REFINING YOUR MODEL
Check the utility of the model by examining the following criteria:
- Global F test: Test the significance of your predictor variables (as a group) for predicting the response of your dependent variable.
- Adjusted R2: Check the overall sample variation of the dependent variable that is explained by the model after the sample size and the number of parameters have been adjusted. Adjusted R2 values are indicative of how well your predictive equation is fit to your data. Larger adjusted R2 values indicate that variables are a better fit for the model.
- Root mean square error (MSE): MSE provides an estimation for the standard deviation of the random error. An interval of ±2 standard deviations approximates the accuracy in predicting the response variable based on a specific subset of predictor variables.
- Coefficient of variation (CV): If a model has a CV value that’s less than or equal to 10%, then the model is more likely to provide accurate predictions.
STEP 3. TESTING MODEL ASSUMPTIONS
Now it’s time to check that your data meets the seven assumptions of a linear regression model. If you want a valid result from multiple regression analysis, these assumptions must be satisfied.
- You must have three or more variables that are of metric scale (integer or ratio variables) and that can be measured on a continuous scale.
- Your data cannot have any major outliers, or data points that exhibit excessive influence on the rest of the dataset.
- Variable relationships exhibit (1) linearity – your response variable has a linear relationship with each of the predictor variables, and (2) additivity – the expected value of your response variable is based on the additive effects of the different predictor variables.
- Your data shows an independence of observations, or in other words, there is no autocorrelation between variables.
- Your data demonstrates an absence of multicollinearity.
- Your data is homoscedastic.
- Your residuals must be normally distributed.
STEP 4. ADDRESSING POTENTIAL PROBLEMS WITH THE MODEL
Most of the time, at least one of the model assumptions will be violated. In these cases, if you’re careful, you may be able to either fix or minimize the problem(s) that are in conflict with the assumptions.
- If your data is heteroscedastic, you can try transforming your response variable.
- If your residuals are non-normal, you can either (1) check to see if your data could be broken into subsets that share more similar statistical distributions, and upon which you could build separate models OR (2) check to see if the problem is related to a few large outliers. If so, and if these are caused by a simple error or some sort of explainable, non-repeating event, then you may be able to remove these outliers to correct for the non-normality in residuals.
- If you are seeing correlation between your predictor variables, try taking one of them out.
- If your model is generating error due to the presence of missing values, try treating the missing values. You can also use dummy variables to cover for them.
STEP 5. VALIDATING YOUR MODEL
Now it’s time to find out whether the model you’ve chosen is valid. The following three methods will be helpful with that.
- Check the predicted values by collecting new data and checking it against results that are predicted by your model.
- Check the results predicted by your model against your own common sense. If they clash, you’ve got a problem.
- Cross validate results by splitting your data into two randomly-selected samples. Use one half of the data to estimate model parameters. Use the other half for checking the predictive results of your model.