Tutorial on Linear Regression Model Building
Ten points to consider before you begin
It is always good practice to document the model building process before you start analysing data. During model building there will invariably be situations where you must choose to go one way or the other, and if you have decided a priori which path to follow (and documented it), it makes life a lot easier. There will be different requirements for different situations, but in general you should make the following decisions before commencing analyses:
1. Which statistical procedure do you plan to use?
Often this will be decided up front based on whether the outcome is categorical or continuous and will be a part of study design. Sometimes, however, we have to make decisions as we go, especially if we detect clustering in the data or some other unexpected feature. In this context, we assume that you have already decided to conduct linear regression analysis.
2. What will be the univariable cut-off p-value to shortlist variables for multivariable analyses?
Most people use a liberal p-value of <0.25 or <0.2 but this value can change depending on your circumstances. While you do not want to shortlist any unnecessary variable, you also do not want to leave out an important variable. You might even want to test all variables, especially if the number of variables in your study is limited.
3. How will you detect multicollinearity amongst explanatory variables?
We use Pearson correlation coefficient for continuous variables, Spearman rank correlation for ordinal and the chi-square test for nominal variables. Usually correlation is only a problem if it is too high (>0.9) but you might consider keeping the cut-off value to 0.7 or 0.8 (for excluding one of a pair of highly correlated/associated variables from multivariable analyses) if the number of variables is large.
4. Which variable selection approach will you follow?
Choice of variable selection approach will depend partly on the number of variables in the study and partly on your preference. In general, a backward approach is superior but may not be possible to implement due to a large number of explanatory variables in the study. Stepwise is also very powerful but a bit more time consuming. Forward might also give similar results as the others but is considered a bit less powerful.
5. What will be the variable selection criterion and cut-off p-value?
You need to specify the variable selection criteria (p-value, Adjusted R2 etc.) and/or the cut-off p-value (only if you are using tests of hypothesis). A p-value <0.05 is used as a cut-off in most studies but you can make your criterion more stringent (such as 0.01 or 0.001) or less rigid (such as 0.1), according to specific conditions of the study.
6. What are the potential confounders?
Ideally, thinking about potential confounders should start during the design phase of the study well before the collection of data. However before the analytical stage, a decision is required about which potential confounders to be forced or tested.
7. Will you test interactions?
This will depend on the objective of the study and whether the sample size is sufficient to allow testing of interactions.
8. At which stage and between which variables will you test interactions?
The interactions can be tested either before testing for significance of explanatory variables (ideal) or after you have selected variables (more practical). You also need to specify one of the four options for inclusion of variables in interaction terms.
9. What will be the cut-off p-value for selecting an interaction term?
Usually this is specified to be the same as for selecting a variable, but sometimes a more stringent criterion is used in evaluating interactions. This is because interactions are considered a nuisance and make the interpretation of the model difficult (though if there really is an interaction between two variables, the model without the interaction term will be inferior).
10. How will you test assumptions of the final model?
Although there are standard methods for testing model assumptions, it is good practice to think about which method/methods you will use to:
- test the assumption of linearity of the continuous variables in the model
- test the assumption of homoscedasticity
- test the assumption of normality
- check outliers/leverage/influence of a particular observation/covariate pattern
- test overall fit of the model.
If you think about these 10 points before starting the model building process, modelling will not only be more logical and objective, it will also be easier to implement and a lot more fun! After building a model you will feel a sense of achievement that is very satisfying and highly motivating.