Introduction Linear Regression Model Building

A major role of epidemiologic research is to identify and quantify associations between explanatory (independent) and outcome (dependent) variables. For example, you may be interested in determining the effect of:

  • management practices (explanatory) on disease status (outcome);
  • milk yield (explanatory) on mastitis incidence (outcome);
  • retention of placenta (explanatory) on reproductive performance (outcome)

Regression analysis is the preferred technique to evaluate these associations. Linear regression analysis is used when the outcome variable is numerical or quantitative and the association between outcome and explanatory variables is approximately linear.

Linear regression model

A linear regression model can be represented as:

y = α + β1x1 + β2x2 + β3x3 + ε


  • y is the quantitative outcome variable and x's are various explanatory variables that can be either quantitative or categorical.
  • β1 indicates an increase in expected value of y with a unit increase in x1, after adjusting for all other variables in the model; β2 indicates an increase in expected value of y with a unit increase in x2, after adjusting for all other variables in the model (and so on)
  • Positive and negative values of β indicate, respectively, increase and decrease in the expected value of y with increase in the respective x value, whereas zero value of β indicates no linear association between y and x
  • α is the expected value of y when all x's are equal to zero and is usually not biologically meaningful and
  • ε is the random error which is assumed to be distributed normally with mean zero and variance σ2.

Parameter estimates and their significance

F-test and adjusted R2

While significance of individual terms can be tested using a t-test, we need to conduct an F-test to evaluate significance of all the terms in the model (Null hypothesis: all β's are equal to zero; Alternate hypothesis: at least one of the β's is not equal to zero). This F-test is based on an analysis of variance (ANOVA) approach and the F-statistic is calculated by dividing the Regression mean square by the Residual mean square.

We can calculate R2, an indicator of the proportion of variability explained by the model, by dividing the Regression sum of squares by the Total sum of squares. A slightly better indicator, adjusted R2 can be estimated by dividing Residual mean square by Total mean square and then subtracting the result from 1. Adjusted R2 is extensively used in variable selection.