# Glossary

###### Akaike's Information Criteria (AIC)

The criterion proposed by Akaike for comparison of models that adjusts -2 Log Likelihood for the number of terms in the model. The model with lower AIC is considered to be better.

Both nested as well as non-nested models can be selected using AIC.

###### Bar chart

A chart that displays categorical data using one bar for each category with height of the bar being proportional to the frequency or relative frequency of the category. Sometimes the height of the bar is kept proportional to the mean or some other summary statistic of a quantitative variable.

###### Bayesian Information Criteria or Schwartz Criteria (BIC or SC)

A model selection criteria similar to AIC but has a different method of adjusting -2 Log Likelihood statistic for the number of terms in the model. As with AIC, lower value of SC is considered better.

###### Binary variable

A categorical variable that has only two categories, such as sex (male/female) and disease status (yes/no).

###### Box-and-whisker plot or boxplot

A plot used to summarise quantitative data. The first and the third quartiles form the boundaries of the box and the whiskers extend from the 1st and 3rd quartile usually up to the most extreme value within 1.5 interquartile range. The observations outside this range are shown as outliers. The median is displayed within the box by a bar.

###### Clustered data

Observations that are not independent but related. For example, cows in a herd or children in a family are more likely to be similar to each other and therefore statistical methods to account for clustering should be used to analyse such data.

###### Coefficient of determination (R2)

The proportion of the total variation in the outcome variable explained by the model. It is calculated by dividing the Model sum of squares by the Total sum of squares.

###### 95% Confidence intervals

A range of plausible values of the population parameter (e.g. regression coefficient, β) that are consistent with the observed sample statistic (e.g. parameter estimate, b). Technically speaking, if an experiment were to be repeated infinite times, on average, 95% of all the 95% confidence intervals would include the true population parameter.

###### Confounder

A variable that distorts the association between another explanatory variable and the outcome. A confounder can either mask the association or can make the effect look stronger than it really is. To be considered as a confounder, a variable must be associated both with the explanatory variable and the outcome should not lie on a causal pathway from the explanatory variable to the outcome.

###### Contingency tables

A table of frequencies of a categorical variable classified based on another categorical variable.

The macros on this website create contingency tables between categorical explanatory variables and the categorical outcome to give you an idea about the association between them.

###### Continuous data

Continuous data is a type of quantitative data that can have any value (usually within a valid range). Given two values, however close together they are, a sensible middle value can be calculated.

For example, soil iron content, age, blood glucose level etc.

###### Descriptive analyses

The analyses undertaken to describe a variable, for example, calculation of frequencies and relative ferquencies for categorical variables and summary statistics for quantitative variables. Descriptive analyses also include graphical presentation of data using bar charts, histograms, box-and-whisker plots etc.

###### Discrete data

Quantitative data that can have only some values (usually integer numbers) within a range. It usually represents counts or numbers. For example, number of ticks, instances of mastitis per lactation, years since vaccination started in the flock, etc.

###### Explanatory variable

A variable on the right side of a regression equation that is used to estimate expected or predicted value of the outcome. This is also called an independent or predictor variable.

###### Exposure variable

A variable that is of prime interest in the study. This term is particularly used in an observational cohort study or a clinical trial where we aim to test the association of an exposure variable after adjusting for other variables and confounders.

For example, dry cow therapy will be an exposure or study variable if we are interested in its effect on mastitis.

###### Forward, backward and stepwise variable Selection Procedures

Procedures for selecting variables during the model building process.

Macros available on this website can use any of these three procedures to select variables during multivariable model building. During forward variable selection, the macro will add one variable at a time to the initial model while during backward variable selection, the macro drops one variable at a time from the initial model. For stepwise procedure, both the above procedures are implemented at the same time.

###### Goodness-of-fit tests

Goodness-of-fit examines the agreement between an observed set of values and another set of values which are derived under some theory or hypothesis.

###### Histogram

A histogram is a type of bar chart for quantitative data in which the base of bars represent various classes into which the data is binned (usually 5-15) and area of bars is proportional to the frequency in the respective class. If the width of all bars is equal, the height of bars is also proportional to the respective frequency. In contrast to bar charts for categorical data, a histogram does not have spaces between bars.

###### Interaction terms

An interaction is present when association of a variable with the outcome is different at different levels of a third variable. In epidemiologic research this is usually termed as an effect modifier.

###### Likelihood-ratio chi-square test

A procedure to test the hypothesis of no association between a categorical variable and one or more explanatory variables. Likelihood ratio chi-square test statistic is the difference in -2 log likelihood of the simpler and complex model, given that both the models are nested.

###### Linear regression

A procedure that tests associations between a quantitative outcome and one or more explanatory variables. The explanatory variables in linear regression can be both categorical and quantitative.

###### Logistic Regression

A modified form of linear regression in which we test association of a categorical outcome (usually binary) with categorical or quantitative explanatory variables. Probability of an event is logit transformed to be used as an outcome.

###### Model

Algorithms for computing outputs from inputs.

###### Multinomial data

Data with more than 2 categories, such as districts, states, breeds, species. Multinomial data can be either ordinal or nominal (see enteries).

###### Multivariable models

Models constructed to assess the association of >1 explanatory variable with an outcome variable.

###### Nominal data

Qualitative data without any inherent order, such as breed, sex, and district.

###### Normal distribution

A continuous bell shaped probability distribution that is characterised by typical areas under the curve. For example, 95% of data/area under the curve is contained within the mean ±1.96 standard deviation and 99% within ±2.58 standard deviation.

###### Odds ratios

Ratio of the odds of disease in the exposed and the unexposed (in a cohort study); alternatively a ratio of odds of exposure in the diseased and non-diseased (in a case control study)

###### Ordinal data

Categorical data having categories with some inherent order. For example, high, medium and low prevalence of a disease.

###### Outcome variable

The variable on the left side of a regression model that can be predicted by explanatory variables on the right side of the model. The outcome variable is usually represented by y and explanatory variables by x.

###### Outlier

In general, observations that are very different from other observations in the data are called outliers. In a box-and-whisker plot, an outlier is an observation that is at least 1.5 times interquartile range away from the first or the third quartile. In residual analysis, standardised residuals with absolute values more than 3 are considered outliers.

###### Parameter estimates

A number that is calculated from the sample is called a statistic while a number that describes the population is called a parameter. We usually do not know exact values of the population parameters; rather we try to estimate them based on sample statistics. Thus any estimate of parameter obtained from a sample is called a parameter estimate.

###### Pearson correlation coefficient

Pearson correlation coefficient indicates how closely two variables are related. This can have values from -1, indicating perfect linear negative association, to +1, indicating perfect linear positive association.

###### p-value

The probability of obtaining the observed results or more extreme, if the null hypothesis is true. Please note that it is not probability of null hypothesis being true though sometimes it is incorrectly interpreted this way.

###### Qualitative/ categorical data

Describes the qualitative property or characteristic of an animal or a group. For example, data about sex, breed, disease status or serological status of an animal.

###### Quantitative or numerical data

Data that have some meaningful numerical values or represent some quantity for example age, parity, number of cows, etc

###### Random variable

A variable that can take any value from a given distribution.

###### Spearman rank correlation coefficient

A non-parametric equivalent to Pearson's correlation coefficient that measures the association (not necessarily linear) between two variables that may be ordinal.

###### Standard errors

Standard deviation of the sampling distribution of the parameter estimate. For the sample mean, it is calculated by dividing the population standard deviation by the square root of the sample size. That is why to halve the standard error, we need to increase the sample size four times.

###### Study variable

See Exposure variable.

###### Test Statistic

A statistic calculated from sample data used to evaluate a test of hypothesis. The choice of a test statistic will depend on the assumed probability model and the hypotheses under question.

###### Unconditional/ univariable association

Association of an explanatory variable with an outcome without adjusting for other variables or confounders. Evaluation of unconditional associations is important before conducting multivariable analyses.

###### Variable

A characteristic that varies from one subject to the other, e.g. breed, age, parity, etc.