# Statistics

## Measures of central tendency

In the example of measuring the length of a frond (leaf) in a bracken fern population, measurements were made of the length of 100 bracken fronds. Ideally the 100 fronds represent a random sample of the bracken population. The data are presented as the frequency of measurements in a series of 10 cm size classes so that measurements are to the nearest 0.1 metre.

 Size class (m) No. of fronds in size class Cumulative frequency 0.7 4 4 0.8 7 11 0.9 6 17 1.0 8 25 1.1 16 41 1.2 18 59 1.3 20 79 1.4 12 91 1.5 5 96 1.6 4 100

There is considerable natural variability in the length of bracken fronds. There are three different measures of central tendency we could report for this data:

The mode: the size class that had the highest frequency (the most popular size) = 1.3m

The median: the middle value (half the measurements are above it and half below) = 1.2m

The mean: the sum of all the readings divided by the number of readings = 1.1m

Another method that is used in conjunction with the median is to divide the measurements into groups according to the proportion of the readings in the group. The median divides two groups with half the measurements in each. Quartiles are groups with one quarter of the readings, deciles contain one tenth of the readings, and there would be 1% of the readings in each percentile. The values of the readings in each of these groups gives you some idea of how spread out the readings are.

The mean value is obtained by taking the sum of all measurements and dividing by the sample size (n).

e.g. For the following set of measurements (10, 20, 15, 35, 1, 30, 39, 20, 25, 5), the mean =

= 20 (200/10).

A mean value of 20 would also be obtained for the measurements (18, 20, 19, 23, 19, 22, 21, 20, 21, 17). It is clear that while each set of numbers has the same mean value there is quite a difference in the way the numbers are spread around each mean value. This demonstrates that it is not sufficient to describe a set of measurements by providing just the mean value: it is essential to provide some description (or statistic) of the distribution of the individual values around the mean.

## Measures of variability

A number of statistics have been derived to describe the spread or variation about the mean value. The most commonly used are the variance, the standard deviation and standard error.

It is possible to calculate how much each measurement varies from the mean value: these values are referred to as deviations. A simple sum of these values would not give a true picture as some would be positive and some would be negative, and the sum could well be zero. To overcome this the quantity is squared; and to relate it to the individual variation it is divided by n -1 (see box below). The result is called the variance (s2 ) (or, more correctly, the sample variance).

The standard deviation (s) is the square root of the variance.

N.B. The term n -1 is used as the divisor to provide an unbiased estimate for the population from the sample available. The values we calculate in most experimental situations should more accurately be called sample means, sample variances, etc. In cases where the complete population has been measured then it is appropriate to divide by n rather than n -1. Note that many calculators provide you with this choice when calculating standard deviations. You should use n -1 in all cases, except for the extremely rare case where you have data from an entire population.

When you use a calculator to calculate standard deviation, an equivalent formula, which uses accumulated sums and sums of squares, is used. Equations for the two common statistics are:

Although the variance is a mathematically correct parameter, it has become the convention to use standard deviation. The standard error (se) is often useful for biological measures as it takes sample size into account.

Exercise: Calculate these two statistics for each of the sets of data in Table 1 in the Intermediate Skills Manual. Use the appropriate keys on your calculator.

You should find that the standard deviations are 12.6 and 1.8 respectively for columns C and D, and the standard errors are 4.0 and 0.6.

The correct way to fully describe each of the sets of measurements would be:

 length (mm) of beetles in population C length (mm) of beetles in population D mean + standard deviation 20.0 ± 12.6 20.0 ± 1.83 mean + standard error 20.0 ± 3.97 20.0 ± 0.58

Both expressions indicate how the individual values are distributed around the mean value. Standard deviation is the most commonly quoted statistic. However, where your purpose is to allow comparisons between mean values for populations or samples from those populations, the most correct statistic is standard error of the (sample) mean, commonly referred to simply as standard error (se).