### Grafen & Hails: Modern Statistics for the Life Sciences

# 03.09.04 Andrew Stoehr, Department of Biology, University of California-Riverside, USA

**03.09.04 Andrew Stoehr, Department of Biology, University of California-Riverside
Chapter 9, page 161**

**Q:** In your chapter on model assumptions you discuss homogeneity of variance and normality of error. Early in the chapter, you show diagrams (e.g. Figs. 9.1, 9.2) that suggest that what are important are the distributions of these things in each group or cell of an analysis. However, in some examples, for example Box 9.2, Fig. 9.6, you show one distribution of residuals for a data set that is in fact made up of data from two tree species. Shouldn't one check the assumptions for each treatment group (and combination), not for the entire pooled data set? In other words, shouldn't Fig. 9.6 show two distributions, one for each tree species, and shouldn't each of these be evaluated separately?

**A:** Andrew raises a very interesting question about how to aggregate
datapoints to check for Normality, and heterogeneity, of variance. There are different possible views about what is appropriate in various circumstances. Let's review the logic. The assumption is about the error for each datapoint, and the error is the deviation of the y-value from the expected value based on the true values of the parameters and the model. For each datapoint, therefore, we would ideally like to know the true parameters and to have many, many repeats of the dataset, so we could check whether the error for that one datapoint comes from a Normal distribution, and then whether the variances of the error distributions for each datapoint are equal.

There are two very serious limitations here. First, we don't know the true parameters, and second, we have only one dataset and so only one example of each datapoint. (Except in some textbook exercises, of course.) In response to the first limitation, we use our estimates of the parameters, and use the known residuals in place of the unknown errors. This means that some properties of the aggregate of residuals would not hold for the errors. For example, the residuals will always sum to zero, while the errors would not sum to zero except by infinite coincidence; indeed the residuals for the datapoints sharing the same level of any categorical variable will also sum to zero. So the residuals have some properties not possessed by the error, and we must be careful quite how to look at the data in order not to be misled by this.

The second limitation is of major importance for answering Andrew's question. We have only one example of each datapoint, and so only one residual. We can't check Normality with one datapoint, and so we need to aggregate: we need to check a whole set of residuals together to ask if they look Normal, thus supplying by repetition over datapoints what in logic we should supply by repetition over datasets. Andrew's question can now be phrased as “How should we choose the subsets of datapoints over which to check Normality?”

It is important to see that the question has no fully principled answer -- we are definitively in a logically tentative position for the reasons already explained. Let's look at two kinds of situation. First, analysis of variance, so all the variables are categorical. With a one-way ANOVA, it is natural (as Andrew suggests) to consider looking at each level of the categorical variable separately. But even here there is a trade-off. If there are thirty datapoints, and ten levels, with three datapoints per level, then none of the ten tests will be very helpful. Testing for Normality is a tricky enough business with thirty datapoints, never mind three. So if there are enough datapoints at each level, then test separately, bearing in mind the danger of multiple tests.

Now think of a one-way blocked ANOVA. One could test at each level of the experimental factor, and one could also test at each level of the blocking factor. The latter would be helpful if we thought that the blocks might differ in their variance. Perhaps one block is very uniform and another block is very variable. But the problems of multiplicity start to mount here. And if we want to look at combinations of experimental factor and block, then in a Complete Randomized Block design, we're back down to one datapoint again and can't do it.

So issues of sample size and multiplicity dominate decisions in ANOVA. If we have a multiple regression situation, then there are no natural divisions anyway. The natural thing to do here is to plot all the residuals at once. But we could divide between low values of an x-variable and high values.

With all these choices, and the pragmatic knowledge that what we really need in principle is not possible, what is a sensible approach? There is no single right answer. But here are some suggestions:

**(1)** The question of where to look depends on what subdivisions we are
most worried about. If we have reason to suspect that one set of
datapoints has different error variance from another, then we should
certainly look within those two sets. On the other hand, if we don't
have a special reason, we must beware of problems of multiplicity -- if we test enough times for heterogeneity of variance, we will find it!

**(2)** The standard plot of fitted values against residuals takes care of the most common reason to doubt homogeneity of variance, that higher values have higher variance. Whether we have categorical or continuous variables, this is the most likely pattern to be concerned about. Hence this plot is recommended as a standard part of model criticism.

**(3)** If each group is Normal, but has a different variance, then this will come out as non-Normality in the whole dataset. It would be peculiar data if the assumptions were broken seriously, but the whole
dataset's residuals looked Normal, and the fitted values vs residuals looked as it should.

We hope this rather long reply clarifies our position in general terms.

In the example on page 161, it is a moot point. Testing with 25
datapoints each rather than 50 gives two chances for false positives, and has rather few datapoints in each test. We're happy with the test we've done, but quite accept that someone else might decide to test twice, provided they also had a strategy for multiplicity. We tend to prefer single tests in the book because we are emphasising methods that apply in the same way for categorical and continuous variables, where there wouldn't be separate groups.