Grafen & Hails: Modern Statistics for the Life Sciences
08.04.08 Claire Standby
Q: I am a complete beginner to GLM and a relative newcomer to SPSS as well. I was wondering whether you explicitly explain at any point the precise difference between the usages of the 'fixed factor' and 'covariate(s)' boxes within the univariate GLM window? As far as I could ascertain, the former is used for categorical data and the latter for continuous data - is this correct? Why are the results so different if a variable is placed in one rather than the other? Thank you very much for the help, the website is fantastically helpful.
A: Claire is right to be concerned about the difference between 'fixed factors' and 'covariates', as this is one of the fundamental distinctions in GLM, and is indeed a question of whether the x-variable is to be treated as categorical or continuous. The basics of an analysis with a categorical x-variable are the subject of Chapter 1, and the basics of an analysis with a continuous x-variable are the subject of Chapter 2. How both fit into the same overall framework is discussed in Chapter 3. The question of whether an x-variable should be treated as categorical or continuous is covered in Section 6.4, "Treating variables as continuous or categorical", while Section 10.5 shows how the distinction can be usefully blurred on occasions.
But it is useful here to pull out the main point of Claire's question. Why does it make such a difference whether we treat an x-variable as categorical or continuous? First, it certainly does! If we have a case in which no two datapoints share the same value of the x-variable, then treating it as categorical will produce very strange output -- packages vary in exactly how they explain that there is nothing useful they can say.
The explanation is simple and lies in the question being asked. With a categorical x-variable we are asking if there are differences between groups whereas with a continuous x-variable we are asking if there is a linear relationship between Y and X. Lets consider this in a little more detail.
With a categorical x-variable, every distinct value is treated as completely separate, and all the x-variable does is to divide the dataset into subsets that share the same value of x. The analysis then asks whether these subsets might all come from the same population (in which case the x-variable does not affect y), or whether there is evidence that the subsets come from different populations (in which case the x-variable does predict y). There is no sense in which x=2 is close to x=3 than x=1 is to x=10: all that matters is whether x is identical. No consistency of effect (e.g. a higher x gives a higher y) is looked for. So in the extreme case in which every datapoint has a different value of x, the dataset is divided into subsets all of size 1. And there is no way to estimate the variability of the population from which they're drawn, because we'd need at least two datapoints in one subset to do that.
With a continuous x-variable, on the other hand, we are definitely looking for consistent effects. The difference in y between x=1 and x=3 must be exactly the same as the difference in y between x=8 and x=10 --- this just means we're looking for a linear relationship. So the actual values of the x-variable then matter a great deal.
Two final niggles. It is just the x-variable that is continuous or categorical in this case - not "the data" as a whole. With a categorical y-variable, different methods are needed (see Chapters 13 and 14). Second, the 'fixed' in 'fixed covariate' is a reference to 'fixed effects' as opposed to 'random effects'. This is a distinction not usefully made for beginners like Claire, but is covered in Chapter 12 when she feels ready for it!