Most readers will be familiar with straightforward linear regression. For example, one might be interested in predicting the height school children from their age. You can go out and measure a couple of hundred kids, plot the data and get something like the graph below. If you do a linear regression on these data you'll see that height can be predicted using the equation: height = 7.873*age+59.8 and that about 86% of the variation in height is explained by age (Adjusted r-squared = 0.8558).
The above example works because the relationship and range of ages is such that you are unlikely to end up predicting negative heights. Clearly if the relationship was such that predicting negative heights was likely, then something is is amiss with your linear regression. This is one of the reasons why we end up using Generalized Linear Models.
Generalized Linear Models
I wont' go into all the boring and mathematical background of Generalized Linear Models (GLMs), and I won't even go into all the technical details of how link functions and error distributions work. You can get quite a detailed overview here. All I will do is give you a very quick overview and when hand how to use them, giving examples in R.
Basically you use a GLM rather than linear regression whenever your dependent (response) variable isn't a continuous range of non-integer values. Examples include count data, which are always positive integers (whole numbers), presence / absence data which can only take on two values (often codes as 0 = absent or 1 = present), or size, which is always positive. Often, as in the example above, you get a way with using linear regression if your data are continuous, but bounded in some way (e.g. restricted to positive values), but there are better ways of dealing with such data. One thing you'll notice is that almost all ecological data are not a continuous range of non-integer values, which is why GLMs are so useful.
GLMs allow you to specify and error distribution and link functions to cope with such data. The link function allows the response variable to be non-linearly related to your dependent variables the error distribution allows your response variable to be non-normally distributed. I'm not talking about skewed data here, which can easily be transformed to normal, but the weird data mentioned above, like count data or presence / absence data that can never be made normally distributed. To get an idea for how these work, I'll go through the age and height example above. To generate these data in R do the following:
age<-runif(200,4,12)
hgt<-7.873*age+59.8; r<-rnorm(200,0,8); hgt<-hgt+r
You can then do a normal linear regression as follows:
summary(lm(hgt~age))
which should give you an output something like that below (the exact number will differ as the age data and scatter are generated randomly).
You can actually get virtually the same results using a GLM by specifying family=gaussian:
summary(glm(hgt~age,family=gaussian))
This is the special case, in which an identity link function (i.e. no transform) is used and the error distribution is assumed to be normal as with linear regression. You'll notice the R code is pretty similar to that for a simple linear regression, so it's easy enough to switch from using one to the other without getting too overwhelmed. You'll notice the output does look slightly different though. The coefficient estimates are the same, but it gives you an AIC value and null and residual deviances instead of an R-squared value. Don't worry about this. The important thing to know is how you report your results when you write them up and I'll give you an example below.
Firstly there's quite a good guide to reporting biological results here, although I'd probably suggest some differences. You will notice that for a GLM you want to know the F value, which is also what's given when doing straightforward linear regression. You can get this using the anova.glm function:
m1<-glm(hgt~age,family=gaussian)
anova.glm(m1,test="F")
I would report report the results like this: "The height of school children was strongly linearly related to age (F1,198 =
1182.2, P<0.001)". Test statistics (e.g. F or r2) are italicised, but the P isn't. The two sub-scripts are the degrees of freedom (the first basically the number of independent variables and the second the number of data points minus the number of independent variables minus one). The P value is overall P value for the F statistic, not the P value for each term in the model. If your a purist, you could also report the test: "The height of school children was strongly linearly related to age (GLM with identity link and gaussian error, F1,198 = 1182.2, P<0.001)" and if using a normal regression you might also report the adjusted r-squared value (F1,198 =
1182.2, P<0.001, r2 = 0.856). You may also chose to report the estimates and standard errors of the coefficients, particularly if you have lots of independent variables, but in this case a plot is probably more informative.
The following gives an overview of when to use what families.
Gaussian error distribution with an identity link function (family=gaussian)
Use this when you could as well use a linear regression
Poisson error distribution with an log link function (family=poisson)
This works quite well with some count data. Basically it assumes that your dependent variable is made up of positive integers. However, it also assumes the variance in your data is approximately equal to the mean. Some examples of the Poisson distribution with different means are shown below. With most ecological count data the variance is actually greater than the mean (this is called over-dispersion) and you should really use either a quasi-Poisson or Negative Binomial distribution. It is also often the case that you have many more zeros that would be typical of a Poisson distribution, in which case you might wish to use a zero-inflated distribution.
You can do this by exploiting a simple bit of maths. Basically log(A/B) = log(A) - log(B). To do this in R specify area as an offset. For example:
m1<-glm(density~treeheight,offset=woodlandarea)
Binomial error distribution with a logit link function (family=binomial)
A binomial family is usually used for presence / absence data, where your response variable can be coded as zero or one. However, it can actually be used when your response variable takes on more than category (e.g. absent, present in low numbers and abundant). A GLM with a binomial error distribution and logit link function is often called a logistic regression. One minor complication is interpreting what the values predicting by the model mean as these are non-integers ranging between zero and one. Basically this type of model predicts e.g. the probability of occurrence for given values of your independent variable.
Gamma error distribution with inverse link function
The gamma distribution assumes a continuous range of numbers (I.e. not all your data are integers, but the values are all positive. This actually what we should have used with the height from age model. A few examples of the Gamma distribution are shown below.
In addition to these usual families of error distributions and link functions, there are also a number of other useful error distributions. These often slightly more difficult to implement in R and I will blog about them another time. It also worth spending a bit of time running diagnostics on your data and carefully interpreting the results of your model. Again I'll blog about this another time.