Wednesday, 9 July 2014

When and how to use Generalized Linear Models

What's wrong with linear regression?
Most readers will be familiar with straightforward linear regression. For example, one might be interested in predicting the height school children from their age. You can go out and measure a couple of hundred kids, plot the data and get something like the graph below. If you do a linear regression on these data you'll see that height can be predicted using the equation: height = 7.873*age+59.8 and that about 86% of the variation in height is explained by age (Adjusted r-squared = 0.8558). 




The above example works because the relationship and range of ages is such that you are unlikely to end up predicting negative heights. Clearly if the relationship was such that predicting negative heights was likely, then something is is amiss with your linear regression. This is one of the reasons why we end up using Generalized Linear Models.

Generalized Linear Models
I wont' go into all the boring and mathematical background of Generalized Linear Models (GLMs), and I won't even go into all the technical details of how link functions and error distributions work. You can get quite a detailed overview here. All I will do is give you a very quick overview and when hand how to use them, giving examples in R

Basically you use a GLM rather than linear regression whenever your dependent (response) variable isn't a continuous range of non-integer values. Examples include count data, which are always positive integers (whole numbers), presence / absence data which can only take on two values (often codes as 0 = absent  or 1 = present), or size, which is always positive. Often, as in the example above, you get a way with using linear regression if your data are continuous, but bounded in some way (e.g. restricted to positive values), but there are better ways of dealing with such data. One thing you'll notice is that almost all ecological data are not a continuous range of non-integer values, which is why GLMs are so useful.

GLMs allow you to specify and error distribution and link functions to cope with such data. The link function allows the response variable to be non-linearly related to your dependent variables the error distribution allows your response variable to be non-normally distributed. I'm not talking about skewed data here, which can easily be transformed to normal, but the weird data mentioned above, like count data or presence / absence data that can never be made normally distributed. To get an idea for how these work, I'll go through the age and height example above. To generate these data in R do the following:

age<-runif(200,4,12)
hgt<-7.873*age+59.8; r<-rnorm(200,0,8); hgt<-hgt+r

You can then do a normal linear regression as follows:

summary(lm(hgt~age))

which should give you an output something like that below (the exact number will differ as the age data and scatter are generated randomly).



You can actually get virtually the same results using a GLM by specifying family=gaussian:

summary(glm(hgt~age,family=gaussian))

This is the special case, in which an identity link function (i.e. no transform) is used and the error distribution is assumed to be normal as with linear regression. You'll notice the R code is pretty similar to that for a simple linear regression, so it's easy enough to switch from using one to the other without getting too overwhelmed. You'll notice the output does look slightly different though. The coefficient estimates are the same, but it gives you an AIC value and null and residual deviances instead of an R-squared value. Don't worry about this. The important thing to know is how you report your results when you write them up and I'll give you an example below.





Firstly there's quite a good guide to reporting biological results here, although I'd probably suggest some differences. You will notice that for a GLM you want to know the F value, which is also what's given when doing straightforward linear regression. You can get this using the anova.glm function:

m1<-glm(hgt~age,family=gaussian)
anova.glm(m1,test="F")

I would report report the results like this: "The height of school children was strongly linearly related to age (1,198 = 1182.2, P<0.001)". Test statistics (e.g. F or r2) are italicised, but the P isn't. The two sub-scripts are the degrees of freedom (the first basically the number of independent variables and the second the number of data points minus the number of independent variables minus one). The P value is overall P value for the F statistic, not the P value for each term in the model. If your a purist, you could also report the test: "The height of school children was strongly linearly related to age (GLM with identity link and gaussian error, 1,198 = 1182.2, P<0.001)" and if using a normal regression you might also report the adjusted r-squared value (1,198 = 1182.2, P<0.001, r2 = 0.856). You may also chose to report the estimates and standard errors of the coefficients, particularly if you have lots of independent variables, but in this case a plot is probably more informative.


The usual families of error distributions and link functions
The following gives an overview of when to use what families.

Gaussian error distribution with an identity link function (family=gaussian)
Use this when you could as well use a linear regression

Poisson error distribution with an log link function (family=poisson)
This works quite well with some count data. Basically it assumes that your dependent variable is made up of positive integers. However, it also assumes the variance in your data is approximately equal to the mean. Some examples of the Poisson distribution with different means are shown below. With most ecological count data the variance is actually greater than the mean (this is called over-dispersion) and you should really use either a quasi-Poisson or Negative Binomial distribution. It is also often the case that you have many more zeros that would be typical of a Poisson distribution, in which case you might wish to use a zero-inflated distribution.

Another thing worth mentioning is that it's often better to use Poisson error distribution and log link function when dealing with density data even though these data are not integers. The reason for this is that density is actually abundance per unit area and abundance is a positive integer. You may for example wish to relate the density of a given bird species to mean tree height in patches of forest

You can do this by exploiting a simple bit of maths. Basically log(A/B) = log(A) - log(B). To do this in R specify area as an offset. For example:

m1<-glm(density~treeheight,offset=woodlandarea)

Binomial error distribution with a logit link function (family=binomial)
A binomial family is usually used for presence / absence data, where your response variable can be coded as zero or one. However, it can actually be used when your response variable takes on more than category (e.g. absent, present in low numbers and abundant). A GLM with a binomial error distribution and logit link function is often called a logistic regression. One minor complication is interpreting what the values predicting by the model mean as these are non-integers ranging between zero and one. Basically this type of model predicts e.g. the probability of occurrence for given values of your independent variable.

Gamma error distribution with inverse link function
The gamma distribution assumes a continuous range of numbers (I.e. not all your data are integers, but the values are all positive. This actually what we should have used with the height from age model. A few examples of the Gamma distribution are shown below.





In addition to these usual families of error distributions and link functions, there are also a number of other useful error distributions. These often slightly more difficult to implement in R and I will blog about them another time. It also worth spending a bit of time running diagnostics on your data and carefully interpreting the results of your model. Again I'll blog about this another time.



Getting Ordnance Survey grid references to work in ArcGIS

How the 100 km squares work
Ordnance Survey (OS) divides Great Britain into 100 km by 100 km squares, each with a two-letter code. The two-letter codes can be found printed in faint-blue capitals on OS maps and are also given when using a GPS set to British National Grid. The first letter, for example ‘S’, denotes 500 km by 500 km squares and this is subdivided into 25 squares that are 100 km by 100 km within it, making ‘ST’, ‘SU’, ‘SO’ and so on. There are four main first letters: ‘S’, ‘T’, ‘N’ and ‘H’ covering Great Britain, plus an ‘O’ square covering a tiny part of North Yorkshire that is usually below tide. A unique National Grid reference should have this two-letter descriptor followed by the grid reference numbers within that square. ArcGIS, however, can’t cope with the two letter code, and instead divides Britain up into an xy coordinates grid in metres (a so called Cartesian coordinate system), starting in the bottom-left corner as shown below.



National Grid reference numbers
The numbers going across the map from left to right are called eastings, and go up in value eastwards, and the numbers going up the map from bottom to top are called northings, because they go up in a northward direction. In ArcGIS the eastings are the x-coordinates and northings the y-coordinates. Eastings are always given before northings, much as one gives x-coordinates before y-coordinates.

However, it is important to understand that grid references can be taken to varying precision (eastings and northings are invariably taken to the same precision). Ignoring the two letter code, a two-figure grid reference (one figure being an easting and one a northing) has a 10 km precision, a four-figure grid reference has 1 km precision and so forth. Typically, A GPS gives a 10-figure grid reference, which has a precision of 1 m (although it is not uncommon for modern differential GPSs to grid references with decimal places allowing sub-metre precision).

There is a step-by-step guide to reading and interpreting OS grid references here, but I'll explain briefly. First decide to what precision you want to take it and then you take it, then start by working out the two letter code, then take the eastings and then the northings. For example, look at the map below of the shores of Loch Fyne in Scotland. This actually encompasses four 100 km squares. The top-left, is in in square NM, the top-right in NN, the bottom-left in NR and the bottom-right in NS. The map is further divided into 1 km squares with a number system that begins at zero and goes to 99 and repeats in every 100 km square.




Click on the image to make it larger

If one wanted to work out the grid reference of most easterly building in the hamlet of Cumlodden Cott, then one first decides the precision. If to 1 km precision, then one simply needs the two letter code followed by a four figure grid number. The two letter code is NS, the eastings is quite close to the 01 line and the northings to the 99 line then the 00 line so the grid reference is NS 01 99. If you want to take it to 100 m precision, then divide (by eye or using a ruler) each one km square into 100 (10 x 10) little squares. The grid reference becomes NS 009 993. Often the grid reference are given without spaces so they would become NS0199 (1 km precision) or NS009993 (100m precision).

Converting National Grid references to numbers that can be used by ArcGIS

Here is a 3-step guide:
(1) Take the numbers of the grid references and split them down the middle. The first half are the eastings and the second half are the northings. In the example above (NS009993) the eastings are 009 and the northings are 993.

(2) Work out the precision based on the total number of figures given excluding the letters (2 figures = 10 km, 4 figures = 1 km, 6 figures = 100 m, 8 figures = 10m, 10 figures = 1m). NX009993 has six numbers, so the precision is 100. Then multiply the eastings and northings by the precision in metres. In the NS009993 example, the eastings (i.e. x-coordinates) become 00900 and the northings become 99300.

(3) Add the appropriate amount to take account of the two letter code. In the NS009993 example, you can see from the very top figure, that you need to add 200,000 to the eastings and 600,000 to the northings. In GIS ready format the eastings (x-cordinates) become 200,900 and the northings (y-coordinates) become 699,300.

Often you may need to convert a whole bunch of grid references at once, so I've included below some R-code and example datasets that should do the job for you here. The link is to a single zip file that includes the code, example datasets and a readme.txt file that gives instructions for use

An introduction to this blog

So why am I creating this blog?
I’m a lecturer in theEnvironment and Sustainability Institute at the University of Exeter and could describe myself as a spatial ecologist that likes finding practical solutions to conservation problems. Every year I supervise about eight or so students. Over the years I’ve noticed the students encounter the same types of problems when analysing their data. This is partly by design. I tend to give students similar types of project as past experience has taught me that these types of projects work.  This means that I get lots of requests for help, often concentrated at the same time of year. It also means that I end up explaining the same types of issues over and over again.

This blog has been created for several reasons. On a personal note, I hope it will cut down on the number of meetings I have, while at the same time do a better job of explaining the issues and how to get round them. On a more general note, I think many academics can forget that our main legacy is not our research, but the students we teach, inspire and motivate.  Unfortunately, not all of them are equipped with the skills most wanted in the workplace. This blog is designed to go some way to redressing that balance by provided, simple, easy to understand advice for dealing with commonly encountered problems when carrying out statistical analyses or using geographical information systems (GIS).



Student projects have ranged from work on chough (top) to rare plants like this pigmy rush (bottom)

How it works
Each blog post covers a separate issue and is tagged using keywords. I’ve also created a website to facilitate the organisation of the blog (you can see an index of all blog posts, ordered by topic, here. The website also hosts example datasets and R code, referred to in the blog posts, which I’m happy for students to borrow.

What types of analyses is it for?

The types of project I supervise tend to involve collecting presence absence and/or abundance data on a species, often of some conservation concern, and relating their occurrence or abundance to some environmental variables. Basically, the projects have the aim of working out why a species occurs where it does, what it “likes” in terms of habitat (or climate) and what conservation could do to benefit it. Examples include working out the microhabitat preferences of Marsh Fritillaries on the Lizard Peninsula, guiding habitat management for Chough across Cornwall, working out whether habitat management can be used to offset the effects of climate change on some of Britain’s rarest plants and a bunch of other projects, which I aim to host on the website in due course. However, it deals with (or will in due course) some of the commonly encountered problems when working with ecological data: my data have loads of zeros! When should I use a mixed model and how? What is AIC and multi-model inference?