For a current research project, I am developing an occupancy model for horseshoe crabs in Florida. We are working with data from FWC, which is awesome long-term random stratified sampling data, but since it isn't specifically targeted at horseshoe crabs (it is for fish inventories), we are dealing with highly zero-inflated data ‚Äì lots of non-detection. Hence, we can't in good conscience treat the data in its current form as a measure of abundance of the crabs. Cue occupancy modeling: this type of model considers two processes: occupancy and detectability. Occupancy refers to the presence or absence of a species during the period of sampling / season, and detectability refers to whether or not a species is detected. Thus, there is a possibility that a species is present but not detected! This method also allows you to incorporate variables that could potentially impact either occupancy or detectability.
I am really excited about this approach! I first heard about it at a 2016 seminar talk by Dr Bliznyuk (UF, see here for his arXiv paper), and have been itching to apply it ever since! And now I finally have a chance.
However, I am new to this type of modeling, and also have not done much binomial modeling in the past. As I was going through a book, the R code and its manuals, making little examples to clarify formulas, I realized this might also benefit others ‚Äì which is the reason I am blogging about this. I will post several installments as I progress from simple to more complex models. I will make R code available too.
First and foremost; all credit for theoretical aspects goes to the following publications:
- MacKenzie, D. I., Nichols, J. D., Lachman, G. B., Droege, S., Royle, A., & Langtimm, C. A. (2002). Estimating site occupancy rates when detection probabilities are less than one. Ecology, 83(8), 2248‚Äì2255.
- MacKenzie, D. I., Nichols, J. D., Royle, A., Pollock, K.H., Bailey, L.L. & Hines, J.E. (2005). Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence. Elsevier.
Especially the latter, the book, has been incredibly helpful.
For coding, I am using the package ‚Äúunmarked‚Äù in R:
Fiske, I., & Chandler, R. (2011). unmarked: An RPackage for Fitting Hierarchical Models of Wildlife Occurrence and Abundance. Journal of Statistical Software, 43(10), 1‚Äì23. http://doi.org/10.18637/jss.v043.i10
Let‚Äôs get started!
First a refresher: the binomial process shows the probability of either success (1) or failure (0).
The normal binomial distribution formula:
with n=number of trials, x=number of successes, p=probability of success and b=binomial probability. The first part, the binomial coefficient expresses how many ways k successes can be distributed over n trials. We then multiply with the probability of success (p), then the probability of failure (p-1).
The expected mean is and the variance is for a normal distribution.
So if for 29 trials (n), you get 13 successes (x), and you know the probability of success is 0.5 this formula tells you that the probability of getting this result (13 out of 29 is a success) is 0.13. There is a 0.14 probability of getting 14 successes. There is only a 0.02 probability of getting 20 successes. These values are the area under the probability distribution plot (pdf). So if you know p, you can calculate b, and vice versa.
For occupancy modeling, we use the following terminology:
- The 'area' is the larger spatial area for which we want to draw an inference;
- It is divided into sampling units (S) ‚Äì which could be grid cells or discrete units (like a pond);
- We select s units ‚Äì sites ‚Äì as random samples: the assumption is that S is very large in comparison to the sites s, and we are trying to say something about occupancy of S (not s);
- Surveys are done repeatedly at each site, K times, to record presence (1) or absence (0) of a species, creating a detection history (hi).
There are a few assumptions we need to take into account:
1) We assume occupancy for the period of analysis to be 'closed', i.e. it does not change, a species occupies a site or it does not. In practice this often means you have to develop a model for a 'season', which can differ in length for different species.
2) The probability of occupancy is the same for all sites;
3) The probability of detection is the same for all sites;
4) Detections in each survey are independent; and
5) Detection histories are independent.
Note: some of these assumptions are/can be modified with more complex modeling, but for now we are starting with simple situations!
The starting point for our first simple occupancy model is that we assume there is a common probability of sites being occupied by a species, e.g. 0.7. With formula (with species detected perfectly), this would result in 70 detections (x) at 100 sites (s), for instance. Note that this is the same formula as in the refresher earlier, just with different symbols (p= and n=s).
But then we realize, detection is not perfect! If the probability of detecting a species at an occupied site is p, after k surveys the probability of detecting at least once is p*=1-(1-p)k. The second part (1-p)k is the probability that the species is not detected at all after k surveys.
High p and high k: p*=1-(1-0.7)30=1
High p and low k: p*=1-(1-0.7)3=0.973
Low p and high k: p*=1-(1-0.3)30=1
Low p and low k: p*=1-(1-0.3)3=0.66 (i.e. the probability 1-0.66 that species go undetected in all surveys)
This illustrates that with low probabilities of detection, you need more trials to detect at least once (obviously).
Thus the probability of the species being present and being detected becomes and thus the estimator for the proportion of sites occupied (if we know p) is
with sD = sites where species is detected. Similarly as above
With 30 surveys and 10 detections, if p=0.7 and k=30, p*=1 (see above), so should be close to the real number, e.g. 10/30. In this case, even if p is low, the number of k makes up for it, so this is also close to the 'real' probability.
But let's say k=3 and p=0.3 (so a low probability of detection, which makes p*=0.66, as above), and there is 1 detection, this makes (instead of 0.333).
Generally though, we do not know what the detection probability is - so occupancy modeling is based on figuring out what the detection probability (p) and the occupancy probability () is.
So, what are we calculating?
1) expresses the probability of getting this particular detection history (absence-presence-absence-presence) with being the probability that the species occupies the site and pj the probability of detecting the species during survey j: to repeat, p is the probability of success and p-1 is the probability of failure.
2) It gets a little bit more complex if the species is not detected at all since this does not necessarily imply absence. We add the possibility of non-detection to absence:
Here we add the probability that the species is present but undetected (probability of occupancy multiplied by the product of all absence probabilities) to the probability of non-occupancy
Then, if these detection histories are constructed for all sites, we assume they‚Äôre independent and use the model likelihood for the observed data: the likelihood of occupancy and detection given the available data, is the product of all site probabilities:
In words this is 'the likelihood that we observe these detection histories, given occupancy probability and detection probability p)'.
In a more extensive form, using the formulas defined earlier, this becomes
Essentially, the second part represents the sites where there was no detection, so where we need to take absence and non-detection into account. This part is raised to the number of sites without detection.
The first part is for where there is detection: sj is the number of sites where the species was detected at the jth survey. It again includes a calculation for success (presence) and for failure (absence). We first take the occupancy probability and raise it to the power of the number of sites where there was detection. We then multiply this with the product of the detection probabilities based on K surveys.
This started to look like alphabet soup to me too, so I made an example with known probabilities.
We make a detection survey with 4 sites (s, rows), 5 surveys (j=1 to K, columns), =0.7 and p=0.6. For:
Site 1 1 1 1 0 0
Site 2 0 0 0 0 0
Site 3 1 0 1 0 1
Site 4 1 1 0 0 0
sD=3 (number of sites where species was detected at least once
We start with second half of the detection histories (h1 through h5), the parts with p's in them:
survey j=1: =0.216
survey j=2: =0.144
survey j=3: =0.144
survey j=4: =0.7 (this one is easy because there is only one survey with zeros, so no product calculation needed)
survey j=5: =0.096
Then putting it all together gives
for which the log-likelihood is -4.141
Since this has become a longer post than expected, I will leave more detail on log-likelihood and Maximum Likelihood Estimation (MLE) for the next post. I hope to also include a bit of R code in the next post ‚Äì though admittedly I have found implementation a little bit challenging so far.
If you want to do some more reading on occupancy models, the USGS created a nice straightforward (short) document: https://fresc.usgs.gov/products/fs/fs2005-3096.pdf