# [R] cumsum function to determine plankton phenology

Heather Anne Wright heather.wright at szn.it
Tue Feb 14 17:34:48 CET 2012

```Apologies for the empty email earlier!

I have species abundance data sampled at a weekly frequency or
sometimes monthly depending on the year.
The goal is to identify the dates in an annual cycle in which the
cumulative abundance of a species reaches some threshold.

Here's an example of the data for 1 species over an annual period:
"mc_pheno" is the object created from this data:

zoo_date sp
1/8/1988 13.2
1/20/1988 11.6
2/3/1988 8.7
2/17/1988 10.7
3/3/1988 2.5
3/21/1988 2.2
4/12/1988 0.6
5/2/1988 0.3
5/10/1988 0.6
5/25/1988 0.8
6/8/1988 0.8
6/28/1988 3.8
7/5/1988 11.2
7/20/1988 9.3
8/3/1988 31.1
8/17/1988 50.9
9/6/1988 225
9/15/1988 172.6
10/18/1988 752.1
11/3/1988 28
11/15/1988 32.8
11/30/1988 10.6
12/14/1988 2.6

I want to obtain the start, middle and end dates of cumulative
abundance (for those ecologically minded folks the reference is in a
Greve, et al., 2005 paper) to determine phenology of a species or
functional group. I previously applied this method in Excel and want
to replicate the formula in R using cumsum.

Easy right?
Excel method:
1. calculate cumulative sum
2. divide the cumulative sum for each time point by the annual
cumulative sum or maximum value
3. Calculate three different percentile thresholds corresponding to
the start (when a population reaches 15% of cumulative annual
abundance), the middle (50% of cumulative abundance) or end (85%).
This is done by taking an average between the original sampling dates
that most closely correspond to the threshold values. The function in
excel is INDEX and in R is which.min.

R method:
# convert data
zoo_date <- strptime(mc_pheno\$zoo_date,"%m/%d/%Y")

# check dates
str(zoo_date)

# prepare data frame
data<-na.omit(data.frame(date=zoo_date,sp=mc_pheno[,2])

# start the series
datayear<-data[format(data\$date,"%Y")==1984,]

# calculate cumulative sum
datayear\$cumsum<-cumsum(datayear\$sp)
# calculate cumsum standardised
datayear\$cumsum<-cumsum(datayear\$sp)/max(cumsum(datayear\$sp))

# take index of minimum value of cumsum
idq15<-which.min(mean(.15-datayear\$cumsum))
idq50<-which.min(mean(.50-datayear\$cumsum))
idq85<-which.min(mean(.85-datayear\$cumsum))

#take corresponding date
dateq15<-datayear\$date[idq15]
dateq50<-datayear\$date[idq50]
dateq85<-datayear\$date[idq85]

# output results in day in year form
print(format(c(dateq15,dateq50,dateq85),"%j")
# results should be similar to this:
[1] "2010-03-02 CET"  "2010-06-29 CEST" "2010-10-27 CEST"
with dates formatted as a day in year corresponding to each threshold

The problem lies in the comparison between methods between Excel and
R. Using the original method in R, the dates obtained are always
earlier than with Excel due to the which.min calculation:
idq15<-which.min(abs(.15-datayear\$cumsum))
idq50<-which.min(abs(.50-datayear\$cumsum))
idq85<-which.min(abs(.85-datayear\$cumsum))

If I change the formula to
idq15<-which.min(mean(.15-datayear\$cumsum))  how do I take an average
between the dates so each threshold is consistent between the results
I obtain in R and excel?
Adding the mean to this line instead of abs resulted in obtaining the
exact same output dates.

Happy to provide more code but this is long enough.