[R] Counting/processing a character vector

Wed Feb 18 14:37:06 CET 2009

Dear List,

I have a data set stored in the following format:

> head(dat, n = 10)
      id  sppcode abundance
1  10307 10000000         1
2  10307 16220602         2
3  10307 20000000         5
4  10307 20110000         2
5  10307 24000000         1
6  10307 40210000        83
7  10307 40210102        45
8  10307 45140000         1
9  10307 45630000         1
10 10307 45630600        41
> str(dat)
'data.frame':	111 obs. of  3 variables:
 $ id       : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sppcode  : chr  "10000000" "16220602" "20000000" "20110000" ...
 $ abundance: num  1 2 5 2 1 83 45 1 1 41 ...

that represent counts of species, recorded with a particular coding
system. The abundance column is not needed for this particular
operation, but is present in the data files.

I am interested in counting entries (rows) in the sppcode component of
dat. The sppcode takes a particular format: Order Family Genus Species,
with 2 alphanumeric digits allocated for each level of the hierarchy. I
want to know how many species there are in each site (the id factor),
but I should only count a higher level entry if there are no lower
levels present.

For example, for the above data excerpt (just the headed rows), I would
count the following rows:

10000000
16220602
20110000
24000000
40320203
45140000
45630600 == 7 "species" present.

To be more specific, I don't count 45630000 (row 9) because there exists
a sppcode for this 'id' where either of the next two pairs of digits are
not all 0's.

In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then,
rows where ZZ == 00 only if the WWXXYY combination has not been counted
yet.

An example data set has been placed in my University web space and can
be read into R with the following:

## read example csv data
dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
                colClasses = c("factor","character","numeric"))
## show the data
head(dat, n = 10)

And the sppcode variable can be broken out into the 4 levels if required via:

## split out the four levels of categorisation:
dat2 <- data.frame(dat,
                   order = with(dat, substr(sppcode, 1, 2)),
                   family = with(dat, substr(sppcode, 3, 4)),
                   genus = with(dat, substr(sppcode, 5, 6)),
                   species = with(dat, substr(sppcode, 7, 8)))

The actual data set/problem contains several hundred different id's.

I can't see an efficient way of processing these data in the manner
described. Any help would be most gratefully received.

Many thanks,

Gavin
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/d76209d9/attachment-0002.bin>