[R] Counting/processing a character vector
Gavin Simpson
gavin.simpson at ucl.ac.uk
Wed Feb 18 14:37:06 CET 2009
Dear List,
I have a data set stored in the following format:
> head(dat, n = 10)
id sppcode abundance
1 10307 10000000 1
2 10307 16220602 2
3 10307 20000000 5
4 10307 20110000 2
5 10307 24000000 1
6 10307 40210000 83
7 10307 40210102 45
8 10307 45140000 1
9 10307 45630000 1
10 10307 45630600 41
> str(dat)
'data.frame': 111 obs. of 3 variables:
$ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sppcode : chr "10000000" "16220602" "20000000" "20110000" ...
$ abundance: num 1 2 5 2 1 83 45 1 1 41 ...
that represent counts of species, recorded with a particular coding
system. The abundance column is not needed for this particular
operation, but is present in the data files.
I am interested in counting entries (rows) in the sppcode component of
dat. The sppcode takes a particular format: Order Family Genus Species,
with 2 alphanumeric digits allocated for each level of the hierarchy. I
want to know how many species there are in each site (the id factor),
but I should only count a higher level entry if there are no lower
levels present.
For example, for the above data excerpt (just the headed rows), I would
count the following rows:
10000000
16220602
20110000
24000000
40320203
45140000
45630600 == 7 "species" present.
To be more specific, I don't count 45630000 (row 9) because there exists
a sppcode for this 'id' where either of the next two pairs of digits are
not all 0's.
In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then,
rows where ZZ == 00 only if the WWXXYY combination has not been counted
yet.
An example data set has been placed in my University web space and can
be read into R with the following:
## read example csv data
dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
colClasses = c("factor","character","numeric"))
## show the data
head(dat, n = 10)
And the sppcode variable can be broken out into the 4 levels if required via:
## split out the four levels of categorisation:
dat2 <- data.frame(dat,
order = with(dat, substr(sppcode, 1, 2)),
family = with(dat, substr(sppcode, 3, 4)),
genus = with(dat, substr(sppcode, 5, 6)),
species = with(dat, substr(sppcode, 7, 8)))
The actual data set/problem contains several hundred different id's.
I can't see an efficient way of processing these data in the manner
described. Any help would be most gratefully received.
Many thanks,
Gavin
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson [t] +44 (0)20 7679 0522
ECRC, UCL Geography, [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/d76209d9/attachment-0002.bin>
More information about the R-help
mailing list