[R] Counting/processing a character vector
Gavin Simpson
gavin.simpson at ucl.ac.uk
Wed Feb 18 18:44:24 CET 2009
To answer my own post, and for the archives (hopefully not that anyone
has to repeat what I had to do ;-), after much hair-pulling , frowning
at the screen and general dumb headedness the following slab of R code
achieves the results I wanted. It isn't elegant but does a job.
msr <- function(x) {
res <- numeric(length = length(levels(x$id)))
names(res) <- levels(x$id)
for(site in levels(x$id)) {
## subset just data for this site
DAT <- x[x$id == site, ]
## split out the spp and count the ones not 00
spp <- with(DAT, substr(sppcode, 7, 8))
spp.counted <- which(spp != "00")
spp <- with(DAT[spp.counted, ], sppcode)
SPP <- length(spp.counted)
DAT <- DAT[-spp.counted, ]
## drop genera for spp already counted
want <- with(DAT, which(substr(sppcode, 1, 6) %in% substr(spp, 1, 6)))
if(length(want) >= 1) {
DAT <- DAT[-want, ]
}
## now count genera remaining not 00
gen <- with(DAT, substr(sppcode, 5, 6))
gen.counted <- which(gen != "00")
gen <- with(DAT[gen.counted, ], sppcode)
GEN <- length(gen.counted)
DAT <- DAT[-gen.counted, ]
## drop families already in spp, or genera that we already caught
want1 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(spp, 1, 4)))
want2 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(gen, 1, 4)))
if(length(want <- unique(c(want1, want2))) >= 1) {
DAT <- DAT[-want, ]
}
## count remaining families != 00
fam <- with(DAT, substr(sppcode, 3, 4))
fam.counted <- which(fam != "00")
fam <- with(DAT[fam.counted, ], sppcode)
FAM <- length(fam.counted)
DAT <- DAT[-fam.counted, ]
## drop orders for families already counted
want1 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(spp, 1, 2)))
want2 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(gen, 1, 2)))
want3 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(fam, 1, 2)))
if(length(want <- unique(c(want1, want2, want3))) >= 1) {
DAT <- DAT[-want, ]
}
## count the orders remaining
ORD <- nrow(DAT)
## populate return vector
res[site] <- SPP + GEN + FAM + ORD
}
return(res)
}
## read example csv data
dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
colClasses = c("factor","character","numeric"))
## show the data
head(dat, n = 10)
## split out the four levels of categorisation:
dat2 <- data.frame(dat,
order = with(dat, substr(sppcode, 1, 2)),
family = with(dat, substr(sppcode, 3, 4)),
genus = with(dat, substr(sppcode, 5, 6)),
species = with(dat, substr(sppcode, 7, 8)))
msr(dat)
Yields:
> msr(dat)
10307 10719 10786
15 40 35
Which are correct.
G
On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote:
> Dear List,
>
> I have a data set stored in the following format:
>
> > head(dat, n = 10)
> id sppcode abundance
> 1 10307 10000000 1
> 2 10307 16220602 2
> 3 10307 20000000 5
> 4 10307 20110000 2
> 5 10307 24000000 1
> 6 10307 40210000 83
> 7 10307 40210102 45
> 8 10307 45140000 1
> 9 10307 45630000 1
> 10 10307 45630600 41
> > str(dat)
> 'data.frame': 111 obs. of 3 variables:
> $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ...
> $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ...
> $ abundance: num 1 2 5 2 1 83 45 1 1 41 ...
>
> that represent counts of species, recorded with a particular coding
> system. The abundance column is not needed for this particular
> operation, but is present in the data files.
>
> I am interested in counting entries (rows) in the sppcode component of
> dat. The sppcode takes a particular format: Order Family Genus Species,
> with 2 alphanumeric digits allocated for each level of the hierarchy. I
> want to know how many species there are in each site (the id factor),
> but I should only count a higher level entry if there are no lower
> levels present.
>
> For example, for the above data excerpt (just the headed rows), I would
> count the following rows:
>
> 10000000
> 16220602
> 20110000
> 24000000
> 40320203
> 45140000
> 45630600 == 7 "species" present.
>
> To be more specific, I don't count 45630000 (row 9) because there exists
> a sppcode for this 'id' where either of the next two pairs of digits are
> not all 0's.
>
> In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then,
> rows where ZZ == 00 only if the WWXXYY combination has not been counted
> yet.
>
> An example data set has been placed in my University web space and can
> be read into R with the following:
>
> ## read example csv data
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
> colClasses = c("factor","character","numeric"))
> ## show the data
> head(dat, n = 10)
>
> And the sppcode variable can be broken out into the 4 levels if required via:
>
> ## split out the four levels of categorisation:
> dat2 <- data.frame(dat,
> order = with(dat, substr(sppcode, 1, 2)),
> family = with(dat, substr(sppcode, 3, 4)),
> genus = with(dat, substr(sppcode, 5, 6)),
> species = with(dat, substr(sppcode, 7, 8)))
>
> The actual data set/problem contains several hundred different id's.
>
> I can't see an efficient way of processing these data in the manner
> described. Any help would be most gratefully received.
>
> Many thanks,
>
> Gavin
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson [t] +44 (0)20 7679 0522
ECRC, UCL Geography, [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/205da37d/attachment-0002.bin>
More information about the R-help
mailing list