[R] Counting/processing a character vector

Gavin Simpson gavin.simpson at ucl.ac.uk
Wed Feb 18 18:44:24 CET 2009


To answer my own post, and for the archives (hopefully not that anyone
has to repeat what I had to do ;-), after much hair-pulling , frowning
at the screen and general dumb headedness the following slab of R code
achieves the results I wanted. It isn't elegant but does a job.

msr <- function(x) {
    res <- numeric(length = length(levels(x$id)))
    names(res) <- levels(x$id)
    for(site in levels(x$id)) {
        ## subset just data for this site
        DAT <- x[x$id == site, ]

        ## split out the spp and count the ones not 00
        spp <- with(DAT, substr(sppcode, 7, 8))
        spp.counted <- which(spp != "00")
        spp <- with(DAT[spp.counted, ], sppcode)
        SPP <- length(spp.counted)
        DAT <- DAT[-spp.counted, ]

        ## drop genera for spp already counted
        want <- with(DAT, which(substr(sppcode, 1, 6) %in% substr(spp, 1, 6)))
        if(length(want) >= 1) {
            DAT <- DAT[-want, ]
        }

        ## now count genera remaining not 00
        gen <- with(DAT, substr(sppcode, 5, 6))
        gen.counted <- which(gen != "00")
        gen <- with(DAT[gen.counted, ], sppcode)
        GEN <- length(gen.counted)
        DAT <- DAT[-gen.counted, ]

        ## drop families already in spp, or genera that we already caught
        want1 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(spp, 1, 4)))
        want2 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(gen, 1, 4)))
        if(length(want <- unique(c(want1, want2))) >= 1) {
            DAT <- DAT[-want, ]
        }

        ## count remaining families != 00
        fam <- with(DAT, substr(sppcode, 3, 4))
        fam.counted <- which(fam != "00")
        fam <- with(DAT[fam.counted, ], sppcode)
        FAM <- length(fam.counted)
        DAT <- DAT[-fam.counted, ]

        ## drop orders for families already counted
        want1 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(spp, 1, 2)))
        want2 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(gen, 1, 2)))
        want3 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(fam, 1, 2)))
        if(length(want <- unique(c(want1, want2, want3))) >= 1) {
            DAT <- DAT[-want, ]
        }

        ## count the orders remaining
        ORD <- nrow(DAT)

        ## populate return vector
        res[site] <- SPP + GEN + FAM + ORD
    }
    return(res)
}
## read example csv data
dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
                colClasses = c("factor","character","numeric"))
## show the data
head(dat, n = 10)
## split out the four levels of categorisation:
dat2 <- data.frame(dat,
                   order = with(dat, substr(sppcode, 1, 2)),
                   family = with(dat, substr(sppcode, 3, 4)),
                   genus = with(dat, substr(sppcode, 5, 6)),
                   species = with(dat, substr(sppcode, 7, 8)))

msr(dat)

Yields:
> msr(dat)
10307 10719 10786 
   15    40    35

Which are correct.

G

On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote:
> Dear List,
> 
> I have a data set stored in the following format:
> 
> > head(dat, n = 10)
>       id  sppcode abundance
> 1  10307 10000000         1
> 2  10307 16220602         2
> 3  10307 20000000         5
> 4  10307 20110000         2
> 5  10307 24000000         1
> 6  10307 40210000        83
> 7  10307 40210102        45
> 8  10307 45140000         1
> 9  10307 45630000         1
> 10 10307 45630600        41
> > str(dat)
> 'data.frame':	111 obs. of  3 variables:
>  $ id       : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ...
>  $ sppcode  : chr  "10000000" "16220602" "20000000" "20110000" ...
>  $ abundance: num  1 2 5 2 1 83 45 1 1 41 ...
> 
> that represent counts of species, recorded with a particular coding
> system. The abundance column is not needed for this particular
> operation, but is present in the data files.
> 
> I am interested in counting entries (rows) in the sppcode component of
> dat. The sppcode takes a particular format: Order Family Genus Species,
> with 2 alphanumeric digits allocated for each level of the hierarchy. I
> want to know how many species there are in each site (the id factor),
> but I should only count a higher level entry if there are no lower
> levels present.
> 
> For example, for the above data excerpt (just the headed rows), I would
> count the following rows:
> 
> 10000000
> 16220602
> 20110000
> 24000000
> 40320203
> 45140000
> 45630600 == 7 "species" present.
> 
> To be more specific, I don't count 45630000 (row 9) because there exists
> a sppcode for this 'id' where either of the next two pairs of digits are
> not all 0's.
> 
> In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then,
> rows where ZZ == 00 only if the WWXXYY combination has not been counted
> yet.
> 
> An example data set has been placed in my University web space and can
> be read into R with the following:
> 
> ## read example csv data
> dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"),
>                 colClasses = c("factor","character","numeric"))
> ## show the data
> head(dat, n = 10)
> 
> And the sppcode variable can be broken out into the 4 levels if required via:
> 
> ## split out the four levels of categorisation:
> dat2 <- data.frame(dat,
>                    order = with(dat, substr(sppcode, 1, 2)),
>                    family = with(dat, substr(sppcode, 3, 4)),
>                    genus = with(dat, substr(sppcode, 5, 6)),
>                    species = with(dat, substr(sppcode, 7, 8)))
> 
> The actual data set/problem contains several hundred different id's.
> 
> I can't see an efficient way of processing these data in the manner
> described. Any help would be most gratefully received.
> 
> Many thanks,
> 
> Gavin
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/205da37d/attachment-0002.bin>


More information about the R-help mailing list