[R] Calculation of group summaries

Seeliger.Curt@epamail.epa.gov Seeliger.Curt at epamail.epa.gov
Fri Jul 15 02:31:51 CEST 2005


Several people suggested specific functions (by, tapply, sapply and
others); thanks for not blowing off a simple question regarding how to
do the following SQL in R:
>   select year,
>          site_id,
>          visit_no,
>          mean(undercut) AS meanUndercut,
>          count(undercut) AS nUndercut,
>          std(undercut) AS stdUndercut
>   from channelMorphology
>   group by year, site_id, visit_no
>   ;

I'd spent quite a bit of time with the suggested functions earlier but
had no luck as I'd misread the docs and put the entire dataframe where
it only wants the columns to be processed.  Sometimes it's the simplest
of things.

This has lead to another confoundment-- sd() acts differently than
mean() for some reason, at least with R 1.9.0.  For some reason, means
generate NA results and a warning message for each group:

  argument is not numeric or logical: returning NA in:
mean.default(data[x, ], ...)

Of course, the argument is numeric, or there'd be no sd value.  Or more
likely, I'm still missing something really basic. If I wrap the value in
as.numeric() things work fine.  Why should I have to do this for mean
and median, but not sd? The code below should reproduce this error

  # Fake data for demo:
  nsites<-6
  yearList<-1999:2001
  fakesub<-as.data.frame(cbind(
                 year     =rep(yearList,nsites/length(yearList),each=11)
                ,site_id  =rep(c('site1','site2'),each=11*nsites)
                ,visit_no =rep(1,11*2*nsites)
                ,transect =rep(LETTERS[1:11],nsites,each=2)
                ,transdir =rep(c('LF','RT'),11*nsites)
                ,undercut =abs(rnorm(11*2*nsites,10))
                ,angle    =runif(11*2*nsites,0,180)
                ))

  # Create group summaries:
  sdmets<-by(fakesub$undercut
            ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
            ,sd
            )
  nmets<-by(fakesub$undercut
           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
           ,length
           )
  xmets<-by(fakesub$undercut
           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
           ,mean
           )
   xmets<-by(as.numeric(fakesub$undercut)
           ,list(fakesub$year,fakesub$site_id,fakesub$visit_no)
           ,mean
           )

  # Put site id values (year, site_id and visit_no) into results:
  # List unique id combinations as a list of lists.  Then
  # reorganize that into 3 vectors for final results.
  # Certainly, there MUST be a better way...
  foo<-strsplit(unique(paste(fakesub$year
                            ,fakesub$site_id
                            ,fakesub$visit_no
                            ,sep='#'))
               ,split='#'
               )
  year<-list()
  for(i in 1:length(foo)) {year<-rbind(year,foo[[i]][1])}
  site_id<-list()
  for(i in 1:length(foo)) {site_id<-rbind(site_id,foo[[i]][2])}
  visit_no<-list()
  for(i in 1:length(foo)) {visit_no<-rbind(visit_no,foo[[i]][3])}

  # Final result, more or less
  data.frame(cbind(a=year,b=site_id,c=visit_no,sdmets,nmets,xmets))


cur

--
Curt Seeliger, Data Ranger
CSC, EPA/WED contractor
541/754-4638
seeliger.curt at epa.gov




More information about the R-help mailing list