[BioC] subsetting SummarizedExperiments - a proposal (or a hack?)

Wed Feb 5 23:11:28 CET 2014

Martin,

Google says: 'Your search - "subset.SummarizedExperiment" - did not match any documents.'

So I offer the below for your consideration, along with a few examples and a quick benchmark comparison with using the more explicit `[` extraction operator.

subset.SummarizedExperiment<-function(x                                                                                                                                           
                                      ,rowSubset=TRUE                                                                                                                             
                                      ,colSubset=TRUE                                                                                                                             
                                      ,assaySubset=TRUE                                                                                                                           
                                      ,drop=FALSE                                                                                                                                 
                                      ,provenanceTrack=FALSE                                                                                                                      
                                      ,...) {                                                                                                                                     
    ## PURPOSE: implement subsetting of SummarizedExperiments,                                                                                                                    
    ## allowing expressions in terms of:                                                                                                                                          
    ##                                                                                                                                                                            
    ##  + the GRanges (and its mcols meta-data) held in rowData                                                                                                                   
    ##  + the experimental meta-data held in the colData DataFrame                                                                                                                
    ##  + the names of the assays                                                                                                                                                 
    x<-x[                                                                                                                                                                         
         ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv)                                                                                                      
         eval(as.expression(substitute(rowSubset)),as.data.frame(rowData(x)),.GlobalEnv)                                                                                          
         ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv)                                                                                                        
         ,drop=drop,...]                                                                                                                                                          
    if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset]                                                                                                          
    if(provenanceTrack) {                                                                                                                                                         
        exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character(substitute(rowSubset)))                                                                                       
        exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character(substitute(colSubset)))                                                                                       
    }                                                                                                                                                                             
    x                                                                                                                                                                             
}                                                                                                                                                                                 
attr(subset,'ex')<-function() {                                                                                                                                                   
    example(SummarizedExperiment)                                                                                                                                                 
    assays(se1)$a2<-assays(se1)$counts*2                                                                                                                                          
    assays(se1)$a3<-assays(se1)$counts*3                                                                                                                                          
    benchmark(replications=100                                                                                                                                                    
              ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP']                                                                                                    
              ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP')                                                                                                               
              )                                                                                                                                                                   
    stopifnot(identical(assays(se1.ss1),assays(se1.ss2)))                                                                                                                         
    se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3'))                                                                                                               
    stopifnot(identical(c('a2','a3'),names(assays(se1.ss3))))                                                                                                                     
}                                                             

The rbenchmarks shows the cost of overhead in this contrived minimal example.  YMMV:

                                                                    test replications elapsed relative user.self sys.self user.child sys.child                                   
 3                                                          c("a2", "a3")          100   0.001        1     0.001    0.000          0         0                                   
 1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == "ChIP"]          100   3.113     3113     3.111    0.001          0         0                                   
 2           se1.ss2 <- subset(se1, start == 344757, Treatment == "ChIP")          100   3.486     3486     3.486    0.000          0         0  

Is all this in the spirit of things as you see it?   It sure makes my use of SE "scan" (erhm, like poetry;)

if you like, I also have castLong and castWide functions  to reshape a (subsetted) SE as a data.table for use in ggplot and friends.

You like?

Cheers,

~Malcolm