[Bioc-devel] Using integrated contains in Bioconductor packages

Tue Nov 5 15:12:29 CET 2013

Hi package developers --

I found this article pretty intersting reading

   http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html?WT.ec_id=NBT-201310

especially of course the comments of Robert Gentleman and the reasons for 
success of R (external packages written by domain experts) and Bioconductor 
(interoperability between different analysis capabilities enabled by using 
similar data structures). It's also very important to provide 'integrated' 
containers that couple, say, a matrix of expression count data with the 
annotations of the genes / gene regions (rows) and sample phenotypic data (columns).

With these ideas in mind, I want to emphasize that new and existing Bioconductor 
packages should be re-using established data structures. With omics data it is 
very important to offer users a way to easily work with data across Bioconductor 
packages. While you might implement 'internal' functions that perform numerical 
calculations on an R `matrix`, say, the major input functions should really 
support GenomicRanges::SummarizedExperiment objects, rather than (in addition 
to?) plain old matrix objects.

The rowData of summarized experiments can minimally contain names like the 
rownames() of a matrix, but can typically contain much more useful information, 
e.g., the genomic coordinates of regions of the regions of interst (as GRanges 
or GRangesList objects) and / or other attributes that are useful to your own 
analysis (GC content of each region?) or to the user (p-values from previous 
analysis?). Similarly the colData can be simple identifiers like colnames() of a 
matrix, but it's much more informative to tightly couple the phenotypic data 
about the samples. This makes it easy and error-free for the user to do things 
like subset both the phenotype and experssion data by some phenotype of 
interest, e.g., se[, colData(se)$Gender %in% "Female"].

Return values should respect the row and column indicies of the inputs as 
appropriate, so for instance it's easy for the user to add a matrix 
(assays(se)[["foo"]] <- foo(se, ...)), or vector or data.frame (preferablly, 
DataFrame) mcols(colData)$bar <- bar(se, ...) of results to their summarized 
experiment. It may often be appropriate to do this work for the user, returning 
a SummarizedExperiment annotated with your additional results.

There are similar data structures for other types of data, e.g., 
Biobase::ExpressionSet for microarrays and in the flow cell packages. Feel free 
to ask on this list if you're looking for guidance.

Not all return values are as simple as a vector, matrix, or data.frame, and of 
course one should not try to fit this into an inappropriate data structure.

Martin
-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793