[Bioc-devel] Using integrated contains in Bioconductor packages
Martin Morgan
mtmorgan at fhcrc.org
Tue Nov 5 15:12:29 CET 2013
Hi package developers --
I found this article pretty intersting reading
http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html?WT.ec_id=NBT-201310
especially of course the comments of Robert Gentleman and the reasons for
success of R (external packages written by domain experts) and Bioconductor
(interoperability between different analysis capabilities enabled by using
similar data structures). It's also very important to provide 'integrated'
containers that couple, say, a matrix of expression count data with the
annotations of the genes / gene regions (rows) and sample phenotypic data (columns).
With these ideas in mind, I want to emphasize that new and existing Bioconductor
packages should be re-using established data structures. With omics data it is
very important to offer users a way to easily work with data across Bioconductor
packages. While you might implement 'internal' functions that perform numerical
calculations on an R `matrix`, say, the major input functions should really
support GenomicRanges::SummarizedExperiment objects, rather than (in addition
to?) plain old matrix objects.
The rowData of summarized experiments can minimally contain names like the
rownames() of a matrix, but can typically contain much more useful information,
e.g., the genomic coordinates of regions of the regions of interst (as GRanges
or GRangesList objects) and / or other attributes that are useful to your own
analysis (GC content of each region?) or to the user (p-values from previous
analysis?). Similarly the colData can be simple identifiers like colnames() of a
matrix, but it's much more informative to tightly couple the phenotypic data
about the samples. This makes it easy and error-free for the user to do things
like subset both the phenotype and experssion data by some phenotype of
interest, e.g., se[, colData(se)$Gender %in% "Female"].
Return values should respect the row and column indicies of the inputs as
appropriate, so for instance it's easy for the user to add a matrix
(assays(se)[["foo"]] <- foo(se, ...)), or vector or data.frame (preferablly,
DataFrame) mcols(colData)$bar <- bar(se, ...) of results to their summarized
experiment. It may often be appropriate to do this work for the user, returning
a SummarizedExperiment annotated with your additional results.
There are similar data structures for other types of data, e.g.,
Biobase::ExpressionSet for microarrays and in the flow cell packages. Feel free
to ask on this list if you're looking for guidance.
Not all return values are as simple as a vector, matrix, or data.frame, and of
course one should not try to fit this into an inappropriate data structure.
Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioc-devel
mailing list