[Bioc-devel] Help in designing class based on SummarizedExperiment

Sat Feb 15 01:25:14 CET 2014

Apologies for the bump, but is anyone able to help me on this? Or are these questions more appropriate for the general Bioconductor mailing list rather than Bioc-Devel?

Many thanks,
Pete

----- Original Message -----
Date: Mon, 10 Feb 2014 13:20:47 +1100
From: Peter Hickey <hickey at wehi.EDU.AU>
To: bioc-devel at r-project.org
Subject: [Bioc-devel] Help in designing class based on
	SummarizedExperiment
Message-ID: <E3127FD3-87E5-43DA-B056-A633525EEC71 at wehi.edu.au>
Content-Type: text/plain

Hi all,

Apologies up front for the rather long post.

I'm designing a class to store what I call co-methylation m-tuples. These are based on a very simple tab-delimited file format. 
For example, here are 1-tuples (m = 1):
chr		pos1	M	U
chr1	57691	0	1
chr1	59276	1	0
chr1	60408	1	0
chr1	63495	1	0
chr1	63568	2	0
chr1	63627	3	0

2-tuples (m = 2):
chr		pos1	pos2	MM	MU	UM	UU
chr1	567438	567570	0	0	0	2
chr1	567501	567549	0	0	0	35
chr1	567549	567558	0	1	0	139

3-tuples (m = 3):
chr		pos1	pos2	pos3	MMM	MMU	MUM	MUU	UMM	UMU	UUM	UUU
chr1	13644	13823	13828	1		0		0		0		0		0		0		0
chr1	14741	14747 	14773	1		0		0		0		0		0		0		0

etc.

1-tuples are basically the standard input to an analysis of BS-seq data. 

I think of these files as being comprised of 3 parts: the 'chr' column (chr), the 'pos' matrix (pos1, pos2, pos3) and the 'counts' matrix (MMM, MMU, MUM, MUU, UMM, UMU, UUM, UUU), when m = 3. For a given value of 'm' there is one 'chr' column, m 'pos' columns and 2^m 'counts' columns.

I want to implement a class for these objects as I'm writing a package for the analysis of this type of data. I'd like a GRanges-type object storing the genomic information and a matrix-like object storing the counts. After tinkering around for a while, and doing some reading of the code in packages such as GenomicRanges and bsseq, I decided to extend the SummarizedExperiment class.  I now have a prototype but I have some questions and would appreciate feedback on some of my design choices before I translate my existing functions to work with this class of object.

Here is the code for the prototype:
#####################################################################
library(GenomicRanges)

setClass("CoMeth", contains = "SummarizedExperiment")

CoMeth <- function(seqnames, pos, counts, m, methylation_type, sample_name, strand = "*", seqlengths = NULL, seqinfo = NULL){

  # Argument checks, etc. go here #

  gr <- GRanges(seqnames = seqnames, ranges = IRanges(start = pos[[1]], end = pos[[length(pos)]]), strand = strand, seqlengths = seqlengths, seqinfo = seqinfo) # The width of each element is defined by the first and last 'pos', e.g. for 3-tuples it is defined by pos1 and pos3.
  # Need to store the "extra" positions if m > 2. Each additional position is stored as a separate assay
  if (m > 2){
    extra_pos <- lapply(seq(2, m - 1, 1), function(i, pos){
      pos[[i]]
    }, pos = pos)
    names(extra_pos) <- names(pos)[2:(m-1)]
  } else {
    extra_pos <- NULL
  }
  assays <- SimpleList(c(counts, extra_pos))
  colData <- DataFrame(sample_name = sample_name, m = m, methylation_type = paste0(sort(methylation_type), collapse = '/'))
  cometh <- SummarizedExperiment(assays = assays, rowData = gr, colData = colData)
  cometh <- as(cometh, "CoMeth")

  return(cometh)
}

And here's some example data:
# A function that roughly imitates the output of a call to scan() to read in BS-seq m-tuple data
# m is the size of the m-tuples
# n is the number of m-tuples
# z is the proportion of each column of 'counts' that is zero
make_test_data <- function(m, n, z){
  seqnames <- list(seqnames = rep('chr1', n))
  pos <- lapply(1:m, function(x, n){matrix(seq(from = 1 + x - 1, to = n + x - 1, by = 1), ncol = 1)}, n = n) # Need these to be matrices rather than vectors
  names(pos) <- paste0('pos', 1:m)
  # A rough hack to simulate counts where a proportion (z) are 0 and the rest are sampled from Poisson(lambda). Small values of lambda will inflate the zero-count.
  counts <- mapply(FUN = function(i, z, n, lambda){
    nz <- floor(n * (1 - z))
    matrix(sample(c(rpois(nz, lambda), rep(0, n - nz))), ncol = 1)
    }, i = 1:(2 ^ m), z = z, n = n, lambda = 4, SIMPLIFY = FALSE) # Need these to be matrices rather than vectors
  names(counts) <- sort(do.call(paste0, expand.grid(lapply(seq_len(m), function(x){c('M', 'U')}))))
  return(c(seqnames, pos, counts))
}

m <- 3 # An example using 3-tuples
n <- 1000 # A typical value for 3-tuples from a methylC-seq experiment is n = 17,000,000
z <- c(0.2, 0.6, 0.6, 0.7, 0.6, 0.8, 0.8, 0.7) # Typical proportions of each column of 'counts' that are zero when using 3-tuples for a methylC-seq experiment
test_data <- make_test_data(n = n, m = m, z = z)
cometh <- CoMeth(seqnames = test_data[['seqnames']], pos = test_data[grepl('pos', names(test_data))], counts = test_data[grepl('[MU]', names(test_data))], m = m, methylation_type = 'CG', sample_name = 'test_data')

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicRanges_1.14.4 XVector_0.2.0        IRanges_1.20.6       BiocGenerics_0.8.0  

loaded via a namespace (and not attached):
[1] stats4_3.0.2 tools_3.0.2 
#####################################################################

Questions
	1. How can I move the 'extra_pos' columns from the assay slot but keep the copy-on-change behaviour? From a design perspective, I think it would make more sense for the 'extra_pos' columns, i.e. ('pos2') for 3-tuples and ('pos2', 'pos3') for 4-tuples etc., to be in their own slot rather than in the assays slot, after all, they aren't assays but rather are additional genomic co-ordinates. The 'extra_pos' fields are fixed (at least until I start subsetting or combining multiple CoMeth objects). My understanding of the the SummarizedExperiment class is that the assays slot is a reference class to avoid excessive copying when changing other slots of a SummarizedExperiment object. So if the 'extra_pos' columns were stored outside of the assays slot then these would have to be copied when any changes are made to the other slots of a CoMeth object, correct? Is there a way to avoid this, i.e. so that these 'extra_pos' columns are stored separately from the assays slot but with the !
 copy-on-change behaviour of the assays slot?
	2. Is the correct to compute something based on the 'counts' data via the assay() accessor? For example, I might want a helper function getCounts(cometh) that does the equivalent of sapply(X = 1:(2^m), function(i, cometh){assay(cometh, i)}, cometh = cometh). Similarly, I might want to compute the coverage of an m-tuple, which would be the equivalent of rowSums(getCounts(cometh)). Is this the correct way to do this sort of thing?
	3. How do I measure the size of a SummarizedExperiment/CoMeth object? For example, with the test data, print(object.size(cometh), units = "auto") < print(object.size(assays(cometh)), units = "auto"), so it seems that the size of the assays slot isn't counted by object.size(). 
	4. Is it possible to store an Rle-type object in the assays slot of a SummarizedExperiment? 20-80% of the entries in each column of 'counts' are zero and there are often runs of zeros. So I thought that perhaps an Rle representation (column-wise) might be more (memory) efficient. But I can't seem to get an Rle object in the assays slot (I tried via DataFrame); is it even possible?
	5. Are there matrix-like objects with Rle columns? I found this thread started by Kasper Hansen (https://stat.ethz.ch/pipermail/bioconductor/2012-June/046473.html) discussing the idea of matrix-like object where the columns are Rle's; I could imagine using such an object for a CoMeth object containing multiple samples, i.e. MMM is a matrix-like object with ncol = # of samples, MMU is matrix-like object with ncol = # of samples, etc. Was anything like this ever implemented? My reading of the previous thread was to use a DataFrame but the "matrix API", e.g. rowSums, doesn't work with DataFrames (and see (4) as to whether it's even possible to store such objects in the assays slot).

Many thanks for your help in answering these questions. Any other suggestions on the design of the CoMeth class are appreciated.

Thanks,
Pete
--------------------------------
Peter Hickey,
PhD Student/Research Assistant,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Ph: +613 9345 2324

hickey at wehi.edu.au
http://www.wehi.edu.au

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}