[Bioc-devel] Reading and storing single cell ATAC data

Hervé Pagès hpages at fredhutch.org
Sat Sep 24 02:06:50 CEST 2016


Yes, I would also encourage you to explore the SummarizedExperiment
route rather than ExpressionSet one. To be more precise,
RangedSummarizedExperiment is probably what you need: you can put
a GRanges object along the 1st dimension to describe your peaks and
a DataFrame object along the 2nd dimension to describe your samples.

For the matrix-like object that you use to store the assay data, it
can be anything (e.g. sparse matrix, on-disk matrix, etc...) as long
as it supports dim, dimnames, and 2D-style subsetting. That's the
bare minimum I think. It can support more (e.g. cbind, rbind), in
which case you'll be able to cbind and/or rbind 2 SummarizedExperiment
objects together. We've tried to keep the requirements for what can
be used to store the assay data as minimalist as possible.

One matrix-like container that you might want to consider is
DelayedArray defined in the HDF5Array package. Right now it uses
hdf5 for on-disk storage but other backends could be implemented
(something I'm planning to work on after the upcoming release).
You can stick a huge DelayedArray object that wouldn't fit in memory
in a SummarizedExperiment and manipulate it *almost* like you would
do with a regular SummarizedExperiment object. See ?HDF5Array for
an example of such an hdf5-backed SummarizedExperiment.

Hope this helps,

On 09/23/2016 03:46 PM, Andrew McDavid wrote:
> Hi Caleb,
> Hopefully Herve will chime in regarding SummarizedExperiment, but yes, I
> think you can and should inherit from that. The `assays` slot must be an
> object of type `Assays`, but that does appear to include a sparse Matrix.
> See the comments at the top of Assays-class.R in the tarball for
> SummarizedExperiment.  For example:
> library(SummarizedExperiment)
> library(Matrix)
> library(GenomicRanges)
> Nrow=1e6
> Ncol=1e4
> assay=Matrix::Matrix(0, nrow=Nrow, ncol=Ncol, sparse=TRUE)
> gr <- GRanges(Rle("chr2", Nrow),
>               IRanges(seq_len(Nrow), width=10))
> se <- SummarizedExperiment(assays=assay, rowRanges=gr)
> As far as out-of-core storage of sparse matrices, I do not know of any good
> (portable) solutions.   If it makes more sense to chunk the matrix along
> some dimension, you could always pickle the chunked, (sparse) Matrix
> objects. In my experience, the decision to adopt sparse vs out-of-core
> dense arrays has often required empirical testing to determine what is
> fastest/most scalable, since you lose caching benefits from sequential
> memory access once you go sparse.  I know there has been talk of extending
> SummarizedExperiment to easily permit the Assays to be hdf5-based.
> Is disk space really going to be a limiting factor? If so, then you will
> probably be IO-bound, so you will need to distribute the data across
> computing nodes for your analysis to scale anyways, which suggests some
> sort of map-reduce formalism.  Which to my knowledge no one has considered
> yet in Bioconductor.  But unless you are generating > 1 TB of
> semi-processed data, maybe you don't need to go there?
> -Andrew
>> Message: 3
>> Date: Fri, 23 Sep 2016 17:37:04 -0400
>> From: Caleb Lareau <caleblareau at g.harvard.edu>
>> To: bioc-devel at r-project.org
>> Subject: [Bioc-devel] Reading and storing single cell ATAC data
>> Message-ID: <83B8A778-4DAA-4A16-9DA8-22DEF9AD1252 at g.harvard.edu>
>> Content-Type: text/plain; charset="UTF-8"
>> Hi everyone?
>> I?m working with a team that?s generating single cell ATAC data in large
>> amounts and am designing the framework of an S4 object to facilitate
>> analyses in R. I have a couple of high-level questions that I wanted to
>> pose early to hopefully attain some community guidance in the
>> implementation of these data structures.
>> Question on S4 scATAC Structure--
>> It?s easy to imagine scATAC data as a matrix where the rows are particular
>> peaks and the columns are individual samples. We already have such an
>> impressive volume of data, such that if stored in an ordinary matrix, we
>> run into ~20 GB objects. As these data are very sparse, we store the peak
>> values in a sparse matrix (through the Matrix library). I wanted to collate
>> the peak information (probably in GRanges object) and sample information
>> (in a data frame) as well as some potential meta data in an S4 object.
>> Easy enough, sure, but after looking at the scRNA structure (e.g. scater <
>> https://bioconductor.org/packages/devel/bioc/vignettes/
>> scater/inst/doc/vignette.html>), I feel like I should be considering how
>> to inherit some of the nice properties from the canonical `ExpressionSet`
>> structure. However, since my constraints aren?t directly compatible (namely
>> the featureData slot really needs to be a GRanges and the exprs slot must
>> be an object from Matrix), it wasn?t clear to me how to maximize the
>> inheritance properties while adjusting to my unique constraints. Also, it
>> wasn?t clear to me whether or not I could inherent `SummarizedExperiment`
>> due to the different nature of the sparse matrix. Does anyone have any
>> advice on this structure?
>> Question on reading sparse matrices from disk--
>> I?m trying to work out the best to selectively read certain rows and
>> columns from a sparse matrix on disk into memory. I anticipate a time
>> fairly soon that loading our full scATAC data, even in sparse matrices, is
>> going to be untenable. Any matrix reading/slicing implementations that I?ve
>> seen don?t play friendly with sparse matrices. So, I hacked together two
>> solutions? 1) reads and subsets a gzipped matrix with 3 columns (row index;
>> column index; non-zero value) through a system call to awk. 2) converts
>> that same 3 column matrix into an SQLite object and send queries to read
>> values based on indices. The hiccups are that 1) doesn?t play friendly on
>> non-unix platforms and always scans the full file, and 2) is faster for
>> querying, but the binary object is ~7x larger than the gzipped object. I?ve
>> played around with hdf5 as well, but it didn?t seem to give me much back in
>> terms of speed or storage benefits comparatively. Has anyone found an
>> implementation that achieves a decent!
>>   lookup time and compression, or am I essentially needing to choose
>> between the two?
>> Thanks and have a great weekend!
>> -Caleb
>>         [[alternative HTML version deleted]]
>> ------------------------------
>> Subject: Digest Footer
>> _______________________________________________
>> Bioc-devel mailing list
>> Bioc-devel at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> ------------------------------
>> End of Bioc-devel Digest, Vol 150, Issue 41
>> *******************************************
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioc-devel mailing list