[Bioc-devel] Base class for interaction data - expressions of interest

Aaron Lun alun at wehi.edu.au
Mon Nov 16 12:46:57 CET 2015

While I'm on this point, there's another, more subtle issue with using 
sparseMatrix(). Specifically, there's a distinction between zeros and 
missing values when considering a ContactMatrix. For example, in Hi-C 
data, a zero in the matrix means there aren't any read pairs mapping 
between the corresponding bins. A missing value means that the count for 
the bin pair is unknown, e.g., because that particular pairwise 
interaction was missing from the InteractionSet during conversion.

This difference may be important in calculating correct statistics; one 
can imagine situations where assuming all missing values are zero would 
not be appropriate. In general, I would expect that missing values would 
take up most of the matrix entries after conversion from an 
InteractionSet. sparseMatrix() doesn't seem to support setting "NA" as 
the default value to collapse a sparse matrix; it's fixed at zero, which 
makes mathematical sense but isn't quite right for our purposes.

Now, this might not be so bad for count data, depending on how you 
counted the reads into bin pairs; converting all NA's to zeros might be 
okay in such circumstances, if the occurrence of those NA's in the first 
place was due to the lack of reads. However, if you fill the contact 
matrix with other metrics (e.g., log-FCs, average log-CPMs), assuming 
that all missing values are zero would probably be incorrect.

Anyway, food for thought.

- Aaron

On 16/11/15 10:31, Aaron Lun wrote:
> Thanks for the comment Nadhir.
> I had considered the use of a sparse matrix class. The reason I didn't
> implement it originally is because truly sparse interaction data would
> be better represented by just working with the pairwise format in the
> InteractionSet. You need the row/column indices to pass to the
> sparseMatrix constructor anyway; a memory-efficient algorithm to do, for
> example, compartment identification could just use that directly.
> Most existing algorithms for doing this (e.g., k-means/hierarchical
> clustering) won't operate natively from a sparseMatrix, and I suspect
> they'll just run as.matrix() and convert it to a full matrix. Obviously,
> this would defeat the purpose of using a sparse matrix. So, if you have
> to rewrite the algorithms anyway, you might as well rewrite them in a
> manner that avoids needing the sparseMatrix() as a middleman.
> Nonetheless, it's a good point about memory usage. I'll have a think
> about it; sparseMatrix() would help a bit, but as coverage increases for
> these experiments, the matrix will probably become fairly dense (even if
> it's just counts of 1 for some bin pairs). Even now, for compartment
> detection, fairly large bins are involved that sparseness usually isn't
> observed. Perhaps big.matrix() might be a better choice.
> Cheers,
> Aaron
> On 16/11/15 09:58, DJEKIDEL MOHAMED NADHIR wrote:
>> Hi Aaron,
>> Sounds as a great initiative.
>> I just have some comments about the ContactMatrix-Class.
>> I think with increasing Hi-C resolution the usage of the matrix class
>> will consume a lot of memory.
>> Maybe using sparseMatrix from the Matrix package has a smaller finger
>> print.
>> it can also be manipulated in cpp using  RcppEigen, if for example you
>> plan some functionalities such as AB domains or insulation scores, ...
>> etc.
>> Regards,
>> - Nadhir
>> On Mon, Nov 16, 2015 at 5:33 PM, Aaron Lun <alun at wehi.edu.au
>> <mailto:alun at wehi.edu.au>> wrote:
>>     Hello all,
>>     I thought I might give an update on the state of affairs for the
>>     InteractionSet package. Currently, there's three classes:
>>     - the GInteractions class, inheriting from Vector and intended to
>>     represent pairwise interactions between genomic regions (based on
>>     suggestions from Malcolm Perry and Liz Ing-Simmons).
>>     - the InteractionSet class, inheriting from SummarizedExperiment0
>>     and containing a GInteractions object; intended to store
>>     experimental data about pairwise interactions (one interaction per
>> row).
>>     - the ContactMatrix class, inheriting from Annotated and storing
>>     data in matrix form (where rows/columns represent genomic regions).
>>     Getters, setters, conversion methods between classes, distance
>>     calculation methods and overlap methods have been implemented. Man
>>     pages and "testthat" scripts have also been written. Still missing a
>>     vignette, though it should be easy enough to write one.
>>     All in all, I think it's a solid first draft. Any comments would be
>>     appreciated.
>>     Cheers,
>>     Aaron
>>     On 08/11/15 19:31, Aaron Lun wrote:
>>         Okay, some meat and bones are on GitHub now:
>>         https://github.com/LTLA/InteractionSet
>>         The idea is to represent genomic interactions as pairs of genomic
>>         regions, using indices to point to a common GRanges object (a la
>>         Hits,
>>         though I haven't used that explicitly due to the presence of
>>         additional
>>         constraints on the indices). Data for each interaction is stored
>>         using a
>>         SummarizedExperiment framework (one row per interaction).
>>         With regards to the methods, most of the low-hanging fruit has
>> been
>>         implemented, courtesy of inheriting from SummarizedExperiment0.
>>         I'll add
>>         proper unit tests over the coming week. It currently passes
>>         through R
>>         CMD check okay, except for a warning about ":::" in the
>> cbind/rbind
>>         definitions (callNextMethod() didn't seem to work inside those
>>         methods,
>>         and I didn't want to rewrite the SE0 'binding methods).
>>         Any thoughts appreciated.
>>         - Aaron
>>         On 07/11/15 19:33, Morgan, Martin wrote:
>>             Just to say that this is a great idea. If this starts as a
>>             github
>>             package (or in svn, we can create a location for you if
>>             you'd like) I
>>             and others would I am sure be happy to try to provide any
>>             guidance /
>>             insight. The main design principles are probably to reuse as
>>             much as
>>             possible from existing classes, especially the S4Vectors /
>>             GRanges
>>             world, and to integrate metadata as appropriate (like
>>             SummarizedExepriment, for instance).
>>             Martin
>>             ________________________________________
>>             From: Bioc-devel [bioc-devel-bounces at r-project.org
>>             <mailto:bioc-devel-bounces at r-project.org>] on behalf of Aaron
>>             Lun [alun at wehi.edu.au <mailto:alun at wehi.edu.au>]
>>             Sent: Thursday, November 05, 2015 12:27 PM
>>             To: bioc-devel at r-project.org
>> <mailto:bioc-devel at r-project.org>
>>             Subject: Re: [Bioc-devel] Base class for interaction data -
>>             expressions of      interest
>>             There's a growing number of Bioconductor packages dealing
>> with
>>             interaction data; diffHic, GenomicInteractions, HiTC, to
>>             name a few (and
>>             probably more in the future). Each of these packages defines
>>             its own
>>             class to store interaction data - DIList for diffHic,
>>             GenomicInteractions for GenomicInteractions, and HTClist for
>>             HiTC.
>>             These classes seem to share a lot of features, which
>>             suggests that they
>>             can be (easily?) replaced with a common class. This would
>>             have two
>>             advantages - one, developers of new and existing packages
>>             don't have to
>>             continually write and maintain new classes; and two, it
>>             provides users
>>             with a consistent user experience across the relevant
>> packages.
>>             My question is, does anybody have anything in the pipeline
>>             with respect
>>             to a base package for an interaction class? If not, I'm
>>             planning to put
>>             something together for the next BioC release. To this end,
>>             I'd welcome
>>             any ideas/input/code; the aim is to make a drop-in
>>             replacement (insofar
>>             as that's possible) for the existing classes in each package.
>>             Cheers,
>>             Aaron
>>             _______________________________________________
>>             Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>             mailing list
>>             https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>             This email message may contain legally privileged and/or
>>             confidential
>>             information.  If you are not the intended recipient(s), or
>> the
>>             employee or agent responsible for the delivery of this
>>             message to the
>>             intended recipient(s), you are hereby notified that any
>>             disclosure,
>>             copying, distribution, or use of this email message is
>>             prohibited.  If
>>             you have received this message in error, please notify the
>>             sender
>>             immediately by e-mail and delete this email message from your
>>             computer. Thank you.
>>         _______________________________________________
>>         Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>         mailing list
>>         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>     _______________________________________________
>>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

More information about the Bioc-devel mailing list