[Bioc-devel] plyranges group_by

Thu Oct 17 10:24:17 CEST 2019

Thank you Stuart and Michael for your feedback.

Stuart, in response to your request for more context regarding my use case, I have updated my recent BioC support post<https://support.bioconductor.org/p/125623/>, now providing all use-case details.

Michael, I didn't selfmatch yet, but Stuart's reply seems to suggest that it would not get the data.table performance (which is literally instantaneous).

As a general question, do you think it would be useful to add a data.table-based split-apply-combine functionality to plyranges (such that end user operations remain on GRanges-only)? I wouldn't mind writing a function to do that (in github), but first need your feedback as to whether you think that would be useful :-)

Aditya

________________________________
From: Stuart Lee [lee.s using wehi.edu.au]
Sent: Thursday, October 17, 2019 3:01 AM
To: Michael Lawrence
Cc: Bhagwat, Aditya; bioc-devel using r-project.org
Subject: Re: plyranges group_by

Currently, the way grouping indices are generated is pretty slow if you�re doing stuff rowwise. Michael�s suggestion for using selfmatch should speed things up a bit. What are you planning to do after grouping? I�ve found there�s usually to do stuff without rowwise grouping but really depends on what you�re after. Re your other issue would you mind putting it on as a GitHub issue.
�
Stuart Lee
Visiting PhD Student - Ritchie Lab

On 16 Oct 2019, at 22:54, Michael Lawrence <lawrence.michael using gene.com<mailto:lawrence.michael using gene.com>> wrote:

Just a note that in this particular case, selfmatch(annotatedsrf) would be a fast way to generate a grouping vector, like plyranges::group_by(annotatedsrf, selfmatch(annotatedsrf)).

Michael

On Wed, Oct 16, 2019 at 2:48 AM Bhagwat, Aditya <Aditya.Bhagwat using mpi-bn.mpg.de<mailto:Aditya.Bhagwat using mpi-bn.mpg.de>> wrote:
Hi Stuart, Michael,

Your plyranges package is really cool - now I am using it for left joining GRanges (I am facing a minor issue there<https://support.bioconductor.org/p/125623/>, but that is not the topic of this email - I have been asked by Lori not to double-post :-)).

This email is about the plyranges functionality for grouping GRanges.
That is cool, but I found it to be not so performant for large numbers of ranges.
My R session hangs when I do:

bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis',
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
srfranges <- rtracklayer::import.bed(bedfile, genome = 'mm10')
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    generanges <- GenomicFeatures::genes(txdb)
annotatedsrf <- plyranges::join_overlap_left(srfranges, generanges)
plyranges::group_by(annotatedsrf, seqnames, start, end, strand)

For my purposes, I worked around it by performing a groupby in data.table:

data.table::as.data.table(annotatedsrf)[
    !is.na<http://is.na/>(gene_id),
    gene_id := paste0(gene_id, collapse = ';'),
    by = c('seqnames', 'start', 'end', 'strand'))

And was wondering, in general, whether it would be useful to have a data.table-based backend for plyranges::groupby()
And, whether all of this is actually a on-issue due to my improper use of plyranges::group_by properly.

Thank you for feebdack :-)

Aditya

--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com<mailto:michafla using gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

_______________________________________________

The information in this email is confidential and intend...{{dropped:15}}