[Bioc-devel] plyranges group_by

Bhagwat, Aditya Ad|ty@@Bh@gw@t @end|ng |rom mp|-bn@mpg@de
Thu Oct 17 14:55:58 CEST 2019


Thank you Michael,

In attach the example file, since I noticed you were unable to download it from gitlab.
Will continue the discussion there, then :-)

Aditya

________________________________
From: Michael Lawrence [lawrence.michael using gene.com]
Sent: Thursday, October 17, 2019 11:45 AM
To: Bhagwat, Aditya
Cc: Stuart Lee; Michael Lawrence; bioc-devel using r-project.org
Subject: Re: plyranges group_by

I replied on the support site. Let's move the discussion there.

On Thu, Oct 17, 2019 at 1:24 AM Bhagwat, Aditya <Aditya.Bhagwat using mpi-bn.mpg.de<mailto:Aditya.Bhagwat using mpi-bn.mpg.de>> wrote:
Thank you Stuart and Michael for your feedback.

Stuart, in response to your request for more context regarding my use case, I have updated my recent BioC support post<https://support.bioconductor.org/p/125623/>, now providing all use-case details.

Michael, I didn't selfmatch yet, but Stuart's reply seems to suggest that it would not get the data.table performance (which is literally instantaneous).

As a general question, do you think it would be useful to add a data.table-based split-apply-combine functionality to plyranges (such that end user operations remain on GRanges-only)? I wouldn't mind writing a function to do that (in github), but first need your feedback as to whether you think that would be useful :-)

Aditya


________________________________
From: Stuart Lee [lee.s using wehi.edu.au<mailto:lee.s using wehi.edu.au>]
Sent: Thursday, October 17, 2019 3:01 AM
To: Michael Lawrence
Cc: Bhagwat, Aditya; bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>
Subject: Re: plyranges group_by

Currently, the way grouping indices are generated is pretty slow if you’re doing stuff rowwise. Michael’s suggestion for using selfmatch should speed things up a bit. What are you planning to do after grouping? I’ve found there’s usually to do stuff without rowwise grouping but really depends on what you’re after. Re your other issue would you mind putting it on as a GitHub issue.
—
Stuart Lee
Visiting PhD Student - Ritchie Lab



On 16 Oct 2019, at 22:54, Michael Lawrence <lawrence.michael using gene.com<mailto:lawrence.michael using gene.com>> wrote:

Just a note that in this particular case, selfmatch(annotatedsrf) would be a fast way to generate a grouping vector, like plyranges::group_by(annotatedsrf, selfmatch(annotatedsrf)).

Michael

On Wed, Oct 16, 2019 at 2:48 AM Bhagwat, Aditya <Aditya.Bhagwat using mpi-bn.mpg.de<mailto:Aditya.Bhagwat using mpi-bn.mpg.de>> wrote:
Hi Stuart, Michael,

Your plyranges package is really cool - now I am using it for left joining GRanges (I am facing a minor issue there<https://support.bioconductor.org/p/125623/>, but that is not the topic of this email - I have been asked by Lori not to double-post :-)).

This email is about the plyranges functionality for grouping GRanges.
That is cool, but I found it to be not so performant for large numbers of ranges.
My R session hangs when I do:

bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis',
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
srfranges <- rtracklayer::import.bed(bedfile, genome = 'mm10')
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    generanges <- GenomicFeatures::genes(txdb)
annotatedsrf <- plyranges::join_overlap_left(srfranges, generanges)
plyranges::group_by(annotatedsrf, seqnames, start, end, strand)

For my purposes, I worked around it by performing a groupby in data.table:

data.table::as.data.table(annotatedsrf)[
    !is.na<http://is.na/>(gene_id),
    gene_id := paste0(gene_id, collapse = ';'),
    by = c('seqnames', 'start', 'end', 'strand'))

And was wondering, in general, whether it would be useful to have a data.table-based backend for plyranges::groupby()
And, whether all of this is actually a on-issue due to my improper use of plyranges::group_by properly.

Thank you for feebdack :-)

Aditya




--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com<mailto:michafla using gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube


_______________________________________________

The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.

The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin
Nation as the traditional owners of the land where our campuses are located and
the continuing connection to country and community.
_______________________________________________


--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com<mailto:michafla using gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube


More information about the Bioc-devel mailing list