[Bioc-devel] plyranges group_by

Michael Lawrence |@wrence@m|ch@e| @end|ng |rom gene@com
Thu Oct 17 11:45:28 CEST 2019


I replied on the support site. Let's move the discussion there.

On Thu, Oct 17, 2019 at 1:24 AM Bhagwat, Aditya <
Aditya.Bhagwat using mpi-bn.mpg.de> wrote:

> Thank you Stuart and Michael for your feedback.
>
> Stuart, in response to your request for more context regarding my use
> case, I have updated my recent BioC support post
> <https://support.bioconductor.org/p/125623/>, now providing all use-case
> details.
>
> Michael, I didn't selfmatch yet, but Stuart's reply seems to suggest that
> it would not get the data.table performance (which is literally
> instantaneous).
>
> As a general question, do you think it would be useful to add a
> data.table-based split-apply-combine functionality to plyranges (such that
> end user operations remain on GRanges-only)? I wouldn't mind writing a
> function to do that (in github), but first need your feedback as to whether
> you think that would be useful :-)
>
> Aditya
>
>
> ------------------------------
> *From:* Stuart Lee [lee.s using wehi.edu.au]
> *Sent:* Thursday, October 17, 2019 3:01 AM
> *To:* Michael Lawrence
> *Cc:* Bhagwat, Aditya; bioc-devel using r-project.org
> *Subject:* Re: plyranges group_by
>
> Currently, the way grouping indices are generated is pretty slow if you’re
> doing stuff rowwise. Michael’s suggestion for using selfmatch should speed
> things up a bit. What are you planning to do after grouping? I’ve found
> there’s usually to do stuff without rowwise grouping but really depends on
> what you’re after. Re your other issue would you mind putting it on as a
> GitHub issue.
>> Stuart Lee
> Visiting PhD Student - Ritchie Lab
>
>
>
> On 16 Oct 2019, at 22:54, Michael Lawrence <lawrence.michael using gene.com>
> wrote:
>
> Just a note that in this particular case, selfmatch(annotatedsrf) would
> be a fast way to generate a grouping vector, like
> plyranges::group_by(annotatedsrf, selfmatch(annotatedsrf)).
>
> Michael
>
> On Wed, Oct 16, 2019 at 2:48 AM Bhagwat, Aditya <
> Aditya.Bhagwat using mpi-bn.mpg.de> wrote:
>
>> Hi Stuart, Michael,
>>
>> Your plyranges package is really cool - now I am using it for left
>> joining GRanges (I am facing a minor issue there
>> <https://support.bioconductor.org/p/125623/>, but that is not the topic
>> of this email - I have been asked by Lori not to double-post :-)).
>>
>> This email is about the plyranges functionality for grouping GRanges.
>> That is cool, but I found it to be not so performant for large numbers of
>> ranges.
>> My R session hangs when I do:
>>
>> bedfile <- paste0('
>> https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis',
>>                       '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
>> srfranges <- rtracklayer::import.bed(bedfile, genome = 'mm10')
>> txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
>>     generanges <- GenomicFeatures::genes(txdb)
>> annotatedsrf <- plyranges::join_overlap_left(srfranges, generanges)
>> plyranges::group_by(annotatedsrf, seqnames, start, end, strand)
>>
>> For my purposes, I worked around it by performing a groupby in data.table:
>>
>> data.table::as.data.table(annotatedsrf)[
>>     !is.na(gene_id),
>>     gene_id := paste0(gene_id, collapse = ';'),
>>     by = c('seqnames', 'start', 'end', 'strand'))
>>
>> And was wondering, in general, whether it would be useful to have a
>> data.table-based backend for plyranges::groupby()
>> And, whether all of this is actually a on-issue due to my improper use
>> of plyranges::group_by properly.
>>
>> Thank you for feebdack :-)
>>
>> Aditya
>>
>>
>>
>
> --
> Michael Lawrence
> Scientist, Bioinformatics and Computational Biology
> Genentech, A Member of the Roche Group
> Office +1 (650) 225-7760
> michafla using gene.com
>
> Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube
>
>
> _______________________________________________
>
> The information in this email is confidential and inte...{{dropped:26}}



More information about the Bioc-devel mailing list