[Bioc-devel] plyranges group_by

Thu Oct 17 03:01:27 CEST 2019

Currently, the way grouping indices are generated is pretty slow if you’re doing stuff rowwise. Michael’s suggestion for using selfmatch should speed things up a bit. What are you planning to do after grouping? I’ve found there’s usually to do stuff without rowwise grouping but really depends on what you’re after. Re your other issue would you mind putting it on as a GitHub issue.
—
Stuart Lee
Visiting PhD Student - Ritchie Lab

On 16 Oct 2019, at 22:54, Michael Lawrence <lawrence.michael using gene.com<mailto:lawrence.michael using gene.com>> wrote:

Just a note that in this particular case, selfmatch(annotatedsrf) would be a fast way to generate a grouping vector, like plyranges::group_by(annotatedsrf, selfmatch(annotatedsrf)).

Michael

On Wed, Oct 16, 2019 at 2:48 AM Bhagwat, Aditya <Aditya.Bhagwat using mpi-bn.mpg.de<mailto:Aditya.Bhagwat using mpi-bn.mpg.de>> wrote:
Hi Stuart, Michael,

Your plyranges package is really cool - now I am using it for left joining GRanges (I am facing a minor issue there<https://support.bioconductor.org/p/125623/>, but that is not the topic of this email - I have been asked by Lori not to double-post :-)).

This email is about the plyranges functionality for grouping GRanges.
That is cool, but I found it to be not so performant for large numbers of ranges.
My R session hangs when I do:

bedfile <- paste0('https://gitlab.gwdg.de/loosolab/software/multicrispr/wikis',
                      '/uploads/a51e98516c1e6b71441f5b5a5f741fa1/SRF.bed')
srfranges <- rtracklayer::import.bed(bedfile, genome = 'mm10')
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene
    generanges <- GenomicFeatures::genes(txdb)
annotatedsrf <- plyranges::join_overlap_left(srfranges, generanges)
plyranges::group_by(annotatedsrf, seqnames, start, end, strand)

For my purposes, I worked around it by performing a groupby in data.table:

data.table::as.data.table(annotatedsrf)[
    !is.na<http://is.na/>(gene_id),
    gene_id := paste0(gene_id, collapse = ';'),
    by = c('seqnames', 'start', 'end', 'strand'))

And was wondering, in general, whether it would be useful to have a data.table-based backend for plyranges::groupby()
And, whether all of this is actually a on-issue due to my improper use of plyranges::group_by properly.

Thank you for feebdack :-)

Aditya

--
Michael Lawrence
Scientist, Bioinformatics and Computational Biology
Genentech, A Member of the Roche Group
Office +1 (650) 225-7760
michafla using gene.com<mailto:michafla using gene.com>

Join Genentech on LinkedIn | Twitter | Facebook | Instagram | YouTube

_______________________________________________

The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.

The Walter and Eliza Hall Institute acknowledges the Wurundjeri people of the Kulin
Nation as the traditional owners of the land where our campuses are located and
the continuing connection to country and community.
_______________________________________________

	[[alternative HTML version deleted]]