[Bioc-devel] From Biostring matching to short read mapping

Sat Nov 9 17:12:21 CET 2019

Hi,

it might be worthwhile to note that the concern about different chromosome sizes only applies if you have more workers than chromosomes. If you're running on 2-8 threads, the longer chromosome might hold up a thread while another processes two short ones.

Cheers,
-Eric

________________________________

Date: Fri, 8 Nov 2019 18:19:27 +0000
From: "Pages, Herve" <hpages using fredhutch.org>
To: "Bhagwat, Aditya" <Aditya.Bhagwat using mpi-bn.mpg.de>,
        "bioc-devel using r-project.org" <bioc-devel using r-project.org>
Subject: Re: [Bioc-devel] From Biostring matching to short read
        mapping
Message-ID: <84550bd2-9ded-04a3-6ef6-52746c66f35c using fredhutch.org>
Content-Type: text/plain; charset="windows-1252"

Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one
worker per chromosome (which is the easy way to go) wouldn't be optimal
because of the size differences between the chromosomes. So a better
approach is to try to give each worker the same amount of work by
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a
collection of patterns. You say you want to find "all genomic
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.

On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
>
> multicrispr
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.gwdg.de_loosolab_software_multicrispr&d=DwMFAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=B3ZdDoy-Ur4VIfZr68ORA8dplv90DuCcehJEWpkwWUU&s=UsUGsKc2SVyrBHDWnEJS0FVy1wIhoeq2WA4nlLmtmfo&e=> provides
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
> One task involves finding all genomic (mis)matches of a 23-bp candidate
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
> approach that is successful, though not fast. An alternative would be to
> switch to short read mapping rather than (Bio)string matching, which
> involves a one-time indexing effort, but subsequent fast alignment.
>
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
> whereas I know from vcountPDict that some Cas9 candidates have hundreds
> of genomic matches.
>
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
> on `maxHits`.
>
> Feedback request�
>
> Michael, would QuasR/(R)bowtie be a good approach to do this?
>
> Wei, did I overlook a way to do this with Rsubread?
>
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>
> Thankyou J
>
> Aditya
>

--
Herv� Pag�s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

	[[alternative HTML version deleted]]