[Bioc-devel] From Biostring matching to short read mapping

Sat Nov 9 23:22:06 CET 2019

On 11/9/19 08:12, Éric Fournier wrote:
> Hi,
> 
> it might be worthwhile to note that the concern about different chromosome sizes only applies if you have more workers than chromosomes. If you're running on 2-8 threads, the longer chromosome might hold up a thread while another processes two short ones.

Exactly. This is why a naive 'bplapply(seq_along(chromosomes), ...)' 
strategy might be ok when a small number of workers is used but won't 
scale well if we want to use dozens of workers. A good parallelization 
strategy should be able to break down big chromosomes in smaller pieces.

H.

> 
> Cheers,
> -Eric
> 
> 
> ________________________________
> 
> Date: Fri, 8 Nov 2019 18:19:27 +0000
> From: "Pages, Herve" <hpages using fredhutch.org>
> To: "Bhagwat, Aditya" <Aditya.Bhagwat using mpi-bn.mpg.de>,
>          "bioc-devel using r-project.org" <bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] From Biostring matching to short read
>          mapping
> Message-ID: <84550bd2-9ded-04a3-6ef6-52746c66f35c using fredhutch.org>
> Content-Type: text/plain; charset="windows-1252"
> 
> Hi Aditya,
> 
> Should not be too hard to parallelize. With some gotchas: using one
> worker per chromosome (which is the easy way to go) wouldn't be optimal
> because of the size differences between the chromosomes. So a better
> approach is to try to give each worker the same amount of work by
> splitting the set of chromosomes in groups of more or less equal sizes.
> The split can either preserve full chromosomes or break them in smaller
> pieces. The later will allow using a lot more workers than the former.
> I'll try to come up with some code that I'll share here.
> 
> BTW the *PDict() family in Biostrings is for finding the matches of a
> collection of patterns. You say you want to find "all genomic
> (mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
> using vmatchPattern() (or vcountPattern()) for that?
> 
> Cheers,
> H.
> 
> 
> On 11/7/19 02:11, Bhagwat, Aditya wrote:
>> Dear bioc-devel,
>>
>> multicrispr
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.gwdg.de_loosolab_software_multicrispr&d=DwMFAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=B3ZdDoy-Ur4VIfZr68ORA8dplv90DuCcehJEWpkwWUU&s=UsUGsKc2SVyrBHDWnEJS0FVy1wIhoeq2WA4nlLmtmfo&e=> provides
>> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
>> One task involves finding all genomic (mis)matches of a 23-bp candidate
>> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
>> approach that is successful, though not fast. An alternative would be to
>> switch to short read mapping rather than (Bio)string matching, which
>> involves a one-time indexing effort, but subsequent fast alignment.
>>
>> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
>> whereas I know from vcountPDict that some Cas9 candidates have hundreds
>> of genomic matches.
>>
>> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
>> on `maxHits`.
>>
>> Feedback request�
>>
>> Michael, would QuasR/(R)bowtie be a good approach to do this?
>>
>> Wei, did I overlook a way to do this with Rsubread?
>>
>> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>>
>> Thankyou J
>>
>> Aditya
>>
> 
> --
> Herv� Pag�s
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages using fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> 	[[alternative HTML version deleted]]
> 
> 
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=Gn7ePk_oWSVdI0hwo_p4vLLD1L0Txmz9e835vnyFyCc&s=d69FGAwKsfrk8ywu_HN3bvQjHxbz4eaSunLV2-bq8dQ&e=
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319