[BioC] BioStrings questions

hpages at fhcrc.org hpages at fhcrc.org
Sat Oct 18 20:17:35 CEST 2008


Hi Raffaele,

Quoting rcaloger <raffaele.calogero at gmail.com>:

> Hi Herve,
> many thanks for the informations.
> After digging into the man pages of the Biostrings package I found the
> countPatter function.
> It is nice because it returns me the simple presence of the pattern
> allowing a certain number of mismatches
>
> score.tmp <- countPattern(PSR.seq ,
> DNAString(as.character(exs.seq[j])), max.mismatch=3)
>
> I found it very useful if a limited number of sequences are queried,
> however, when I work with the full refseq fasta file blast is much
> faster.

If you have a big number of query sequences, using countPattern() in a loop
won't be very efficient I agree. In that case it's better to use countPDict().
countPDict() is to matchPDict() what countPattern() is to matchPattern().
Use countPDict() the same way you would use matchPDict(). The returned
object is different though: it returns an integer vector instead of an
MIndex object.
With a lot of query sequences, chances are that it will be faster than
BLAST: it takes less than 1 minute to count all the exact matches of a
set of tens of thousands of 25-mers in Human chormosome 1.

Cheers,
H.



> Cheers
> Raffaele
>
> hpages at fhcrc.org wrote:
>> Hi Raffaele,
>>
>> The PSR do not need to be of equal length with PDict/matchPDit.
>> An important limitation though is that indels are not supported
>> for now (but will be in the very near future).
>> Another limitation is that you need to define a Trusted Band on
>> your PDict object i.e. a vertical band of constant width where
>> mismatches are not allowed. For example:
>>
>>  psrPDict <- PDict(PSRDNAStringSet, start=5, end=15)
>>
>> will set a Trusted Band of width 11 that covers nucleotides
>> 5 to 15 of each PSR.
>> Then use the 'max.mismatch' arg of matchPDict() to control the max
>> number of mismatches that you want to allow *outside* the Trusted
>> Band for each PSR (the nucleotides that are *inside* must match
>> exactly).
>> Note that even if you want to perform exact matching, you need
>> to define a Trusted Band if your PSR are not of equal length.
>> It's important to try to define the widest possible Trusted Band
>> since the performance of the underlying algo depends directly
>> on how wide it is. Typically the performance will be good if the
>> width is >= 12 and will start to drop significantly for smaller
>> values.
>> Of course if you want to use you PDict object for exact matching
>> only, then you want to set the Trusted Band to the max possible
>> width i.e. to min(width(PSRDNAStringSet)).
>>
>> Hope this helps,
>> H.
>>
>>
>> Quoting rcaloger <raffaele.calogero at gmail.com>:
>>
>>> Dear Patrick,
>>> many thanks for the quick answer. Unfortunately PSR can be located in
>>> any position of the refseq and the sequence length can be very
>>> different. Therefore, I cannot apply your suggestion.
>>> Cheers
>>> Raffaele
>>>
>>> Patrick Aboyoun wrote:
>>>> Raffaele,
>>>> The pairwiseAlignment function uses an O(nm) {with n and m being   
>>>>  the length of the two sequences being aligned} dynamic   
>>>> programming  algorithm that is designed to find an optimal   
>>>> alignment and as you  have discovered isn't intended for use with  
>>>>  a long reference  sequence. Do your PSR sequences map nearly   
>>>> exactly to a location on  your reference sequence and are these   
>>>> sequences of equal length? If  so, see the matchPDict function.   
>>>> It matches a pattern dictionary  consisting of equal length   
>>>> fragments to a reference sequence. The  pseudo code looks   
>>>> something like:
>>>>
>>>> psrPDict <- PDict(PSRDNAStringSet)
>>>> matchPDict(psrPDict, refseq)
>>>>
>>>> To answer your second question, the append function should get   
>>>> you  what you want:
>>>>
>>>>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG", "TTTACCC")))
>>>> A DNAStringSet instance of length 4
>>>>  width seq
>>>> [1]     3 AAA
>>>> [2]     2 GA
>>>> [3]     4 ACTG
>>>> [4]     7 TTTACCC
>>>>
>>>>
>>>> Patrick
>>>>
>>>>
>>>> rcaloger wrote:
>>>>> Hi,
>>>>> In my onechannelGUI package I am developing a section related to  
>>>>>   Affymetrix exon array analysis, creating few functions that   
>>>>> allow  the association of exon-level Probe Selection Region   
>>>>> (PSR) to refseq
>>>>>
>>>>> 1st question:
>>>>> I have implemented a function that blast a list of PSR sequences  
>>>>>   over all refseq.
>>>>> However, I would like to know if there is any way of doing    
>>>>> something similar using the Biostring package.
>>>>> I tried the pairwiseAlignment function but it is quite slow    
>>>>> compared to blast.
>>>>>
>>>>> 2nd question:
>>>>> there is any way of merging two DNAStringSets ?
>>>>>
>>>>> Cheers
>>>>> Raffaele
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> -- 
>>>
>>> ----------------------------------------
>>> Prof. Raffaele A. Calogero
>>> Bioinformatics and Genomics Unit
>>> Dipartimento di Scienze Cliniche e Biologiche
>>> c/o Az. Ospedaliera S. Luigi
>>> Regione Gonzole 10, Orbassano
>>> 10043 Torino
>>> tel.   ++39 0116705417
>>> Lab.   ++39 0116705408
>>> Fax    ++39 0119038639
>>> Mobile ++39 3333827080
>>> email: raffaele.calogero at unito.it
>>>      raffaele[dot]calogero[at]gmail[dot]com
>>> www:   http://www.bioinformatica.unito.it
>>> Info: http://publicationslist.org/raffaele.calogero
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>
>
>
> -- 
>
> ----------------------------------------
> Prof. Raffaele A. Calogero
> Bioinformatics and Genomics Unit
> Dipartimento di Scienze Cliniche e Biologiche
> c/o Az. Ospedaliera S. Luigi
> Regione Gonzole 10, Orbassano
> 10043 Torino
> tel.   ++39 0116705417
> Lab.   ++39 0116705408
> Fax    ++39 0119038639
> Mobile ++39 3333827080
> email: raffaele.calogero at unito.it
>       raffaele[dot]calogero[at]gmail[dot]com
> www:   http://www.bioinformatica.unito.it
> Info: http://publicationslist.org/raffaele.calogero



More information about the Bioconductor mailing list