[BioC] BioStrings questions
rcaloger
raffaele.calogero at gmail.com
Sat Oct 18 18:56:20 CEST 2008
Hi Herve,
many thanks for the informations.
After digging into the man pages of the Biostrings package I found the
countPatter function.
It is nice because it returns me the simple presence of the pattern
allowing a certain number of mismatches
score.tmp <- countPattern(PSR.seq , DNAString(as.character(exs.seq[j])),
max.mismatch=3)
I found it very useful if a limited number of sequences are queried,
however, when I work with the full refseq fasta file blast is much faster.
Cheers
Raffaele
hpages at fhcrc.org wrote:
> Hi Raffaele,
>
> The PSR do not need to be of equal length with PDict/matchPDit.
> An important limitation though is that indels are not supported
> for now (but will be in the very near future).
> Another limitation is that you need to define a Trusted Band on
> your PDict object i.e. a vertical band of constant width where
> mismatches are not allowed. For example:
>
> psrPDict <- PDict(PSRDNAStringSet, start=5, end=15)
>
> will set a Trusted Band of width 11 that covers nucleotides
> 5 to 15 of each PSR.
> Then use the 'max.mismatch' arg of matchPDict() to control the max
> number of mismatches that you want to allow *outside* the Trusted
> Band for each PSR (the nucleotides that are *inside* must match
> exactly).
> Note that even if you want to perform exact matching, you need
> to define a Trusted Band if your PSR are not of equal length.
> It's important to try to define the widest possible Trusted Band
> since the performance of the underlying algo depends directly
> on how wide it is. Typically the performance will be good if the
> width is >= 12 and will start to drop significantly for smaller
> values.
> Of course if you want to use you PDict object for exact matching
> only, then you want to set the Trusted Band to the max possible
> width i.e. to min(width(PSRDNAStringSet)).
>
> Hope this helps,
> H.
>
>
> Quoting rcaloger <raffaele.calogero at gmail.com>:
>
>> Dear Patrick,
>> many thanks for the quick answer. Unfortunately PSR can be located in
>> any position of the refseq and the sequence length can be very
>> different. Therefore, I cannot apply your suggestion.
>> Cheers
>> Raffaele
>>
>> Patrick Aboyoun wrote:
>>> Raffaele,
>>> The pairwiseAlignment function uses an O(nm) {with n and m being
>>> the length of the two sequences being aligned} dynamic programming
>>> algorithm that is designed to find an optimal alignment and as you
>>> have discovered isn't intended for use with a long reference
>>> sequence. Do your PSR sequences map nearly exactly to a location on
>>> your reference sequence and are these sequences of equal length?
>>> If so, see the matchPDict function. It matches a pattern
>>> dictionary consisting of equal length fragments to a reference
>>> sequence. The pseudo code looks something like:
>>>
>>> psrPDict <- PDict(PSRDNAStringSet)
>>> matchPDict(psrPDict, refseq)
>>>
>>> To answer your second question, the append function should get you
>>> what you want:
>>>
>>>> append(DNAStringSet(c("AAA", "GA")), DNAStringSet(c("ACTG",
>>>> "TTTACCC")))
>>> A DNAStringSet instance of length 4
>>> width seq
>>> [1] 3 AAA
>>> [2] 2 GA
>>> [3] 4 ACTG
>>> [4] 7 TTTACCC
>>>
>>>
>>> Patrick
>>>
>>>
>>> rcaloger wrote:
>>>> Hi,
>>>> In my onechannelGUI package I am developing a section related to
>>>> Affymetrix exon array analysis, creating few functions that allow
>>>> the association of exon-level Probe Selection Region (PSR) to refseq
>>>>
>>>> 1st question:
>>>> I have implemented a function that blast a list of PSR sequences
>>>> over all refseq.
>>>> However, I would like to know if there is any way of doing
>>>> something similar using the Biostring package.
>>>> I tried the pairwiseAlignment function but it is quite slow
>>>> compared to blast.
>>>>
>>>> 2nd question:
>>>> there is any way of merging two DNAStringSets ?
>>>>
>>>> Cheers
>>>> Raffaele
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> ----------------------------------------
>> Prof. Raffaele A. Calogero
>> Bioinformatics and Genomics Unit
>> Dipartimento di Scienze Cliniche e Biologiche
>> c/o Az. Ospedaliera S. Luigi
>> Regione Gonzole 10, Orbassano
>> 10043 Torino
>> tel. ++39 0116705417
>> Lab. ++39 0116705408
>> Fax ++39 0119038639
>> Mobile ++39 3333827080
>> email: raffaele.calogero at unito.it
>> raffaele[dot]calogero[at]gmail[dot]com
>> www: http://www.bioinformatica.unito.it
>> Info: http://publicationslist.org/raffaele.calogero
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
--
----------------------------------------
Prof. Raffaele A. Calogero
Bioinformatics and Genomics Unit
Dipartimento di Scienze Cliniche e Biologiche
c/o Az. Ospedaliera S. Luigi
Regione Gonzole 10, Orbassano
10043 Torino
tel. ++39 0116705417
Lab. ++39 0116705408
Fax ++39 0119038639
Mobile ++39 3333827080
email: raffaele.calogero at unito.it
raffaele[dot]calogero[at]gmail[dot]com
www: http://www.bioinformatica.unito.it
Info: http://publicationslist.org/raffaele.calogero
More information about the Bioconductor
mailing list