[BioC] consensusString Function
Patrick Aboyoun
paboyoun at fhcrc.org
Wed Apr 7 18:41:34 CEST 2010
Erik,
The consensusString weights each input string equally and assigns an
equal probability to each of the base letters represented by an
ambiguity letter. So to use the examples you mentioned the pseudo
arithmetic results in
G + R => R
0.5 G + 0.5 R = 0.5 G + 0.5 (0.5 A + 0.5 G) = 0.5 G + 0.25 A + 0.25 G =
0.75 G + 0.25 A => R
G + R + G => G
2/3 G + 1/3 R = 2/3 G + 1/3 (1/2 A + 1/2 G) = 2/3 G + 1/6 A + 1/6 G =
5/6 G + 1/6 A => G
A + N => A
0.5 A + 0.5 N = 0.5 A + 0.5 (0.25 A + 0.25 C + 0.25 G + 0.25 T) = 0.625
A + 0.125 C + 0.125 G + 0.125 T => A
There a little room for growth in the consensusString function. For
example, it could accept a non-negative vector of weights so the input
strings are not all equally weighted. Anything more complicated than
that, however, and I would recommend representing each string in PWM
form and then performing weighted/unweighted averages across each of the
cells in the PWMs.
Patrick
On 4/7/10 9:06 AM, Erik Wright wrote:
> Hi Patrick,
>
> Thanks for closing the loop on this issue. I have a question about the outputs you show below:
>
> If two strings are ACAG and ACAR where R can be A or G, then it makes sense to me that the consensus is ACAR.
>
> Why then is an N treated differently than an R? If two strings are NNNN and ACTG, where N can be A, C, T, or G, then based on the prior example the output should be NNNN, but the output you show is ACTG.
>
> If there are three sequences, ACAG, ACAR, and ACAG, and the threshold is set at the default 25% (for a DNAStringSet) then why is the output ACAG? If an R is a C or G, and the other two codons in the final position are G's then the sum is two G's and one C. One in three (33.33...%) being C is greater than the default threshold of 25%, so I would expect the consensus character to be R. Therefore the output should be ACAR.
>
> Thanks for helping me understand the output.
>
> Smiles,
> Erik
>
>
>
> On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote:
>
>
>> Erik, Heidi, and Wolfgang,
>> To bring this thread full circle, Biostrings::consensusString didn't support ambiguity letters in input strings for BioC<= 2.5 (R<= 2.10). This support has been added to BioC 2.6 (R 2.11), but as Wolfgang mentioned there was a bug in the code. This bug has now been fixed in BioC 2.6 and will be available for download from bioconductor.org within 36 hours. Here are some examples of consensusString operating on DNAStringSet objects containing ambiguity letters:
>>
>>
>>> consensusString(DNAStringSet(c("NNNN","ACTG")))
>>>
>> [1] "ACTG"
>>
>>> consensusString(DNAStringSet(c("AANN","ACTG")))
>>>
>> [1] "AMTG"
>>
>>> consensusString(DNAStringSet(c("ACAG","ACAR")))
>>>
>> [1] "ACAR"
>>
>>> consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
>>>
>> [1] "ACAG"
>>
>>> sessionInfo()
>>>
>> R version 2.11.0 alpha (2010-04-04 r51591)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] Biostrings_2.15.27 IRanges_1.5.74
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.7.5
>>
>>
>>
>> On 4/6/10 2:36 PM, Wolfgang Huber wrote:
>>
>>> Hi Erik, Herv'e
>>>
>>> please always provide the output of sessionInfo(), and a complete reproducible example (you let Heidi and the others guess that you're talking about the Biostrings package).
>>>
>>> With a more recent version of Biostrings, I get:
>>>
>>> library("Biostrings")
>>> consensusString( DNAStringSet(c("AAAA","ACTG")) )
>>> # [1] "AMWR"
>>>
>>> consensusString( DNAStringSet(c("AAAB","ACTG")) )
>>> # Error in FUN(newX[, i], ...) :
>>> 'ambiguityMap' is missing some combinations of row names
>>>
>>> And going into the debugger where the error is caused, i.e. within the function "consensusLetter", the expression
>>>
>>> i<- paste(all_letters[col>= threshold], collapse = "")
>>>
>>> is evaluated where
>>> Browse[1]> all_letters
>>> [1] "A" "C" "G" "T" "B" ## length is 5
>>> Browse[1]> col
>>> [1] 1 0 0 0 ## length is 4
>>> Browse[1]> i
>>> [1] "AB" ## recycling rule was applied
>>>
>>> which seems unintended and with some more insight will probably lead to the fixing of a bug.
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>>
>>>> sessionInfo()
>>>>
>>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>> [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C
>>> [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=C LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices datasets utils methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.7.5
>>>
>>> Heidi Dvinge ha scritto:
>>>
>>>> Hello Erik,
>>>>
>>>> unfortunately I'm not familiar with the Biostrings package, so I can't
>>>> tell you why this doesn't work, but until someone else can answer, this
>>>> seems to be a work-around.
>>>>
>>>> Apparently, consensusString doesn't handle Ns.
>>>>
>>>>
>>>>> test<- DNAStringSet(c("AANN","ACTG"))
>>>>> consensusString(test)
>>>>>
>>>> Error in .local(x, ...) :
>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)]
>>>>
>>>> If there are no Ns things are okay though.
>>>>
>>>>
>>>>> test2<- DNAStringSet(c("AAAA","ACTG"))
>>>>> consensusString(test2)
>>>>>
>>>> [1] "AMWR"
>>>>
>>>> However, Ns seem acceptable if the consensus matrix is calculated first,
>>>> although they might result in ?s where no consensus could be found.
>>>>
>>>>
>>>>> test3<- consensusMatrix(test)
>>>>> consensusString(test3)
>>>>>
>>>> [1] "A???"
>>>>
>>>> HTH
>>>> \Heidi
>>>>
>>>>
>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to get a consensus string for a DNAStringSet, but I am getting
>>>>> an error. The documentation for consensusString says the argument "x" is
>>>>> either a consensus matrix or an XStringSet. So this should work, right?:
>>>>>
>>>>>> myDNAStringSet<- DNAStringSet(c("NNNN","ACTG"))
>>>>>> consensusString(myDNAStringSet)
>>>>>>
>>>>> Error in .local(x, ...) :
>>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)]
>>>>>
>>>>> Specifying a threshold in the arguments doesn't seem to make a difference.
>>>>>
>>>>> Thanks!,
>>>>> Erik
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
More information about the Bioconductor
mailing list