[BioC] consensusString Function
Patrick Aboyoun
paboyoun at fhcrc.org
Wed Apr 7 19:17:20 CEST 2010
Erik,
I should point out that the threshold parameter supplies some
flexibility in how unambiguous a base needs to be in the consensus:
> consensusString(DNAStringSet(c("A", "N")), threshold = 1e-6)
[1] "N"
> consensusString(DNAStringSet(c("G", "R", "G")), threshold = 1e-6)
[1] "R"
> sessionInfo()
R version 2.11.0 alpha (2010-04-04 r51591)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.15.27 IRanges_1.5.74 GEOquery_2.11.3
[4] Biobase_2.7.5 RCurl_1.3-1 bitops_1.0-4.1
[7] SRAdb_1.1.10 RSQLite_0.8-4 DBI_0.2-5
[10] codetoolsBioC_0.0.14
loaded via a namespace (and not attached):
[1] codetools_0.2-2
Patrick
On 4/7/10 9:41 AM, Patrick Aboyoun wrote:
> Erik,
> The consensusString weights each input string equally and assigns an
> equal probability to each of the base letters represented by an
> ambiguity letter. So to use the examples you mentioned the pseudo
> arithmetic results in
>
> G + R => R
> 0.5 G + 0.5 R = 0.5 G + 0.5 (0.5 A + 0.5 G) = 0.5 G + 0.25 A + 0.25 G
> = 0.75 G + 0.25 A => R
>
> G + R + G => G
> 2/3 G + 1/3 R = 2/3 G + 1/3 (1/2 A + 1/2 G) = 2/3 G + 1/6 A + 1/6 G =
> 5/6 G + 1/6 A => G
>
> A + N => A
> 0.5 A + 0.5 N = 0.5 A + 0.5 (0.25 A + 0.25 C + 0.25 G + 0.25 T) =
> 0.625 A + 0.125 C + 0.125 G + 0.125 T => A
>
> There a little room for growth in the consensusString function. For
> example, it could accept a non-negative vector of weights so the input
> strings are not all equally weighted. Anything more complicated than
> that, however, and I would recommend representing each string in PWM
> form and then performing weighted/unweighted averages across each of
> the cells in the PWMs.
>
>
> Patrick
>
>
> On 4/7/10 9:06 AM, Erik Wright wrote:
>> Hi Patrick,
>>
>> Thanks for closing the loop on this issue. I have a question about
>> the outputs you show below:
>>
>> If two strings are ACAG and ACAR where R can be A or G, then it makes
>> sense to me that the consensus is ACAR.
>>
>> Why then is an N treated differently than an R? If two strings are
>> NNNN and ACTG, where N can be A, C, T, or G, then based on the prior
>> example the output should be NNNN, but the output you show is ACTG.
>>
>> If there are three sequences, ACAG, ACAR, and ACAG, and the threshold
>> is set at the default 25% (for a DNAStringSet) then why is the output
>> ACAG? If an R is a C or G, and the other two codons in the final
>> position are G's then the sum is two G's and one C. One in three
>> (33.33...%) being C is greater than the default threshold of 25%, so
>> I would expect the consensus character to be R. Therefore the output
>> should be ACAR.
>>
>> Thanks for helping me understand the output.
>>
>> Smiles,
>> Erik
>>
>>
>>
>> On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote:
>>
>>> Erik, Heidi, and Wolfgang,
>>> To bring this thread full circle, Biostrings::consensusString didn't
>>> support ambiguity letters in input strings for BioC<= 2.5 (R<=
>>> 2.10). This support has been added to BioC 2.6 (R 2.11), but as
>>> Wolfgang mentioned there was a bug in the code. This bug has now
>>> been fixed in BioC 2.6 and will be available for download from
>>> bioconductor.org within 36 hours. Here are some examples of
>>> consensusString operating on DNAStringSet objects containing
>>> ambiguity letters:
>>>
>>>> consensusString(DNAStringSet(c("NNNN","ACTG")))
>>> [1] "ACTG"
>>>> consensusString(DNAStringSet(c("AANN","ACTG")))
>>> [1] "AMTG"
>>>> consensusString(DNAStringSet(c("ACAG","ACAR")))
>>> [1] "ACAR"
>>>> consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
>>> [1] "ACAG"
>>>> sessionInfo()
>>> R version 2.11.0 alpha (2010-04-04 r51591)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.15.27 IRanges_1.5.74
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.7.5
>>>
>>>
>>>
>>> On 4/6/10 2:36 PM, Wolfgang Huber wrote:
>>>> Hi Erik, Herv'e
>>>>
>>>> please always provide the output of sessionInfo(), and a complete
>>>> reproducible example (you let Heidi and the others guess that
>>>> you're talking about the Biostrings package).
>>>>
>>>> With a more recent version of Biostrings, I get:
>>>>
>>>> library("Biostrings")
>>>> consensusString( DNAStringSet(c("AAAA","ACTG")) )
>>>> # [1] "AMWR"
>>>>
>>>> consensusString( DNAStringSet(c("AAAB","ACTG")) )
>>>> # Error in FUN(newX[, i], ...) :
>>>> 'ambiguityMap' is missing some combinations of row names
>>>>
>>>> And going into the debugger where the error is caused, i.e. within
>>>> the function "consensusLetter", the expression
>>>>
>>>> i<- paste(all_letters[col>= threshold], collapse = "")
>>>>
>>>> is evaluated where
>>>> Browse[1]> all_letters
>>>> [1] "A" "C" "G" "T" "B" ## length is 5
>>>> Browse[1]> col
>>>> [1] 1 0 0 0 ## length is 4
>>>> Browse[1]> i
>>>> [1] "AB" ## recycling rule was applied
>>>>
>>>> which seems unintended and with some more insight will probably
>>>> lead to the fixing of a bug.
>>>>
>>>> Best wishes
>>>> Wolfgang
>>>>
>>>>
>>>>> sessionInfo()
>>>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C
>>>> [4] LC_COLLATE=C LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
>>>> [10] LC_TELEPHONE=C LC_MEASUREMENT=C
>>>> LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices datasets utils methods base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.15.26 IRanges_1.5.74 fortunes_1.3-7
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.7.5
>>>>
>>>> Heidi Dvinge ha scritto:
>>>>> Hello Erik,
>>>>>
>>>>> unfortunately I'm not familiar with the Biostrings package, so I
>>>>> can't
>>>>> tell you why this doesn't work, but until someone else can answer,
>>>>> this
>>>>> seems to be a work-around.
>>>>>
>>>>> Apparently, consensusString doesn't handle Ns.
>>>>>
>>>>>> test<- DNAStringSet(c("AANN","ACTG"))
>>>>>> consensusString(test)
>>>>> Error in .local(x, ...) :
>>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)]
>>>>>
>>>>> If there are no Ns things are okay though.
>>>>>
>>>>>> test2<- DNAStringSet(c("AAAA","ACTG"))
>>>>>> consensusString(test2)
>>>>> [1] "AMWR"
>>>>>
>>>>> However, Ns seem acceptable if the consensus matrix is calculated
>>>>> first,
>>>>> although they might result in ?s where no consensus could be found.
>>>>>
>>>>>> test3<- consensusMatrix(test)
>>>>>> consensusString(test3)
>>>>> [1] "A???"
>>>>>
>>>>> HTH
>>>>> \Heidi
>>>>>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to get a consensus string for a DNAStringSet, but I
>>>>>> am getting
>>>>>> an error. The documentation for consensusString says the
>>>>>> argument "x" is
>>>>>> either a consensus matrix or an XStringSet. So this should work,
>>>>>> right?:
>>>>>>> myDNAStringSet<- DNAStringSet(c("NNNN","ACTG"))
>>>>>>> consensusString(myDNAStringSet)
>>>>>> Error in .local(x, ...) :
>>>>>> 'threshold' must be a numeric in (0, 1/sum(rowSums(x)> 0)]
>>>>>>
>>>>>> Specifying a threshold in the arguments doesn't seem to make a
>>>>>> difference.
>>>>>>
>>>>>> Thanks!,
>>>>>> Erik
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list