[BioC] consensusString Function

Wed Apr 7 19:17:20 CEST 2010

Erik,
I should point out that the threshold parameter supplies some 
flexibility in how unambiguous a base needs to be in the consensus:

 > consensusString(DNAStringSet(c("A", "N")), threshold = 1e-6)
[1] "N"

 > consensusString(DNAStringSet(c("G", "R", "G")), threshold = 1e-6)
[1] "R"

 > sessionInfo()
R version 2.11.0 alpha (2010-04-04 r51591)
i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] Biostrings_2.15.27   IRanges_1.5.74       GEOquery_2.11.3
  [4] Biobase_2.7.5        RCurl_1.3-1          bitops_1.0-4.1
  [7] SRAdb_1.1.10         RSQLite_0.8-4        DBI_0.2-5
[10] codetoolsBioC_0.0.14

loaded via a namespace (and not attached):
[1] codetools_0.2-2


Patrick


On 4/7/10 9:41 AM, Patrick Aboyoun wrote:
> Erik,
> The consensusString weights each input string equally and assigns an 
> equal probability to each of the base letters represented by an 
> ambiguity letter. So to use the examples you mentioned the pseudo 
> arithmetic results in
>
> G + R => R
> 0.5 G + 0.5 R = 0.5 G + 0.5 (0.5 A + 0.5 G) = 0.5 G + 0.25 A + 0.25 G 
> = 0.75 G + 0.25 A => R
>
> G + R + G => G
> 2/3 G + 1/3 R = 2/3 G + 1/3 (1/2 A + 1/2 G) = 2/3 G + 1/6 A + 1/6 G = 
> 5/6 G + 1/6 A => G
>
> A + N => A
> 0.5 A + 0.5 N = 0.5 A + 0.5 (0.25 A + 0.25 C + 0.25 G + 0.25 T) = 
> 0.625 A + 0.125 C + 0.125 G + 0.125 T => A
>
> There a little room for growth in the consensusString function. For 
> example, it could accept a non-negative vector of weights so the input 
> strings are not all equally weighted. Anything more complicated than 
> that, however, and I would recommend representing each string in PWM 
> form and then performing weighted/unweighted averages across each of 
> the cells in the PWMs.
>
>
> Patrick
>
>
> On 4/7/10 9:06 AM, Erik Wright wrote:
>> Hi Patrick,
>>
>> Thanks for closing the loop on this issue.  I have a question about 
>> the outputs you show below:
>>
>> If two strings are ACAG and ACAR where R can be A or G, then it makes 
>> sense to me that the consensus is ACAR.
>>
>> Why then is an N treated differently than an R?  If two strings are 
>> NNNN and ACTG, where N can be A, C, T, or G, then based on the prior 
>> example the output should be NNNN, but the output you show is ACTG.
>>
>> If there are three sequences, ACAG, ACAR, and ACAG, and the threshold 
>> is set at the default 25% (for a DNAStringSet) then why is the output 
>> ACAG?  If an R is a C or G, and the other two codons in the final 
>> position are G's then the sum is two G's and one C.  One in three 
>> (33.33...%) being C is greater than the default threshold of 25%, so 
>> I would expect the consensus character to be R.  Therefore the output 
>> should be ACAR.
>>
>> Thanks for helping me understand the output.
>>
>> Smiles,
>> Erik
>>
>>
>>
>> On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote:
>>
>>> Erik, Heidi, and Wolfgang,
>>> To bring this thread full circle, Biostrings::consensusString didn't 
>>> support ambiguity letters in input strings for BioC<= 2.5 (R<= 
>>> 2.10). This support has been added to BioC 2.6 (R 2.11), but as 
>>> Wolfgang mentioned there was a bug in the code. This bug has now 
>>> been fixed in BioC 2.6 and will be available for download from 
>>> bioconductor.org within 36 hours. Here are some examples of 
>>> consensusString operating on DNAStringSet objects containing 
>>> ambiguity letters:
>>>
>>>>    consensusString(DNAStringSet(c("NNNN","ACTG")))
>>> [1] "ACTG"
>>>>    consensusString(DNAStringSet(c("AANN","ACTG")))
>>> [1] "AMTG"
>>>>    consensusString(DNAStringSet(c("ACAG","ACAR")))
>>> [1] "ACAR"
>>>>    consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
>>> [1] "ACAG"
>>>> sessionInfo()
>>> R version 2.11.0 alpha (2010-04-04 r51591)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] Biostrings_2.15.27 IRanges_1.5.74
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.7.5
>>>
>>>
>>>
>>> On 4/6/10 2:36 PM, Wolfgang Huber wrote:
>>>> Hi Erik, Herv'e
>>>>
>>>> please always provide the output of sessionInfo(), and a complete 
>>>> reproducible example (you let Heidi and the others guess that 
>>>> you're talking about the Biostrings package).
>>>>
>>>> With a more recent version of Biostrings, I get:
>>>>
>>>> library("Biostrings")
>>>> consensusString( DNAStringSet(c("AAAA","ACTG")) )
>>>> # [1] "AMWR"
>>>>
>>>> consensusString( DNAStringSet(c("AAAB","ACTG")) )
>>>> # Error in FUN(newX[, i], ...) :
>>>>   'ambiguityMap' is missing some combinations of row names
>>>>
>>>> And going into the debugger where the error is caused, i.e. within 
>>>> the function "consensusLetter", the expression
>>>>
>>>> i<- paste(all_letters[col>= threshold], collapse = "")
>>>>
>>>> is evaluated where
>>>> Browse[1]>  all_letters
>>>> [1] "A" "C" "G" "T" "B"    ## length is 5
>>>> Browse[1]>  col
>>>> [1] 1 0 0 0                ## length is 4
>>>> Browse[1]>  i
>>>> [1] "AB"                   ## recycling rule was applied
>>>>
>>>> which seems unintended and with some more insight will probably 
>>>> lead to the fixing of a bug.
>>>>
>>>>     Best wishes
>>>>     Wolfgang
>>>>
>>>>
>>>>> sessionInfo()
>>>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=C              LC_NUMERIC=C            LC_TIME=C
>>>> [4] LC_COLLATE=C            LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8
>>>> [7] LC_PAPER=C              LC_NAME=C               LC_ADDRESS=C
>>>> [10] LC_TELEPHONE=C          LC_MEASUREMENT=C        
>>>> LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices datasets  utils     methods   base
>>>>
>>>> other attached packages:
>>>> [1] Biostrings_2.15.26 IRanges_1.5.74     fortunes_1.3-7
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.7.5
>>>>
>>>> Heidi Dvinge ha scritto:
>>>>> Hello Erik,
>>>>>
>>>>> unfortunately I'm not familiar with the Biostrings package, so I 
>>>>> can't
>>>>> tell you why this doesn't work, but until someone else can answer, 
>>>>> this
>>>>> seems to be a work-around.
>>>>>
>>>>> Apparently, consensusString doesn't handle Ns.
>>>>>
>>>>>> test<- DNAStringSet(c("AANN","ACTG"))
>>>>>> consensusString(test)
>>>>> Error in .local(x, ...) :
>>>>>   'threshold' must be a numeric in (0, 1/sum(rowSums(x)>  0)]
>>>>>
>>>>> If there are no Ns things are okay though.
>>>>>
>>>>>> test2<- DNAStringSet(c("AAAA","ACTG"))
>>>>>> consensusString(test2)
>>>>> [1] "AMWR"
>>>>>
>>>>> However, Ns seem acceptable if the consensus matrix is calculated 
>>>>> first,
>>>>> although they might result in ?s where no consensus could be found.
>>>>>
>>>>>> test3<- consensusMatrix(test)
>>>>>> consensusString(test3)
>>>>> [1] "A???"
>>>>>
>>>>> HTH
>>>>> \Heidi
>>>>>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to get a consensus string for a DNAStringSet, but I 
>>>>>> am getting
>>>>>> an error.  The documentation for consensusString says the 
>>>>>> argument "x" is
>>>>>> either a consensus matrix or an XStringSet.  So this should work, 
>>>>>> right?:
>>>>>>> myDNAStringSet<- DNAStringSet(c("NNNN","ACTG"))
>>>>>>> consensusString(myDNAStringSet)
>>>>>> Error in .local(x, ...) :
>>>>>>   'threshold' must be a numeric in (0, 1/sum(rowSums(x)>  0)]
>>>>>>
>>>>>> Specifying a threshold in the arguments doesn't seem to make a 
>>>>>> difference.
>>>>>>
>>>>>> Thanks!,
>>>>>> Erik
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: 
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor