[BioC] Biostrings DNAMultipleAlignment vs AlignedXStringSet

Martin Morgan mtmorgan at fhcrc.org
Fri Aug 31 21:26:28 CEST 2012


Hi Nathan --

On 08/31/2012 08:10 AM, Nathan Sheffield [guest] wrote:
>
> Hi, I have what I thought should be simple procedure...
>
> I have a multiple alignment. I want to find the locations (and
> sequences) of all the differences.
>
> I can read the file into a DNAMultipleAlignment class:
>
> aln = read.DNAMultipleAlignment(filepath=alnFiles[i],
> format="fasta")
>
> This is a 4-species alignment, 300bp long. I finally figured out a
> way find all the "differences" (or, bases where the 4 species are not
> identical), like so:
>
> which(apply(consensusMatrix(aln), 2, max) !=4)
>
> It seems like there should be a more straightforward way, but this
> just looks at the count matrix and finds all the places that don't
> have all the species in the same nucleotide...

For a reproducible example, I grabbed

      aln <-
        readDNAMultipleAlignment(filepath =
                                 system.file("extdata",
                                             "msx2_mRNA.aln",
                                             package="Biostrings"),
                                 format="clustal")

from the help page, ?MultipleAlignment. This has 8 rows.

I did

   cidx = colSums(consensusMatrix(aln) == nrow(aln)) == 1

to get a logical vector, but same idea as your 'apply' solution. I ended 
up with

 > table(cidx)
cidx
FALSE  TRUE
  2024   319

so 2024 columns that are not consistent across all rows.

> Now, I found no easy way to subset the aln object to pull out the
> sequences at those positions. Is there one?

I applied a mask to the columns

   colmask(aln, "replace") = cidx

and then coerced my aln to a DNAStringSet

 > as(aln, "DNAStringSet")
   A DNAStringSet instance of length 8
     width seq                                     names
[1]  2024 -----TCCCGTCTCCGCA...TAAAAAAAAAAAAAAAAA gi|84452153|ref|N...
[2]  2024 ------------------...------------------ gi|208431713|ref|...
[3]  2024 ------------------...------------------ gi|118601823|ref|...
[4]  2024 ------------------...------------------ gi|114326503|ref|...
[5]  2024 ------------------...------------------ gi|119220589|ref|...
[6]  2024 ------------------...------------------ gi|148540149|ref|...
[7]  2024 --------------CGGC...------------------ gi|45383056|ref|N...
[8]  2024 GGGGGAGACTTCAGAAGT...------------------ gi|213515133|ref|...

Also, I coerced the logical vector to an IRanges

 > (crng = as(!cidx, "IRanges"))
IRanges of length 166
       start  end width
[1]       1   98    98
[2]     100  109    10
[3]     111  120    10
[4]     122  122     1
[5]     124  127     4
[6]     129  142    14
[7]     144  155    12
[8]     157  159     3
[9]     161  166     6
...     ...  ...   ...
[158]   876  876     1
[159]   878  882     5
[160]   885  886     2
[161]   888  889     2
[162]   892  892     1
[163]   894  896     3
  [ reached getOption("max.print") -- omitted 3 rows ]

This allowed me to look at one range of non-consensus at a time, e.g.,

 > colmask(aln, "replace", invert=TRUE) = crng[1]
 > as(aln, "DNAStringSet")
   A DNAStringSet instance of length 8
     width seq                                     names
[1]    98 -----TCCCGTCTCCGCA...GGGCAGAGAAGTCA-TGG gi|84452153|ref|N...
[2]    98 ------------------...-------------A-TGG gi|208431713|ref|...
[3]    98 ------------------...---GAGAGAAGTCA-TGG gi|118601823|ref|...
[4]    98 ------------------...GCGCAGA-AAGTCA-TGG gi|114326503|ref|...
[5]    98 ------------------...-------------A-TGG gi|119220589|ref|...
[6]    98 ------------------...-------------A-TGG gi|148540149|ref|...
[7]    98 --------------CGGC...GGGCGGCCCCGCTC-CAG gi|45383056|ref|N...
[8]    98 GGGGGAGACTTCAGAAGT...GAATGTGTTCGTCAACAT gi|213515133|ref|...

I haven't worked with these much, so I might not be doing things in the 
most efficient way; mostly I'm going from ?MultipleAlignment; there is 
also vignette(package="Biostrings", "MultipleAlignments").

>
> I think maybe I want something like "mismatchTable", but according to
> the documentation, this won't work on a DNAMultipleAlignment...
>
> mismatchTable(aln) Error in function (classes, fdef, mtable)  :
> unable to find an inherited method for function "mismatchTable", for
> signature "DNAMultipleAlignment"
>
> Instead, it works on a AlignedXStringSet. Well, I can't figure out
> the difference between an AlignedXStringSet and a
> DNAMultipleAlignment, which both appear to hold aligned strings, even
> XStringSets...and I cannot convert one to another.
>
> Can anyone tell me where I should start to figure out the differences
> between these objects? And, how to solve my original problem of
> getting the the sequences at the locations that are different in an
> alignment?

I think the AlignedXStringSet is strictly for pairwise alignments 
returned by the ?pairwiseAlignment function.

Hope that helps,

Martin


>
>
> Thanks for your help.
>
>
>
> -- output of sessionInfo():
>
> R version 2.13.0 (2011-04-13) Platform: x86_64-redhat-linux-gnu
> (64-bit)
>
> locale: [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3]
> LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C
> LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9]
> LC_ADDRESS=C               LC_TELEPHONE=C [11]
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages: [1] grid      stats     graphics  grDevices
> utils     datasets  methods [8] base
>
> other attached packages: [1] seqLogo_1.18.0      MotIV_1.6.0
> rtracklayer_1.12.2 [4] RCurl_1.6-4         bitops_1.0-4.1
> GenomicRanges_1.4.3 [7] Biostrings_2.20.1   IRanges_1.10.3
>
> loaded via a namespace (and not attached): [1] BSgenome_1.20.0
> lattice_0.19-23 rGADEM_1.4.0    tools_2.13.0 [5] XML_3.4-0
>
>
> -- Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________ Bioconductor mailing
> list Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list