[BioC] Pairwise Alignment on Large Protein Sequence Data Set

Hervé Pagès hpages at fhcrc.org
Fri Dec 20 00:15:26 CET 2013


Hi Thomas,

The projected runtime for your code is more likely to be 5 or 6 months
(but I don't know the average length of your protein sequences so I
can't really say).

Let's start with a couple of easy things you can do to reduce it to 3
months :-)

First, be aware that loops in R tend to very inefficient so one should
try hard to avoid them. One way to avoid them is by taking advantage of
vectorization. Lots of functions that expect vector-like input are
vectorized. pairwiseAlignment() is one of them: if 'pattern' contains
more than 1 pattern (i.e. is an AAStringSet object of length > 1), it
will return a PairwiseAlignments object "parallel" to 'pattern' (i.e.
of the same length as pattern).

With 1 pattern:

   > pwa1 <- pairwiseAlignment(aa1[1], aa2[1])
   > length(pwa1)
   [1] 1
   > score(pwa1)
   [1] -3804.555
   > pid(pwa1)
   [1] 20.9826

With 15 patterns:

   > pwa15 <- pairwiseAlignment(aa1[1:15], aa2[1])
   > length(pwa15)
   [1] 15
   > score(pwa15)
    [1] -3804.555 -4449.021 -4030.539 -3694.222 -3970.066 -4586.336 
-3769.805
    [8] -4644.705 -3674.032 -3742.815 -4550.244 -3827.098 -4003.654 
-4186.814
   [15] -3705.548
   > pid(pwa15)
    [1] 20.982600 20.379147 20.679012 13.903743 20.707071 20.114943 
4.797980
    [8] 18.921476 14.270833 16.474292 21.348315 19.088937 20.777892 
20.320320
   [15]  8.839191

The score and pid vectors are the same as if you were doing something like:

   sapply(1:15, function(i) score(pairwiseAlignment(aa1[i], aa2[1])))

except that vectorization gives you an interesting speed-up.

The 2nd thing that you should definitely avoid is to grow an object
in a loop, starting with an empty object. This is very inefficient.
In your case, you're going to write your results to a file at the end
anyway so you don't need to keep them in memory.
Instead you could call write.table() (or write.csv()) in the loop to
grow a file. These functions have an 'append' argument that you could
set to TRUE for that (it's FALSE by default).

But I'm going to take a different approach here. I'll keep the results
in memory but they will be presented as a list that is parallel to the
2nd set of proteins ('SeqData2'). Each list element will be a character
vector containing the names of the proteins in the 1st set that are
similar. No need to store the protein sequences in that object. The
idea is to represent the results in an object that is as light as
possible. Then, in a second step, we can use that object to make a
data.frame and write that data.frame to a file.

So this gives you something like this:

   ## Returns a list parallel to 'y'. Each list element is a character
   ## vector containing names from 'x'.
   findSimilarProteins <- function(x, y, min.pid=90,
                                   substitutionMatrix="BLOSUM100",
                                   gapOpening=-10, gapExtension=-4)
   {
     if (!is(x, "AAStringSet"))
         stop("'x' must be an AAStringSet object")
     if (!is(y, "AAStringSet"))
         stop("'y' must be an AAStringSet object")
     x_names <- names(x)
     if (is.null(x_names) || any(duplicated(x_names)))
         stop("'x' must have unique names")
     y_names <- names(y)
     if (is.null(y_names) || any(duplicated(y_names)))
         stop("'y' must have unique names")

     ## Only one level of looping. No easy way to avoid that one :-/
     lapply(y,
            function(yi) {
                pwa <- pairwiseAlignment(x, yi,
 
substitutionMatrix=substitutionMatrix,
                                         gapOpening=gapOpening,
                                         gapExtension=gapExtension)
                names(x)[pid(pwa) >= min.pid]
            })
   }

Then:

   hits <- findSimilarProteins(SeqData1, SeqData2,
                             gapOpening=0, gapExtension=-5)

This is still a long computation... about 3-4 months on your data!

Once you have 'hits' it's easy (and fast) to put the results in a
data.frame (including the sequences) and to send them to a file.
Those steps take only a few seconds:

   id1 <- unlist(hits, use.names=FALSE)
   id2 <- rep.int(names(hits), elementLengths(hits))
   df <- data.frame(id1=id1, id2=id2, seq1=SeqData1[id1], 
seq2=SeqData2[id2])
   write.table(df, file="similar_proteins.txt", quote=FALSE, sep="\t")

Finally you can reduce findSimilarProteins() running time by
parallelizing it. For this, install and load the BiocParallel package
and replace lapply() by bplapply() in findSimilarProteins(). If you
have access to a cluster with hundreds of nodes, you might be able to
run this in a couple of days only.

Otherwise, maybe you could just use BLAST which is way faster than
the Needleman-Wunsch and Smith-Waterman algos implemented by
pairwiseAlignment().

Cheers,
H.


On 12/17/2013 03:22 PM, Thomas Atanasov [guest] wrote:
>
> Basically, I have a runtime/performance issue question. I wrote the following nested while-loop using the "Biostrings" package (biocLite) in order to link protein sequences from two species if they have a >90% alignment score.
>
> I input two protein genomes (15,000-85,000 protein sequences each), compare each protein's amino acid sequence in SeqData1 to each amino acid sequence from SeqData2, compute an alignment score, and if the score is >90% I concatenate a list of the protein names that match and the sequence of the SeqData2 protein.
>
> Is there a more efficient solution to my currently-projected runtime of 1 month? Maybe a solution involving a growing data.table? Im not sure what the most efficient implementation of it is in R (I am a python programmer by trade).
>
> Thank you in advance.
>
>   -- output of sessionInfo():
>
> SeqScore <- function()
> {
>
> source("http://bioconductor.org/biocLite.R")
> biocLite()
> require("Biostrings")
> data(BLOSUM100)
>
>
> SeqData1 <- readDNAStringSet("SeqData1.fasta")
> proNum1 = 84390     # number of proteins in Seq1
>
> SeqData2 <- readDNAStringSet("SeqData2.fasta")
> proNum2 = 15194     # number of proteins in Seq
>
>
> #Create empty list to fill with percent scores and matching sequences:
> DList=NULL
> QueSeqList = NULL
> TotList = NULL
>
> #initiating the counters:
> i=1
> j=1
> c=0
>
> #Perform alignment and generate percent identity scores:
> while(i<=proNum1)
> {
>   while(j<=proNum2)
> {
>    SeqAlign <- pairwiseAlignment(SeqData1[i], SeqData2[j], substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)
> PercAlign <- pid(SeqAlign)
> if(PercAlign>=90)
> {
> DList = c(DList, names(SeqData1[i]), names(SeqData2[j]))
> QueSeqList = c(QueSeqList, toString(SeqData2[j]))
> c=c+1
> }
> else{c=c+1; print(c)
> }
> j=j+1
> }
> i=i+1
> j=1 #to reset the inner while loop
> }
> unlist<-t(sapply(DList, unlist));
> outputMatrix<-cbind(DList,QueSeqList)
> outputMatrix<-as.matrix(outputMatrix, ncol=3)
> write.csv(outputMatrix, "outputMatrix.csv")
> }
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list