[BioC] Pairwise Alignment on Large Protein Sequence Data Set

Thomas Atanasov [guest] guest at bioconductor.org
Wed Dec 18 00:22:04 CET 2013


Basically, I have a runtime/performance issue question. I wrote the following nested while-loop using the "Biostrings" package (biocLite) in order to link protein sequences from two species if they have a >90% alignment score.

I input two protein genomes (15,000-85,000 protein sequences each), compare each protein's amino acid sequence in SeqData1 to each amino acid sequence from SeqData2, compute an alignment score, and if the score is >90% I concatenate a list of the protein names that match and the sequence of the SeqData2 protein.

Is there a more efficient solution to my currently-projected runtime of 1 month? Maybe a solution involving a growing data.table? Im not sure what the most efficient implementation of it is in R (I am a python programmer by trade).

Thank you in advance.

 -- output of sessionInfo(): 

SeqScore <- function()
{

source("http://bioconductor.org/biocLite.R")
biocLite()
require("Biostrings")
data(BLOSUM100)


SeqData1 <- readDNAStringSet("SeqData1.fasta")
proNum1 = 84390     # number of proteins in Seq1

SeqData2 <- readDNAStringSet("SeqData2.fasta")
proNum2 = 15194     # number of proteins in Seq


#Create empty list to fill with percent scores and matching sequences:
DList=NULL
QueSeqList = NULL
TotList = NULL

#initiating the counters:
i=1
j=1
c=0

#Perform alignment and generate percent identity scores:
while(i<=proNum1)
{
 while(j<=proNum2)
{
  SeqAlign <- pairwiseAlignment(SeqData1[i], SeqData2[j], substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)
PercAlign <- pid(SeqAlign)
if(PercAlign>=90)
{
DList = c(DList, names(SeqData1[i]), names(SeqData2[j]))
QueSeqList = c(QueSeqList, toString(SeqData2[j]))
c=c+1
}
else{c=c+1; print(c)
}
j=j+1
}
i=i+1
j=1 #to reset the inner while loop
}
unlist<-t(sapply(DList, unlist));
outputMatrix<-cbind(DList,QueSeqList)
outputMatrix<-as.matrix(outputMatrix, ncol=3)
write.csv(outputMatrix, "outputMatrix.csv")
}

--
Sent via the guest posting facility at bioconductor.org.



More information about the Bioconductor mailing list