[BioC] Pairwise Alignment on Large Protein Sequence Data Set
Thomas Atanasov [guest]
guest at bioconductor.org
Wed Dec 18 00:22:04 CET 2013
Basically, I have a runtime/performance issue question. I wrote the following nested while-loop using the "Biostrings" package (biocLite) in order to link protein sequences from two species if they have a >90% alignment score.
I input two protein genomes (15,000-85,000 protein sequences each), compare each protein's amino acid sequence in SeqData1 to each amino acid sequence from SeqData2, compute an alignment score, and if the score is >90% I concatenate a list of the protein names that match and the sequence of the SeqData2 protein.
Is there a more efficient solution to my currently-projected runtime of 1 month? Maybe a solution involving a growing data.table? Im not sure what the most efficient implementation of it is in R (I am a python programmer by trade).
Thank you in advance.
-- output of sessionInfo():
SeqScore <- function()
{
source("http://bioconductor.org/biocLite.R")
biocLite()
require("Biostrings")
data(BLOSUM100)
SeqData1 <- readDNAStringSet("SeqData1.fasta")
proNum1 = 84390 # number of proteins in Seq1
SeqData2 <- readDNAStringSet("SeqData2.fasta")
proNum2 = 15194 # number of proteins in Seq
#Create empty list to fill with percent scores and matching sequences:
DList=NULL
QueSeqList = NULL
TotList = NULL
#initiating the counters:
i=1
j=1
c=0
#Perform alignment and generate percent identity scores:
while(i<=proNum1)
{
while(j<=proNum2)
{
SeqAlign <- pairwiseAlignment(SeqData1[i], SeqData2[j], substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)
PercAlign <- pid(SeqAlign)
if(PercAlign>=90)
{
DList = c(DList, names(SeqData1[i]), names(SeqData2[j]))
QueSeqList = c(QueSeqList, toString(SeqData2[j]))
c=c+1
}
else{c=c+1; print(c)
}
j=j+1
}
i=i+1
j=1 #to reset the inner while loop
}
unlist<-t(sapply(DList, unlist));
outputMatrix<-cbind(DList,QueSeqList)
outputMatrix<-as.matrix(outputMatrix, ncol=3)
write.csv(outputMatrix, "outputMatrix.csv")
}
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list