[Bioc-sig-seq] Fast inject into XStringSet objects

Tue May 5 01:48:28 CEST 2009

Dear Developers,

Is there an efficient method available for replacing base calls in 
sequences by Ns (or any other character) at positions where the quality 
score drops below some arbitrary threshold. So far we have been using 
the following code for this purpose. In general this approach works fine. 
However, the R loop for applying this function makes it slow for very large 
data sets with millions of sequences. 

Is there a method available for achieving the same result more efficiently? 

Considering the available resources in Biostrings, I am wondering if it wouldn't 
make a lot of sense of using Biostrings' sequence masking method along with the 
injectHardMask function for this purpose. However, at this point I am only aware 
of methods for applying masks efficiently to a single sequence in a XString object, 
but not to many sequences stored in a XStringSet object. 

#################
## Sample Code ##
#################
## Create example data set with random data
library(Biostrings)
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) 
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random Phred score list.
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
myqual <- PhredQuality(myqual) 
dsetq1 <- QualityScaledDNAStringSet(dset, myqual) 

## Define in character inject function XStringSetInject     
XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, mychar="N") { 
        mypos <- which(as.integer(myqual)<cutoff)
        return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
}
## Apply XStringSetInject function to all sequences.
dvec <- sapply(1:length(dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
dset2 <- DNAStringSet(dvec)
names(dset2) <- paste("d", 1:length(dvec), sep="")
names(myqual) <- paste("d", 1:length(dvec), sep="")
dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) 

## Compare dsetq1 and dsetq2
dsetq1[1:2] 
dsetq2[1:2] 

> sessionInfo()
R version 2.9.0 (2009-04-17) 
x86_64-unknown-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.12.1 IRanges_1.2.0    

loaded via a namespace (and not attached):
[1] Biobase_2.4.0

Thanks in advance,

Thomas