[Bioc-sig-seq] Fast inject into XStringSet objects

Thomas Girke thomas.girke at ucr.edu
Tue May 19 20:47:53 CEST 2009


Dear Herve,

Thanks a lot for your suggestions and efforts to optimize this step!

We are often using sequences of variable length (e.g. after adaptor 
trimming), but this shouldn't be a problem because we can simply loop 
over every length interval (usually less than 100 iterations!).

One short question:
Are there any plans to support masks in the future on XStringSet
objects? If one could define and apply masks to sequences of variable
length in XStringSet objects (without loops) then this would be be very
useful for many applications.

Many thanks again, 

Thomas

On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
> Hi Thomas,
> 
> Sorry for the slow response on this.
> 
> Here is a much faster solution to your problem. Depending on the
> size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
> faster than the sapply-based solution. Let's assume 'dset' is the
> constant width DNAStringSet object containing your reads and 'myqual'
> is the constant width PhredQuality object containing the qualities
> of your reads.
> The following solution takes advantage of the fact that 'dset' and
> therefore 'myqual' are rectangular, hence 'myqual' can be turned into
> a raw matrix pretty efficiently:
> 
>   myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
>                        nrow=length(myqual), byrow=TRUE)
> 
> This raw matrix itself can be turned into a boolean matrix using some
> cutoff value:
> 
>   at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))
> 
> Also I've just added a "replaceLetterAt" method for DNAStringSet objects
> to Biostrings (in addition to the existing method for DNAString objects).
> Currently, it has the limitation that 'x' and 'at' must be rectangular
> i.e. 'x' must have a constant width and 'at' must be a logical matrix
> with the same dimensions as 'x'. That's of course just what's needed to
> handle your usecase:
> 
>   letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]), 
>   collapse=""))
>   letter <- as(Views(letter_subject, start=1, end=rowSums(at)), 
>   "DNAStringSet")
>   dset3 <- replaceLetterAt(dset, at, letter)
> 
> 'dset3' is the same as the 'dset2' object you obtained with the sapply-based
> solution.
> 
> This improved replaceLetterAt() function will be available in Biostrings
> release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the public
> repositories (in the next 24 hours).
> 
> Please let me know if you have any other concerns.
> 
> Cheers,
> H.
> 
> 
> Thomas Girke wrote:
> >Dear Developers,
> >
> >Is there an efficient method available for replacing base calls in 
> >sequences by Ns (or any other character) at positions where the quality 
> >score drops below some arbitrary threshold. So far we have been using 
> >the following code for this purpose. In general this approach works fine. 
> >However, the R loop for applying this function makes it slow for very 
> >large data sets with millions of sequences. 
> >
> >Is there a method available for achieving the same result more 
> >efficiently? 
> >Considering the available resources in Biostrings, I am wondering if it 
> >wouldn't make a lot of sense of using Biostrings' sequence masking method 
> >along with the injectHardMask function for this purpose. However, at this 
> >point I am only aware of methods for applying masks efficiently to a 
> >single sequence in a XString object, but not to many sequences stored in a 
> >XStringSet object. 
> >
> >#################
> >## Sample Code ##
> >#################
> >## Create example data set with random data
> >library(Biostrings)
> >dset <- DNAStringSet(sapply(1:100, function(x) 
> >paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) myqlist <- 
> >lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random 
> >Phred score list.
> >myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
> >myqual <- PhredQuality(myqual) 
> >dsetq1 <- QualityScaledDNAStringSet(dset, myqual) 
> >
> >## Define in character inject function XStringSetInject     
> >XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, 
> >mychar="N") { mypos <- which(as.integer(myqual)<cutoff)
> >        return(toString(replaceLetterAt(myseq, 
> >        which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
> >}
> >## Apply XStringSetInject function to all sequences.
> >dvec <- sapply(1:length(dset), function(x) 
> >XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
> >dset2 <- DNAStringSet(dvec)
> >names(dset2) <- paste("d", 1:length(dvec), sep="")
> >names(myqual) <- paste("d", 1:length(dvec), sep="")
> >dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) 
> >
> >## Compare dsetq1 and dsetq2
> >dsetq1[1:2] 
> >dsetq2[1:2] 
> >
> >
> >>sessionInfo()
> >R version 2.9.0 (2009-04-17) 
> >x86_64-unknown-linux-gnu 
> >
> >locale:
> >LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >
> >attached base packages:
> >[1] stats     graphics  grDevices utils     datasets  methods   base     
> >
> >other attached packages:
> >[1] Biostrings_2.12.1 IRanges_1.2.0    
> >
> >loaded via a namespace (and not attached):
> >[1] Biobase_2.4.0
> >
> >
> >Thanks in advance,
> >
> >Thomas
> >
> >_______________________________________________
> >Bioc-sig-sequencing mailing list
> >Bioc-sig-sequencing at r-project.org
> >https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>



More information about the Bioc-sig-sequencing mailing list