[Bioc-sig-seq] Fast inject into XStringSet objects

Wed May 20 18:45:10 CEST 2009

Dear Herve and Patrick,

The proposed approaches (and existing solutions!) are nearly perfect for
efficiently processing many small RNA-Seq and multiplexing projects that
require adaptor trimming and quality editing before they can be aligned
in any meaningful manner against their reference. I also anticipate that
similar quality editing/trimming steps will become important for
analyzing long NG-Seq runs (e.g. 75 cycles) where the quality tails off
rapidly toward the end of the reads.  Removing low quality regions of
long reads will allow to 'rescue' many of the good sequence regions for
de novo assemblies and other applications. 

The possibility of using 'soft' masks for these editing and trimming
steps along with functions like trimLRPatterns and replaceLetterAt would
be ideal. However, I certainly understand that it may not be trivial to
implement masking support for XStringSet objects. Nevertheless, it may
be something worth to consider in the future. If this is not an option
right now, then please don't worry. The existing resources in Biostrings
and related packages work great for everything we need! 

Thanks a lot for your clarifications and help on this topic. 

Thomas

On Tue, May 19, 2009 at 04:06:54PM -0700, Hervé Pagès wrote:
> Thomas,
> 
> Thomas Girke wrote:
> >Dear Herve,
> >
> >Thanks a lot for your suggestions and efforts to optimize this step!
> >
> >We are often using sequences of variable length (e.g. after adaptor 
> >trimming), but this shouldn't be a problem because we can simply loop 
> >over every length interval (usually less than 100 iterations!).
> 
> Yes I could extend replaceLetterAt() so it supports DNAStringSet objects
> of arbitrary shape. In that case, two formats for 'at' would be natural:
> "list of logical vectors" or "list of integer vectors". Then you could do
> something like this:
> 
> (1) Trim the adaptor:
> 
>       trimming <- trimLRPatterns(..., dset, ranges=TRUE)
>       trimmed_dset <- narrow(dset, start=start(trimming), end=end(trimming))
>       trimmed_myqual <- narrow(myqual,
>                                start=start(trimming),
>                                end=end(trimming))
> 
> (2) Build the 'at' object to be used in replaceLetterAt():
> 
>       flat_quals <- charToRaw(as.character(unlist(trimmed_myqual)))
>       flat_at <- flat_quals < charToRaw(as.character(PhredQuality(20L)))
>       at <- split(flat_at,
>                   rep.int(seq_len(length(trimmed_myqual)),
>                   width(trimmed_myqual)))
> 
> (3) Build the 'letter' object to be used in replaceLetterAt():
> 
>       letter_subject <-
>           DNAString(paste(rep.int("N", max(width(trimmed_dset))), 
>           collapse=""))
>       letter <- as(Views(letter_subject, start=1, end=sapply(at, sum)),
>                    "DNAStringSet")
> 
> (4) Call replaceLetterAt():
> 
>       dset4 <- replaceLetterAt(trimmed_dset, at, letter)
> 
> Also some convenience could be added to Biostrings to support this use case
> by wrapping some of the steps above into utility functions e.g.
>  o (1) could be done in a single call if we had a "trimLRPatterns"
>    method for QualityScaledDNAStringSet objects (which is easy
>    to add),
>  o (2) and (3) could be wrapped too but I don't have good names for
>    the wrappers and am not sure about their arguments either. I would
>    need to think a little bit about it.
> 
> What do you think? Would that work?
> 
> >
> >One short question:
> >Are there any plans to support masks in the future on XStringSet
> >objects? If one could define and apply masks to sequences of variable
> >length in XStringSet objects (without loops) then this would be be very
> >useful for many applications.
> 
> No plans to support *soft* masking of XStringSet objects in the near
> future because it's a too complicated thing to put in place, at least
> for now. However *hard* masking (i.e. alteration of an XStringSet object
> by injection of the "-" letter at some locations) is much easier to
> support because 99% of the times you don't need to touch the implementation
> of the functions that operate on XStringSet objects to make them work
> on a hard-masked XStringSet object. They just keep working just fine.
> The downside of hard masking is that every time the hard mask needs to be
> modified, you need to make a new copy of the XStringSet object. But we
> first need to see use cases where this becomes really an issue before
> we start working on *soft* masking of XStringSet objects.
> 
> Cheers,
> H.
> 
> 
> >
> >Many thanks again, 
> >
> >Thomas
> >
> >On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
> >>Hi Thomas,
> >>
> >>Sorry for the slow response on this.
> >>
> >>Here is a much faster solution to your problem. Depending on the
> >>size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
> >>faster than the sapply-based solution. Let's assume 'dset' is the
> >>constant width DNAStringSet object containing your reads and 'myqual'
> >>is the constant width PhredQuality object containing the qualities
> >>of your reads.
> >>The following solution takes advantage of the fact that 'dset' and
> >>therefore 'myqual' are rectangular, hence 'myqual' can be turned into
> >>a raw matrix pretty efficiently:
> >>
> >>  myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
> >>                       nrow=length(myqual), byrow=TRUE)
> >>
> >>This raw matrix itself can be turned into a boolean matrix using some
> >>cutoff value:
> >>
> >>  at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))
> >>
> >>Also I've just added a "replaceLetterAt" method for DNAStringSet objects
> >>to Biostrings (in addition to the existing method for DNAString objects).
> >>Currently, it has the limitation that 'x' and 'at' must be rectangular
> >>i.e. 'x' must have a constant width and 'at' must be a logical matrix
> >>with the same dimensions as 'x'. That's of course just what's needed to
> >>handle your usecase:
> >>
> >>  letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]), 
> >>  collapse=""))
> >>  letter <- as(Views(letter_subject, start=1, end=rowSums(at)), 
> >>  "DNAStringSet")
> >>  dset3 <- replaceLetterAt(dset, at, letter)
> >>
> >>'dset3' is the same as the 'dset2' object you obtained with the 
> >>sapply-based
> >>solution.
> >>
> >>This improved replaceLetterAt() function will be available in Biostrings
> >>release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the 
> >>public
> >>repositories (in the next 24 hours).
> >>
> >>Please let me know if you have any other concerns.
> >>
> >>Cheers,
> >>H.
> >>
> >>
> >>Thomas Girke wrote:
> >>>Dear Developers,
> >>>
> >>>Is there an efficient method available for replacing base calls in 
> >>>sequences by Ns (or any other character) at positions where the quality 
> >>>score drops below some arbitrary threshold. So far we have been using 
> >>>the following code for this purpose. In general this approach works 
> >>>fine. However, the R loop for applying this function makes it slow for 
> >>>very large data sets with millions of sequences. 
> >>>
> >>>Is there a method available for achieving the same result more 
> >>>efficiently? 
> >>>Considering the available resources in Biostrings, I am wondering if it 
> >>>wouldn't make a lot of sense of using Biostrings' sequence masking 
> >>>method along with the injectHardMask function for this purpose. However, 
> >>>at this point I am only aware of methods for applying masks efficiently 
> >>>to a single sequence in a XString object, but not to many sequences 
> >>>stored in a XStringSet object. 
> >>>
> >>>#################
> >>>## Sample Code ##
> >>>#################
> >>>## Create example data set with random data
> >>>library(Biostrings)
> >>>dset <- DNAStringSet(sapply(1:100, function(x) 
> >>>paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) myqlist 
> >>><- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create 
> >>>random Phred score list.
> >>>myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
> >>>myqual <- PhredQuality(myqual) 
> >>>dsetq1 <- QualityScaledDNAStringSet(dset, myqual) 
> >>>
> >>>## Define in character inject function XStringSetInject     
> >>>XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], 
> >>>cutoff=20, mychar="N") { mypos <- which(as.integer(myqual)<cutoff)
> >>>       return(toString(replaceLetterAt(myseq, 
> >>>       which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
> >>>}
> >>>## Apply XStringSetInject function to all sequences.
> >>>dvec <- sapply(1:length(dset), function(x) 
> >>>XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
> >>>dset2 <- DNAStringSet(dvec)
> >>>names(dset2) <- paste("d", 1:length(dvec), sep="")
> >>>names(myqual) <- paste("d", 1:length(dvec), sep="")
> >>>dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) 
> >>>
> >>>## Compare dsetq1 and dsetq2
> >>>dsetq1[1:2] 
> >>>dsetq2[1:2] 
> >>>
> >>>
> >>>>sessionInfo()
> >>>R version 2.9.0 (2009-04-17) 
> >>>x86_64-unknown-linux-gnu 
> >>>
> >>>locale:
> >>>LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >>>
> >>>attached base packages:
> >>>[1] stats     graphics  grDevices utils     datasets  methods   base     
> >>>
> >>>other attached packages:
> >>>[1] Biostrings_2.12.1 IRanges_1.2.0    
> >>>
> >>>loaded via a namespace (and not attached):
> >>>[1] Biobase_2.4.0
> >>>
> >>>
> >>>Thanks in advance,
> >>>
> >>>Thomas
> >>>
> >>>_______________________________________________
> >>>Bioc-sig-sequencing mailing list
> >>>Bioc-sig-sequencing at r-project.org
> >>>https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>-- 
> >>Hervé Pagès
> >>
> >>Program in Computational Biology
> >>Division of Public Health Sciences
> >>Fred Hutchinson Cancer Research Center
> >>1100 Fairview Ave. N, M2-B876
> >>P.O. Box 19024
> >>Seattle, WA 98109-1024
> >>
> >>E-mail: hpages at fhcrc.org
> >>Phone:  (206) 667-5791
> >>Fax:    (206) 667-1319
> >>
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>