[BioC] ShortRead 3' trimming negative length vectors are not allowed
Martin Morgan
mtmorgan at fhcrc.org
Tue May 21 15:44:20 CEST 2013
On 05/17/2013 12:55 PM, Sam McInturf wrote:
> A few follow up questions.
> I wan to trim my reads to keep everything up until the first base with a quality
> below 20. Is trimEnds() the right function for this, or is trimTail(k = 1, a =
> "see below") the way to go?
>
> In your reply, you used "V" as the cut off, what is that Q score? My reads use
> an offset of 33 for the qual scores, so ! (ascii = 33) is the lowest score? Is
> that consistent with what ShortRead is doing? I would like a cutoff of 20, so
> 33+20 = 53, which is "5", is this what I should be telling trimEnds?
Sorry not to have answered sooner. Yes, it sounds like trimTail is what you're
after. The letter / quality score encoding depends on class(quality(reads)), with
FastqQuality:
! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 :
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
; < = > ? @ A B C D E F G H I J
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
SFastqQuality:
; < = > ? @ A B C D E F G H I J K L M N O P Q R S T
-5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
In response to your question, I made it so that in the 'devel' version of
Bioconductor / ShortRead
encoding(FastqQuality())
returns this information. I also added a function filterFastq that allows
space-efficient filtering from one fastq file to another; there are also
trimTails and friends that use this to transform a character vector of 'object's
to new files given by the argument 'destinations'. Again this is available in
the devel branch of ShortRead.
Thanks for the questions...
Martin
>
> Best
>
>
> On Fri, May 17, 2013 at 1:38 PM, Sam McInturf <smcinturf at gmail.com
> <mailto:smcinturf at gmail.com>> wrote:
>
> Martin,
> I was unaware of the trimEnds() function, it worked wonderfully without
> having to use your modified trimEnds(). I imagine it is because the cluster
> has a lot of memory available. I ran it a second time using the
> modification, but it was much slower.
>
> Thanks again!
>
>
> On Fri, May 17, 2013 at 12:20 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> On 05/17/2013 09:19 AM, Sam McInturf wrote:
>
> Hello everyone,
> I am trying to do some quality trimming on some RNA seq reads
> (Illumina,
> 100 bp, single end). I have been following the pipeline from the UCR
> manuals<http://manuals.__bioinformatics.ucr.edu/home/__ht-seq#TOC-Trimming-Low-__Quality-3-Tails-from-R
> <http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Trimming-Low-Quality-3-Tails-from-R>>
>
> but
> I have run into an error I can't seem to resolve.
>
> library(ShortRead)
> reads <- readFastq("myFile.fastq")
>
> qualityCutoff <- 10 # remove read tails with quality lower than this
>
> seqs <- sread(reads) # get sequence list
>
> qual <- PhredQuality(quality(quality(__reads))) # get quality score
> list as
> PhredQuality
>
> length(qual) #my length is positive
>
> [1] 39145018
>
> myqual_mat <- matrix(charToRaw(as.character(__unlist(qual))),
>
> nrow=length(qual), byrow=TRUE)
>
>
> I'm guessing that what's happening is that as.character(unlist(qual)) is
> making a very large vector sum(width(qual)), larger than can be
> represented in (your) version of R (the output of sessionInfo() would be
> helpful...).
>
> But I don't think you need to go this route; have you looked at
> ShortRead::trimEnds & friends? For instance, after running
>
> exmample(trimEnds)
>
> I have an object 'rfq' with some qualities
>
> > quality(rfq)
> class: SFastqQuality
> quality:
> A BStringSet instance of length 256
> width seq
> [1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]__VCHVMPLAS
> [2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][__PZPICCK
> [3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__TMJEUXEFLA
> [4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__RJRZTQLOA
> [5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]__MJUJVLSS
> ... ... ...
> [252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]__MJQNSAOC
> [253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]__VTVVRVMSM
> [254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__YYVVTSZUOOHH
> [255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[__OVXEJSJ
> [256] 36 ]]]]]]]]]]]Y]]T]T]]]]__TRYVMEVVRSRHHNH
>
> And I can simultaneously trim reads and sequences with
>
> trimEnds(rfq, "V")
>
> which you can see in action as
>
> > quality(trimEnds(rfq, "V"))
> class: SFastqQuality
> quality:
> A BStringSet instance of length 256
> width seq
> [1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]
> [2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][__PZ
> [3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__TMJEUX
> [4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__RJRZ
> [5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]
> ... ... ...
> [252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]
> [253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]
> [254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__YYVVTSZ
> [255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[__OVX
> [256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY
>
> The following might be helpful to transform a large file --
>
> setMethod(trimEnds, "character",
> function(object, a, left = TRUE, right = TRUE,
> relation = c("<=", "=="), ...,
> destination, yieldSize=1000000, ranges = FALSE)
> {
> if (missing('destination'))
> stop("'destination' required")
> strm <- FastqStreamer(object, yieldSize)
> tot <- nNuc <- nTrimNuc <- 0L
> while (length(fq <- yield(strm))) {
> tot <- tot + length(fq)
> nNuc <- nNuc + sum(width(fq))
> fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges)
> nTrimNuc <- nTrimNuc + sum(width(fq))
> writeFastq(fq, destination, "a")
> }
> list(destination=destination,
> c(TotalReads=tot, TotalNucleotides=nNuc,
> TrimmedNucleotides=nNuc - nTrimNuc))
> })
>
> provide the first argument to trimEnds as a simple character vector, and
> provide a 'destination' file name. For example
>
> ## just a convenient path to a fastq file
> fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
> trimEnds(fl, "V", destination=tempfile())
>
> A variation of this will be added to the 'devel' version of ShortRead,
> so if there are suggestions for improvements please let me know.
>
> Martin
>
>
>
>
> Error in .Call(.NAME, ..., PACKAGE = PACKAGE) :
> negative length vectors are not allowed
>
> as(qual, matrix) gives the same error, and as.matrix(qual) does as well,
> not sure how those two commands are different, but tried anyway.
>
> If I run the exact same commands, instead of the full 39145018
> reads, I use
> the first 100, it works like a charm. I am working on a linux
> cluster, if
> that is important
>
> Does anyone have an idea?
>
> Cheers!
>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>
>
> --
> Sam McInturf
>
>
>
>
> --
> Sam McInturf
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list