[BioC] ShortRead 3' trimming negative length vectors are not allowed
Martin Morgan
mtmorgan at fhcrc.org
Fri May 17 19:20:56 CEST 2013
On 05/17/2013 09:19 AM, Sam McInturf wrote:
> Hello everyone,
> I am trying to do some quality trimming on some RNA seq reads (Illumina,
> 100 bp, single end). I have been following the pipeline from the UCR
> manuals<http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Trimming-Low-Quality-3-Tails-from-R>
> but
> I have run into an error I can't seem to resolve.
>
> library(ShortRead)
> reads <- readFastq("myFile.fastq")
>
> qualityCutoff <- 10 # remove read tails with quality lower than this
>
> seqs <- sread(reads) # get sequence list
>
> qual <- PhredQuality(quality(quality(reads))) # get quality score list as
> PhredQuality
>
>> length(qual) #my length is positive
> [1] 39145018
>
>> myqual_mat <- matrix(charToRaw(as.character(unlist(qual))),
> nrow=length(qual), byrow=TRUE)
I'm guessing that what's happening is that as.character(unlist(qual)) is making
a very large vector sum(width(qual)), larger than can be represented in (your)
version of R (the output of sessionInfo() would be helpful...).
But I don't think you need to go this route; have you looked at
ShortRead::trimEnds & friends? For instance, after running
exmample(trimEnds)
I have an object 'rfq' with some qualities
> quality(rfq)
class: SFastqQuality
quality:
A BStringSet instance of length 256
width seq
[1] 36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
[2] 36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
[3] 36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
[4] 36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
[5] 36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
... ... ...
[252] 36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253] 36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254] 36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255] 36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256] 36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH
And I can simultaneously trim reads and sequences with
trimEnds(rfq, "V")
which you can see in action as
> quality(trimEnds(rfq, "V"))
class: SFastqQuality
quality:
A BStringSet instance of length 256
width seq
[1] 27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]
[2] 31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZ
[3] 32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
[4] 31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZ
[5] 28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]
... ... ...
[252] 28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]
[253] 27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]
[254] 31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZ
[255] 32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVX
[256] 24 ]]]]]]]]]]]Y]]T]T]]]]TRY
The following might be helpful to transform a large file --
setMethod(trimEnds, "character",
function(object, a, left = TRUE, right = TRUE,
relation = c("<=", "=="), ...,
destination, yieldSize=1000000, ranges = FALSE)
{
if (missing('destination'))
stop("'destination' required")
strm <- FastqStreamer(object, yieldSize)
tot <- nNuc <- nTrimNuc <- 0L
while (length(fq <- yield(strm))) {
tot <- tot + length(fq)
nNuc <- nNuc + sum(width(fq))
fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges)
nTrimNuc <- nTrimNuc + sum(width(fq))
writeFastq(fq, destination, "a")
}
list(destination=destination,
c(TotalReads=tot, TotalNucleotides=nNuc,
TrimmedNucleotides=nNuc - nTrimNuc))
})
provide the first argument to trimEnds as a simple character vector, and provide
a 'destination' file name. For example
## just a convenient path to a fastq file
fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
trimEnds(fl, "V", destination=tempfile())
A variation of this will be added to the 'devel' version of ShortRead, so if
there are suggestions for improvements please let me know.
Martin
> Error in .Call(.NAME, ..., PACKAGE = PACKAGE) :
> negative length vectors are not allowed
>
> as(qual, matrix) gives the same error, and as.matrix(qual) does as well,
> not sure how those two commands are different, but tried anyway.
>
> If I run the exact same commands, instead of the full 39145018 reads, I use
> the first 100, it works like a charm. I am working on a linux cluster, if
> that is important
>
> Does anyone have an idea?
>
> Cheers!
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list