[BioC] Trouble creating a ShortReadQ object and using writeFastq

Thu Feb 21 18:48:18 CET 2013

On 02/21/2013 08:47 AM, NOAH LEE DOWELL wrote:
> Thank you Martin!
>
> Once again you caught a simple mistake I was making with regards to using the
> wrong argument (filepath instead of file). Changing it to 'file' allowed
> writeFastq to work but then reading the newly written fastq file threw the same
> "line error" when I tried to read it in with readFastq.
>
> It is hard to find anything wrong with that specific line.  It is a line
> containing DNA sequence that is 23728 nt long.  I opened the file in vi and went

The line length itself might have caused problems; ShortRead was written when 
short reads were 36 nt (back in the day...) though it could accommodate short 
reads of up to about 20k. I made a temporary patch that should handle much 
longer reads (200k) available via biocLite as 1.16.4 probably noon tomorrow, PST.

> to that line and looked for anything odd but did not "see" anything to edit.
>   The exact same data in fasta format can be read in using readBStringSet(
> format = "fasta") in the BioStrings package.  So length of sequence does not
> seem to be a problem and maybe this particular file is just corrupted.  Sorry
> for bothering the list with this issue.

They're using different input mechanisms, so have different limits.
>
> Moving forward the utilities in Biostrings and ShortRead for manipulating fasta
> files are very powerful for people with PacBio long reads or assembling genomes
> and extracting contigs/scaffolds associated with specific genes.  This was not
> the intended user function but I thought it might be useful to provide another
> user scenario for the developers.
>
>   Extending the FASTQ functions to work on reads of non-uniform length might be
> a worthwhile discussion for your team. Alternatively, allowing slots within

Generally, uneven read lengths in ShortRead are not (meant to be) a problem. 
Maybe you can point to some specific issues you're having?

Martin

> SequenceSummary objects (FASTQSummary in qrqc package) to be accessed by the
> same methods as a XString-Set object could leverage the power of both BioStrings
> and qrqc packages. I recognize these are non-trivial requests.
>
> Thank you for all the great work you and your Bioconductor team do.
>
> Best,
> Noah
>
>
> On Wed, Feb 20, 2013 at 5:08 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     Hi Noah --
>
>
>     On 2/20/2013 2:47 PM, Noah Dowell wrote:
>
>         Hello All,
>
>         I am trying to do some preprocessing quality analysis on PacBio reads.
>           To make any plots of quality score or nucleotide frequency one should
>         only look at reads of the same length.  One output from PacBio filtering
>         of the raw data is a FASTQ file.  The nature of the PacBio reads is that
>         they lack uniformity in their length dimension.   I am pretty sure the
>         file is okay because the readSeqFile() function in the QRQC package is
>         able to read the data in and the FASTQSummary object can be used to make
>         plots.  Given the non-uniform length of reads though those plots show
>         artifacts.  As far as can tell the FASTQSummary object can not be
>         manipulated like a ShortRead or BioStrings object.
>
>         So I turn to the ShortRead package.  I have also successfully used the
>         readFastq() function in  ShortRead package to read in Illumina reads of
>         different length and then preprocess or slice-n-dice those reads.  I am
>         getting the following error though when I try this on PacBio fastq files:
>         ##############################__##############################__#########################
>
>             dirPath =
>             "/Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___255/017249_12cells_filtered/__data/"
>             pat1 = "filteredSTD_12cells.fastq"
>             seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file =
>             "filteredSTD_12cells.fastq")
>
>         Error: Input/Output
>             file(s):
>
>           /Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___255/017249_12cells_filtered/__data//filteredSTD_12cells.__fastq
>             message: line too long
>         /Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___255/017249_12cells_filtered/__data//filteredSTD_12cells.__fastq:17521
>
>
>     is there something unusual about this line of this file, e.g., blank or
>     otherwise? How long are reads, anyway? Maybe you can trigger this in a
>     subset of the file with just one or two records, and you'd be willing to
>     share that with me? The error could come from one of two places in the C
>     code, and both are preceded by a comment 'this should never happen'
>
>
>
>
>         ##############################__##############################__#########################
>
>         So I tried to manually build a ShortReadQ file using the following
>         (there are generally 4 distinct lines in a FASTQ file):
>         #workaround to error with too long lines
>         seqTemp <- readLines("filteredSTD___12cells.fastq")
>         seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)])
>         ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)])
>         qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)])
>
>         SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids)
>
>         #This worked!!:
>         length(SeqClean) #484232
>         summary(width(SeqClean))
>         Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>         50    1368    2153    2785    3567   26940
>
>         #But I get this error when I try to write the file (changing withIds to
>         TRUE or full to TRUE gives same error):
>
>             writeFastq(SeqClean, filepath = "cleanReads.fastq", format =
>             "fastq", mode = "w", full = FALSE, withIds = FALSE)
>
>         Error in function (classes, fdef, mtable)  :
>             unable to find an inherited method for function ‘writeFastq’ for
>         signature ‘"ShortReadQ", "missing"’
>
>
>      > args(writeFastq)
>     function (object, file, mode = "w", full = FALSE, ...)
>     NULL
>
>     so I think your 'filepath' should be file="cleanReads.fastq".
>
>
>
>         #I think something is not quite right with my construction of the
>         ShortReadQ object and specifically the read IDs
>         #I wonder if it is because it is slotting the ids into the wrong place:
>
>             head(ids)
>
>             A BStringSet instance of length 6
>               width seq
>         [1]    72
>         @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/21/0_5104
>         [2]    75
>         @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/53/2430_3193
>         [3]    75
>         @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/53/3237_6580
>         [4]    75
>         @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/53/6626_9948
>         [5]    72
>         @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/54/0_4929
>         [6]    74
>         @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/81/276_2420
>         ##############################__##############################__#########################
>
>         My questions:
>         1. Is there any way to leverage the ability to slice-n-dice a FASTQ data
>         set in the ShortRead package and then write to a new FASTQ file?
>
>         2 Should I just try to add in names to my sequences manually (to the
>         ShortReadQ object) to alleviate the 'mssing' error above?
>
>         3. Is there any desire from the developers to merge the id() and names()
>         functions?
>
>
>     would have been a good choice, in retrospect ;)
>
>     Martin
>
>
>
>
>         Any help/hints would be appreciated.  I apologize for the lack of a
>         toy-reproducible example; I was unsuccessful in my attempt to create a
>         toy quality score.  Thank you in advance for your assistance and time.
>
>         Best,
>         Noah
>
>
>
>
>
>
>
>
>         ##############################__##############################__#########################
>
>
>             sessionInfo()
>
>         R version 2.15.2 (2012-10-26)
>         Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
>         locale:
>         [1] en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
>         attached base packages:
>         [1] grid      stats     graphics  grDevices utils     datasets  methods
>            base
>
>         other attached packages:
>            [1] qrqc_1.12.0          testthat_0.7         xtable_1.7-0
>         brew_1.0-6           biovizBase_1.6.2     ggplot2_0.9.3
>            [7] reshape_0.8.4        plyr_1.8             org.Hs.eg.db_2.8.0
>         RSQLite_0.11.2       DBI_0.2-5            AnnotationDbi_1.20.3
>         [13] Biobase_2.18.0       BiocInstaller_1.8.3  ShortRead_1.16.3
>         latticeExtra_0.6-24  RColorBrewer_1.0-5   Rsamtools_1.10.2
>         [19] lattice_0.20-13      Biostrings_2.26.3    GenomicRanges_1.10.6
>         IRanges_1.16.5       BiocGenerics_0.4.0
>
>         loaded via a namespace (and not attached):
>            [1] biomaRt_2.14.0         bitops_1.0-4.2         BSgenome_1.26.1
>             cluster_1.14.3         colorspace_1.2-1
>            [6] dichromat_2.0-0        digest_0.6.2           evaluate_0.4.3
>              GenomicFeatures_1.10.1 gtable_0.1.2
>         [11] Hmisc_3.10-1           hwriter_1.3            labeling_0.1
>            MASS_7.3-23            munsell_0.4
>         [16] parallel_2.15.2        proto_0.3-10           RCurl_1.95-3
>            reshape2_1.2.2         rtracklayer_1.18.2
>         [21] scales_0.2.3           stats4_2.15.2          stringr_0.6.2
>           tools_2.15.2           XML_3.95-0.1
>         [26] zlibbioc_1.4.0
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Dr. Martin Morgan, PhD
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N.
>     PO Box 19024 Seattle, WA 98109
>
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793