[BioC] Trouble creating a ShortReadQ object and using writeFastq
Martin Morgan
mtmorgan at fhcrc.org
Thu Feb 21 18:48:18 CET 2013
On 02/21/2013 08:47 AM, NOAH LEE DOWELL wrote:
> Thank you Martin!
>
> Once again you caught a simple mistake I was making with regards to using the
> wrong argument (filepath instead of file). Changing it to 'file' allowed
> writeFastq to work but then reading the newly written fastq file threw the same
> "line error" when I tried to read it in with readFastq.
>
> It is hard to find anything wrong with that specific line. It is a line
> containing DNA sequence that is 23728 nt long. I opened the file in vi and went
The line length itself might have caused problems; ShortRead was written when
short reads were 36 nt (back in the day...) though it could accommodate short
reads of up to about 20k. I made a temporary patch that should handle much
longer reads (200k) available via biocLite as 1.16.4 probably noon tomorrow, PST.
> to that line and looked for anything odd but did not "see" anything to edit.
> The exact same data in fasta format can be read in using readBStringSet(
> format = "fasta") in the BioStrings package. So length of sequence does not
> seem to be a problem and maybe this particular file is just corrupted. Sorry
> for bothering the list with this issue.
They're using different input mechanisms, so have different limits.
>
> Moving forward the utilities in Biostrings and ShortRead for manipulating fasta
> files are very powerful for people with PacBio long reads or assembling genomes
> and extracting contigs/scaffolds associated with specific genes. This was not
> the intended user function but I thought it might be useful to provide another
> user scenario for the developers.
>
> Extending the FASTQ functions to work on reads of non-uniform length might be
> a worthwhile discussion for your team. Alternatively, allowing slots within
Generally, uneven read lengths in ShortRead are not (meant to be) a problem.
Maybe you can point to some specific issues you're having?
Martin
> SequenceSummary objects (FASTQSummary in qrqc package) to be accessed by the
> same methods as a XString-Set object could leverage the power of both BioStrings
> and qrqc packages. I recognize these are non-trivial requests.
>
> Thank you for all the great work you and your Bioconductor team do.
>
> Best,
> Noah
>
>
> On Wed, Feb 20, 2013 at 5:08 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
> Hi Noah --
>
>
> On 2/20/2013 2:47 PM, Noah Dowell wrote:
>
> Hello All,
>
> I am trying to do some preprocessing quality analysis on PacBio reads.
> To make any plots of quality score or nucleotide frequency one should
> only look at reads of the same length. One output from PacBio filtering
> of the raw data is a FASTQ file. The nature of the PacBio reads is that
> they lack uniformity in their length dimension. I am pretty sure the
> file is okay because the readSeqFile() function in the QRQC package is
> able to read the data in and the FASTQSummary object can be used to make
> plots. Given the non-uniform length of reads though those plots show
> artifacts. As far as can tell the FASTQSummary object can not be
> manipulated like a ShortRead or BioStrings object.
>
> So I turn to the ShortRead package. I have also successfully used the
> readFastq() function in ShortRead package to read in Illumina reads of
> different length and then preprocess or slice-n-dice those reads. I am
> getting the following error though when I try this on PacBio fastq files:
> ##############################__##############################__#########################
>
> dirPath =
> "/Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___255/017249_12cells_filtered/__data/"
> pat1 = "filteredSTD_12cells.fastq"
> seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file =
> "filteredSTD_12cells.fastq")
>
> Error: Input/Output
> file(s):
>
> /Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___255/017249_12cells_filtered/__data//filteredSTD_12cells.__fastq
> message: line too long
> /Users/ndowell/Documents/__PacBio/130204_NLD1_XL_12cells___255/017249_12cells_filtered/__data//filteredSTD_12cells.__fastq:17521
>
>
> is there something unusual about this line of this file, e.g., blank or
> otherwise? How long are reads, anyway? Maybe you can trigger this in a
> subset of the file with just one or two records, and you'd be willing to
> share that with me? The error could come from one of two places in the C
> code, and both are preceded by a comment 'this should never happen'
>
>
>
>
> ##############################__##############################__#########################
>
> So I tried to manually build a ShortReadQ file using the following
> (there are generally 4 distinct lines in a FASTQ file):
> #workaround to error with too long lines
> seqTemp <- readLines("filteredSTD___12cells.fastq")
> seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)])
> ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)])
> qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)])
>
> SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids)
>
> #This worked!!:
> length(SeqClean) #484232
> summary(width(SeqClean))
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 50 1368 2153 2785 3567 26940
>
> #But I get this error when I try to write the file (changing withIds to
> TRUE or full to TRUE gives same error):
>
> writeFastq(SeqClean, filepath = "cleanReads.fastq", format =
> "fastq", mode = "w", full = FALSE, withIds = FALSE)
>
> Error in function (classes, fdef, mtable) :
> unable to find an inherited method for function ‘writeFastq’ for
> signature ‘"ShortReadQ", "missing"’
>
>
> > args(writeFastq)
> function (object, file, mode = "w", full = FALSE, ...)
> NULL
>
> so I think your 'filepath' should be file="cleanReads.fastq".
>
>
>
> #I think something is not quite right with my construction of the
> ShortReadQ object and specifically the read IDs
> #I wonder if it is because it is slotting the ids into the wrong place:
>
> head(ids)
>
> A BStringSet instance of length 6
> width seq
> [1] 72
> @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/21/0_5104
> [2] 75
> @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/53/2430_3193
> [3] 75
> @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/53/3237_6580
> [4] 75
> @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/53/6626_9948
> [5] 72
> @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/54/0_4929
> [6] 74
> @m130205_030957_42152___c10045396255000000152306890320__1342_s1_p0/81/276_2420
> ##############################__##############################__#########################
>
> My questions:
> 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data
> set in the ShortRead package and then write to a new FASTQ file?
>
> 2 Should I just try to add in names to my sequences manually (to the
> ShortReadQ object) to alleviate the 'mssing' error above?
>
> 3. Is there any desire from the developers to merge the id() and names()
> functions?
>
>
> would have been a good choice, in retrospect ;)
>
> Martin
>
>
>
>
> Any help/hints would be appreciated. I apologize for the lack of a
> toy-reproducible example; I was unsuccessful in my attempt to create a
> toy quality score. Thank you in advance for your assistance and time.
>
> Best,
> Noah
>
>
>
>
>
>
>
>
> ##############################__##############################__#########################
>
>
> sessionInfo()
>
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.__UTF-8/C/en_US.UTF-8/en_US.UTF-__8
>
> attached base packages:
> [1] grid stats graphics grDevices utils datasets methods
> base
>
> other attached packages:
> [1] qrqc_1.12.0 testthat_0.7 xtable_1.7-0
> brew_1.0-6 biovizBase_1.6.2 ggplot2_0.9.3
> [7] reshape_0.8.4 plyr_1.8 org.Hs.eg.db_2.8.0
> RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3
> [13] Biobase_2.18.0 BiocInstaller_1.8.3 ShortRead_1.16.3
> latticeExtra_0.6-24 RColorBrewer_1.0-5 Rsamtools_1.10.2
> [19] lattice_0.20-13 Biostrings_2.26.3 GenomicRanges_1.10.6
> IRanges_1.16.5 BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.14.0 bitops_1.0-4.2 BSgenome_1.26.1
> cluster_1.14.3 colorspace_1.2-1
> [6] dichromat_2.0-0 digest_0.6.2 evaluate_0.4.3
> GenomicFeatures_1.10.1 gtable_0.1.2
> [11] Hmisc_3.10-1 hwriter_1.3 labeling_0.1
> MASS_7.3-23 munsell_0.4
> [16] parallel_2.15.2 proto_0.3-10 RCurl_1.95-3
> reshape2_1.2.2 rtracklayer_1.18.2
> [21] scales_0.2.3 stats4_2.15.2 stringr_0.6.2
> tools_2.15.2 XML_3.95-0.1
> [26] zlibbioc_1.4.0
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
> --
> Dr. Martin Morgan, PhD
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the Bioconductor
mailing list