[Bioc-sig-seq] adapter removal

Martin Morgan mtmorgan at fhcrc.org
Sat Jan 10 04:19:18 CET 2009


"David A.G" <dasolexa at hotmail.com> writes:

> Jim, thanks for pointing that out, I hadn´t faced that issue yet. The "reads" object I created only had the sequences and not the qualities, and after editing the adapters it is interesting to only have the quality calls for the remaining bases. 
>
> Tyler and Martin, thanks for the information on writing back to fastq.

There is a now an initial version of writeFastq in the 'devel' version
of ShortRead. It is memory and time efficient; try

> example(writeFastq)
> ?writeFastq

please let me know if there are problems, or other missing parts of
ShortRead.

Martin

> Dave
>
>> Date: Thu, 8 Jan 2009 15:12:01 -0800
>> From: casioqv at usermail.com
>> To: bioc-sig-sequencing at r-project.org
>> Subject: Re: [Bioc-sig-seq] adapter removal
>> 
>> Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very slow...
>> 
>> exportFastaQ <- function(shortReadQObject, fileName) {
>>     ## dump ShortReadQ object data to FASTQ format
>>     entries <- length(shortReadQObject)
>>     i <- 1
>>     while (i <= entries){
>>         cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
>> file=fileName, append=TRUE)
>>         cat("\n", file=fileName, append=TRUE)
>>         cat(as.character(sread(shortReadQObject)[i]), sep="",
>> file=fileName, append=TRUE)
>>         cat("\n", file=fileName, append=TRUE)
>>         cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
>> file=fileName, append=TRUE)
>>         cat("\n", file=fileName, append=TRUE)
>>         cat(as.character(quality(shortReadQObject)[[i]]), sep="",
>> file=fileName, append=TRUE)
>>         cat("\n", file=fileName, append=TRUE)
>>                 i <- i + 1
>>         }
>> }
>> 
>> Sincerely,
>> Tyler William H Backman
>> Bioinformatics Analyst
>> Institute for Integrative Genome Biology (IIGB)
>> Department of Botany and Plant Sciences
>> E-mail: tyler.backman at ucr.edu
>> 1007 Noel T. Keen Hall
>> University of California
>> Riverside, CA 92521
>> 
>> > Hi Jim --
>> >
>> > "James W. MacDonald" <jmacdon at med.umich.edu> writes:
>> >
>> >> As a follow-up question, once you have removed the adapter sequences
>> >> is it then possible to write the ShortReadQ object back out in a fastq
>> >> format for aligning with e.g., bowtie? I know I can use
>> >> write.XstringSet() to output in fasta format, but I would like to feed
>> >> bowtie the quality scores as well.
>> >
>> > Actually a 'writeFastq' is on my 'do this week' list. In the mean time
>> > my best (untested) suggestion is along the lines of
>> >
>> > n <- length(fq)
>> > lines <- character(n)
>> > lines[(seq_len(n)) * 4 - 3] <-
>> >    paste("@", as.character(id(fq)), sep="")
>> > lines[(seq_len(n)) * 4 - 2] <-
>> >    as.character(sread(rfq1))
>> > lines[(seq_len(n)) * 4 - 1] <-
>> >     paste("+", as.character(id(fq)), sep="")
>> > lines[(seq_len(n)) * 4] <-
>> >     as.character(quality(quality(fq)))
>> > writeLines(lines, "my.fastq")
>> >
>> > This will be unnecessarily memory intensive.
>> >
>> >> Or is the speed reasonably comparable to use the PDict functions in
>> >> Biostrings to do the aligning?
>> >
>> > Bowtie is very fast. One potential 'gotcha' is the orientation of
>> > reads throughout the workflow -- Eland _sequence.txt files are the
>> > reads as they come from the machine; MAQ (and I think Bowtie) aligned
>> > files report reads that align to the '-' strand in their reverse
>> > complement sequence, with qualities reversed to match the reverse
>> > compelement read, and with the alignment position recorded at the
>> > 'left-most' (rather than 5', say) end. Not that it will be a problem
>> > in the current use case, just something to be aware of as you move
>> > between tools.
>> >
>> > Martin
>> >
>> >> Best,
>> >>
>> >> Jim
>> >>
>> >>
>> >>
>> >> Martin Morgan wrote:
>> >>> "David A.G" <dasolexa at hotmail.com> writes:
>> >>>
>> >>>> Dear list,
>> >>>>
>> >>>> I have some experience with Bioconductor but am newbie to this list
>> >>>> and to NGS. I am trying to remove some adapters from my solexa
>> >>>> s_N_sequence.txt file using Biostrings and ShortRead packages and the
>> >>>> vignettes.  I managed to read in the text file and got to save the
>> >>>> reads as follows
>> >>>>
>> >>>> fqpattern <- "s_4_sequence.txt"
>> >>>> f4 <- file.path(analysisPath(sp), fqpattern)
>> >>>> fq4 <- readFastq(sp, fqpattern)
>> >>>> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length
>> >>>> fragments
>> >>>>
>> >>>> Having the following adapter sequence:
>> >>>>
>> >>>> adapter <- DNAString("ACGGATTGTTCAGT")
>> >>>>
>> >>>> I tried to mimic the example in the Biostring vignette as follows:
>> >>>>
>> >>>>
>> >>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>> >>>>
>> >>>> but after more than two hours the process is still running.
>> >>>>
>> >>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with
>> >>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on
>> >>>> adapter removal but does not clear things much to me, since as far as
>> >>>> I understood the option mentioned in the thread is not appropriate
>> >>>> (quote :(though apparently this is not entirely satisfactory, see the
>> >>>> second entry!)).
>> >>>>
>> >>>> Is this just a memory issue or am I doing something wrong? Shall I
>> >>>> leave the process to run for longer?
>> >>> As a bit of a follow-up, a reasonable strategy is to figure out how
>> >>> long / how much memory the calculation takes on smaller chunks of
>> >>> data.
>> >>> Also, here I 'clean' (remove reads with 'N's) and then calculate the
>> >>> srdistance to your adapter on a lane with 5978132 reads in about a
>> >>> minute
>> >>>
>> >>>> fq <- readFastq(sp, "s_1_sequence.txt")
>> >>>> fq
>> >>> class: ShortReadQ
>> >>> length: 5978132 reads; width: 53 cycles
>> >>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
>> >>>    user  system elapsed  60.960   0.000  60.961 I might then
>> >>> tabulate the distances
>> >>>
>> >>>> table(dist[[1]])
>> >>> and subset fq based on some criterion
>> >>> fq[dist[[1]] > 4]
>> >>> There's a second interface to this, too, using filters, e.g.,
>> >>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
>> >>> Martin
>> >>>
>> >>>> TIA for your help,
>> >>>>
>> >>>> Dave
>> >>>>
>> >>>> _________________________________________________________________
>> >>>> Show them the way! Add maps and directions to your party invites.
>> >>>>
>> >>>> 	[[alternative HTML version deleted]]
>> >>>>
>> >>>> _______________________________________________
>> >>>> Bioc-sig-sequencing mailing list
>> >>>> Bioc-sig-sequencing at r-project.org
>> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >>>
>> >>
>> >> --
>> >> James W. MacDonald, M.S.
>> >> Biostatistician
>> >> Hildebrandt Lab
>> >> 8220D MSRB III
>> >> 1150 W. Medical Center Drive
>> >> Ann Arbor MI 48109-5646
>> >> 734-936-8662
>> >
>> > --
>> > Martin Morgan
>> > Computational Biology / Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N.
>> > PO Box 19024 Seattle, WA 98109
>> >
>> > Location: Arnold Building M2 B169
>> > Phone: (206) 667-2793
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > Bioc-sig-sequencing at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> >
>> 
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
> _________________________________________________________________
> Show them the way! Add maps and directions to your party invites. 
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list