[Bioc-sig-seq] adapter removal

Wed Jan 14 15:03:26 CET 2009

Hi Dave --

"David A.G" <dasolexa at hotmail.com> writes:

> Hi Martin and list,
> going back to the adapter topic, I tried the srdistance() function and it
> works really quick!

yeah it's great; Herve Pages & Patrick Aboyoun deserve credit for this.

> I have a question regarding the edit distance. After having used tables() I
> find that the second top element is CTGGACTTTGTAGGATACCCTCGCTTTCCTGCTC with
> 890 occurrences. The edit distance for this sequence is of course 0. But if I
> try only part of it as an adapter, CTGGACTTTGTAGGATA, the lowest edit distance
> is 17 and there are 14758 reads with this sequence.
> Would you just use 17 as the distance cutoff?

If I wanted to find just those 890 occurrences in an object 'fq' with
the adapter at the start, I'd use srdistance on a DNAStringSet created
from just those nucleotides, along the lines of

> d <- srdistance(DNAStringSet(sread(fq), 1, 17), "CTGGACTTTGTAGGATA")
> fg[d != 0]

It's not really clear that the adapters will always be exactly at the
beginning, etc., so this might just be a starting point. Creating a
new DNAStringSet is not particularly memory- or CPU-intensive, so
creating it 'on the fly' as in the above example should be
reasonable. I think Patrick is working on a one-stop-shopping adapter
removal function in Biostrings.

> Also,is it possible to merge two ShortReadQ objects, or, having subset an
> ShortReadQ object in two by indexing, how do I put them back together?

The ability to 'append' two ShortReadQ objects was added just
yesterday

> append(fq, fq)

It should be available in the 'devel' branch using biocLite after
about 1pm Seattle time, all being well; look for ShortRead version >=
1.1.33. It's possible to do this with the released version of
ShortRead, and I can provide off-line hints if you'd like, but I'd
recommend working with the development version of R and these
packages.

I'm curious to know the use case here; the need to append hasn't come
up in my own work.

Martin

> Cheers,
> Dave
>> To: dasolexa at hotmail.com
>> CC: casioqv at usermail.com; bioc-sig-sequencing at r-project.org
>> Subject: Re: [Bioc-sig-seq] adapter removal
>> From: mtmorgan at fhcrc.org
>> Date: Fri, 9 Jan 2009 19:19:18 -0800
>>
>> "David A.G" <dasolexa at hotmail.com> writes:
>>
>> > Jim, thanks for pointing that out, I hadn´t faced that issue yet. The
> "reads" object I created only had the sequences and not the qualities, and
> after editing the adapters it is interesting to only have the quality calls
> for the remaining bases.
>> >
>> > Tyler and Martin, thanks for the information on writing back to fastq.
>>
>> There is a now an initial version of writeFastq in the 'devel' version
>> of ShortRead. It is memory and time efficient; try
>>
>> > example(writeFastq)
>> > ?writeFastq
>>
>> please let me know if there are problems, or other missing parts of
>> ShortRead.
>>
>> Martin
>>
>> > Dave
>> >
>> >> Date: Thu, 8 Jan 2009 15:12:01 -0800
>> >> From: casioqv at usermail.com
>> >> To: bioc-sig-sequencing at r-project.org
>> >> Subject: Re: [Bioc-sig-seq] adapter removal
>> >>
>> >> Here's another "quick and dirty" ShortReadQ-fastq exporter. It's very
> slow...
>> >>
>> >> exportFastaQ <- function(shortReadQObject, fileName) {
>> >> ## dump ShortReadQ object data to FASTQ format
>> >> entries <- length(shortReadQObject)
>> >> i <- 1
>> >> while (i <= entries){
>> >> cat(append("@", as.character(id(shortReadQObject)[[i]])), sep="",
>> >> file=fileName, append=TRUE)
>> >> cat("\n", file=fileName, append=TRUE)
>> >> cat(as.character(sread(shortReadQObject)[i]), sep="",
>> >> file=fileName, append=TRUE)
>> >> cat("\n", file=fileName, append=TRUE)
>> >> cat(append("+", as.character(id(shortReadQObject)[[i]])), sep="",
>> >> file=fileName, append=TRUE)
>> >> cat("\n", file=fileName, append=TRUE)
>> >> cat(as.character(quality(shortReadQObject)[[i]]), sep="",
>> >> file=fileName, append=TRUE)
>> >> cat("\n", file=fileName, append=TRUE)
>> >> i <- i + 1
>> >> }
>> >> }
>> >>
>> >> Sincerely,
>> >> Tyler William H Backman
>> >> Bioinformatics Analyst
>> >> Institute for Integrative Genome Biology (IIGB)
>> >> Department of Botany and Plant Sciences
>> >> E-mail: tyler.backman at ucr.edu
>> >> 1007 Noel T. Keen Hall
>> >> University of California
>> >> Riverside, CA 92521
>> >>
>> >> > Hi Jim --
>> >> >
>> >> > "James W. MacDonald" <jmacdon at med.umich.edu> writes:
>> >> >
>> >> >> As a follow-up question, once you have removed the adapter sequences
>> >> >> is it then possible to write the ShortReadQ object back out in a fastq
>> >> >> format for aligning with e.g., bowtie? I know I can use
>> >> >> write.XstringSet() to output in fasta format, but I would like to feed
>> >> >> bowtie the quality scores as well.
>> >> >
>> >> > Actually a 'writeFastq' is on my 'do this week' list. In the mean time
>> >> > my best (untested) suggestion is along the lines of
>> >> >
>> >> > n <- length(fq)
>> >> > lines <- character(n)
>> >> > lines[(seq_len(n)) * 4 - 3] <-
>> >> > paste("@", as.character(id(fq)), sep="")
>> >> > lines[(seq_len(n)) * 4 - 2] <-
>> >> > as.character(sread(rfq1))
>> >> > lines[(seq_len(n)) * 4 - 1] <-
>> >> > paste("+", as.character(id(fq)), sep="")
>> >> > lines[(seq_len(n)) * 4] <-
>> >> > as.character(quality(quality(fq)))
>> >> > writeLines(lines, "my.fastq")
>> >> >
>> >> > This will be unnecessarily memory intensive.
>> >> >
>> >> >> Or is the speed reasonably comparable to use the PDict functions in
>> >> >> Biostrings to do the aligning?
>> >> >
>> >> > Bowtie is very fast. One potential 'gotcha' is the orientation of
>> >> > reads throughout the workflow -- Eland _sequence.txt files are the
>> >> > reads as they come from the machine; MAQ (and I think Bowtie) aligned
>> >> > files report reads that align to the '-' strand in their reverse
>> >> > complement sequence, with qualities reversed to match the reverse
>> >> > compelement read, and with the alignment position recorded at the
>> >> > 'left-most' (rather than 5', say) end. Not that it will be a problem
>> >> > in the current use case, just something to be aware of as you move
>> >> > between tools.
>> >> >
>> >> > Martin
>> >> >
>> >> >> Best,
>> >> >>
>> >> >> Jim
>> >> >>
>> >> >>
>> >> >>
>> >> >> Martin Morgan wrote:
>> >> >>> "David A.G" <dasolexa at hotmail.com> writes:
>> >> >>>
>> >> >>>> Dear list,
>> >> >>>>
>> >> >>>> I have some experience with Bioconductor but am newbie to this list
>> >> >>>> and to NGS. I am trying to remove some adapters from my solexa
>> >> >>>> s_N_sequence.txt file using Biostrings and ShortRead packages and
> the
>> >> >>>> vignettes. I managed to read in the text file and got to save the
>> >> >>>> reads as follows
>> >> >>>>
>> >> >>>> fqpattern <- "s_4_sequence.txt"
>> >> >>>> f4 <- file.path(analysisPath(sp), fqpattern)
>> >> >>>> fq4 <- readFastq(sp, fqpattern)
>> >> >>>> reads <- sread(fq4) #"reads" contains more than 4 million 34-length
>> >> >>>> fragments
>> >> >>>>
>> >> >>>> Having the following adapter sequence:
>> >> >>>>
>> >> >>>> adapter <- DNAString("ACGGATTGTTCAGT")
>> >> >>>>
>> >> >>>> I tried to mimic the example in the Biostring vignette as follows:
>> >> >>>>
>> >> >>>>
>> >> >>>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type =
> "overlap")
>> >> >>>>
>> >> >>>> but after more than two hours the process is still running.
>> >> >>>>
>> >> >>>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with
>> >> >>>> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread
> on
>> >> >>>> adapter removal but does not clear things much to me, since as far
> as
>> >> >>>> I understood the option mentioned in the thread is not appropriate
>> >> >>>> (quote :(though apparently this is not entirely satisfactory, see
> the
>> >> >>>> second entry!)).
>> >> >>>>
>> >> >>>> Is this just a memory issue or am I doing something wrong? Shall I
>> >> >>>> leave the process to run for longer?
>> >> >>> As a bit of a follow-up, a reasonable strategy is to figure out how
>> >> >>> long / how much memory the calculation takes on smaller chunks of
>> >> >>> data.
>> >> >>> Also, here I 'clean' (remove reads with 'N's) and then calculate the
>> >> >>> srdistance to your adapter on a lane with 5978132 reads in about a
>> >> >>> minute
>> >> >>>
>> >> >>>> fq <- readFastq(sp, "s_1_sequence.txt")
>> >> >>>> fq
>> >> >>> class: ShortReadQ
>> >> >>> length: 5978132 reads; width: 53 cycles
>> >> >>>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
>> >> >>> user system elapsed 60.960 0.000 60.961 I might then
>> >> >>> tabulate the distances
>> >> >>>
>> >> >>>> table(dist[[1]])
>> >> >>> and subset fq based on some criterion
>> >> >>> fq[dist[[1]] > 4]
>> >> >>> There's a second interface to this, too, using filters, e.g.,
>> >> >>> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
>> >> >>> Martin
>> >> >>>
>> >> >>>> TIA for your help,
>> >> >>>>
>> >> >>>> Dave
>> >> >>>>
>> >> >>>> _________________________________________________________________
>> >> >>>> Show them the way! Add maps and directions to your party invites.
>> >> >>>>
>> >> >>>> [[alternative HTML version deleted]]
>> >> >>>>
>> >> >>>> _______________________________________________
>> >> >>>> Bioc-sig-sequencing mailing list
>> >> >>>> Bioc-sig-sequencing at r-project.org
>> >> >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >>>
>> >> >>
>> >> >> --
>> >> >> James W. MacDonald, M.S.
>> >> >> Biostatistician
>> >> >> Hildebrandt Lab
>> >> >> 8220D MSRB III
>> >> >> 1150 W. Medical Center Drive
>> >> >> Ann Arbor MI 48109-5646
>> >> >> 734-936-8662
>> >> >
>> >> > --
>> >> > Martin Morgan
>> >> > Computational Biology / Fred Hutchinson Cancer Research Center
>> >> > 1100 Fairview Ave. N.
>> >> > PO Box 19024 Seattle, WA 98109
>> >> >
>> >> > Location: Arnold Building M2 B169
>> >> > Phone: (206) 667-2793
>> >> >
>> >> > _______________________________________________
>> >> > Bioc-sig-sequencing mailing list
>> >> > Bioc-sig-sequencing at r-project.org
>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >
>> >> >
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> > _________________________________________________________________
>> > Show them the way! Add maps and directions to your party invites.
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > Bioc-sig-sequencing at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>> --
>> Martin Morgan
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M2 B169
>> Phone: (206) 667-2793
>
> ------------------------------------------------------------------------------
>
> Get news, entertainment and everything you care about at Live.com. [[Check it
> out!]]

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793