[Bioc-sig-seq] adapter removal

Martin Morgan mtmorgan at fhcrc.org
Thu Jan 8 17:59:16 CET 2009


Hi Dave --

"David A.G" <dasolexa at hotmail.com> writes:

> Dear list,
>
> I have some experience with Bioconductor but am newbie to this list
> and to NGS. I am trying to remove some adapters from my solexa
> s_N_sequence.txt file using Biostrings and ShortRead packages and the
> vignettes.  I managed to read in the text file and got to save the
> reads as follows
>
> fqpattern <- "s_4_sequence.txt" f4 <- file.path(analysisPath(sp),
> fqpattern) fq4 <- readFastq(sp, fqpattern) reads <- sread(fq4)
> #"reads" contains more than 4 million 34-length fragments
>
> Having the following adapter sequence:
>
> adapter <- DNAString("ACGGATTGTTCAGT")
>
> I tried to mimic the example in the Biostring vignette as follows:
>
>
> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>
> but after more than two hours the process is still running.

A couple of suggestions. The 'srdistance' function in ShortRead might
be your friend -- it calculates the edit distance between each read
and your adapter. You can then choose an appropriate threshold
(usually the distances are distinctly bimodal, use 'table' to
summarize the return value of srdistance.

A little bit more fine-grained is to look at how srdistance does its
work, which is in ShortRead:::.srdistance; the key to performance is
to choose appropriate parameters to pairwiseAlignemnt, especially
returning the edit distance alone and excluding indels from
consideration.

Martin

> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with
> 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on
> adapter removal but does not clear things much to me, since as far as
> I understood the option mentioned in the thread is not appropriate
> (quote :(though apparently this is not entirely satisfactory, see the
> second entry!)).
>
> Is this just a memory issue or am I doing something wrong? Shall I
> leave the process to run for longer?
>
> TIA for your help,
>
> Dave
>
> _________________________________________________________________ Show
> them the way! Add maps and directions to your party invites.
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________ Bioc-sig-sequencing
> mailing list Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list