[Bioc-sig-seq] adapter removal

Martin Morgan mtmorgan at fhcrc.org
Thu Jan 8 18:22:30 CET 2009


"David A.G" <dasolexa at hotmail.com> writes:

> Dear list,
>
> I have some experience with Bioconductor but am newbie to this list and to NGS. I am trying to remove some adapters from my solexa s_N_sequence.txt file using Biostrings and ShortRead packages and the vignettes.  I managed to read in the text file and got to save the reads as follows
>
> fqpattern <- "s_4_sequence.txt"
> f4 <- file.path(analysisPath(sp), fqpattern)
> fq4 <- readFastq(sp, fqpattern)
> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length fragments
>
> Having the following adapter sequence:
>
> adapter <- DNAString("ACGGATTGTTCAGT")
>
> I tried to mimic the example in the Biostring vignette as follows:
>
>
> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>
> but after more than two hours the process is still running.
>
> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on adapter removal but does not clear things much to me, since as far as I understood the option mentioned in the thread is not appropriate (quote :(though apparently this is not entirely satisfactory, see the second entry!)).
>
> Is this just a memory issue or am I doing something wrong? Shall I leave the process to run for longer?

As a bit of a follow-up, a reasonable strategy is to figure out how
long / how much memory the calculation takes on smaller chunks of
data.

Also, here I 'clean' (remove reads with 'N's) and then calculate the
srdistance to your adapter on a lane with 5978132 reads in about a
minute

> fq <- readFastq(sp, "s_1_sequence.txt")
> fq
class: ShortReadQ
length: 5978132 reads; width: 53 cycles
> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
   user  system elapsed 
 60.960   0.000  60.961 

I might then tabulate the distances

> table(dist[[1]])

and subset fq based on some criterion

fq[dist[[1]] > 4]

There's a second interface to this, too, using filters, e.g.,

fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]

Martin

> TIA for your help,
>
> Dave
>
> _________________________________________________________________
> Show them the way! Add maps and directions to your party invites. 
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793



More information about the Bioc-sig-sequencing mailing list