[Bioc-sig-seq] adapter removal

James W. MacDonald jmacdon at med.umich.edu
Thu Jan 8 22:26:56 CET 2009


As a follow-up question, once you have removed the adapter sequences is 
it then possible to write the ShortReadQ object back out in a fastq 
format for aligning with e.g., bowtie? I know I can use 
write.XstringSet() to output in fasta format, but I would like to feed 
bowtie the quality scores as well.

Or is the speed reasonably comparable to use the PDict functions in 
Biostrings to do the aligning?

Best,

Jim



Martin Morgan wrote:
> "David A.G" <dasolexa at hotmail.com> writes:
> 
>> Dear list,
>>
>> I have some experience with Bioconductor but am newbie to this list and to NGS. I am trying to remove some adapters from my solexa s_N_sequence.txt file using Biostrings and ShortRead packages and the vignettes.  I managed to read in the text file and got to save the reads as follows
>>
>> fqpattern <- "s_4_sequence.txt"
>> f4 <- file.path(analysisPath(sp), fqpattern)
>> fq4 <- readFastq(sp, fqpattern)
>> reads <- sread(fq4)  #"reads" contains more than 4 million 34-length fragments
>>
>> Having the following adapter sequence:
>>
>> adapter <- DNAString("ACGGATTGTTCAGT")
>>
>> I tried to mimic the example in the Biostring vignette as follows:
>>
>>
>> myAdapterAligns <- pairwiseAlignment(reads, adapter, type = "overlap")
>>
>> but after more than two hours the process is still running.
>>
>> I am running R 2.8.0 on a 64bit linux machine (Kubuntu 2.6.24) with 4Gb RAM, and I only have some 30Mb free RAM left. I found a thread on adapter removal but does not clear things much to me, since as far as I understood the option mentioned in the thread is not appropriate (quote :(though apparently this is not entirely satisfactory, see the second entry!)).
>>
>> Is this just a memory issue or am I doing something wrong? Shall I leave the process to run for longer?
> 
> As a bit of a follow-up, a reasonable strategy is to figure out how
> long / how much memory the calculation takes on smaller chunks of
> data.
> 
> Also, here I 'clean' (remove reads with 'N's) and then calculate the
> srdistance to your adapter on a lane with 5978132 reads in about a
> minute
> 
>> fq <- readFastq(sp, "s_1_sequence.txt")
>> fq
> class: ShortReadQ
> length: 5978132 reads; width: 53 cycles
>> system.time(dist <- srdistance(clean(fq), "ACGGATTGTTCAGT"))
>    user  system elapsed 
>  60.960   0.000  60.961 
> 
> I might then tabulate the distances
> 
>> table(dist[[1]])
> 
> and subset fq based on some criterion
> 
> fq[dist[[1]] > 4]
> 
> There's a second interface to this, too, using filters, e.g.,
> 
> fq[srdistanceFilter("ACGGATTGTTCAGT", 4)(fq)]
> 
> Martin
> 
>> TIA for your help,
>>
>> Dave
>>
>> _________________________________________________________________
>> Show them the way! Add maps and directions to your party invites. 
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-5646
734-936-8662



More information about the Bioc-sig-sequencing mailing list