[BioC] Trimming of partial adaptor sequences
Michael Stadler
michael.stadler at fmi.ch
Tue Jul 23 08:45:14 CEST 2013
Hi Sean,
I agree with Herve, I think trimLRPatterns() will do the job.
An alternative could be to use preprocessReads() from the QuasR package,
which itself is based on trimLRPatterns(), but has a few convenience
features, such as dealing with paired-end files.
Best wishes,
Michael
On 23.07.2013 02:36, Hervé Pagès wrote:
> Hi Sean,
>
> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
>> We have been experimenting with a NGS protocol in which we insert
>> sheared genomic fragments into a custom plasmid for sequencing on an
>> Illumina MiSeq instrument. The insertion site of this plasmid is flanked
>> by our own custom barcodes (N7) and ~80 nt Illumina-based adaptor
>> sequence. We then PCR out the insert with barcodes and adaptors for
>> sequencing. Our adaptor sequence is similar to the Illumina adaptor, but
>> we use custom primer binding sites. We are not sure if the Illumina
>> software will be able to recognize and trim our custom adaptors. We are
>> trying to figure out the best way to trim read through into the 3’
>> adaptor ourselves. We have roughly three scenarios:
>>
>> (1) The insert is long enough that we have no read through
>>
>> (2) The vector is empty, in which case the entire adaptor sequence is
>> present
>>
>> (3) The insert is long enough to have useful data, but we get
>> read-through into the 3’ adaptor sequence that must be trimmed.
>>
>> The solution we are currently working on is to identify the minimal
>> sequence that is recognizable as the adaptor sequence and trim that
>> using trimLRPatterns() in the Biostrings package. Ideally we would like
>> it if we could give trimLRPatterns() the entire adaptor sequence and
>> have it recognize it on our reads even if it is only partially present.
>
> May be I misunderstand what you are trying to do exactly but yes, you
> can give the entire adaptor sequence to trimLRPatterns() and it will
> recognize it on our reads even if it's only partially present:
>
> library(Biostrings)
>
> adaptor <- DNAString("ACCAGGACAA") # entire adaptor
> reads <- DNAStringSet(c(
> "GACAATTTATTT", # adaptor partially present on the left
> "CAATTTATTTGC", # adaptor partially present on the left
> "TTTATTTACCAG", # adaptor partially present on the right
> "CAATTTTTTACC" # adaptor partially present on both ends
> ))
>
> Then:
>
> > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
> A DNAStringSet instance of length 4
> width seq
> [1] 7 TTTATTT
> [2] 9 TTTATTTGC
> [3] 7 TTTATTT
> [4] 6 TTTTTT
>
> Note that trimLRPatterns() expects that, when the adaptor is partially
> present on the left (resp. right), what's present is a suffix (resp.
> prefix) of the adaptor, and not an arbitrary substring of it. Is it
> what you expect too?
>
> Thanks,
> H.
>
>> However, in my experimenting it did not seem to be able to this. I
>> thought I would ask the Bioconductor community if there are any better
>> solutions to recognizing and trimming partial adaptor sequences.
>>
>> Thanks in advance for any input.
>>
>> Sean Taylor
>>
>> Post-doctoral Fellow
>>
>> Fred Hutchinson Cancer Research Center
>>
>> 206-667-5544
>>
>
More information about the Bioconductor
mailing list