[BioC] Trimming of partial adaptor sequences

Michael Stadler michael.stadler at fmi.ch
Tue Jul 23 08:45:14 CEST 2013


Hi Sean,

I agree with Herve, I think trimLRPatterns() will do the job.

An alternative could be to use preprocessReads() from the QuasR package,
which itself is based on trimLRPatterns(), but has a few convenience
features, such as dealing with paired-end files.

Best wishes,
Michael


On 23.07.2013 02:36, Hervé Pagès wrote:
> Hi Sean,
> 
> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
>> We have been experimenting with a NGS protocol in which we insert
>> sheared genomic fragments into a custom plasmid for sequencing on an
>> Illumina MiSeq instrument. The insertion site of this plasmid is flanked
>> by our own custom barcodes (N7) and ~80 nt Illumina-based adaptor
>> sequence. We then PCR out the insert with barcodes and adaptors for
>> sequencing. Our adaptor sequence is similar to the Illumina adaptor, but
>> we use custom primer binding sites. We are not sure if the Illumina
>> software will be able to recognize and trim our custom adaptors. We are
>> trying to figure out the best way to trim read through into the 3’
>> adaptor ourselves.  We have roughly three scenarios:
>>
>> (1) The insert is long enough that we have no read through
>>
>> (2) The vector is empty, in which case the entire adaptor sequence is
>> present
>>
>> (3) The insert is long enough to have useful data, but we get
>> read-through into the 3’ adaptor sequence that must be trimmed.
>>
>> The solution we are currently working on is to identify the minimal
>> sequence that is recognizable as the adaptor sequence and trim that
>> using trimLRPatterns() in the Biostrings package.  Ideally we would like
>> it if we could give trimLRPatterns() the entire adaptor sequence and
>> have it recognize it on our reads even if it is only partially present.
> 
> May be I misunderstand what you are trying to do exactly but yes, you
> can give the entire adaptor sequence to trimLRPatterns() and it will
> recognize it on our reads even if it's only partially present:
> 
>   library(Biostrings)
> 
>   adaptor <- DNAString("ACCAGGACAA")  # entire adaptor
>   reads <- DNAStringSet(c(
>     "GACAATTTATTT", # adaptor partially present on the left
>     "CAATTTATTTGC", # adaptor partially present on the left
>     "TTTATTTACCAG", # adaptor partially present on the right
>     "CAATTTTTTACC"  # adaptor partially present on both ends
>   ))
> 
> Then:
> 
>   > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
>     A DNAStringSet instance of length 4
>       width seq
>   [1]     7 TTTATTT
>   [2]     9 TTTATTTGC
>   [3]     7 TTTATTT
>   [4]     6 TTTTTT
> 
> Note that trimLRPatterns() expects that, when the adaptor is partially
> present on the left (resp. right), what's present is a suffix (resp.
> prefix) of the adaptor, and not an arbitrary substring of it. Is it
> what you expect too?
> 
> Thanks,
> H.
> 
>> However, in my experimenting it did not seem to be able to this. I
>> thought I would ask the Bioconductor community if there are any better
>> solutions to recognizing and trimming partial adaptor sequences.
>>
>> Thanks in advance for any input.
>>
>> Sean Taylor
>>
>> Post-doctoral Fellow
>>
>> Fred Hutchinson Cancer Research Center
>>
>> 206-667-5544
>>
>



More information about the Bioconductor mailing list