[BioC] Trimming of partial adaptor sequences

Hervé Pagès hpages at fhcrc.org
Tue Jul 23 02:36:44 CEST 2013


Hi Sean,

On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
> We have been experimenting with a NGS protocol in which we insert
> sheared genomic fragments into a custom plasmid for sequencing on an
> Illumina MiSeq instrument. The insertion site of this plasmid is flanked
> by our own custom barcodes (N7) and ~80 nt Illumina-based adaptor
> sequence. We then PCR out the insert with barcodes and adaptors for
> sequencing. Our adaptor sequence is similar to the Illumina adaptor, but
> we use custom primer binding sites. We are not sure if the Illumina
> software will be able to recognize and trim our custom adaptors. We are
> trying to figure out the best way to trim read through into the 3’
> adaptor ourselves.  We have roughly three scenarios:
>
> (1) The insert is long enough that we have no read through
>
> (2) The vector is empty, in which case the entire adaptor sequence is
> present
>
> (3) The insert is long enough to have useful data, but we get
> read-through into the 3’ adaptor sequence that must be trimmed.
>
> The solution we are currently working on is to identify the minimal
> sequence that is recognizable as the adaptor sequence and trim that
> using trimLRPatterns() in the Biostrings package.  Ideally we would like
> it if we could give trimLRPatterns() the entire adaptor sequence and
> have it recognize it on our reads even if it is only partially present.

May be I misunderstand what you are trying to do exactly but yes, you
can give the entire adaptor sequence to trimLRPatterns() and it will
recognize it on our reads even if it's only partially present:

   library(Biostrings)

   adaptor <- DNAString("ACCAGGACAA")  # entire adaptor
   reads <- DNAStringSet(c(
     "GACAATTTATTT", # adaptor partially present on the left
     "CAATTTATTTGC", # adaptor partially present on the left
     "TTTATTTACCAG", # adaptor partially present on the right
     "CAATTTTTTACC"  # adaptor partially present on both ends
   ))

Then:

   > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
     A DNAStringSet instance of length 4
       width seq
   [1]     7 TTTATTT
   [2]     9 TTTATTTGC
   [3]     7 TTTATTT
   [4]     6 TTTTTT

Note that trimLRPatterns() expects that, when the adaptor is partially
present on the left (resp. right), what's present is a suffix (resp.
prefix) of the adaptor, and not an arbitrary substring of it. Is it
what you expect too?

Thanks,
H.

> However, in my experimenting it did not seem to be able to this. I
> thought I would ask the Bioconductor community if there are any better
> solutions to recognizing and trimming partial adaptor sequences.
>
> Thanks in advance for any input.
>
> Sean Taylor
>
> Post-doctoral Fellow
>
> Fred Hutchinson Cancer Research Center
>
> 206-667-5544
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list