[BioC] Trimming of partial adaptor sequences

Wed Jul 24 00:26:32 CEST 2013

Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R.

Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with:

library(Rsamtools)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what=c("seq", "qual"))
gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param)

These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence:

gal<-gal[which(start(gal)<=width(gal[1]))]
reads <- setNames(mcols(gal)$seq, names(gal))
seqnames<-seqnames(gal)
adaptor<-reads[[1]]

The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios:

seq1<-reads[seqnames=='seq1']
seq2<-reads[seqnames=='seq2']

> adaptor
  36-letter "DNAString" instance
seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG

> seq1
  A DNAStringSet instance of length 16
     width seq                                                names               
 [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG               B7_591:4:96:693:509
 [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT                EAS54_65:7:152:36...
 [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC                EAS51_64:8:5:734:57
 [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT               B7_591:1:289:587:906
 [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG                EAS56_59:8:38:671...
 ...   ... ...
[12]    36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC               B7_591:3:188:662:155
[13]    35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT                EAS56_59:2:225:60...
[14]    35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG                EAS51_66:7:328:39...
[15]    35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA                EAS51_64:5:257:96...
[16]    35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG                EAS54_61:4:143:69...

> seq2
  A DNAStringSet instance of length 25
     width seq                                                names               
 [1]    36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA               B7_591:8:4:841:340
 [2]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA                EAS54_67:4:142:94...
 [3]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA                EAS54_67:6:43:859...
 [4]    35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA                EAS1_93:2:286:923...
 [5]    35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA                EAS1_99:8:117:578...
 ...   ... ...
[21]    35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA                EAS54_65:8:305:81...
[22]    35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG                EAS114_26:7:13:17...
[23]    35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT                EAS56_63:7:34:334...
[24]    35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC                EAS114_45:3:32:13...
[25]    40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA           EAS139_19:5:70:31...

If I use the basic parameters for trimLRPatterns on seq1:

> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1)

My expected sequence widths after trimming are:
> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1)
 [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 34

Actually trimmed sequence widths:
> width(trimmed_reads1)
 [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34

Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming:

> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5)
> width(trimmed_reads2)
 [1]  0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35

In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence:

> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5)
> width(trimmed_reads3)
 [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 27

I tried setting max.Lmismatch =0.1 but got too little trimming again:

> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1)
> width(trimmed_reads4)
 [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34

 I could probably optimize until I got it just right, but with a large data set that might not be too practical.
 If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them:

> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5)
> trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1)
> width(seq2)
 [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40
> width(trimmed_reads5)
 [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40
> width(trimmed_reads6)
 [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40

So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix?

Thanks for your help (again)!
Sean

> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Monday, July 22, 2013 5:37 PM
> To: Taylor, Sean D
> Cc: bioconductor at r-project.org
> Subject: Re: Trimming of partial adaptor sequences
> 
> Hi Sean,
> 
> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
> > We have been experimenting with a NGS protocol in which we insert
> > sheared genomic fragments into a custom plasmid for sequencing on an
> > Illumina MiSeq instrument. The insertion site of this plasmid is
> > flanked by our own custom barcodes (N7) and ~80 nt Illumina-based
> > adaptor sequence. We then PCR out the insert with barcodes and
> > adaptors for sequencing. Our adaptor sequence is similar to the
> > Illumina adaptor, but we use custom primer binding sites. We are not
> > sure if the Illumina software will be able to recognize and trim our
> > custom adaptors. We are trying to figure out the best way to trim read
> through into the 3'
> > adaptor ourselves.  We have roughly three scenarios:
> >
> > (1) The insert is long enough that we have no read through
> >
> > (2) The vector is empty, in which case the entire adaptor sequence is
> > present
> >
> > (3) The insert is long enough to have useful data, but we get
> > read-through into the 3' adaptor sequence that must be trimmed.
> >
> > The solution we are currently working on is to identify the minimal
> > sequence that is recognizable as the adaptor sequence and trim that
> > using trimLRPatterns() in the Biostrings package.  Ideally we would
> > like it if we could give trimLRPatterns() the entire adaptor sequence
> > and have it recognize it on our reads even if it is only partially present.
> 
> May be I misunderstand what you are trying to do exactly but yes, you can
> give the entire adaptor sequence to trimLRPatterns() and it will recognize it
> on our reads even if it's only partially present:
> 
>    library(Biostrings)
> 
>    adaptor <- DNAString("ACCAGGACAA")  # entire adaptor
>    reads <- DNAStringSet(c(
>      "GACAATTTATTT", # adaptor partially present on the left
>      "CAATTTATTTGC", # adaptor partially present on the left
>      "TTTATTTACCAG", # adaptor partially present on the right
>      "CAATTTTTTACC"  # adaptor partially present on both ends
>    ))
> 
> Then:
> 
>    > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
>      A DNAStringSet instance of length 4
>        width seq
>    [1]     7 TTTATTT
>    [2]     9 TTTATTTGC
>    [3]     7 TTTATTT
>    [4]     6 TTTTTT
> 
> Note that trimLRPatterns() expects that, when the adaptor is partially
> present on the left (resp. right), what's present is a suffix (resp.
> prefix) of the adaptor, and not an arbitrary substring of it. Is it what you
> expect too?
> 
> Thanks,
> H.
> 
> > However, in my experimenting it did not seem to be able to this. I
> > thought I would ask the Bioconductor community if there are any better
> > solutions to recognizing and trimming partial adaptor sequences.
> >
> > Thanks in advance for any input.
> >
> > Sean Taylor
> >
> > Post-doctoral Fellow
> >
> > Fred Hutchinson Cancer Research Center
> >
> > 206-667-5544
> >
> 
> --
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319