[BioC] Trimming of partial adaptor sequences

Wed Jul 24 18:26:39 CEST 2013

Hi Harris,

No you are right, I will mostly do 3' trimming. But for my test data it just seemed more convenient to test using Lpattern and I assumed (correctly I hope) that all the same things apply to Rpattern so long as you keep your prefixes and suffixes straight. 

It took me a little while to figure out what was going on with the max.Lmismatch vector and the negative values (@Herve, I'm glad I'm not the only one who finds the language on the man page to be a bit opaque at times!) but with some playing around and with the help of your example I think I understand how it's working now. Just to clarify, the low spots are always set at the beginning of the vector, regardless of if you are doing L or R patterns, correct?

Also thanks for the tip on finding the internal adaptors. I tried it out and it worked pretty well, although I discovered that it is easy to overkill with a bunch of Ns, so I'll have to use it conservatively. My adaptor sequence is pretty big though (~80nts) so I should be able to tolerate a decent number of N's on the end, but I will experiment and see. When you use this trick, are the N's being counted as mismatches and should I alter  max.L/Rmismatch?

Otherwise, I think I'm pretty happy with these solutions. Thanks so much for your help!
Sean

> -----Original Message-----
> From: Harris A. Jaffee [mailto:hj at jhu.edu]
> Sent: Tuesday, July 23, 2013 9:23 PM
> To: Taylor, Sean D
> Cc: Pages, Herve; bioconductor at r-project.org; James W. MacDonald;
> dpryan79 at gmail.com
> Subject: Re: [BioC] Trimming of partial adaptor sequences
> 
> Sorry to chime in late; I've been on the road.  I'm still not clear on your
> specifications.
> Herve's answer showed you how to do trimming of the same adaptor, so to
> speak, at both ends of your reads, but I took your original question to be
> about trimming at the 3' end only (in which case your adaptor goes as
> 'Rpattern', with nothing in 'Lpattern').  But I could be all wrong.  Please
> explain in more detail what you want, if you aren't happy yet.
> 
> You are right about a given "max.L/Rmismatch" value not being suitable for
> all your data.
> There is no way to beat that except to subdivide your data.  You may be able
> to gather some intuition using neditStartingAt() or neditEndingAt().
> 
> To prevent (tiny) "false positive" trimming, construct a max.mismatch
> *vector* of the same length as your adaptor string, with -1 in the low spots.
> By "low", I mean as many letters as you want never to be trimmed.  For
> example, to set a trimming lower limit to 3 letters, do something along these
> lines:
> 
> 	rate <- .25	# pretty big rate
> 	max.Lmismatch (or *R*, whichever you really want) <- rate *
> 1:nchar(adaptor)
> 	max.Lmismatch[1:2] <- -1
> 
> Equivalently, send a shorter vector ("missing the low spots"):
> 
> 	max.Lmismatch <- (rate * 1:nchar(adaptor)) [-(1:2)]
> 
> There isn't an argument to do this for you.  Maybe there should be.
> 
> There is also a trick to locate adaptors in "accidental places" in your data, such
> as 3'
> adaptors in the middle of reads, like
> 
> 	[good_data][adaptor]AAAAAAAAAAAAAAAAAAAAA
> 
> For your Rpattern value, append many N's to your adaptor string, and set
> Rfixed="subject"
> (or Rfixed=FALSE, just *not* Rfixed="pattern").  I suppose, if you knew that
> the only garbage that might occur to the right of such an accident was always
> poly-*A*, then you could just append A's instead of N's, and then you
> wouldn't need to relax Rfixed.  This line of attack, with the N's, is due to
> Patrick Aboyoun, the original author.
> 
> Let me know if I'm way off.
> 
> On Jul 23, 2013, at 6:26 PM, Taylor, Sean D wrote:
> 
> > Thanks all for your input. I will give some of these solutions a try and will
> probably go with whichever is faster and integrates well with the remainder
> of our analyses in R.
> >
> > Herve, thanks for the clarification. Looking back I see that I hadn't explored
> the function fully. I think it will work, but I have a few follow-up questions.
> Here is the sample data set that I'm playing around with:
> >
> > library(Rsamtools)
> > bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
> > param <- ScanBamParam(what=c("seq", "qual")) gal <-
> > readGappedAlignments(bamfile, use.names=TRUE, param=param)
> >
> > These sequences are all relatively short, about 35nts each. I set the first
> read as my "adaptor" and pulled out all the reads that have a start position
> overlapping the 'adaptor' sequence:
> >
> > gal<-gal[which(start(gal)<=width(gal[1]))]
> > reads <- setNames(mcols(gal)$seq, names(gal))
> > seqnames<-seqnames(gal)
> > adaptor<-reads[[1]]
> >
> > The data are from two 'chromosomes', seq1 and seq2. My adaptor is from
> seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me
> approximate best and worst case scenarios:
> >
> > seq1<-reads[seqnames=='seq1']
> > seq2<-reads[seqnames=='seq2']
> >
> >> adaptor
> >  36-letter "DNAString" instance
> > seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
> >
> >> seq1
> >  A DNAStringSet instance of length 16
> >     width seq                                                names
> > [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
> B7_591:4:96:693:509
> > [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
> EAS54_65:7:152:36...
> > [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
> EAS51_64:8:5:734:57
> > [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
> B7_591:1:289:587:906
> > [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
> EAS56_59:8:38:671...
> > ...   ... ...
> > [12]    36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC
> B7_591:3:188:662:155
> > [13]    35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT
> EAS56_59:2:225:60...
> > [14]    35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG
> EAS51_66:7:328:39...
> > [15]    35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA
> EAS51_64:5:257:96...
> > [16]    35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG
> EAS54_61:4:143:69...
> >
> >> seq2
> >  A DNAStringSet instance of length 25
> >     width seq                                                names
> > [1]    36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
> B7_591:8:4:841:340
> > [2]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA
> EAS54_67:4:142:94...
> > [3]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA
> EAS54_67:6:43:859...
> > [4]    35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
> EAS1_93:2:286:923...
> > [5]    35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA
> EAS1_99:8:117:578...
> > ...   ... ...
> > [21]    35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA
> EAS54_65:8:305:81...
> > [22]    35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG
> EAS114_26:7:13:17...
> > [23]    35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT
> EAS56_63:7:34:334...
> > [24]    35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC
> EAS114_45:3:32:13...
> > [25]    40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA
> EAS139_19:5:70:31...
> >
> > If I use the basic parameters for trimLRPatterns on seq1:
> >
> >> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1)
> >
> > My expected sequence widths after trimming are:
> >> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1)
> > [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 34
> >
> > Actually trimmed sequence widths:
> >> width(trimmed_reads1)
> > [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34
> >
> > Here, the 4th sequence isn't trimmed because of some mismatches in the
> sequence. I expect that my data will have some mismatches, so I need to be
> able to work with those. So I tried setting the max.Lmismatch=5 to allow for
> up to 5 mismatches, but got no trimming:
> >
> >> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
> >> max.Lmismatch=5)
> >> width(trimmed_reads2)
> > [1]  0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35
> >
> > In my experimenting, I had gone straight to this and saw that it didn't trim
> partial sequence with these settings, so I had concluded that it wouldn't do
> the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow
> 50% mismatches (if I understand that correctly) and this time get too much
> trimming on the last sequence:
> >
> >> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
> >> max.Lmismatch=0.5)
> >> width(trimmed_reads3)
> > [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 27
> >
> > I tried setting max.Lmismatch =0.1 but got too little trimming again:
> >
> >> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
> >> max.Lmismatch=0.1)
> >> width(trimmed_reads4)
> > [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34
> >
> > I could probably optimize until I got it just right, but with a large data set
> that might not be too practical.
> > If I try these same settings against seq2, where there is not supposed to be
> overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With
> mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a
> few of them:
> >
> >> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2,
> >> max.Lmismatch=0.5) trimmed_reads6<-
> trimLRPatterns(Lpattern=adaptor,
> >> subject=seq2, max.Lmismatch=0.1)
> >> width(seq2)
> > [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35
> > 35 35 40
> >> width(trimmed_reads5)
> > [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29
> > 33 27 40
> >> width(trimmed_reads6)
> > [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35
> > 35 35 40
> >
> > So what is the best way to optimize this? Obviously you are trying to
> balance recognizing the whole adaptor sequence without trimming off too
> many false positives. Also, can I set a limit that requires to have at least n
> bases overlap with the adaptor suffix?
> >
> > Thanks for your help (again)!
> > Sean
> >
> >
> >> -----Original Message-----
> >> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> >> Sent: Monday, July 22, 2013 5:37 PM
> >> To: Taylor, Sean D
> >> Cc: bioconductor at r-project.org
> >> Subject: Re: Trimming of partial adaptor sequences
> >>
> >> Hi Sean,
> >>
> >> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
> >>> We have been experimenting with a NGS protocol in which we insert
> >>> sheared genomic fragments into a custom plasmid for sequencing on an
> >>> Illumina MiSeq instrument. The insertion site of this plasmid is
> >>> flanked by our own custom barcodes (N7) and ~80 nt Illumina-based
> >>> adaptor sequence. We then PCR out the insert with barcodes and
> >>> adaptors for sequencing. Our adaptor sequence is similar to the
> >>> Illumina adaptor, but we use custom primer binding sites. We are not
> >>> sure if the Illumina software will be able to recognize and trim our
> >>> custom adaptors. We are trying to figure out the best way to trim
> >>> read
> >> through into the 3'
> >>> adaptor ourselves.  We have roughly three scenarios:
> >>>
> >>> (1) The insert is long enough that we have no read through
> >>>
> >>> (2) The vector is empty, in which case the entire adaptor sequence
> >>> is present
> >>>
> >>> (3) The insert is long enough to have useful data, but we get
> >>> read-through into the 3' adaptor sequence that must be trimmed.
> >>>
> >>> The solution we are currently working on is to identify the minimal
> >>> sequence that is recognizable as the adaptor sequence and trim that
> >>> using trimLRPatterns() in the Biostrings package.  Ideally we would
> >>> like it if we could give trimLRPatterns() the entire adaptor
> >>> sequence and have it recognize it on our reads even if it is only partially
> present.
> >>
> >> May be I misunderstand what you are trying to do exactly but yes, you
> >> can give the entire adaptor sequence to trimLRPatterns() and it will
> >> recognize it on our reads even if it's only partially present:
> >>
> >>   library(Biostrings)
> >>
> >>   adaptor <- DNAString("ACCAGGACAA")  # entire adaptor
> >>   reads <- DNAStringSet(c(
> >>     "GACAATTTATTT", # adaptor partially present on the left
> >>     "CAATTTATTTGC", # adaptor partially present on the left
> >>     "TTTATTTACCAG", # adaptor partially present on the right
> >>     "CAATTTTTTACC"  # adaptor partially present on both ends
> >>   ))
> >>
> >> Then:
> >>
> >>> trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
> >>     A DNAStringSet instance of length 4
> >>       width seq
> >>   [1]     7 TTTATTT
> >>   [2]     9 TTTATTTGC
> >>   [3]     7 TTTATTT
> >>   [4]     6 TTTTTT
> >>
> >> Note that trimLRPatterns() expects that, when the adaptor is
> >> partially present on the left (resp. right), what's present is a suffix (resp.
> >> prefix) of the adaptor, and not an arbitrary substring of it. Is it
> >> what you expect too?
> >>
> >> Thanks,
> >> H.
> >>
> >>> However, in my experimenting it did not seem to be able to this. I
> >>> thought I would ask the Bioconductor community if there are any
> >>> better solutions to recognizing and trimming partial adaptor sequences.
> >>>
> >>> Thanks in advance for any input.
> >>>
> >>> Sean Taylor
> >>>
> >>> Post-doctoral Fellow
> >>>
> >>> Fred Hutchinson Cancer Research Center
> >>>
> >>> 206-667-5544
> >>>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpages at fhcrc.org
> >> Phone:  (206) 667-5791
> >> Fax:    (206) 667-1319
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> > http://news.gmane.org/gmane.science.biology.informatics.conductor