[BioC] Trimming of partial adaptor sequences

Thu Jul 25 11:04:35 CEST 2013

Hi Harris,

On 07/24/2013 10:41 AM, Harris A. Jaffee wrote:
>
> On Jul 24, 2013, at 12:26 PM, Taylor, Sean D wrote:
>
>> Hi Harris,
>>
>> No you are right, I will mostly do 3' trimming. But for my test data it just seemed more convenient to test using Lpattern and I assumed (correctly I hope) that all the same things apply to Rpattern so long as you keep your prefixes and suffixes straight.
>
> Yes.
>
>> It took me a little while to figure out what was going on with the max.Lmismatch vector and the negative values (@Herve, I'm glad I'm not the only one who finds the language on the man page to be a bit opaque at times!)
>
> Opaqueness due to my writing skills and an effort not to overload with detail.  It needs work and examples.
>
>> but with some playing around and with the help of your example I think I understand how it's working now. Just to clarify, the low spots are always set at the beginning of the vector, regardless of if you are doing L or R patterns, correct?
>
> Yes, it (and supporting code underneath) is symmetric, at least intended to be.
>
>> Also thanks for the tip on finding the internal adaptors. I tried it out and it worked pretty well, although I discovered that it is easy to overkill with a bunch of Ns, so I'll have to use it conservatively. My adaptor sequence is pretty big though (~80nts) so I should be able to tolerate a decent number of N's on the end, but I will experiment and see. When you use this trick, are the N's being counted as mismatches and should I alter  max.L/Rmismatch?
>
> No.  The "non-fixed" matching of IUPAC wild-card letters is free.  (Use neditAt to see this.)
>
> I point out 'with.indels=TRUE' in case you missed it.  They count as 1 mismatch per indel needed.
> Previously (and still in the release branch, I think), you could not get indels and non-fixed at
> the same time.  But it was just implemented last week, in Biostrings devel anyway, by Herve.

Yes the matching functions in Biostrings (in devel) finally support
indels and IUPAC ambiguities *together*. You get the credit for
requesting this feature off-list and sending me an initial patch.
Thanks!

H.

>  In
> my limited view, you always want to allow indels and almost always some degree of non-fixed (i.e.
> fixed="pattern" or, if you add N's to your adaptor, fixed=FALSE), because there will be N's in
> your data.  Although, you have to decide whether these N's should match your adaptor freely or by
> using your allowed mismatch limit.
>
>> Otherwise, I think I'm pretty happy with these solutions. Thanks so much for your help!
>> Sean
>>
>>> -----Original Message-----
>>> From: Harris A. Jaffee [mailto:hj at jhu.edu]
>>> Sent: Tuesday, July 23, 2013 9:23 PM
>>> To: Taylor, Sean D
>>> Cc: Pages, Herve; bioconductor at r-project.org; James W. MacDonald;
>>> dpryan79 at gmail.com
>>> Subject: Re: [BioC] Trimming of partial adaptor sequences
>>>
>>> Sorry to chime in late; I've been on the road.  I'm still not clear on your
>>> specifications.
>>> Herve's answer showed you how to do trimming of the same adaptor, so to
>>> speak, at both ends of your reads, but I took your original question to be
>>> about trimming at the 3' end only (in which case your adaptor goes as
>>> 'Rpattern', with nothing in 'Lpattern').  But I could be all wrong.  Please
>>> explain in more detail what you want, if you aren't happy yet.
>>>
>>> You are right about a given "max.L/Rmismatch" value not being suitable for
>>> all your data.
>>> There is no way to beat that except to subdivide your data.  You may be able
>>> to gather some intuition using neditStartingAt() or neditEndingAt().
>>>
>>> To prevent (tiny) "false positive" trimming, construct a max.mismatch
>>> *vector* of the same length as your adaptor string, with -1 in the low spots.
>>> By "low", I mean as many letters as you want never to be trimmed.  For
>>> example, to set a trimming lower limit to 3 letters, do something along these
>>> lines:
>>>
>>> 	rate <- .25	# pretty big rate
>>> 	max.Lmismatch (or *R*, whichever you really want) <- rate *
>>> 1:nchar(adaptor)
>>> 	max.Lmismatch[1:2] <- -1
>>>
>>> Equivalently, send a shorter vector ("missing the low spots"):
>>>
>>> 	max.Lmismatch <- (rate * 1:nchar(adaptor)) [-(1:2)]
>>>
>>> There isn't an argument to do this for you.  Maybe there should be.
>>>
>>> There is also a trick to locate adaptors in "accidental places" in your data, such
>>> as 3'
>>> adaptors in the middle of reads, like
>>>
>>> 	[good_data][adaptor]AAAAAAAAAAAAAAAAAAAAA
>>>
>>> For your Rpattern value, append many N's to your adaptor string, and set
>>> Rfixed="subject"
>>> (or Rfixed=FALSE, just *not* Rfixed="pattern").  I suppose, if you knew that
>>> the only garbage that might occur to the right of such an accident was always
>>> poly-*A*, then you could just append A's instead of N's, and then you
>>> wouldn't need to relax Rfixed.  This line of attack, with the N's, is due to
>>> Patrick Aboyoun, the original author.
>>>
>>> Let me know if I'm way off.
>>>
>>> On Jul 23, 2013, at 6:26 PM, Taylor, Sean D wrote:
>>>
>>>> Thanks all for your input. I will give some of these solutions a try and will
>>> probably go with whichever is faster and integrates well with the remainder
>>> of our analyses in R.
>>>>
>>>> Herve, thanks for the clarification. Looking back I see that I hadn't explored
>>> the function fully. I think it will work, but I have a few follow-up questions.
>>> Here is the sample data set that I'm playing around with:
>>>>
>>>> library(Rsamtools)
>>>> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
>>>> param <- ScanBamParam(what=c("seq", "qual")) gal <-
>>>> readGappedAlignments(bamfile, use.names=TRUE, param=param)
>>>>
>>>> These sequences are all relatively short, about 35nts each. I set the first
>>> read as my "adaptor" and pulled out all the reads that have a start position
>>> overlapping the 'adaptor' sequence:
>>>>
>>>> gal<-gal[which(start(gal)<=width(gal[1]))]
>>>> reads <- setNames(mcols(gal)$seq, names(gal))
>>>> seqnames<-seqnames(gal)
>>>> adaptor<-reads[[1]]
>>>>
>>>> The data are from two 'chromosomes', seq1 and seq2. My adaptor is from
>>> seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me
>>> approximate best and worst case scenarios:
>>>>
>>>> seq1<-reads[seqnames=='seq1']
>>>> seq2<-reads[seqnames=='seq2']
>>>>
>>>>> adaptor
>>>> 36-letter "DNAString" instance
>>>> seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
>>>>
>>>>> seq1
>>>> A DNAStringSet instance of length 16
>>>>    width seq                                                names
>>>> [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
>>> B7_591:4:96:693:509
>>>> [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
>>> EAS54_65:7:152:36...
>>>> [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
>>> EAS51_64:8:5:734:57
>>>> [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
>>> B7_591:1:289:587:906
>>>> [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
>>> EAS56_59:8:38:671...
>>>> ...   ... ...
>>>> [12]    36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC
>>> B7_591:3:188:662:155
>>>> [13]    35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT
>>> EAS56_59:2:225:60...
>>>> [14]    35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG
>>> EAS51_66:7:328:39...
>>>> [15]    35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA
>>> EAS51_64:5:257:96...
>>>> [16]    35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG
>>> EAS54_61:4:143:69...
>>>>
>>>>> seq2
>>>> A DNAStringSet instance of length 25
>>>>    width seq                                                names
>>>> [1]    36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
>>> B7_591:8:4:841:340
>>>> [2]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA
>>> EAS54_67:4:142:94...
>>>> [3]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA
>>> EAS54_67:6:43:859...
>>>> [4]    35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA
>>> EAS1_93:2:286:923...
>>>> [5]    35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA
>>> EAS1_99:8:117:578...
>>>> ...   ... ...
>>>> [21]    35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA
>>> EAS54_65:8:305:81...
>>>> [22]    35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG
>>> EAS114_26:7:13:17...
>>>> [23]    35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT
>>> EAS56_63:7:34:334...
>>>> [24]    35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC
>>> EAS114_45:3:32:13...
>>>> [25]    40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA
>>> EAS139_19:5:70:31...
>>>>
>>>> If I use the basic parameters for trimLRPatterns on seq1:
>>>>
>>>>> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1)
>>>>
>>>> My expected sequence widths after trimming are:
>>>>> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1)
>>>> [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 34
>>>>
>>>> Actually trimmed sequence widths:
>>>>> width(trimmed_reads1)
>>>> [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34
>>>>
>>>> Here, the 4th sequence isn't trimmed because of some mismatches in the
>>> sequence. I expect that my data will have some mismatches, so I need to be
>>> able to work with those. So I tried setting the max.Lmismatch=5 to allow for
>>> up to 5 mismatches, but got no trimming:
>>>>
>>>>> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
>>>>> max.Lmismatch=5)
>>>>> width(trimmed_reads2)
>>>> [1]  0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35
>>>>
>>>> In my experimenting, I had gone straight to this and saw that it didn't trim
>>> partial sequence with these settings, so I had concluded that it wouldn't do
>>> the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow
>>> 50% mismatches (if I understand that correctly) and this time get too much
>>> trimming on the last sequence:
>>>>
>>>>> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
>>>>> max.Lmismatch=0.5)
>>>>> width(trimmed_reads3)
>>>> [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 27
>>>>
>>>> I tried setting max.Lmismatch =0.1 but got too little trimming again:
>>>>
>>>>> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1,
>>>>> max.Lmismatch=0.1)
>>>>> width(trimmed_reads4)
>>>> [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34
>>>>
>>>> I could probably optimize until I got it just right, but with a large data set
>>> that might not be too practical.
>>>> If I try these same settings against seq2, where there is not supposed to be
>>> overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With
>>> mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a
>>> few of them:
>>>>
>>>>> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2,
>>>>> max.Lmismatch=0.5) trimmed_reads6<-
>>> trimLRPatterns(Lpattern=adaptor,
>>>>> subject=seq2, max.Lmismatch=0.1)
>>>>> width(seq2)
>>>> [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35
>>>> 35 35 40
>>>>> width(trimmed_reads5)
>>>> [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29
>>>> 33 27 40
>>>>> width(trimmed_reads6)
>>>> [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35
>>>> 35 35 40
>>>>
>>>> So what is the best way to optimize this? Obviously you are trying to
>>> balance recognizing the whole adaptor sequence without trimming off too
>>> many false positives. Also, can I set a limit that requires to have at least n
>>> bases overlap with the adaptor suffix?
>>>>
>>>> Thanks for your help (again)!
>>>> Sean
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>>>>> Sent: Monday, July 22, 2013 5:37 PM
>>>>> To: Taylor, Sean D
>>>>> Cc: bioconductor at r-project.org
>>>>> Subject: Re: Trimming of partial adaptor sequences
>>>>>
>>>>> Hi Sean,
>>>>>
>>>>> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
>>>>>> We have been experimenting with a NGS protocol in which we insert
>>>>>> sheared genomic fragments into a custom plasmid for sequencing on an
>>>>>> Illumina MiSeq instrument. The insertion site of this plasmid is
>>>>>> flanked by our own custom barcodes (N7) and ~80 nt Illumina-based
>>>>>> adaptor sequence. We then PCR out the insert with barcodes and
>>>>>> adaptors for sequencing. Our adaptor sequence is similar to the
>>>>>> Illumina adaptor, but we use custom primer binding sites. We are not
>>>>>> sure if the Illumina software will be able to recognize and trim our
>>>>>> custom adaptors. We are trying to figure out the best way to trim
>>>>>> read
>>>>> through into the 3'
>>>>>> adaptor ourselves.  We have roughly three scenarios:
>>>>>>
>>>>>> (1) The insert is long enough that we have no read through
>>>>>>
>>>>>> (2) The vector is empty, in which case the entire adaptor sequence
>>>>>> is present
>>>>>>
>>>>>> (3) The insert is long enough to have useful data, but we get
>>>>>> read-through into the 3' adaptor sequence that must be trimmed.
>>>>>>
>>>>>> The solution we are currently working on is to identify the minimal
>>>>>> sequence that is recognizable as the adaptor sequence and trim that
>>>>>> using trimLRPatterns() in the Biostrings package.  Ideally we would
>>>>>> like it if we could give trimLRPatterns() the entire adaptor
>>>>>> sequence and have it recognize it on our reads even if it is only partially
>>> present.
>>>>>
>>>>> May be I misunderstand what you are trying to do exactly but yes, you
>>>>> can give the entire adaptor sequence to trimLRPatterns() and it will
>>>>> recognize it on our reads even if it's only partially present:
>>>>>
>>>>> library(Biostrings)
>>>>>
>>>>> adaptor <- DNAString("ACCAGGACAA")  # entire adaptor
>>>>> reads <- DNAStringSet(c(
>>>>>    "GACAATTTATTT", # adaptor partially present on the left
>>>>>    "CAATTTATTTGC", # adaptor partially present on the left
>>>>>    "TTTATTTACCAG", # adaptor partially present on the right
>>>>>    "CAATTTTTTACC"  # adaptor partially present on both ends
>>>>> ))
>>>>>
>>>>> Then:
>>>>>
>>>>>> trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
>>>>>    A DNAStringSet instance of length 4
>>>>>      width seq
>>>>> [1]     7 TTTATTT
>>>>> [2]     9 TTTATTTGC
>>>>> [3]     7 TTTATTT
>>>>> [4]     6 TTTTTT
>>>>>
>>>>> Note that trimLRPatterns() expects that, when the adaptor is
>>>>> partially present on the left (resp. right), what's present is a suffix (resp.
>>>>> prefix) of the adaptor, and not an arbitrary substring of it. Is it
>>>>> what you expect too?
>>>>>
>>>>> Thanks,
>>>>> H.
>>>>>
>>>>>> However, in my experimenting it did not seem to be able to this. I
>>>>>> thought I would ask the Bioconductor community if there are any
>>>>>> better solutions to recognizing and trimming partial adaptor sequences.
>>>>>>
>>>>>> Thanks in advance for any input.
>>>>>>
>>>>>> Sean Taylor
>>>>>>
>>>>>> Post-doctoral Fellow
>>>>>>
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>>
>>>>>> 206-667-5544
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319