[BioC] Trimming of partial adaptor sequences
Taylor, Sean D
sdtaylor at fhcrc.org
Wed Jul 24 00:26:32 CEST 2013
Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R.
Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with:
library(Rsamtools)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what=c("seq", "qual"))
gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param)
These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence:
gal<-gal[which(start(gal)<=width(gal[1]))]
reads <- setNames(mcols(gal)$seq, names(gal))
seqnames<-seqnames(gal)
adaptor<-reads[[1]]
The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios:
seq1<-reads[seqnames=='seq1']
seq2<-reads[seqnames=='seq2']
> adaptor
36-letter "DNAString" instance
seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
> seq1
A DNAStringSet instance of length 16
width seq names
[1] 36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG B7_591:4:96:693:509
[2] 35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT EAS54_65:7:152:36...
[3] 35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC EAS51_64:8:5:734:57
[4] 36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT B7_591:1:289:587:906
[5] 35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG EAS56_59:8:38:671...
... ... ...
[12] 36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC B7_591:3:188:662:155
[13] 35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT EAS56_59:2:225:60...
[14] 35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG EAS51_66:7:328:39...
[15] 35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA EAS51_64:5:257:96...
[16] 35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG EAS54_61:4:143:69...
> seq2
A DNAStringSet instance of length 25
width seq names
[1] 36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA B7_591:8:4:841:340
[2] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:4:142:94...
[3] 35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA EAS54_67:6:43:859...
[4] 35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA EAS1_93:2:286:923...
[5] 35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA EAS1_99:8:117:578...
... ... ...
[21] 35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA EAS54_65:8:305:81...
[22] 35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG EAS114_26:7:13:17...
[23] 35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT EAS56_63:7:34:334...
[24] 35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC EAS114_45:3:32:13...
[25] 40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA EAS139_19:5:70:31...
If I use the basic parameters for trimLRPatterns on seq1:
> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1)
My expected sequence widths after trimming are:
> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1)
[1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 34
Actually trimmed sequence widths:
> width(trimmed_reads1)
[1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34
Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming:
> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5)
> width(trimmed_reads2)
[1] 0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35
In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence:
> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5)
> width(trimmed_reads3)
[1] 0 1 3 5 7 11 12 13 16 20 20 23 26 27 29 27
I tried setting max.Lmismatch =0.1 but got too little trimming again:
> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1)
> width(trimmed_reads4)
[1] 0 1 3 35 7 11 12 13 16 20 20 23 26 27 29 34
I could probably optimize until I got it just right, but with a large data set that might not be too practical.
If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them:
> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5)
> trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1)
> width(seq2)
[1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40
> width(trimmed_reads5)
[1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40
> width(trimmed_reads6)
[1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40
So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix?
Thanks for your help (again)!
Sean
> -----Original Message-----
> From: Hervé Pagès [mailto:hpages at fhcrc.org]
> Sent: Monday, July 22, 2013 5:37 PM
> To: Taylor, Sean D
> Cc: bioconductor at r-project.org
> Subject: Re: Trimming of partial adaptor sequences
>
> Hi Sean,
>
> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
> > We have been experimenting with a NGS protocol in which we insert
> > sheared genomic fragments into a custom plasmid for sequencing on an
> > Illumina MiSeq instrument. The insertion site of this plasmid is
> > flanked by our own custom barcodes (N7) and ~80 nt Illumina-based
> > adaptor sequence. We then PCR out the insert with barcodes and
> > adaptors for sequencing. Our adaptor sequence is similar to the
> > Illumina adaptor, but we use custom primer binding sites. We are not
> > sure if the Illumina software will be able to recognize and trim our
> > custom adaptors. We are trying to figure out the best way to trim read
> through into the 3'
> > adaptor ourselves. We have roughly three scenarios:
> >
> > (1) The insert is long enough that we have no read through
> >
> > (2) The vector is empty, in which case the entire adaptor sequence is
> > present
> >
> > (3) The insert is long enough to have useful data, but we get
> > read-through into the 3' adaptor sequence that must be trimmed.
> >
> > The solution we are currently working on is to identify the minimal
> > sequence that is recognizable as the adaptor sequence and trim that
> > using trimLRPatterns() in the Biostrings package. Ideally we would
> > like it if we could give trimLRPatterns() the entire adaptor sequence
> > and have it recognize it on our reads even if it is only partially present.
>
> May be I misunderstand what you are trying to do exactly but yes, you can
> give the entire adaptor sequence to trimLRPatterns() and it will recognize it
> on our reads even if it's only partially present:
>
> library(Biostrings)
>
> adaptor <- DNAString("ACCAGGACAA") # entire adaptor
> reads <- DNAStringSet(c(
> "GACAATTTATTT", # adaptor partially present on the left
> "CAATTTATTTGC", # adaptor partially present on the left
> "TTTATTTACCAG", # adaptor partially present on the right
> "CAATTTTTTACC" # adaptor partially present on both ends
> ))
>
> Then:
>
> > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
> A DNAStringSet instance of length 4
> width seq
> [1] 7 TTTATTT
> [2] 9 TTTATTTGC
> [3] 7 TTTATTT
> [4] 6 TTTTTT
>
> Note that trimLRPatterns() expects that, when the adaptor is partially
> present on the left (resp. right), what's present is a suffix (resp.
> prefix) of the adaptor, and not an arbitrary substring of it. Is it what you
> expect too?
>
> Thanks,
> H.
>
> > However, in my experimenting it did not seem to be able to this. I
> > thought I would ask the Bioconductor community if there are any better
> > solutions to recognizing and trimming partial adaptor sequences.
> >
> > Thanks in advance for any input.
> >
> > Sean Taylor
> >
> > Post-doctoral Fellow
> >
> > Fred Hutchinson Cancer Research Center
> >
> > 206-667-5544
> >
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list