[BioC] Filtering BAM files by start position for VariantTools
Taylor, Sean D
sdtaylor at fhcrc.org
Fri Aug 2 16:08:33 CEST 2013
Martin, it always makes me feel better when I see you scratching your head too! ;)
Sean
Sent from my iPod
On Aug 1, 2013, at 6:02 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
> On 07/16/2013 05:40 PM, Michael Lawrence wrote:
>> The necessary update to VariantTools will propagate soon to devel. Or you
>> can grab it from svn.
>
> This isn't building in devel; does it require something special with gmapR? Dan and I looked at this (??) for the conference, but were not quite able to master the automake foo required to get off the ground.
>
> Martin
>
>>
>> Michael
>>
>>
>> On Tue, Jul 16, 2013 at 3:27 PM, Taylor, Sean D <sdtaylor at fhcrc.org> wrote:
>>
>>> In order to work through some of the code, I installed the devel version
>>> of R and updated all the packages. Now when I run tallyVariants() I get the
>>> following error message:****
>>>
>>> ** **
>>>
>>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : ****
>>>
>>> object 'castList' not found****
>>>
>>> ** **
>>>
>>>> traceback()****
>>>
>>> 12: get(name, envir = asNamespace(pkg), inherits = FALSE)****
>>>
>>> 11: IRanges:::castList****
>>>
>>> 10: safe_mclapply(ind, function(i, ...) {****
>>>
>>> FUN(gr[i], ...)****
>>>
>>> }, ...)****
>>>
>>> 9: applyByChromosome(si, bam_tally_region, mc.cores = mc.cores)****
>>>
>>> 8: .local(x, ...)****
>>>
>>> 7: tallyVariants(x, tally.param)****
>>>
>>> 6: tallyVariants(x, tally.param)****
>>>
>>> 5: .local(x, ...)****
>>>
>>> 4: callVariants(BamFile(x), ...)****
>>>
>>> 3: callVariants(BamFile(x), ...)****
>>>
>>> 2: callVariants(destination, tally.param)****
>>>
>>> 1: callVariants(destination, tally.param)****
>>>
>>> ** **
>>>
>>> Perhaps this is something missing/changed in the devel version of IRanges?
>>> ****
>>>
>>> ** **
>>>
>>> Updated sessionInfo() below:****
>>>
>>> ** **
>>>
>>> R version 3.0.1 (2013-05-16)****
>>>
>>> Platform: x86_64-unknown-linux-gnu (64-bit)****
>>>
>>> ** **
>>>
>>> locale:****
>>>
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ****
>>>
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ****
>>>
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ****
>>>
>>> [7] LC_PAPER=C LC_NAME=C ****
>>>
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C ****
>>>
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ****
>>>
>>> ** **
>>>
>>> attached base packages:****
>>>
>>> [1] grid parallel stats graphics grDevices utils datasets *
>>> ***
>>>
>>> [8] methods base ****
>>>
>>> ** **
>>>
>>> other attached packages:****
>>>
>>> [1] BiocInstaller_1.11.3 latticeExtra_0.6-24
>>> lattice_0.20-15 ****
>>>
>>> [4] RColorBrewer_1.0-5 genoPlotR_0.8
>>> ade4_1.5-2 ****
>>>
>>> [7] gmapR_1.3.2 VariantTools_1.3.2
>>> VariantAnnotation_1.7.34****
>>>
>>> [10] Rsamtools_1.13.24 Biostrings_2.29.13
>>> GenomicRanges_1.13.33 ****
>>>
>>> [13] XVector_0.1.0 IRanges_1.19.18
>>> BiocGenerics_0.7.3 ****
>>>
>>> ** **
>>>
>>> loaded via a namespace (and not attached):****
>>>
>>> [1] AnnotationDbi_1.23.16 Biobase_2.20.0 biomaRt_2.17.2
>>> ****
>>>
>>> [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7
>>> ****
>>>
>>> [7] GenomicFeatures_1.13.19 graph_1.39.3
>>> Matrix_1.0-12 ****
>>>
>>> [10] RBGL_1.37.2 RCurl_1.95-4.1
>>> RSQLite_0.11.4 ****
>>>
>>> [13] rtracklayer_1.20.2 stats4_3.0.1
>>> tools_3.0.1 ****
>>>
>>> [16] XML_3.96-1.1 zlibbioc_1.6.0 ****
>>>
>>>> ** **
>>>
>>> ** **
>>>
>>> Thanks again,****
>>>
>>> Sean****
>>>
>>> ** **
>>>
>>> *From:* Michael Lawrence [mailto:lawrence.michael at gene.com]
>>> *Sent:* Saturday, July 13, 2013 1:59 PM
>>> *To:* Taylor, Sean D
>>> *Cc:* Michael Lawrence; Pages, Herve; Obenchain, Valerie J;
>>> bioconductor at r-project.org
>>>
>>> *Subject:* Re: [BioC] Filtering BAM files by start position for
>>> VariantTools****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> On Sat, Jul 13, 2013 at 1:50 PM, Taylor, Sean D <sdtaylor at fhcrc.org>
>>> wrote:****
>>>
>>> Thanks Michael, ****
>>>
>>> This is an interesting idea. Usually, we resolve PCR/optical duplicates
>>> with the Picard MarkDuplicates command, which just chooses the read with
>>> the highest average quality. This approximates what you want, I think. ***
>>> *
>>>
>>> Is the Picard MarkDuplicates command run by default? Or is this a separate
>>> command from a different package?****
>>>
>>> ** **
>>>
>>> It's a separate package, basically the Java implementation of samtools;
>>> you might that first and see how it improves your error rates, before
>>> taking a more complicated approach.****
>>>
>>> Also, I find it hard to believe that sequence errors would be causing
>>> you trouble at 10%, as long as you're filtering for quality and have decent
>>> coverage. PCR might in limited cases, and Picard effectively takes care of
>>> that.****
>>>
>>> 10% is fine, but we have problems at 1% and lower. I think even the
>>> defaults in tallyVariants() use 1% as a cutoff if I’m not mistaken.****
>>>
>>> ****
>>>
>>> ** **
>>>
>>> It uses a ~4% cutoff. And yes, you'll start running into issues around 1%.
>>> Calling at such a low frequency is kind of ambitious.****
>>>
>>> ****
>>>
>>> What you want is to specify the cycleBreaks argument to
>>> VariantTallyParam. It allows you to define bins by cycle. So if you made
>>> bins for all 100bp, you could do things like generate a consensus by read
>>> family. A family corresponding to position X would be cycle bin #1 at X,
>>> cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the highest
>>> count. Might be tricky to implement efficiently for the whole genome. Of
>>> course, this assumes you've got single end reads, otherwise this won't
>>> work, because the mate is not considered.****
>>>
>>> Cool, I will try that out, that may be just what I was looking for. It
>>> might be tricky for the whole genome, but we are restricting ourselves to
>>> just the mitochondrial genome. Still gives us several thousand start sites
>>> to work from but should be easier than the whole genome. As for single end
>>> reads, can I just filter on the strand (i.e. ‘+’ or ‘-‘)? Or will I have to
>>> perform separate alignments for each of the paired reads?****
>>>
>>> ****
>>>
>>> ** **
>>>
>>> Now that I think about it, you should be fine just considering the
>>> alignments individually (as if they were unpaired).****
>>>
>>> ****
>>>
>>> ** **
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
More information about the Bioconductor
mailing list