[BioC] Filtering BAM files by start position for VariantTools
Dan Tenenbaum
dtenenba at fhcrc.org
Fri Aug 2 20:36:23 CEST 2013
On Fri, Aug 2, 2013 at 11:12 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
> The problem is pretty simple, the mtimes are arbitrary from any VCS, and
> autotools of course relies on those. When the same version of the tools are
> not available, things break. Ideally, the user would not need autotools AT
> ALL to build gmapR, but without mtime preservation, we can't avoid it.
>
> There are some sub-optimal solutions to this problem, like telling svn on
> the client side to use commit times for the mtimes (requires everyone to
> modify their svn config), or tell configure to ignore the mtimes for
> automake-generated files (would be a pain for the maintainers at least, who
> need the files regenerated when changed). The final one is to pick some
> version of autotools that everyone has and generate everything with that
> version, living with the autotools requirement and inefficiencies of
> arbitrary regeneration.
>
> We can try the last one. The cluster has automake 1.10; how about you guys?
>
>
1.11.3. Somehow in the last go-round I thought you were using a higher version.
Dan
>
>
>
> On Fri, Aug 2, 2013 at 10:18 AM, Dan Tenenbaum <dtenenba at fhcrc.org> wrote:
>>
>> On Fri, Aug 2, 2013 at 7:27 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>> > I will fix this today out of necessity to get it building on one of our
>> > clusters. It's not going to be fun.
>> >
>>
>> Thanks! If the solution could work with standard versions of automake
>> available on Ubuntu 12.04LTS that would be great, and mean that our
>> build systems would not need any special tweaking.
>>
>> Dan
>>
>>
>> >
>> > On Fri, Aug 2, 2013 at 7:08 AM, Taylor, Sean D <sdtaylor at fhcrc.org>
>> > wrote:
>> >>
>> >> Martin, it always makes me feel better when I see you scratching your
>> >> head
>> >> too! ;)
>> >>
>> >> Sean
>> >>
>> >> Sent from my iPod
>> >>
>> >> On Aug 1, 2013, at 6:02 PM, "Martin Morgan" <mtmorgan at fhcrc.org> wrote:
>> >>
>> >> > On 07/16/2013 05:40 PM, Michael Lawrence wrote:
>> >> >> The necessary update to VariantTools will propagate soon to devel.
>> >> >> Or
>> >> >> you
>> >> >> can grab it from svn.
>> >> >
>> >> > This isn't building in devel; does it require something special with
>> >> > gmapR? Dan and I looked at this (??) for the conference, but were not
>> >> > quite
>> >> > able to master the automake foo required to get off the ground.
>> >> >
>> >> > Martin
>> >> >
>> >> >>
>> >> >> Michael
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 16, 2013 at 3:27 PM, Taylor, Sean D <sdtaylor at fhcrc.org>
>> >> >> wrote:
>> >> >>
>> >> >>> In order to work through some of the code, I installed the devel
>> >> >>> version
>> >> >>> of R and updated all the packages. Now when I run tallyVariants() I
>> >> >>> get the
>> >> >>> following error message:****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>> >> >>> ****
>> >> >>>
>> >> >>> object 'castList' not found****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>>> traceback()****
>> >> >>>
>> >> >>> 12: get(name, envir = asNamespace(pkg), inherits = FALSE)****
>> >> >>>
>> >> >>> 11: IRanges:::castList****
>> >> >>>
>> >> >>> 10: safe_mclapply(ind, function(i, ...) {****
>> >> >>>
>> >> >>> FUN(gr[i], ...)****
>> >> >>>
>> >> >>> }, ...)****
>> >> >>>
>> >> >>> 9: applyByChromosome(si, bam_tally_region, mc.cores = mc.cores)****
>> >> >>>
>> >> >>> 8: .local(x, ...)****
>> >> >>>
>> >> >>> 7: tallyVariants(x, tally.param)****
>> >> >>>
>> >> >>> 6: tallyVariants(x, tally.param)****
>> >> >>>
>> >> >>> 5: .local(x, ...)****
>> >> >>>
>> >> >>> 4: callVariants(BamFile(x), ...)****
>> >> >>>
>> >> >>> 3: callVariants(BamFile(x), ...)****
>> >> >>>
>> >> >>> 2: callVariants(destination, tally.param)****
>> >> >>>
>> >> >>> 1: callVariants(destination, tally.param)****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> Perhaps this is something missing/changed in the devel version of
>> >> >>> IRanges?
>> >> >>> ****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> Updated sessionInfo() below:****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> R version 3.0.1 (2013-05-16)****
>> >> >>>
>> >> >>> Platform: x86_64-unknown-linux-gnu (64-bit)****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> locale:****
>> >> >>>
>> >> >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ****
>> >> >>>
>> >> >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ****
>> >> >>>
>> >> >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ****
>> >> >>>
>> >> >>> [7] LC_PAPER=C LC_NAME=C ****
>> >> >>>
>> >> >>> [9] LC_ADDRESS=C LC_TELEPHONE=C ****
>> >> >>>
>> >> >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> attached base packages:****
>> >> >>>
>> >> >>> [1] grid parallel stats graphics grDevices utils
>> >> >>> datasets *
>> >> >>> ***
>> >> >>>
>> >> >>> [8] methods base ****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> other attached packages:****
>> >> >>>
>> >> >>> [1] BiocInstaller_1.11.3 latticeExtra_0.6-24
>> >> >>> lattice_0.20-15 ****
>> >> >>>
>> >> >>> [4] RColorBrewer_1.0-5 genoPlotR_0.8
>> >> >>> ade4_1.5-2 ****
>> >> >>>
>> >> >>> [7] gmapR_1.3.2 VariantTools_1.3.2
>> >> >>> VariantAnnotation_1.7.34****
>> >> >>>
>> >> >>> [10] Rsamtools_1.13.24 Biostrings_2.29.13
>> >> >>> GenomicRanges_1.13.33 ****
>> >> >>>
>> >> >>> [13] XVector_0.1.0 IRanges_1.19.18
>> >> >>> BiocGenerics_0.7.3 ****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> loaded via a namespace (and not attached):****
>> >> >>>
>> >> >>> [1] AnnotationDbi_1.23.16 Biobase_2.20.0 biomaRt_2.17.2
>> >> >>> ****
>> >> >>>
>> >> >>> [4] bitops_1.0-5 BSgenome_1.28.0 DBI_0.2-7
>> >> >>> ****
>> >> >>>
>> >> >>> [7] GenomicFeatures_1.13.19 graph_1.39.3
>> >> >>> Matrix_1.0-12 ****
>> >> >>>
>> >> >>> [10] RBGL_1.37.2 RCurl_1.95-4.1
>> >> >>> RSQLite_0.11.4 ****
>> >> >>>
>> >> >>> [13] rtracklayer_1.20.2 stats4_3.0.1
>> >> >>> tools_3.0.1 ****
>> >> >>>
>> >> >>> [16] XML_3.96-1.1 zlibbioc_1.6.0 ****
>> >> >>>
>> >> >>>> ** **
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> Thanks again,****
>> >> >>>
>> >> >>> Sean****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> *From:* Michael Lawrence [mailto:lawrence.michael at gene.com]
>> >> >>> *Sent:* Saturday, July 13, 2013 1:59 PM
>> >> >>> *To:* Taylor, Sean D
>> >> >>> *Cc:* Michael Lawrence; Pages, Herve; Obenchain, Valerie J;
>> >> >>> bioconductor at r-project.org
>> >> >>>
>> >> >>> *Subject:* Re: [BioC] Filtering BAM files by start position for
>> >> >>> VariantTools****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> On Sat, Jul 13, 2013 at 1:50 PM, Taylor, Sean D
>> >> >>> <sdtaylor at fhcrc.org>
>> >> >>> wrote:****
>> >> >>>
>> >> >>> Thanks Michael, ****
>> >> >>>
>> >> >>> This is an interesting idea. Usually, we resolve PCR/optical
>> >> >>> duplicates
>> >> >>> with the Picard MarkDuplicates command, which just chooses the read
>> >> >>> with
>> >> >>> the highest average quality. This approximates what you want, I
>> >> >>> think.
>> >> >>> ***
>> >> >>> *
>> >> >>>
>> >> >>> Is the Picard MarkDuplicates command run by default? Or is this a
>> >> >>> separate
>> >> >>> command from a different package?****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> It's a separate package, basically the Java implementation of
>> >> >>> samtools;
>> >> >>> you might that first and see how it improves your error rates,
>> >> >>> before
>> >> >>> taking a more complicated approach.****
>> >> >>>
>> >> >>> Also, I find it hard to believe that sequence errors would be
>> >> >>> causing
>> >> >>> you trouble at 10%, as long as you're filtering for quality and
>> >> >>> have
>> >> >>> decent
>> >> >>> coverage. PCR might in limited cases, and Picard effectively takes
>> >> >>> care of
>> >> >>> that.****
>> >> >>>
>> >> >>> 10% is fine, but we have problems at 1% and lower. I think even the
>> >> >>> defaults in tallyVariants() use 1% as a cutoff if I’m not
>> >> >>> mistaken.****
>> >> >>>
>> >> >>> ****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> It uses a ~4% cutoff. And yes, you'll start running into issues
>> >> >>> around
>> >> >>> 1%.
>> >> >>> Calling at such a low frequency is kind of ambitious.****
>> >> >>>
>> >> >>> ****
>> >> >>>
>> >> >>> What you want is to specify the cycleBreaks argument to
>> >> >>> VariantTallyParam. It allows you to define bins by cycle. So if you
>> >> >>> made
>> >> >>> bins for all 100bp, you could do things like generate a consensus
>> >> >>> by
>> >> >>> read
>> >> >>> family. A family corresponding to position X would be cycle bin #1
>> >> >>> at
>> >> >>> X,
>> >> >>> cycle bin #2 at X+1, etc. Then just pick the alt (or ref) with the
>> >> >>> highest
>> >> >>> count. Might be tricky to implement efficiently for the whole
>> >> >>> genome.
>> >> >>> Of
>> >> >>> course, this assumes you've got single end reads, otherwise this
>> >> >>> won't
>> >> >>> work, because the mate is not considered.****
>> >> >>>
>> >> >>> Cool, I will try that out, that may be just what I was looking for.
>> >> >>> It
>> >> >>> might be tricky for the whole genome, but we are restricting
>> >> >>> ourselves
>> >> >>> to
>> >> >>> just the mitochondrial genome. Still gives us several thousand
>> >> >>> start
>> >> >>> sites
>> >> >>> to work from but should be easier than the whole genome. As for
>> >> >>> single
>> >> >>> end
>> >> >>> reads, can I just filter on the strand (i.e. ‘+’ or ‘-‘)? Or will I
>> >> >>> have to
>> >> >>> perform separate alignments for each of the paired reads?****
>> >> >>>
>> >> >>> ****
>> >> >>>
>> >> >>> ** **
>> >> >>>
>> >> >>> Now that I think about it, you should be fine just considering the
>> >> >>> alignments individually (as if they were unpaired).****
>> >> >>>
>> >> >>> ****
>> >> >>>
>> >> >>> ** **
>> >> >>
>> >> >> [[alternative HTML version deleted]]
>> >> >>
>> >> >>
>> >> >>
>> >> >> _______________________________________________
>> >> >> Bioconductor mailing list
>> >> >> Bioconductor at r-project.org
>> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >> >> Search the archives:
>> >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> >> >
>> >> >
>> >> > --
>> >> > Computational Biology / Fred Hutchinson Cancer Research Center
>> >> > 1100 Fairview Ave. N.
>> >> > PO Box 19024 Seattle, WA 98109
>> >> >
>> >> > Location: Arnold Building M1 B861
>> >> > Phone: (206) 667-2793
>> >
>> >
>
>
More information about the Bioconductor
mailing list