[BioC] Complete variant toolbox: gmapR/VariantTools/VariantAnnotation

Thomas Girke thomas.girke at ucr.edu
Tue Dec 10 18:09:37 CET 2013


Hi Valerie,

Adding a 'REFLOC' column to the output of locateVariants() would address
this need. Thanks for looking into this.

As for the need for a summary_var_report IN ADDITION TO to a
complete_var_report, the primitive approach, used to create the results
shown on the slides, is here:
http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rvarseq/Rvarseq_Fct.R
Right now this is just a pointer to show students how this could be done
rather than something I would consider even remotely a finished solution
for a package. To achieve the latter, one definitely should look into
how to get rid of some of the tapply steps. As expert of the VCF and
related classes you might have much more elegant and efficient solutions
to this? Also, to address some of Julian's concerns related to ambiguous
annotations, in case of overlapping genes one would append/prepend (but
only for those) the GENEID to the annotation feature names, e.g.
coding_GENE1__coding_GENE2. The result will end up being a gene-centric
rather a transcript-centric report, meaning we are loosing the
assignment to specific transcript variants. In 90% of the use cases of
our discovery oriented VAR-Seq projects, gene resolution is sufficient
here (e.g. supplement tables for publications or grant applications). If
transcript resolution is needed then users are usually happy to look the
results up in the complete variant report. Alternatively, one could
easily do the same on the transcript level, but here a summary report
may become quickly too complex to be useful for practitioners. Perhaps a
well designed Var Summary Report function would include a summary_mode
argument where the user could decide whether to output a gene- or
transcript-centric summary_var_report.

In general, this is obviously one of these tasks where it will be hard
to reach consensus among biologists how exactly the ideal VAR summary
report should look like. However, tackling this problem at least somehow
is extremely important as for biologist this may be one of the most
crucial features of any variant annotation tool. Most of them will not
know how to get things from VRanges/GRanges/VCF objects into a file
containing less than 100K lines that they can easily digest in a
spreadsheet program and is also supported in the supplement section of
most scientific journals (usually limited to Excel). 

Best,

Thomas


On Mon, Dec 09, 2013 at 08:07:34PM +0000, Valerie Obenchain wrote:
> Hi Thomas,
> 
> On 12/08/2013 09:08 AM, Thomas Girke wrote:
> > Dear Michael and Valerie,
> >
> > VariantTools and VariantAnnotation are awesome packages. To the best of my
> > knowledge, VariantTools is currently the only Bioc/R package that performs
> > variant calling and it does this in a very nice way. With the available
> > resources it is now straightforward to set up complete workflows for variant
> > calling projects: (1) variant aware read alignments with GSNAP from gmapR ->
> > (2) variant calling/filtering with VariantTools -> (3) adding genomic context
> > with VariantAnnotation. This is really amazing!!!
> >
> > Here are a few questions related to both packages:
> >
> > (1) For teaching purposes and other obvious reasons it would be useful if a
> > Windows version of VariantTools were available (and perhaps for gmapR too).
> > Installing the package (includes gmapR) from source works fine on both Linux
> > and OS X, but not on Windows.
> >
> > (2) The VRanges class is another great resource for filtering variant calls.
> > What I was not able to locate though is a description/definition of the content
> > of its different columns/components. Is something like this available
> > somewhere?
> >
> > (3) When annotation variants with utilities from VariantAnnotation, it would
> > useful to provide a convenience Summary Report function at the end of the
> > workflow that exports the annotations to a file. A very common need here is to
> > collapse the annotations for each variant on a single line so that one doesn't
> > end up with annotation results of millions of lines as it is typical for many
> > variant discovery projects. This also simplifies joins among different
> > annotation instances because it maintains uniqueness among variant identifiers.
> > This approach is often useful when comparing (joining) the variants among
> > different genotypes (e.g. which variants are identical or unique among
> > different mutants). An example solution is shown on slides 34-35 of this
> > presentation:
> > http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_12_16_2013/Rvarseq/Rvarseq.pdf
> >
> 
> The variantReport() and codingReport() functions looks great. Would you 
> be willing to contribute them to VariantAnnotation?
> 
> > (4) predictCoding() reports the relative location where exactly a variant maps
> > to an annotation range. It would be nice if locateVariants() could report the
> > exact relative mapping locations too, e.g. variant chr1:1033_A/T maps to
> > position x of 5'UTR. Perhaps this is already possible but I couldn't figure
> > out how to do it without reaching too far into my own hacking toolbox.
> >
> 
> I could add a 'REFLOC' column to the otuput of locateVariants() that 
> would essentially be the "equivalent" to 'CDSLOC' from predictCoding().
> 
> Valerie
> 
> 
> > Thanks for providing these excellent resources and most importantly your patience
> > listing to these unsolicited questions.
> >
> > Best,
> >
> >
> > Thomas
> >
> >
> >
> >> sessionInfo()
> > R version 3.0.2 (2013-09-25)
> > Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >
> > locale:
> > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >
> > attached base packages:
> > [1] parallel  stats     graphics  grDevices utils     datasets  methods
> > [8] base
> >
> > other attached packages:
> > [1] VariantTools_1.4.5      VariantAnnotation_1.8.7 Rsamtools_1.14.2
> > [4] Biostrings_2.30.1       GenomicRanges_1.14.3    XVector_0.2.0
> > [7] IRanges_1.20.6          BiocGenerics_0.8.0
> >
> > loaded via a namespace (and not attached):
> >   [1] AnnotationDbi_1.24.0   BatchJobs_1.1-1135     BBmisc_1.4
> >   [4] Biobase_2.22.0         BiocParallel_0.4.1     biomaRt_2.18.0
> >   [7] bitops_1.0-6           brew_1.0-6             BSgenome_1.30.0
> > [10] codetools_0.2-8        DBI_0.2-7              digest_0.6.3
> > [13] fail_1.2               foreach_1.4.1          GenomicFeatures_1.14.2
> > [16] gmapR_1.4.2            grid_3.0.2             iterators_1.0.6
> > [19] lattice_0.20-24        Matrix_1.1-0           plyr_1.8
> > [22] RCurl_1.95-4.1         RSQLite_0.11.4         rtracklayer_1.22.0
> > [25] sendmailR_1.1-2        stats4_3.0.2           tools_3.0.2
> > [28] XML_3.95-0.2           zlibbioc_1.8.0
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> 
> 
> -- 
> Valerie Obenchain
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B155
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: vobencha at fhcrc.org
> Phone:  (206) 667-3158
> Fax:    (206) 667-1319



More information about the Bioconductor mailing list