[BioC] ensemblVEP, variant_effect_predictor versions and release schedule

Mon Jan 13 19:41:58 CET 2014

OK, this is ready to go. Changes are checked into v 1.3.6.

The default creates a VEPParam compatible with the current API.

param73 <- VEPParam()
>> param73
> class: VEPParam73
> identifier(0):
> colocatedVariants(0):
> dataformat(0):
> basic(0):
> input(1): species
> cache(3): dir, dir_cache, dir_plugins
> output(1): terms
> filterqc(0):
> database(2): host, database
> advanced(1): buffer_size
> version(2): 73, 74
> scriptPath(0):

To create a VEPParam for an archived version supply the version to the 
constructor.

param67 <- VEPParam(67)
>>> param67
>> class: VEPParam67
>> basic(0):
>> input(1): species
>> cache(1): dir
>> output(1): terms
>> filterqc(0):
>> database(1): host
>> advanced(1): buffer_size
>> version(1): 67
>> scriptPath(0):

supportedVEP() lists all classes and supported versions. The idea is to 
only create a new subclass when a substantial change is made to the API. 
You can see that VEPParam73 supports both 73 and 74. I'll keep adding 
versions to this class until the interface requires a major change.

supportedVEP()
>> supportedVEP()
> $VEPParam67
> [1] 67
>
> $VEPParam73
> [1] 73 74

To specify a non-standard location of your .pl script use the 
scriptPath<- setter. This was added to handle the case where multiple 
versions are installed locally.

scriptPath(param67) <- "fullPathToScript/variant_effect_predictor.pl"

These examples and more are on ?VEPParam. Let me know how it goes.

Valerie

On 01/07/2014 02:01 AM, Thomas Sandmann wrote:
> Hi Valerie,
>
> thanks a lot for supporting legacy versions of the ensembl database /
> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script.
>
>     I assume you're still using version 67 and have the data cached.
>
>
> Yes, that's right. We use ensembl release 67 together with the
> corresponding variant_effect_predictor.pl
> <http://variant_effect_predictor.pl> script version 2.5.
>
>     How are you calling the script right now?
>
>
> As a temporary fix, I am using the ensemblVEP method from ensemblVEP
> version 1.1.3 (BioC svn revision r76970). I think this is the last
> version that worked with ensembl release 67 for me.
>
> I modified the default parameters in the VEPParam object by creating a
> temporary "gVEPParam" class for use with our in-house ensembl release
> 67. This object is passed to the ensemblVEP method with the default
> parameters listed below. (Please note that our installation of
> variant_effect_predictor.pl <http://variant_effect_predictor.pl> by
> default connects to our in-house database.)
>
> Formal class 'gVEPParam' with 6 slots
>    ..@ basic   :List of 5
>    .. ..$ verbose    : logi FALSE
>    .. ..$ quiet      : logi FALSE
>    .. ..$ no_progress: logi TRUE
>    .. ..$ config     : chr(0)
>    .. ..$ everything : logi FALSE
>    ..@ input   :List of 4
>    .. ..$ species        : chr "homo_sapiens"
>    .. ..$ format         : chr(0)
>    .. ..$ output_file    : chr(0)
>    .. ..$ force_overwrite: logi FALSE
>    ..@ output  :List of 24
>    .. ..$ terms      : chr "so"
>    .. ..$ sift       : chr "b"
>    .. ..$ polyphen   : chr "b"
>    .. ..$ regulatory : logi FALSE
>    .. ..$ cell_type  : chr(0)
>    .. ..$ hgvs       : logi TRUE
>    .. ..$ hgnc       : logi TRUE
>    .. ..$ gene       : logi TRUE
>    .. ..$ protein    : logi TRUE
>    .. ..$ ccds       : logi TRUE
>    .. ..$ canonical  : logi TRUE
>    .. ..$ xref_refseq: logi FALSE
>    .. ..$ numbers    : logi TRUE
>    .. ..$ domains    : logi TRUE
>    .. ..$ most_severe: logi FALSE
>    .. ..$ summary    : logi FALSE
>    .. ..$ per_gene   : logi FALSE
>    .. ..$ convert    : chr(0)
>    .. ..$ fields     : chr(0)
>    .. ..$ vcf        : logi FALSE
>    .. ..$ gvf        : logi FALSE
>    .. ..$ original   : logi FALSE
>    .. ..$ custom     : chr(0)
>    .. ..$ plugin     : chr "GNECondel,/Plugins/config/Condel/config"
>    ..@ filterqc:List of 17
>    .. ..$ check_ref        : logi FALSE
>    .. ..$ coding_only      : logi FALSE
>    .. ..$ check_existing   : logi TRUE
>    .. ..$ check_alleles    : logi FALSE
>    .. ..$ check_svs        : logi FALSE
>    .. ..$ individual       : chr(0)
>    .. ..$ chr              : chr(0)
>    .. ..$ no_intergenic    : logi FALSE
>    .. ..$ filter_common    : logi FALSE
>    .. ..$ check_frequency  : logi FALSE
>    .. ..$ freq_pop         : chr(0)
>    .. ..$ freq_freq        : logi FALSE
>    .. ..$ freq_gt_lt       : chr(0)
>    .. ..$ freq_filter      : chr(0)
>    .. ..$ filter           : chr(0)
>    .. ..$ failed           : logi FALSE
>    .. ..$ allow_non_variant: logi FALSE
>    ..@ database:List of 9
>    .. ..$ database  : logi FALSE
>    .. ..$ host      : chr "useastdb.ensembl.org
> <http://useastdb.ensembl.org>"
>    .. ..$ user      : chr(0)
>    .. ..$ password  : chr(0)
>    .. ..$ port      : num(0)
>    .. ..$ genomes   : logi FALSE
>    .. ..$ refseq    : logi FALSE
>    .. ..$ db_version: num(0)
>    .. ..$ registry  : chr(0)
>    ..@ advanced:List of 4
>    .. ..$ no_whole_genome: logi FALSE
>    .. ..$ buffer_size    : num 5000
>    .. ..$ compress       : chr(0)
>    .. ..$ skip_db_check  : logi FALSE
>
>     Do you use the --cache flag or --offline flag?
>
>
> I am not using the --cache flag right now, because version 2.5 of the
> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script
> does not allow me to specify the Plugin directory and the
> cache directory separately. (This was only introduced in a later version
> of the perl script).
>
> The --offline flag does not seem to be available in
> variant_effect_predictor.pl <http://variant_effect_predictor.pl>
>   version 2.5, at least I cannot find it in the listed arguments
> (provided below for reference).
>
> version 2.5
>
> Options
> =======
>
> --help                 Display this message and quit
> --verbose              Display verbose output as the script runs
> [default: off]
> --quiet                Suppress status and warning messages [default: off]
> --no_progress          Suppress progress bars [default: off]
>
> --config               Load configuration from file. Any command line
> options
>                         specified overwrite those in the file [default: off]
> --everything           Shortcut switch to turn on commonly used options.
> See web
>                         documentation for details [default: off]
>
> -i | --input_file      Input file - if not specified, reads from STDIN.
> Files
>                         may be gzip compressed.
> --format               Specify input file format - one of "ensembl",
> "pileup",
>                         "vcf", "hgvs", "id" or "guess" to try and work
> out format.
> -o | --output_file     Output file. Write to STDOUT by specifying -o
> STDOUT - this
>                         will force --quiet [default:
> "variant_effect_output.txt"]
> --force_overwrite      Force overwriting of output file [default: quit
> if file
>                         exists]
> --original             Writes output as it was in input - must be used
> with --filter
>                         since no consequence data is added [default: off]
> --vcf                  Write output as VCF [default: off]
> --gvf                  Write output as GVF [default: off]
> --fields [field list]  Define a custom output format by specifying a
> comma-separated
>                         list of field names. Field names normally
> present in the
>                         "Extra" field may also be specified, including
> those added by
>                         plugin modules. Can also be used to configure
> VCF output
>                         columns [default: off]
> --species [species]    Species to use [default: "human"]
>
> -t | --terms           Type of consequence terms to output - one of
> "ensembl", "SO",
>                         "NCBI" [default: ensembl]
> --sift=[p|s|b]         Add SIFT [p]rediction, [s]core or [b]oth
> [default: off]
> --polyphen=[p|s|b]     Add PolyPhen [p]rediction, [s]core or [b]oth
> [default: off]
> --regulatory           Look for overlaps with regulatory regions. The
> script can
>                         also call if a variant falls in a high
> information position
>                         within a transcription factor binding site.
> Output lines have
>                         a Feature type of RegulatoryFeature or MotifFeature
>                         [default: off]
> --cell_type [types]    Report only regulatory regions that are found in
> the given cell
>                         type(s). Can be a single cell type or a
> comma-separated list.
>                         The functional type in each cell type is
> reported under
>                         CELL_TYPE in the output. To retrieve a list of
> cell types, use
>                         "--cell_type list" [default: off]
> --custom [file list]   Add custom annotations from tabix-indexed files. See
>                         documentation for full details [default: off]
> --plugin [plugin_name] Use named plugin module [default: off]
> --hgnc                 Add HGNC gene identifiers to output [default: off]
> --hgvs                 Output HGVS identifiers (coding and protein).
> Requires database
>                         connection [default: off]
> --ccds                 Output CCDS transcript identifiers [default: off]
> --xref_refseq          Output aligned RefSeq mRNA identifier for
> transcript. NB: the
>                         RefSeq and Ensembl transcripts aligned in this
> way MAY NOT, AND
>                         FREQUENTLY WILL NOT, match exactly in sequence,
> exon structure
>                         and protein product [default: off]
> --protein              Output Ensembl protein identifer [default: off]
> --gene                 Force output of Ensembl gene identifer - disabled
> by default
>                         unless using --cache or --no_whole_genome
> [default: off]
> --canonical            Indicate if the transcript for this consequence
> is the canonical
>                         transcript for this gene [default: off]
> --domains              Include details of any overlapping protein
> domains [default: off]
> --numbers              Include exon & intron numbers [default: off]
>
> --no_intergenic        Excludes intergenic consequences from the output
> [default: off]
> --coding_only          Only return consequences that fall in the coding
> region of
>                         transcripts [default: off]
> --most_severe          Ouptut only the most severe consequence per
> variation.
>                         Transcript-specific columns will be left blank.
> [default: off]
> --summary              Output only a comma-separated list of all
> consequences per
>                         variation. Transcript-specific columns will be
> left blank.
>                         [default: off]
> --per_gene             Output only the most severe consequence per gene.
> Where more
>                         than one transcript has the same consequence,
> the transcript
>                         chosen is arbitrary. [default: off]
> --check_ref            If specified, checks supplied reference allele
> against stored
>                         entry in Ensembl Core database [default: off]
> --check_existing       If specified, checks for existing co-located
> variations in the
>                         Ensembl Variation database [default: off]
> --failed [0|1]         Include (1) or exclude (0) variants that have
> been flagged as
>                         failed by Ensembl when checking for existing
> variants.
>                         [default: exclude]
> --check_alleles        If specified, the alleles of existing co-located
> variations
>                         are compared to the input; an existing variation
> will only
>                         be reported if no novel allele is in the input
> (strand is
>                         accounted for) [default: off]
> --check_svs            Report overlapping structural variants [default: off]
>
> --filter [filters]     Filter output by consequence type. Use this to
> output only
>                         variants that have at least one consequence type
> matching the
>                         filter. Multiple filters can be used separated
> by ",". By
>                         combining this with --original it is possible to
> run the VEP
>                         iteratively to progressively filter a set of
> variants. See
>                         documentation for full details [default: off]
>
> --check_frequency      Turns on frequency filtering. Use this to include
> or exclude
>                         variants based on the frequency of co-located
> existing
>                         variants in the Ensembl Variation database. You
> must also
>                         specify all of the following --freq flags
> [default: off]
> --freq_pop [pop]       Name of the population to use e.g. hapmap_ceu for
> CEU HapMap,
>                         1kg_yri for YRI 1000 genomes. See documentation
> for more
>                         details
> --freq_freq [freq]     Frequency to use in filter. Must be a number
> between 0 and 0.5
> --freq_gt_lt [gt|lt]   Specify whether the frequency should be greater
> than (gt) or
>                         less than (lt) --freq_freq
> --freq_filter          Specify whether variants that pass the above
> should be included
>    [exclude|include]    or excluded from analysis
> --individual [id]      Consider only alternate alleles present in the
> genotypes of the
>                         specified individual(s). May be a single
> individual, a comma-
>                         separated list or "all" to assess all
> individuals separately.
>                         Each individual and variant combination is given
> on a separate
>                         line of output. Only works with VCF files
> containing individual
>                         genotype data; individual IDs are taken from
> column headers.
> --allow_non_variant    Prints out non-variant lines when using VCF input
> --chr [list]           Select a subset of chromosomes to analyse from
> your file. Any
>                         data not on this chromosome in the input will be
> skipped. The
>                         list can be comma separated, with "-" characters
> representing
>                         a range e.g. 1-5,8,15,X [default: off]
> --gp                   If specified, tries to read GRCh37 position from
> GP field in the
>                         INFO column of a VCF file. Only applies when VCF
> is the input
>                         format and human is the species [default: off]
> --convert              Convert the input file to the output format
> specified.
>    [ensembl|vcf|pileup] Converted output is written to the file specified in
>                         --output_file. No consequence calculation is
> carried out when
>                         doing file conversion. [default: off]
>
> --refseq               Use the otherfeatures database to retrieve
> transcripts - this
>                         database contains RefSeq transcripts (as well as
> CCDS and
>                         Ensembl EST alignments) [default: off]
> --host                 Manually define database host [default:
> "ensembldb.ensembl.org <http://ensembldb.ensembl.org>"]
> -u | --user            Database username [default: "anonymous"]
> --port                 Database port [default: 5306]
> --password             Database password [default: no password]
> --genomes              Sets DB connection params for Ensembl Genomes
> [default: off]
> --registry             Registry file to use defines DB connections
> [default: off]
>                         Defining a registry file overrides above
> connection settings.
> --db_version=[number]  Force script to load DBs from a specific Ensembl
> version. Not
>                         advised due to likely incompatibilities between
> API and DB
>
> --no_whole_genome      Run in old-style, non-whole genome mode [default:
> off]
> --buffer_size          Sets the number of variants sent in each batch
> [default: 5000]
>                         Increasing buffer size can retrieve results more
> quickly
>                         but requires more memory. Only applies to whole
> genome mode.
> --cache                Enables read-only use of cache [default: off]
> --dir [directory]      Specify the base cache directory to use [default:
> "$HOME/.vep/"]
> --write_cache          Enable writing to cache [default: off]
> --build [all|list]     Build a complete cache for the selected species.
> Build for all
>                         chromosomes with --build all, or a list of
> chromosomes (see
>                         --chr). DO NOT USE WHEN CONNECTED TO PUBLIC DB
> SERVERS AS THIS
>                         VIOLATES OUR FAIR USAGE POLICY [default: off]
> --compress             Specify utility to decompress cache files - may
> be "gzcat" or
>                         "gzip -dc" Only use if default does not work
> [default: zcat]
> --skip_db_check        ADVANCED! Force the script to use a cache built
> from a different
>                         database than specified with --host. Only use
> this if you are
>                         sure the hosts are compatible (e.g.
> ensembldb.ensembl.org <http://ensembldb.ensembl.org> and
> useastdb.ensembl.org <http://useastdb.ensembl.org>) [default: off]
> --cache_region_size    ADVANCED! The size in base-pairs of the region
> covered by one
>                         file in the cache. [default: 1MB]
>
>     Also, please remind me of (point me to) the plug-in you're using so
>     I can test that.
>
>
> We are using a single plugin that returns the Condel scores. The *Condel
> plugin* can be found on github here:
> https://github.com/ensembl-variation/VEP_plugins
>
> Again, thanks a lot for your support. Please let me know if there is
> anything I can do to help, e.g. with testing the package.
>
> Best,
> Thomas
>

-- 
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319