[BioC] ensemblVEP, variant_effect_predictor versions and release schedule

Valerie Obenchain vobencha at fhcrc.org
Mon Jan 13 19:47:35 CET 2014


One interesting caveat I should mention. The east coast mirror only 
supports the most current version of the API. When using archived 
versions you need to use the cache (much preferred) or a live query 
against the European mirror. I've set the 'host' default to 
'ensembldb.ensembl.org' for archived versions.

I know you have local data so this isn't an issue for you - just wanted 
to mention it for the wider audience.

Valerie

On 01/13/2014 10:41 AM, Valerie Obenchain wrote:
> OK, this is ready to go. Changes are checked into v 1.3.6.
>
> The default creates a VEPParam compatible with the current API.
>
> param73 <- VEPParam()
>>> param73
>> class: VEPParam73
>> identifier(0):
>> colocatedVariants(0):
>> dataformat(0):
>> basic(0):
>> input(1): species
>> cache(3): dir, dir_cache, dir_plugins
>> output(1): terms
>> filterqc(0):
>> database(2): host, database
>> advanced(1): buffer_size
>> version(2): 73, 74
>> scriptPath(0):
>
> To create a VEPParam for an archived version supply the version to the
> constructor.
>
> param67 <- VEPParam(67)
>>>> param67
>>> class: VEPParam67
>>> basic(0):
>>> input(1): species
>>> cache(1): dir
>>> output(1): terms
>>> filterqc(0):
>>> database(1): host
>>> advanced(1): buffer_size
>>> version(1): 67
>>> scriptPath(0):
>
> supportedVEP() lists all classes and supported versions. The idea is to
> only create a new subclass when a substantial change is made to the API.
> You can see that VEPParam73 supports both 73 and 74. I'll keep adding
> versions to this class until the interface requires a major change.
>
> supportedVEP()
>>> supportedVEP()
>> $VEPParam67
>> [1] 67
>>
>> $VEPParam73
>> [1] 73 74
>
>
> To specify a non-standard location of your .pl script use the
> scriptPath<- setter. This was added to handle the case where multiple
> versions are installed locally.
>
> scriptPath(param67) <- "fullPathToScript/variant_effect_predictor.pl"
>
> These examples and more are on ?VEPParam. Let me know how it goes.
>
> Valerie
>
>
>
> On 01/07/2014 02:01 AM, Thomas Sandmann wrote:
>> Hi Valerie,
>>
>> thanks a lot for supporting legacy versions of the ensembl database /
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script.
>>
>>     I assume you're still using version 67 and have the data cached.
>>
>>
>> Yes, that's right. We use ensembl release 67 together with the
>> corresponding variant_effect_predictor.pl
>> <http://variant_effect_predictor.pl> script version 2.5.
>>
>>     How are you calling the script right now?
>>
>>
>> As a temporary fix, I am using the ensemblVEP method from ensemblVEP
>> version 1.1.3 (BioC svn revision r76970). I think this is the last
>> version that worked with ensembl release 67 for me.
>>
>> I modified the default parameters in the VEPParam object by creating a
>> temporary "gVEPParam" class for use with our in-house ensembl release
>> 67. This object is passed to the ensemblVEP method with the default
>> parameters listed below. (Please note that our installation of
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl> by
>> default connects to our in-house database.)
>>
>> Formal class 'gVEPParam' with 6 slots
>>    ..@ basic   :List of 5
>>    .. ..$ verbose    : logi FALSE
>>    .. ..$ quiet      : logi FALSE
>>    .. ..$ no_progress: logi TRUE
>>    .. ..$ config     : chr(0)
>>    .. ..$ everything : logi FALSE
>>    ..@ input   :List of 4
>>    .. ..$ species        : chr "homo_sapiens"
>>    .. ..$ format         : chr(0)
>>    .. ..$ output_file    : chr(0)
>>    .. ..$ force_overwrite: logi FALSE
>>    ..@ output  :List of 24
>>    .. ..$ terms      : chr "so"
>>    .. ..$ sift       : chr "b"
>>    .. ..$ polyphen   : chr "b"
>>    .. ..$ regulatory : logi FALSE
>>    .. ..$ cell_type  : chr(0)
>>    .. ..$ hgvs       : logi TRUE
>>    .. ..$ hgnc       : logi TRUE
>>    .. ..$ gene       : logi TRUE
>>    .. ..$ protein    : logi TRUE
>>    .. ..$ ccds       : logi TRUE
>>    .. ..$ canonical  : logi TRUE
>>    .. ..$ xref_refseq: logi FALSE
>>    .. ..$ numbers    : logi TRUE
>>    .. ..$ domains    : logi TRUE
>>    .. ..$ most_severe: logi FALSE
>>    .. ..$ summary    : logi FALSE
>>    .. ..$ per_gene   : logi FALSE
>>    .. ..$ convert    : chr(0)
>>    .. ..$ fields     : chr(0)
>>    .. ..$ vcf        : logi FALSE
>>    .. ..$ gvf        : logi FALSE
>>    .. ..$ original   : logi FALSE
>>    .. ..$ custom     : chr(0)
>>    .. ..$ plugin     : chr "GNECondel,/Plugins/config/Condel/config"
>>    ..@ filterqc:List of 17
>>    .. ..$ check_ref        : logi FALSE
>>    .. ..$ coding_only      : logi FALSE
>>    .. ..$ check_existing   : logi TRUE
>>    .. ..$ check_alleles    : logi FALSE
>>    .. ..$ check_svs        : logi FALSE
>>    .. ..$ individual       : chr(0)
>>    .. ..$ chr              : chr(0)
>>    .. ..$ no_intergenic    : logi FALSE
>>    .. ..$ filter_common    : logi FALSE
>>    .. ..$ check_frequency  : logi FALSE
>>    .. ..$ freq_pop         : chr(0)
>>    .. ..$ freq_freq        : logi FALSE
>>    .. ..$ freq_gt_lt       : chr(0)
>>    .. ..$ freq_filter      : chr(0)
>>    .. ..$ filter           : chr(0)
>>    .. ..$ failed           : logi FALSE
>>    .. ..$ allow_non_variant: logi FALSE
>>    ..@ database:List of 9
>>    .. ..$ database  : logi FALSE
>>    .. ..$ host      : chr "useastdb.ensembl.org
>> <http://useastdb.ensembl.org>"
>>    .. ..$ user      : chr(0)
>>    .. ..$ password  : chr(0)
>>    .. ..$ port      : num(0)
>>    .. ..$ genomes   : logi FALSE
>>    .. ..$ refseq    : logi FALSE
>>    .. ..$ db_version: num(0)
>>    .. ..$ registry  : chr(0)
>>    ..@ advanced:List of 4
>>    .. ..$ no_whole_genome: logi FALSE
>>    .. ..$ buffer_size    : num 5000
>>    .. ..$ compress       : chr(0)
>>    .. ..$ skip_db_check  : logi FALSE
>>
>>     Do you use the --cache flag or --offline flag?
>>
>>
>> I am not using the --cache flag right now, because version 2.5 of the
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script
>> does not allow me to specify the Plugin directory and the
>> cache directory separately. (This was only introduced in a later version
>> of the perl script).
>>
>> The --offline flag does not seem to be available in
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl>
>>   version 2.5, at least I cannot find it in the listed arguments
>> (provided below for reference).
>>
>> version 2.5
>>
>> Options
>> =======
>>
>> --help                 Display this message and quit
>> --verbose              Display verbose output as the script runs
>> [default: off]
>> --quiet                Suppress status and warning messages [default:
>> off]
>> --no_progress          Suppress progress bars [default: off]
>>
>> --config               Load configuration from file. Any command line
>> options
>>                         specified overwrite those in the file
>> [default: off]
>> --everything           Shortcut switch to turn on commonly used options.
>> See web
>>                         documentation for details [default: off]
>>
>> -i | --input_file      Input file - if not specified, reads from STDIN.
>> Files
>>                         may be gzip compressed.
>> --format               Specify input file format - one of "ensembl",
>> "pileup",
>>                         "vcf", "hgvs", "id" or "guess" to try and work
>> out format.
>> -o | --output_file     Output file. Write to STDOUT by specifying -o
>> STDOUT - this
>>                         will force --quiet [default:
>> "variant_effect_output.txt"]
>> --force_overwrite      Force overwriting of output file [default: quit
>> if file
>>                         exists]
>> --original             Writes output as it was in input - must be used
>> with --filter
>>                         since no consequence data is added [default: off]
>> --vcf                  Write output as VCF [default: off]
>> --gvf                  Write output as GVF [default: off]
>> --fields [field list]  Define a custom output format by specifying a
>> comma-separated
>>                         list of field names. Field names normally
>> present in the
>>                         "Extra" field may also be specified, including
>> those added by
>>                         plugin modules. Can also be used to configure
>> VCF output
>>                         columns [default: off]
>> --species [species]    Species to use [default: "human"]
>>
>> -t | --terms           Type of consequence terms to output - one of
>> "ensembl", "SO",
>>                         "NCBI" [default: ensembl]
>> --sift=[p|s|b]         Add SIFT [p]rediction, [s]core or [b]oth
>> [default: off]
>> --polyphen=[p|s|b]     Add PolyPhen [p]rediction, [s]core or [b]oth
>> [default: off]
>> --regulatory           Look for overlaps with regulatory regions. The
>> script can
>>                         also call if a variant falls in a high
>> information position
>>                         within a transcription factor binding site.
>> Output lines have
>>                         a Feature type of RegulatoryFeature or
>> MotifFeature
>>                         [default: off]
>> --cell_type [types]    Report only regulatory regions that are found in
>> the given cell
>>                         type(s). Can be a single cell type or a
>> comma-separated list.
>>                         The functional type in each cell type is
>> reported under
>>                         CELL_TYPE in the output. To retrieve a list of
>> cell types, use
>>                         "--cell_type list" [default: off]
>> --custom [file list]   Add custom annotations from tabix-indexed
>> files. See
>>                         documentation for full details [default: off]
>> --plugin [plugin_name] Use named plugin module [default: off]
>> --hgnc                 Add HGNC gene identifiers to output [default: off]
>> --hgvs                 Output HGVS identifiers (coding and protein).
>> Requires database
>>                         connection [default: off]
>> --ccds                 Output CCDS transcript identifiers [default: off]
>> --xref_refseq          Output aligned RefSeq mRNA identifier for
>> transcript. NB: the
>>                         RefSeq and Ensembl transcripts aligned in this
>> way MAY NOT, AND
>>                         FREQUENTLY WILL NOT, match exactly in sequence,
>> exon structure
>>                         and protein product [default: off]
>> --protein              Output Ensembl protein identifer [default: off]
>> --gene                 Force output of Ensembl gene identifer - disabled
>> by default
>>                         unless using --cache or --no_whole_genome
>> [default: off]
>> --canonical            Indicate if the transcript for this consequence
>> is the canonical
>>                         transcript for this gene [default: off]
>> --domains              Include details of any overlapping protein
>> domains [default: off]
>> --numbers              Include exon & intron numbers [default: off]
>>
>> --no_intergenic        Excludes intergenic consequences from the output
>> [default: off]
>> --coding_only          Only return consequences that fall in the coding
>> region of
>>                         transcripts [default: off]
>> --most_severe          Ouptut only the most severe consequence per
>> variation.
>>                         Transcript-specific columns will be left blank.
>> [default: off]
>> --summary              Output only a comma-separated list of all
>> consequences per
>>                         variation. Transcript-specific columns will be
>> left blank.
>>                         [default: off]
>> --per_gene             Output only the most severe consequence per gene.
>> Where more
>>                         than one transcript has the same consequence,
>> the transcript
>>                         chosen is arbitrary. [default: off]
>> --check_ref            If specified, checks supplied reference allele
>> against stored
>>                         entry in Ensembl Core database [default: off]
>> --check_existing       If specified, checks for existing co-located
>> variations in the
>>                         Ensembl Variation database [default: off]
>> --failed [0|1]         Include (1) or exclude (0) variants that have
>> been flagged as
>>                         failed by Ensembl when checking for existing
>> variants.
>>                         [default: exclude]
>> --check_alleles        If specified, the alleles of existing co-located
>> variations
>>                         are compared to the input; an existing variation
>> will only
>>                         be reported if no novel allele is in the input
>> (strand is
>>                         accounted for) [default: off]
>> --check_svs            Report overlapping structural variants
>> [default: off]
>>
>> --filter [filters]     Filter output by consequence type. Use this to
>> output only
>>                         variants that have at least one consequence type
>> matching the
>>                         filter. Multiple filters can be used separated
>> by ",". By
>>                         combining this with --original it is possible to
>> run the VEP
>>                         iteratively to progressively filter a set of
>> variants. See
>>                         documentation for full details [default: off]
>>
>> --check_frequency      Turns on frequency filtering. Use this to include
>> or exclude
>>                         variants based on the frequency of co-located
>> existing
>>                         variants in the Ensembl Variation database. You
>> must also
>>                         specify all of the following --freq flags
>> [default: off]
>> --freq_pop [pop]       Name of the population to use e.g. hapmap_ceu for
>> CEU HapMap,
>>                         1kg_yri for YRI 1000 genomes. See documentation
>> for more
>>                         details
>> --freq_freq [freq]     Frequency to use in filter. Must be a number
>> between 0 and 0.5
>> --freq_gt_lt [gt|lt]   Specify whether the frequency should be greater
>> than (gt) or
>>                         less than (lt) --freq_freq
>> --freq_filter          Specify whether variants that pass the above
>> should be included
>>    [exclude|include]    or excluded from analysis
>> --individual [id]      Consider only alternate alleles present in the
>> genotypes of the
>>                         specified individual(s). May be a single
>> individual, a comma-
>>                         separated list or "all" to assess all
>> individuals separately.
>>                         Each individual and variant combination is given
>> on a separate
>>                         line of output. Only works with VCF files
>> containing individual
>>                         genotype data; individual IDs are taken from
>> column headers.
>> --allow_non_variant    Prints out non-variant lines when using VCF input
>> --chr [list]           Select a subset of chromosomes to analyse from
>> your file. Any
>>                         data not on this chromosome in the input will be
>> skipped. The
>>                         list can be comma separated, with "-" characters
>> representing
>>                         a range e.g. 1-5,8,15,X [default: off]
>> --gp                   If specified, tries to read GRCh37 position from
>> GP field in the
>>                         INFO column of a VCF file. Only applies when VCF
>> is the input
>>                         format and human is the species [default: off]
>> --convert              Convert the input file to the output format
>> specified.
>>    [ensembl|vcf|pileup] Converted output is written to the file
>> specified in
>>                         --output_file. No consequence calculation is
>> carried out when
>>                         doing file conversion. [default: off]
>>
>> --refseq               Use the otherfeatures database to retrieve
>> transcripts - this
>>                         database contains RefSeq transcripts (as well as
>> CCDS and
>>                         Ensembl EST alignments) [default: off]
>> --host                 Manually define database host [default:
>> "ensembldb.ensembl.org <http://ensembldb.ensembl.org>"]
>> -u | --user            Database username [default: "anonymous"]
>> --port                 Database port [default: 5306]
>> --password             Database password [default: no password]
>> --genomes              Sets DB connection params for Ensembl Genomes
>> [default: off]
>> --registry             Registry file to use defines DB connections
>> [default: off]
>>                         Defining a registry file overrides above
>> connection settings.
>> --db_version=[number]  Force script to load DBs from a specific Ensembl
>> version. Not
>>                         advised due to likely incompatibilities between
>> API and DB
>>
>> --no_whole_genome      Run in old-style, non-whole genome mode [default:
>> off]
>> --buffer_size          Sets the number of variants sent in each batch
>> [default: 5000]
>>                         Increasing buffer size can retrieve results more
>> quickly
>>                         but requires more memory. Only applies to whole
>> genome mode.
>> --cache                Enables read-only use of cache [default: off]
>> --dir [directory]      Specify the base cache directory to use [default:
>> "$HOME/.vep/"]
>> --write_cache          Enable writing to cache [default: off]
>> --build [all|list]     Build a complete cache for the selected species.
>> Build for all
>>                         chromosomes with --build all, or a list of
>> chromosomes (see
>>                         --chr). DO NOT USE WHEN CONNECTED TO PUBLIC DB
>> SERVERS AS THIS
>>                         VIOLATES OUR FAIR USAGE POLICY [default: off]
>> --compress             Specify utility to decompress cache files - may
>> be "gzcat" or
>>                         "gzip -dc" Only use if default does not work
>> [default: zcat]
>> --skip_db_check        ADVANCED! Force the script to use a cache built
>> from a different
>>                         database than specified with --host. Only use
>> this if you are
>>                         sure the hosts are compatible (e.g.
>> ensembldb.ensembl.org <http://ensembldb.ensembl.org> and
>> useastdb.ensembl.org <http://useastdb.ensembl.org>) [default: off]
>> --cache_region_size    ADVANCED! The size in base-pairs of the region
>> covered by one
>>                         file in the cache. [default: 1MB]
>>
>>     Also, please remind me of (point me to) the plug-in you're using so
>>     I can test that.
>>
>>
>> We are using a single plugin that returns the Condel scores. The *Condel
>> plugin* can be found on github here:
>> https://github.com/ensembl-variation/VEP_plugins
>>
>> Again, thanks a lot for your support. Please let me know if there is
>> anything I can do to help, e.g. with testing the package.
>>
>> Best,
>> Thomas
>>
>
>


-- 
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319



More information about the Bioconductor mailing list