[BioC] ensemblVEP, variant_effect_predictor versions and release schedule
Valerie Obenchain
vobencha at fhcrc.org
Mon Jan 13 19:47:35 CET 2014
One interesting caveat I should mention. The east coast mirror only
supports the most current version of the API. When using archived
versions you need to use the cache (much preferred) or a live query
against the European mirror. I've set the 'host' default to
'ensembldb.ensembl.org' for archived versions.
I know you have local data so this isn't an issue for you - just wanted
to mention it for the wider audience.
Valerie
On 01/13/2014 10:41 AM, Valerie Obenchain wrote:
> OK, this is ready to go. Changes are checked into v 1.3.6.
>
> The default creates a VEPParam compatible with the current API.
>
> param73 <- VEPParam()
>>> param73
>> class: VEPParam73
>> identifier(0):
>> colocatedVariants(0):
>> dataformat(0):
>> basic(0):
>> input(1): species
>> cache(3): dir, dir_cache, dir_plugins
>> output(1): terms
>> filterqc(0):
>> database(2): host, database
>> advanced(1): buffer_size
>> version(2): 73, 74
>> scriptPath(0):
>
> To create a VEPParam for an archived version supply the version to the
> constructor.
>
> param67 <- VEPParam(67)
>>>> param67
>>> class: VEPParam67
>>> basic(0):
>>> input(1): species
>>> cache(1): dir
>>> output(1): terms
>>> filterqc(0):
>>> database(1): host
>>> advanced(1): buffer_size
>>> version(1): 67
>>> scriptPath(0):
>
> supportedVEP() lists all classes and supported versions. The idea is to
> only create a new subclass when a substantial change is made to the API.
> You can see that VEPParam73 supports both 73 and 74. I'll keep adding
> versions to this class until the interface requires a major change.
>
> supportedVEP()
>>> supportedVEP()
>> $VEPParam67
>> [1] 67
>>
>> $VEPParam73
>> [1] 73 74
>
>
> To specify a non-standard location of your .pl script use the
> scriptPath<- setter. This was added to handle the case where multiple
> versions are installed locally.
>
> scriptPath(param67) <- "fullPathToScript/variant_effect_predictor.pl"
>
> These examples and more are on ?VEPParam. Let me know how it goes.
>
> Valerie
>
>
>
> On 01/07/2014 02:01 AM, Thomas Sandmann wrote:
>> Hi Valerie,
>>
>> thanks a lot for supporting legacy versions of the ensembl database /
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script.
>>
>> I assume you're still using version 67 and have the data cached.
>>
>>
>> Yes, that's right. We use ensembl release 67 together with the
>> corresponding variant_effect_predictor.pl
>> <http://variant_effect_predictor.pl> script version 2.5.
>>
>> How are you calling the script right now?
>>
>>
>> As a temporary fix, I am using the ensemblVEP method from ensemblVEP
>> version 1.1.3 (BioC svn revision r76970). I think this is the last
>> version that worked with ensembl release 67 for me.
>>
>> I modified the default parameters in the VEPParam object by creating a
>> temporary "gVEPParam" class for use with our in-house ensembl release
>> 67. This object is passed to the ensemblVEP method with the default
>> parameters listed below. (Please note that our installation of
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl> by
>> default connects to our in-house database.)
>>
>> Formal class 'gVEPParam' with 6 slots
>> ..@ basic :List of 5
>> .. ..$ verbose : logi FALSE
>> .. ..$ quiet : logi FALSE
>> .. ..$ no_progress: logi TRUE
>> .. ..$ config : chr(0)
>> .. ..$ everything : logi FALSE
>> ..@ input :List of 4
>> .. ..$ species : chr "homo_sapiens"
>> .. ..$ format : chr(0)
>> .. ..$ output_file : chr(0)
>> .. ..$ force_overwrite: logi FALSE
>> ..@ output :List of 24
>> .. ..$ terms : chr "so"
>> .. ..$ sift : chr "b"
>> .. ..$ polyphen : chr "b"
>> .. ..$ regulatory : logi FALSE
>> .. ..$ cell_type : chr(0)
>> .. ..$ hgvs : logi TRUE
>> .. ..$ hgnc : logi TRUE
>> .. ..$ gene : logi TRUE
>> .. ..$ protein : logi TRUE
>> .. ..$ ccds : logi TRUE
>> .. ..$ canonical : logi TRUE
>> .. ..$ xref_refseq: logi FALSE
>> .. ..$ numbers : logi TRUE
>> .. ..$ domains : logi TRUE
>> .. ..$ most_severe: logi FALSE
>> .. ..$ summary : logi FALSE
>> .. ..$ per_gene : logi FALSE
>> .. ..$ convert : chr(0)
>> .. ..$ fields : chr(0)
>> .. ..$ vcf : logi FALSE
>> .. ..$ gvf : logi FALSE
>> .. ..$ original : logi FALSE
>> .. ..$ custom : chr(0)
>> .. ..$ plugin : chr "GNECondel,/Plugins/config/Condel/config"
>> ..@ filterqc:List of 17
>> .. ..$ check_ref : logi FALSE
>> .. ..$ coding_only : logi FALSE
>> .. ..$ check_existing : logi TRUE
>> .. ..$ check_alleles : logi FALSE
>> .. ..$ check_svs : logi FALSE
>> .. ..$ individual : chr(0)
>> .. ..$ chr : chr(0)
>> .. ..$ no_intergenic : logi FALSE
>> .. ..$ filter_common : logi FALSE
>> .. ..$ check_frequency : logi FALSE
>> .. ..$ freq_pop : chr(0)
>> .. ..$ freq_freq : logi FALSE
>> .. ..$ freq_gt_lt : chr(0)
>> .. ..$ freq_filter : chr(0)
>> .. ..$ filter : chr(0)
>> .. ..$ failed : logi FALSE
>> .. ..$ allow_non_variant: logi FALSE
>> ..@ database:List of 9
>> .. ..$ database : logi FALSE
>> .. ..$ host : chr "useastdb.ensembl.org
>> <http://useastdb.ensembl.org>"
>> .. ..$ user : chr(0)
>> .. ..$ password : chr(0)
>> .. ..$ port : num(0)
>> .. ..$ genomes : logi FALSE
>> .. ..$ refseq : logi FALSE
>> .. ..$ db_version: num(0)
>> .. ..$ registry : chr(0)
>> ..@ advanced:List of 4
>> .. ..$ no_whole_genome: logi FALSE
>> .. ..$ buffer_size : num 5000
>> .. ..$ compress : chr(0)
>> .. ..$ skip_db_check : logi FALSE
>>
>> Do you use the --cache flag or --offline flag?
>>
>>
>> I am not using the --cache flag right now, because version 2.5 of the
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script
>> does not allow me to specify the Plugin directory and the
>> cache directory separately. (This was only introduced in a later version
>> of the perl script).
>>
>> The --offline flag does not seem to be available in
>> variant_effect_predictor.pl <http://variant_effect_predictor.pl>
>> version 2.5, at least I cannot find it in the listed arguments
>> (provided below for reference).
>>
>> version 2.5
>>
>> Options
>> =======
>>
>> --help Display this message and quit
>> --verbose Display verbose output as the script runs
>> [default: off]
>> --quiet Suppress status and warning messages [default:
>> off]
>> --no_progress Suppress progress bars [default: off]
>>
>> --config Load configuration from file. Any command line
>> options
>> specified overwrite those in the file
>> [default: off]
>> --everything Shortcut switch to turn on commonly used options.
>> See web
>> documentation for details [default: off]
>>
>> -i | --input_file Input file - if not specified, reads from STDIN.
>> Files
>> may be gzip compressed.
>> --format Specify input file format - one of "ensembl",
>> "pileup",
>> "vcf", "hgvs", "id" or "guess" to try and work
>> out format.
>> -o | --output_file Output file. Write to STDOUT by specifying -o
>> STDOUT - this
>> will force --quiet [default:
>> "variant_effect_output.txt"]
>> --force_overwrite Force overwriting of output file [default: quit
>> if file
>> exists]
>> --original Writes output as it was in input - must be used
>> with --filter
>> since no consequence data is added [default: off]
>> --vcf Write output as VCF [default: off]
>> --gvf Write output as GVF [default: off]
>> --fields [field list] Define a custom output format by specifying a
>> comma-separated
>> list of field names. Field names normally
>> present in the
>> "Extra" field may also be specified, including
>> those added by
>> plugin modules. Can also be used to configure
>> VCF output
>> columns [default: off]
>> --species [species] Species to use [default: "human"]
>>
>> -t | --terms Type of consequence terms to output - one of
>> "ensembl", "SO",
>> "NCBI" [default: ensembl]
>> --sift=[p|s|b] Add SIFT [p]rediction, [s]core or [b]oth
>> [default: off]
>> --polyphen=[p|s|b] Add PolyPhen [p]rediction, [s]core or [b]oth
>> [default: off]
>> --regulatory Look for overlaps with regulatory regions. The
>> script can
>> also call if a variant falls in a high
>> information position
>> within a transcription factor binding site.
>> Output lines have
>> a Feature type of RegulatoryFeature or
>> MotifFeature
>> [default: off]
>> --cell_type [types] Report only regulatory regions that are found in
>> the given cell
>> type(s). Can be a single cell type or a
>> comma-separated list.
>> The functional type in each cell type is
>> reported under
>> CELL_TYPE in the output. To retrieve a list of
>> cell types, use
>> "--cell_type list" [default: off]
>> --custom [file list] Add custom annotations from tabix-indexed
>> files. See
>> documentation for full details [default: off]
>> --plugin [plugin_name] Use named plugin module [default: off]
>> --hgnc Add HGNC gene identifiers to output [default: off]
>> --hgvs Output HGVS identifiers (coding and protein).
>> Requires database
>> connection [default: off]
>> --ccds Output CCDS transcript identifiers [default: off]
>> --xref_refseq Output aligned RefSeq mRNA identifier for
>> transcript. NB: the
>> RefSeq and Ensembl transcripts aligned in this
>> way MAY NOT, AND
>> FREQUENTLY WILL NOT, match exactly in sequence,
>> exon structure
>> and protein product [default: off]
>> --protein Output Ensembl protein identifer [default: off]
>> --gene Force output of Ensembl gene identifer - disabled
>> by default
>> unless using --cache or --no_whole_genome
>> [default: off]
>> --canonical Indicate if the transcript for this consequence
>> is the canonical
>> transcript for this gene [default: off]
>> --domains Include details of any overlapping protein
>> domains [default: off]
>> --numbers Include exon & intron numbers [default: off]
>>
>> --no_intergenic Excludes intergenic consequences from the output
>> [default: off]
>> --coding_only Only return consequences that fall in the coding
>> region of
>> transcripts [default: off]
>> --most_severe Ouptut only the most severe consequence per
>> variation.
>> Transcript-specific columns will be left blank.
>> [default: off]
>> --summary Output only a comma-separated list of all
>> consequences per
>> variation. Transcript-specific columns will be
>> left blank.
>> [default: off]
>> --per_gene Output only the most severe consequence per gene.
>> Where more
>> than one transcript has the same consequence,
>> the transcript
>> chosen is arbitrary. [default: off]
>> --check_ref If specified, checks supplied reference allele
>> against stored
>> entry in Ensembl Core database [default: off]
>> --check_existing If specified, checks for existing co-located
>> variations in the
>> Ensembl Variation database [default: off]
>> --failed [0|1] Include (1) or exclude (0) variants that have
>> been flagged as
>> failed by Ensembl when checking for existing
>> variants.
>> [default: exclude]
>> --check_alleles If specified, the alleles of existing co-located
>> variations
>> are compared to the input; an existing variation
>> will only
>> be reported if no novel allele is in the input
>> (strand is
>> accounted for) [default: off]
>> --check_svs Report overlapping structural variants
>> [default: off]
>>
>> --filter [filters] Filter output by consequence type. Use this to
>> output only
>> variants that have at least one consequence type
>> matching the
>> filter. Multiple filters can be used separated
>> by ",". By
>> combining this with --original it is possible to
>> run the VEP
>> iteratively to progressively filter a set of
>> variants. See
>> documentation for full details [default: off]
>>
>> --check_frequency Turns on frequency filtering. Use this to include
>> or exclude
>> variants based on the frequency of co-located
>> existing
>> variants in the Ensembl Variation database. You
>> must also
>> specify all of the following --freq flags
>> [default: off]
>> --freq_pop [pop] Name of the population to use e.g. hapmap_ceu for
>> CEU HapMap,
>> 1kg_yri for YRI 1000 genomes. See documentation
>> for more
>> details
>> --freq_freq [freq] Frequency to use in filter. Must be a number
>> between 0 and 0.5
>> --freq_gt_lt [gt|lt] Specify whether the frequency should be greater
>> than (gt) or
>> less than (lt) --freq_freq
>> --freq_filter Specify whether variants that pass the above
>> should be included
>> [exclude|include] or excluded from analysis
>> --individual [id] Consider only alternate alleles present in the
>> genotypes of the
>> specified individual(s). May be a single
>> individual, a comma-
>> separated list or "all" to assess all
>> individuals separately.
>> Each individual and variant combination is given
>> on a separate
>> line of output. Only works with VCF files
>> containing individual
>> genotype data; individual IDs are taken from
>> column headers.
>> --allow_non_variant Prints out non-variant lines when using VCF input
>> --chr [list] Select a subset of chromosomes to analyse from
>> your file. Any
>> data not on this chromosome in the input will be
>> skipped. The
>> list can be comma separated, with "-" characters
>> representing
>> a range e.g. 1-5,8,15,X [default: off]
>> --gp If specified, tries to read GRCh37 position from
>> GP field in the
>> INFO column of a VCF file. Only applies when VCF
>> is the input
>> format and human is the species [default: off]
>> --convert Convert the input file to the output format
>> specified.
>> [ensembl|vcf|pileup] Converted output is written to the file
>> specified in
>> --output_file. No consequence calculation is
>> carried out when
>> doing file conversion. [default: off]
>>
>> --refseq Use the otherfeatures database to retrieve
>> transcripts - this
>> database contains RefSeq transcripts (as well as
>> CCDS and
>> Ensembl EST alignments) [default: off]
>> --host Manually define database host [default:
>> "ensembldb.ensembl.org <http://ensembldb.ensembl.org>"]
>> -u | --user Database username [default: "anonymous"]
>> --port Database port [default: 5306]
>> --password Database password [default: no password]
>> --genomes Sets DB connection params for Ensembl Genomes
>> [default: off]
>> --registry Registry file to use defines DB connections
>> [default: off]
>> Defining a registry file overrides above
>> connection settings.
>> --db_version=[number] Force script to load DBs from a specific Ensembl
>> version. Not
>> advised due to likely incompatibilities between
>> API and DB
>>
>> --no_whole_genome Run in old-style, non-whole genome mode [default:
>> off]
>> --buffer_size Sets the number of variants sent in each batch
>> [default: 5000]
>> Increasing buffer size can retrieve results more
>> quickly
>> but requires more memory. Only applies to whole
>> genome mode.
>> --cache Enables read-only use of cache [default: off]
>> --dir [directory] Specify the base cache directory to use [default:
>> "$HOME/.vep/"]
>> --write_cache Enable writing to cache [default: off]
>> --build [all|list] Build a complete cache for the selected species.
>> Build for all
>> chromosomes with --build all, or a list of
>> chromosomes (see
>> --chr). DO NOT USE WHEN CONNECTED TO PUBLIC DB
>> SERVERS AS THIS
>> VIOLATES OUR FAIR USAGE POLICY [default: off]
>> --compress Specify utility to decompress cache files - may
>> be "gzcat" or
>> "gzip -dc" Only use if default does not work
>> [default: zcat]
>> --skip_db_check ADVANCED! Force the script to use a cache built
>> from a different
>> database than specified with --host. Only use
>> this if you are
>> sure the hosts are compatible (e.g.
>> ensembldb.ensembl.org <http://ensembldb.ensembl.org> and
>> useastdb.ensembl.org <http://useastdb.ensembl.org>) [default: off]
>> --cache_region_size ADVANCED! The size in base-pairs of the region
>> covered by one
>> file in the cache. [default: 1MB]
>>
>> Also, please remind me of (point me to) the plug-in you're using so
>> I can test that.
>>
>>
>> We are using a single plugin that returns the Condel scores. The *Condel
>> plugin* can be found on github here:
>> https://github.com/ensembl-variation/VEP_plugins
>>
>> Again, thanks a lot for your support. Please let me know if there is
>> anything I can do to help, e.g. with testing the package.
>>
>> Best,
>> Thomas
>>
>
>
--
Valerie Obenchain
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: vobencha at fhcrc.org
Phone: (206) 667-3158
Fax: (206) 667-1319
More information about the Bioconductor
mailing list