[BioC] ensemblVEP, variant_effect_predictor versions and release schedule
Valerie Obenchain
vobencha at fhcrc.org
Mon Jan 13 19:41:58 CET 2014
OK, this is ready to go. Changes are checked into v 1.3.6.
The default creates a VEPParam compatible with the current API.
param73 <- VEPParam()
>> param73
> class: VEPParam73
> identifier(0):
> colocatedVariants(0):
> dataformat(0):
> basic(0):
> input(1): species
> cache(3): dir, dir_cache, dir_plugins
> output(1): terms
> filterqc(0):
> database(2): host, database
> advanced(1): buffer_size
> version(2): 73, 74
> scriptPath(0):
To create a VEPParam for an archived version supply the version to the
constructor.
param67 <- VEPParam(67)
>>> param67
>> class: VEPParam67
>> basic(0):
>> input(1): species
>> cache(1): dir
>> output(1): terms
>> filterqc(0):
>> database(1): host
>> advanced(1): buffer_size
>> version(1): 67
>> scriptPath(0):
supportedVEP() lists all classes and supported versions. The idea is to
only create a new subclass when a substantial change is made to the API.
You can see that VEPParam73 supports both 73 and 74. I'll keep adding
versions to this class until the interface requires a major change.
supportedVEP()
>> supportedVEP()
> $VEPParam67
> [1] 67
>
> $VEPParam73
> [1] 73 74
To specify a non-standard location of your .pl script use the
scriptPath<- setter. This was added to handle the case where multiple
versions are installed locally.
scriptPath(param67) <- "fullPathToScript/variant_effect_predictor.pl"
These examples and more are on ?VEPParam. Let me know how it goes.
Valerie
On 01/07/2014 02:01 AM, Thomas Sandmann wrote:
> Hi Valerie,
>
> thanks a lot for supporting legacy versions of the ensembl database /
> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script.
>
> I assume you're still using version 67 and have the data cached.
>
>
> Yes, that's right. We use ensembl release 67 together with the
> corresponding variant_effect_predictor.pl
> <http://variant_effect_predictor.pl> script version 2.5.
>
> How are you calling the script right now?
>
>
> As a temporary fix, I am using the ensemblVEP method from ensemblVEP
> version 1.1.3 (BioC svn revision r76970). I think this is the last
> version that worked with ensembl release 67 for me.
>
> I modified the default parameters in the VEPParam object by creating a
> temporary "gVEPParam" class for use with our in-house ensembl release
> 67. This object is passed to the ensemblVEP method with the default
> parameters listed below. (Please note that our installation of
> variant_effect_predictor.pl <http://variant_effect_predictor.pl> by
> default connects to our in-house database.)
>
> Formal class 'gVEPParam' with 6 slots
> ..@ basic :List of 5
> .. ..$ verbose : logi FALSE
> .. ..$ quiet : logi FALSE
> .. ..$ no_progress: logi TRUE
> .. ..$ config : chr(0)
> .. ..$ everything : logi FALSE
> ..@ input :List of 4
> .. ..$ species : chr "homo_sapiens"
> .. ..$ format : chr(0)
> .. ..$ output_file : chr(0)
> .. ..$ force_overwrite: logi FALSE
> ..@ output :List of 24
> .. ..$ terms : chr "so"
> .. ..$ sift : chr "b"
> .. ..$ polyphen : chr "b"
> .. ..$ regulatory : logi FALSE
> .. ..$ cell_type : chr(0)
> .. ..$ hgvs : logi TRUE
> .. ..$ hgnc : logi TRUE
> .. ..$ gene : logi TRUE
> .. ..$ protein : logi TRUE
> .. ..$ ccds : logi TRUE
> .. ..$ canonical : logi TRUE
> .. ..$ xref_refseq: logi FALSE
> .. ..$ numbers : logi TRUE
> .. ..$ domains : logi TRUE
> .. ..$ most_severe: logi FALSE
> .. ..$ summary : logi FALSE
> .. ..$ per_gene : logi FALSE
> .. ..$ convert : chr(0)
> .. ..$ fields : chr(0)
> .. ..$ vcf : logi FALSE
> .. ..$ gvf : logi FALSE
> .. ..$ original : logi FALSE
> .. ..$ custom : chr(0)
> .. ..$ plugin : chr "GNECondel,/Plugins/config/Condel/config"
> ..@ filterqc:List of 17
> .. ..$ check_ref : logi FALSE
> .. ..$ coding_only : logi FALSE
> .. ..$ check_existing : logi TRUE
> .. ..$ check_alleles : logi FALSE
> .. ..$ check_svs : logi FALSE
> .. ..$ individual : chr(0)
> .. ..$ chr : chr(0)
> .. ..$ no_intergenic : logi FALSE
> .. ..$ filter_common : logi FALSE
> .. ..$ check_frequency : logi FALSE
> .. ..$ freq_pop : chr(0)
> .. ..$ freq_freq : logi FALSE
> .. ..$ freq_gt_lt : chr(0)
> .. ..$ freq_filter : chr(0)
> .. ..$ filter : chr(0)
> .. ..$ failed : logi FALSE
> .. ..$ allow_non_variant: logi FALSE
> ..@ database:List of 9
> .. ..$ database : logi FALSE
> .. ..$ host : chr "useastdb.ensembl.org
> <http://useastdb.ensembl.org>"
> .. ..$ user : chr(0)
> .. ..$ password : chr(0)
> .. ..$ port : num(0)
> .. ..$ genomes : logi FALSE
> .. ..$ refseq : logi FALSE
> .. ..$ db_version: num(0)
> .. ..$ registry : chr(0)
> ..@ advanced:List of 4
> .. ..$ no_whole_genome: logi FALSE
> .. ..$ buffer_size : num 5000
> .. ..$ compress : chr(0)
> .. ..$ skip_db_check : logi FALSE
>
> Do you use the --cache flag or --offline flag?
>
>
> I am not using the --cache flag right now, because version 2.5 of the
> variant_effect_predictor.pl <http://variant_effect_predictor.pl> script
> does not allow me to specify the Plugin directory and the
> cache directory separately. (This was only introduced in a later version
> of the perl script).
>
> The --offline flag does not seem to be available in
> variant_effect_predictor.pl <http://variant_effect_predictor.pl>
> version 2.5, at least I cannot find it in the listed arguments
> (provided below for reference).
>
> version 2.5
>
> Options
> =======
>
> --help Display this message and quit
> --verbose Display verbose output as the script runs
> [default: off]
> --quiet Suppress status and warning messages [default: off]
> --no_progress Suppress progress bars [default: off]
>
> --config Load configuration from file. Any command line
> options
> specified overwrite those in the file [default: off]
> --everything Shortcut switch to turn on commonly used options.
> See web
> documentation for details [default: off]
>
> -i | --input_file Input file - if not specified, reads from STDIN.
> Files
> may be gzip compressed.
> --format Specify input file format - one of "ensembl",
> "pileup",
> "vcf", "hgvs", "id" or "guess" to try and work
> out format.
> -o | --output_file Output file. Write to STDOUT by specifying -o
> STDOUT - this
> will force --quiet [default:
> "variant_effect_output.txt"]
> --force_overwrite Force overwriting of output file [default: quit
> if file
> exists]
> --original Writes output as it was in input - must be used
> with --filter
> since no consequence data is added [default: off]
> --vcf Write output as VCF [default: off]
> --gvf Write output as GVF [default: off]
> --fields [field list] Define a custom output format by specifying a
> comma-separated
> list of field names. Field names normally
> present in the
> "Extra" field may also be specified, including
> those added by
> plugin modules. Can also be used to configure
> VCF output
> columns [default: off]
> --species [species] Species to use [default: "human"]
>
> -t | --terms Type of consequence terms to output - one of
> "ensembl", "SO",
> "NCBI" [default: ensembl]
> --sift=[p|s|b] Add SIFT [p]rediction, [s]core or [b]oth
> [default: off]
> --polyphen=[p|s|b] Add PolyPhen [p]rediction, [s]core or [b]oth
> [default: off]
> --regulatory Look for overlaps with regulatory regions. The
> script can
> also call if a variant falls in a high
> information position
> within a transcription factor binding site.
> Output lines have
> a Feature type of RegulatoryFeature or MotifFeature
> [default: off]
> --cell_type [types] Report only regulatory regions that are found in
> the given cell
> type(s). Can be a single cell type or a
> comma-separated list.
> The functional type in each cell type is
> reported under
> CELL_TYPE in the output. To retrieve a list of
> cell types, use
> "--cell_type list" [default: off]
> --custom [file list] Add custom annotations from tabix-indexed files. See
> documentation for full details [default: off]
> --plugin [plugin_name] Use named plugin module [default: off]
> --hgnc Add HGNC gene identifiers to output [default: off]
> --hgvs Output HGVS identifiers (coding and protein).
> Requires database
> connection [default: off]
> --ccds Output CCDS transcript identifiers [default: off]
> --xref_refseq Output aligned RefSeq mRNA identifier for
> transcript. NB: the
> RefSeq and Ensembl transcripts aligned in this
> way MAY NOT, AND
> FREQUENTLY WILL NOT, match exactly in sequence,
> exon structure
> and protein product [default: off]
> --protein Output Ensembl protein identifer [default: off]
> --gene Force output of Ensembl gene identifer - disabled
> by default
> unless using --cache or --no_whole_genome
> [default: off]
> --canonical Indicate if the transcript for this consequence
> is the canonical
> transcript for this gene [default: off]
> --domains Include details of any overlapping protein
> domains [default: off]
> --numbers Include exon & intron numbers [default: off]
>
> --no_intergenic Excludes intergenic consequences from the output
> [default: off]
> --coding_only Only return consequences that fall in the coding
> region of
> transcripts [default: off]
> --most_severe Ouptut only the most severe consequence per
> variation.
> Transcript-specific columns will be left blank.
> [default: off]
> --summary Output only a comma-separated list of all
> consequences per
> variation. Transcript-specific columns will be
> left blank.
> [default: off]
> --per_gene Output only the most severe consequence per gene.
> Where more
> than one transcript has the same consequence,
> the transcript
> chosen is arbitrary. [default: off]
> --check_ref If specified, checks supplied reference allele
> against stored
> entry in Ensembl Core database [default: off]
> --check_existing If specified, checks for existing co-located
> variations in the
> Ensembl Variation database [default: off]
> --failed [0|1] Include (1) or exclude (0) variants that have
> been flagged as
> failed by Ensembl when checking for existing
> variants.
> [default: exclude]
> --check_alleles If specified, the alleles of existing co-located
> variations
> are compared to the input; an existing variation
> will only
> be reported if no novel allele is in the input
> (strand is
> accounted for) [default: off]
> --check_svs Report overlapping structural variants [default: off]
>
> --filter [filters] Filter output by consequence type. Use this to
> output only
> variants that have at least one consequence type
> matching the
> filter. Multiple filters can be used separated
> by ",". By
> combining this with --original it is possible to
> run the VEP
> iteratively to progressively filter a set of
> variants. See
> documentation for full details [default: off]
>
> --check_frequency Turns on frequency filtering. Use this to include
> or exclude
> variants based on the frequency of co-located
> existing
> variants in the Ensembl Variation database. You
> must also
> specify all of the following --freq flags
> [default: off]
> --freq_pop [pop] Name of the population to use e.g. hapmap_ceu for
> CEU HapMap,
> 1kg_yri for YRI 1000 genomes. See documentation
> for more
> details
> --freq_freq [freq] Frequency to use in filter. Must be a number
> between 0 and 0.5
> --freq_gt_lt [gt|lt] Specify whether the frequency should be greater
> than (gt) or
> less than (lt) --freq_freq
> --freq_filter Specify whether variants that pass the above
> should be included
> [exclude|include] or excluded from analysis
> --individual [id] Consider only alternate alleles present in the
> genotypes of the
> specified individual(s). May be a single
> individual, a comma-
> separated list or "all" to assess all
> individuals separately.
> Each individual and variant combination is given
> on a separate
> line of output. Only works with VCF files
> containing individual
> genotype data; individual IDs are taken from
> column headers.
> --allow_non_variant Prints out non-variant lines when using VCF input
> --chr [list] Select a subset of chromosomes to analyse from
> your file. Any
> data not on this chromosome in the input will be
> skipped. The
> list can be comma separated, with "-" characters
> representing
> a range e.g. 1-5,8,15,X [default: off]
> --gp If specified, tries to read GRCh37 position from
> GP field in the
> INFO column of a VCF file. Only applies when VCF
> is the input
> format and human is the species [default: off]
> --convert Convert the input file to the output format
> specified.
> [ensembl|vcf|pileup] Converted output is written to the file specified in
> --output_file. No consequence calculation is
> carried out when
> doing file conversion. [default: off]
>
> --refseq Use the otherfeatures database to retrieve
> transcripts - this
> database contains RefSeq transcripts (as well as
> CCDS and
> Ensembl EST alignments) [default: off]
> --host Manually define database host [default:
> "ensembldb.ensembl.org <http://ensembldb.ensembl.org>"]
> -u | --user Database username [default: "anonymous"]
> --port Database port [default: 5306]
> --password Database password [default: no password]
> --genomes Sets DB connection params for Ensembl Genomes
> [default: off]
> --registry Registry file to use defines DB connections
> [default: off]
> Defining a registry file overrides above
> connection settings.
> --db_version=[number] Force script to load DBs from a specific Ensembl
> version. Not
> advised due to likely incompatibilities between
> API and DB
>
> --no_whole_genome Run in old-style, non-whole genome mode [default:
> off]
> --buffer_size Sets the number of variants sent in each batch
> [default: 5000]
> Increasing buffer size can retrieve results more
> quickly
> but requires more memory. Only applies to whole
> genome mode.
> --cache Enables read-only use of cache [default: off]
> --dir [directory] Specify the base cache directory to use [default:
> "$HOME/.vep/"]
> --write_cache Enable writing to cache [default: off]
> --build [all|list] Build a complete cache for the selected species.
> Build for all
> chromosomes with --build all, or a list of
> chromosomes (see
> --chr). DO NOT USE WHEN CONNECTED TO PUBLIC DB
> SERVERS AS THIS
> VIOLATES OUR FAIR USAGE POLICY [default: off]
> --compress Specify utility to decompress cache files - may
> be "gzcat" or
> "gzip -dc" Only use if default does not work
> [default: zcat]
> --skip_db_check ADVANCED! Force the script to use a cache built
> from a different
> database than specified with --host. Only use
> this if you are
> sure the hosts are compatible (e.g.
> ensembldb.ensembl.org <http://ensembldb.ensembl.org> and
> useastdb.ensembl.org <http://useastdb.ensembl.org>) [default: off]
> --cache_region_size ADVANCED! The size in base-pairs of the region
> covered by one
> file in the cache. [default: 1MB]
>
> Also, please remind me of (point me to) the plug-in you're using so
> I can test that.
>
>
> We are using a single plugin that returns the Condel scores. The *Condel
> plugin* can be found on github here:
> https://github.com/ensembl-variation/VEP_plugins
>
> Again, thanks a lot for your support. Please let me know if there is
> anything I can do to help, e.g. with testing the package.
>
> Best,
> Thomas
>
--
Valerie Obenchain
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: vobencha at fhcrc.org
Phone: (206) 667-3158
Fax: (206) 667-1319
More information about the Bioconductor
mailing list