Archive Ensembl HomeArchive Ensembl Home

Variant Effect Predictor script

About | Web version | Perl script | Data formats | Frequently asked questions

Contents


What's new

New in version 2.5 (May 2012)

  • SIFT and PolyPhen predictions now available for RefSeq transcripts
  • retrieve cell type-specific regulatory consequences
  • consequences can be retrieved based on a single individual's genotype in a VCF input file
  • find overlapping structural variants
  • Condel support removed from main script and moved to a plugin

New in version 2.4 (February 2012)

  • offline mode and new installer script make it easy to use the VEP without the usual dependencies
  • output columns configurable using the --fields flag
  • VCF output support expanded, can now carry all fields
  • output affected exon and intron numbers with --numbers
  • output overlapping protein domains using --domains
  • enhanced support for LRGs
  • plugins now work on variants called as intergenic

New in version 2.3 (December 2011)

  • add custom annotations from tabix-indexed files (BED, GFF, GTF, VCF, bigWig)
  • add new functionality to the VEP with user-written plugins
  • filter input on consequence type

New in version 2.2 (September 2011)

  • SIFT, PolyPhen and Condel predictions and regulatory features now accessible from the cache
  • support for calling consequences against RefSeq transcripts
  • variant identifiers (e.g. dbSNP rsIDs) and HGVS notations supported as input format
  • variants can now be filtered by frequency in HapMap and 1000 genomes populations
  • script can be used to convert files between formats (Ensembl/VCF/Pileup/HGVS to Ensembl/VCF/Pileup)
  • large amount of code moved to API modules to ensure consistency between web and script VEP
  • memory usage optimisations
  • VEP script moved to ensembl-tools CVS module
  • Added --canonical, --per_gene and --no_intergenic options

New in version 2.1 (June 2011)

  • ability to use local file cache in place of or alongside connecting to an Ensembl database
  • significant improvements to speed of script
  • whole-genome mode now default (no disadvantage for smaller datasets)
  • improved status output with progress bars
  • regulatory region consequences now reinstated and improved
  • modification to output file - Transcript column is now Feature, and is followed by a Feature_type column

New in version 2.0 (April 2011)

  • support for SIFT, PolyPhen and Condel non-synonymous predictions in human
  • per-allele and compound consequence types
  • support for Sequence Ontology (SO) and NCBI consequence terms
  • modified output format
    • support for new output fields in Extra column
    • header section contains information on database and software versions
    • codon change shown in output
    • CDS position shown in output
    • option to output Ensembl protein identifiers
    • option to output HGVS nomenclature for variants
  • support for gzipped input files
  • enhanced configuration options, including the ability to read configuration from a file
  • verbose output now much more useful
  • whole-genome mode now more stable
  • finding existing co-located variations now ~5x faster
[Back to top]

Requirements

Version 2.5 of the script requires at least version 67 of the Ensembl Core and Variation APIs and their relevant dependencies to be installed to use the script. A minimal subset of these can be installed using the supplied installer script.

To perform a full install of the API, see the instructions for details. To analyse regulatory features, the Ensembl Regulation API should also be installed.

To use the cache, the gzip and zcat utilities are required.

Download

The Variant Effect Predictor script can be downloaded as a tarball from the Ensembl CVS server:

Download version 2.5 (latest)

It is also included as part of the ensembl-tools module of the Ensembl API - you can find it in the ensembl-tools/scripts/variant_effect_predictor/ directory.

Previous versions

Help

For any questions not covered here or on the FAQ page, please send an email to the Ensembl developer's mailing list, dev@ensembl.org

[Back to top]

The INSTALL.pl installer script

The VEP installer script makes it easy to set up your environment for using the VEP. It will download and configure a minimal set of the Ensembl API for use by the VEP, and can also download and configure cache files for use by the VEP.

Users who already have the latest version of the API installed do not need to run the script, although may find it useful for getting an up-to-date API install (with post-release patches applied), and for retrieving cache files. The API set installed by the script is local to the VEP, and will not affect any other Ensembl API installations.

The installer script is also useful for users whose systems do not have all the modules required by the VEP, specifically DBI and DBI::mysql. After configuration using the installer, users can then use the VEP in offline mode with a cache, eliminating dependency on an Ensembl database (with limitations).

Running the installer

The installer script is run on the command line as follows:

 perl INSTALL.pl [options] 

Users then follow on-screen prompts. Please heed any warnings, as when the script says it will delete/overwrite something, it really will!

Most users should not need to add any options, but configuration of the installer is possible with the following flags:

Flag Alternate Description
--DESTDIR [dir]
-d
By default the script will install the API modules in a subdirectory of the current directory named "Bio". Using this option users may configure where the Bio directory is created. If something other than the default is used, this directory must either be added to your PERL5LIB environment variable when running the VEP, or included using perl's -I flag:
perl -I [dir] variant_effect_predictor.pl
--API_VERSION [version]
-v
By default the script will install the latest version of the Ensembl API (currently 67). Users can force the script to install a different version at their own risk
--CACHE_DIR [dir]
-c
By default the script will install the cache files in the ".vep" subdirectory of the user's home area. Using this option users can configure where cache files are installed. The --dir flag must be passed when running the VEP if a non-default directory is given:
perl variant_effect_predictor.pl --dir [dir]

[Back to top]


Running the script

The VEP script is run on the command line as follows:

 perl variant_effect_predictor.pl [options] 

where [options] represent a set of flags and options to the script. These can be listed using the flag --help:

 perl variant_effect_predictor.pl --help 

By default the script connects to the public Ensembl database server at ensembldb.ensembl.org; other connection options are available.

Most users will need to use only a few of the options described below; for most the following command will be enough to get started with:

 perl variant_effect_predictor.pl -i input.txt -o output.txt 

where input.txt contains data in one of the compatible input formats, and output.txt is the output file created by the script. See Data Formats for more detail on input and output formats.

Options can be passed as the full string (e.g. --format), or as the shortest unique string among the options (e.g. --form for --format, since there is another option --force_overwrite).

Options can also be read from a configuration file - either passively stored as $HOME/.vep/vep.ini, or actively using --config.

NOTE Whole-genome mode is now the default run mode for the VEP script. In the rare case that you would prefer to run the script in the old per-variant mode, you can force this with --no_whole_genome

Performance

In optimal conditions, the VEP script is capable of processing ~500,000 variants in 1 hour on a single processor core. Run time is dependent on various factors, and is especially affected by the chromosomal distribution of variants. Variants in exonic regions naturally take longer to process than those in intronic or intergenic regions. Due to the way transcript data is cached in memory, the VEP will, for example, process a file containing 100 variants that fall in one gene faster than it would a file containing 100 variants in 100 different genes.

The VEP is also optimised to run on input files that are sorted in chromosomal order. Unsorted files will still work, albeit more slowly.

For very large files (for example those from whole-genome sequencing), the VEP process can be easily parallelised by dividing your file into chunks (e.g. by chromosome). The VEP will also work with tabix-indexed, bgzipped VCF files, and so the tabix utility could be used to divide the input file:

 tabix -h variants.vcf.gz 12:1000000-20000000 | perl variant_effect_predictor.pl -vcf 

[Back to top]

Basic options

Flag Alternate Description
--help
  Display help message and quit
--verbose
-v
Output longer status messages as the script runs. This option can be used to generate the basis of a configuration file - see --config below. Not used by default
--quiet
-q
Suppress status and warning messages. Not used by default
--no_progress
  Don't show progress bars. Progress bars shown by default
--config [filename]
  Load configuration options from a config file. The config file should consist of whitespace-separated pairs of option names and settings e.g.:
output_file   my_output.txt
species       mus_musculus
format        vcf
host          useastdb.ensembl.org
A config file can also be implicitly read; save the file as $HOME/.vep/vep.ini (or equivalent directory if using --dir). Any options in this file will be overridden by those specified in a config file using --config, and in turn by any options manually specified on the command line. You can create a quick version file of this by setting the flags as normal and running the script in verbose (-v) mode. This will output lines that can be copied to a config file that can be loaded in on the next run using -config. Not used by default
--everything
  Shortcut flag to switch on all of the following:
  • --sift b
  • --polyphen b
  • --ccds
  • --hgvs
  • --hgnc
  • --numbers
  • --domains
  • --regulatory
  • --cell_type
  • --canonical
  • --protein

[Back to top]

Input options

Flag Alternate Description
--species [species]
  Species for your data. This can be the latin name e.g. "homo_sapiens" or any Ensembl alias e.g. "mouse". Specifying the latin name can speed up initial database connection as the registry does not have to load all available database aliases on the server. Default = "homo_sapiens"
--input_file [filename]
-i
Input file name. If not specified, the script will attempt to read from STDIN.
--format [ensembl|vcf|pileup|hgvs|id|vep]
  Input file format. By default, the script auto-detects the input file format. Using this option you can force the script to read the input file as Ensembl, VCF, pileup or HGVS format, a list of variant identifiers (e.g. rsIDs from dbSNP), or the output from the VEP (e.g. to add custom annotation to an existing results file using --custom). Auto-detects format by default
--output_file [filename]
-o
Output file name. The script can write to STDOUT by specifying STDOUT as the output file name - this will force quiet mode. Default = "variant_effect_output.txt"
--force_overwrite
--force
By default, the script will fail with an error if the output file already exists. You can force the overwrite of the existing file by using this flag. Not used by default

[Back to top]

Database options

Flag Alternate Description
--host [hostname]
  Manually define the database host to connect to. Users in the US may find connection and transfer speeds quicker using our East coast mirror, useastdb.ensembl.org. Default = "ensembldb.ensembl.org"
--user [username]
-u
Manually define the database username. Default = "anonymous"
--password [password]
--pass
Manually define the database password. Not used by default
--port [number]
  Manually define the database port. Default = 5306
--genomes
  Override the default connection settings with those for the Ensembl Genomes public MySQL server. Required when using any of the Ensembl Genomes species. Not used by default
--refseq
  Instead of using the core database, use the otherfeatures database to retrieve transcripts. This database contains transcript objects corresponding to RefSeq transcripts, along with CCDS and Ensembl ESTs. Consequence output will be given relative to these transcripts in place of the default Ensembl transcripts. See here for more details on the contents of the otherfeatures database. The otherfeatures database currently only exists for human and mouse. Not used by default
--db_version [number]
--db
Force the script to connect to a specific version of the Ensembl databases. Not recommended as there will usually be conflicts between software and database versions. Not used by default
--registry [filename]
  Defining a registry file overwrites other connection settings and uses those found in the specified registry file to connect. Not used by default

[Back to top]

Output options

Flag Alternate Description
--terms [ensembl|so|ncbi]
-t
The type of consequence terms to output. The Ensembl terms are described here. The Sequence Ontology is a joint effort by genome annotation centres to standardise descriptions of biological sequences. The NCBI terms are those used by dbSNP, and are the least complete set - where no NCBI term is available, the script will output the Ensembl term. Default = "ensembl"
--sift [p|s|b]
  Human only SIFT predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. The VEP can output the prediction term, score or both. Not used by default
--polyphen [p|s|b]
--poly
Human only PolyPhen is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations. The VEP can output the prediction term, score or both. Not used by default
--regulatory
  Look for overlaps with regulatory regions. The script can also call if a variant falls in a high information position within a transcription factor binding site. Output lines have a Feature type of RegulatoryFeature or MotifFeature. Not used by default
--cell_type
  Report only regulatory regions that are found in the given cell type(s). Can be a single cell type or a comma-separated list. The functional type in each cell type is reported under CELL_TYPE in the output. To retrieve a list of cell types, use "--cell_type list". Not used by default
--hgvs
  Add HGVS nomenclature based on Ensembl stable identifiers to the output. Both coding and protein sequence names are added where appropriate. Currently it is not possible to generate HGVS identifiers from the cache; a database connection must be made. Not used by default
--gene
  Force the gene column to be populated. This is disabled by default unless using --cache. Gene column not populated by default
--protein
  Add the Ensembl protein identifier to the output where appropriate. Not used by default
--hgnc
  Adds the HGNC gene identifer (where available) to the output. Not used by default
--ccds
  Adds the CCDS transcript identifer (where available) to the output. Not used by default
--canonical
  Adds a flag indicating if the transcript is the canonical transcript for the gene. Not used by default
--xref_refseq
  Output aligned RefSeq mRNA identifier for transcript. NB: theRefSeq and Ensembl transcripts aligned in this way MAY NOT, AND FREQUENTLY WILL NOT, match exactly in sequence, exon structure and protein product. Not used by default
--numbers
  Adds affected exon and intron numbering to to output. Format is Number/Total. Not used by default
--domains
  Adds names of overlapping protein domains to output. Not used by default
--most_severe
  Output only the most severe consequence per variation. Transcript-specific columns will be left blank. Not used by default
--summary
  Output only a comma-separated list of all observed consequences per variation. Transcript-specific columns will be left blank. Not used by default
--per_gene
  Output only the most severe consequence per gene. The transcript selected is arbitrary if more than one has the same predicted consequence. Not used by default
--convert [ensembl|vcf|pileup]
  Converts the input file to the specified format. See below for more details. Converted output is written to the file specified with --output_file. No consequence prediction is carried out. Not used by default
--fields [list]
  Configure the output format using a comma separated list of fields. Fields may be those present in the default output columns, or any of those that appear in the Extra column (including those added by plugins or custom annotations). Output remains tab-delimited. Not used by default
--vcf
  Writes output in VCF format. Consequences are added in the INFO field of the VCF file, using the key "CSQ". Data fields are encoded separated by "|"; the order of fields is written in the VCF header. Output fields can be selected by using --fields.

If the input format was VCF, the file will remain unchanged save for the addition of the CSQ field (unless using any filtering).

Custom data added with --custom are added as separate fields, using the key specified for each data file.

Commas in fields are replaced with ampersands (&) to preserve VCF format.

Not used by default
--gvf
  Writes output in GVF format. Not used by default
--original
  Writes output as a filtered set of the input. Must be used with --filter. Input lines are unchanged - consequences are calculated but not written to the output. Not used by default
--custom [filename[,short_name,format,type]]
  Add custom annotation to the output. Files must be tabix indexed or in the bigWig format. Multiple files can be specified by supplying the --custom flag multiple times. See here for full details. Not used by default
--plugin [plugin name]
  Use named plugin. Plugin modules should be installed in the Plugins subdirectory of the VEP cache directory (defaults to $HOME/.vep/). Multiple plugins can be used by supplying the --plugin flag multiple times. For details on how to write a plugin, see here. Not used by default

[Back to top]

Filtering and QC options

Flag Alternate Description
--check_ref
  Force the script to check the supplied reference allele against the sequence stored in the Ensembl Core database. Lines that do not match are skipped. Not used by default
--coding_only
  Only return consequences that fall in the coding regions of transcripts. Not used by default
--check_existing
  Checks for the existence of variants that are co-located with your input. By default the alleles are not compared - to do so, use --check_alleles. Not used by default
--check_alleles
  When checking for existing variants, only report a co-located variant if none of the alleles supplied are novel. For example, if the user input has alleles A/G, and an existing co-located variant has alleles A/C, the co-located variant will not be reported.

Strand is also taken into account - in the same example, if the user input has alleles T/G but on the negative strand, then the co-located variant will be reported since its alleles match the reverse complement of user input. Not used by default
--check_svs
  Checks for the existence of structural variants that overlap your input. Currently requires database access. Not used by default
--individual [all|ind list]
  Consider only alternate alleles present in the genotypes of the specified individual(s). May be a single individual, a comma-separated list or "all" to assess all individuals separately. Individual variant combinations homozygous for the given reference allele will not be reported. Each individual and variant combination is given on a separate line of output. Only works with VCF files containing individual genotype data; individual IDs are taken from column headers. Not used by default
--chr [list]
  Select a subset of chromosomes to analyse from your file. Any data not on this chromosome in the input will be skipped. The list can be comma separated, with "-" characters representing an interval. For example, to include chromsomes 1, 2, 3, 10 and X you could use
--chr 1-3,10,X
Not used by default
--no_intergenic
  Do not include intergenic consequences in the output. Not used by default
--check_frequency
  Turns on frequency filtering. Use this to include or exclude variants based on the frequency of co-located existing variants in the Ensembl Variation database. You must also specify all of the --freq flags below. Using this option requires a database connection - while it can be used with --cache, the database will still be accessed to retrieve frequency data. Frequencies used in filtering are added to the output under the FREQS key in the Extra field. Not used by default
--freq_pop [pop]
  Name of the population to use in frequency filter. This can be the name of the population as it appears on the Ensembl website (suitable for most species), or in the following short form for human. 1000 genomes populations are currently pilot 1 (low coverage).

Example value for --freq_popDescription
1kg_chb1000 genomes CHB population
hapmap_yriHapMap YRI population
1kgAny 1000 genomes pilot 1 population
ceuAny of HapMap or 1000 genomes CEU populations
anyAny HapMap or 1000 genomes population

--freq_freq [freq]
  Minor allele frequency to use for filtering. Must be a float value between 0 and 0.5
--freq_gt_lt [gt|lt]
  Specify whether the frequency of the co-located variant must be greater than (gt) or less than (lt) the value specified with --freq_freq
--freq_filter [exclude|include]
  Specify whether to exclude or include only variants that pass the frequency filter
--filter [filters]
  Filter the output on consequence type. Multiple allowed types can be specified, separated by commas. SO terms should ideally be used, although Ensembl and NCBI types are also allowed. Consequence types can be excluded by adding "no_" to the start of the filter name. Shortcuts to common groupings are available:

Shortcut nameDescription
upstreamAny upstream variant
downstreamAny downstream variant
utrAny UTR variant
spliceAny splicing region variants
codingAny variant that falls in the coding region of a transcript
coding_changeAny variant that causes a coding change in the transcript
regulatoryAny variant that falls in a regulatory or binding motif feature

To reproduce a filtered version of the input file, add the flag --original
Not used by default
--failed
  When checking for co-located variants, by default the script will exclude variants that have been flagged as failed. Set this flag to include such variants. Exclude by default
--allow_non_variant
  When using VCF format as input and output, by default the VEP will skip non-variant lines of input (where the ALT allele is null). Enabling this option the lines will be printed in the VCF output with no consequence data added.

[Back to top]

Caching and advanced options

Flag Alternate Description
--no_whole_genome
  Force the script to run in non-whole-genome mode. This was the original default mode for the VEP script, but has now been superceded by whole-genome mode, which is the default. In this mode, variants are analysed one at a time, with no caching of transcript data. Not used by default
--cache
  Enables use of the cache. By default the VEP will only read from the cache - use --write_cache to enable writing. Not used by default
--dir [directory]
  Specify the base cache directory to use. This should be on a filesystem with around 600MB free (for human, other species may vary). Default = "$HOME/.vep/"
--offline
  Enable offline mode. No database connections will be made, and only a complete cache (either downloaded or built using --build) can be used for this mode. Not used by default
--buffer [number]
  Sets the internal buffer size, corresponding to the number of variations that are read in to memory simultaneously. Set this lower to use less memory at the expense of longer run time, and higher to use more memory with a faster run time. Default = 5000
--write_cache
  Enable writing to the cache. Not used by default
--build [all|list]
  Build a complete cache for the selected species from the database. Either specify a list of chromosomes (see --chr for how to do this), or use
--build all
to build for all top-level chromosomes. WARNING: Do not use this flag when connected to one of the public databases - please instead download a pre-built cache or build against a local database. Not used by default
--compress [command]
  By default the VEP uses the utility zcat to decompress cached files. On some systems zcat may not be installed or may misbehave; by specifying one of
--compress gzcat
or
--compress "gzip -dc"
you may be able to bypass these problems. Not used by default
--skip_db_check
  ADVANCED Force the script to use a cache built from a different host than specified with --host. Only use this if you are sure the two hosts are compatible (e.g. ensembldb.ensembl.org can be considered compatible with useastdb.ensembl.org as the data is mirrored between the two). Not used by default
--cache_region_size [size]
  ADVANCED The size in base-pairs of the region covered by one file in the cache. By default this is 1MB, which produces approximately ~500 files maximum per sub-directory in human. Reducing this can reduce the amount of memory and decrease the run-time when you use a cache built this way. Note that you must specify the same --cache_region_size when both building/writing to the cache and reading from it. Not used by default

[Back to top]

Examples

  • Read input from STDIN, output to STDOUT
    perl variant_effect_predictor.pl -o stdout
  • Add regulatory region consequences
    perl variant_effect_predictor.pl -i variants.txt -regulatory
  • Input file variants.vcf.txt, input file format VCF, add HGNC gene identifiers, output SO consequence terms
    perl variant_effect_predictor.pl -i variants.vcf.txt -format vcf -hgnc -t so
  • Force overwrite of output file variants_output.txt, check for existing co-located variants, output only coding sequence consequences, output HGVS names
    perl variant_effect_predictor.pl -i variants.txt -o variants_output.txt -force -check_existing -coding_only -hgvs
  • Specify DB connection parameters in registry file ensembl.registry, add SIFT score and prediction, PolyPhen prediction
    perl variant_effect_predictor.pl -i variants.txt -registry ensembl.registry -sift b -polyphen p
  • Connect to Ensembl Genomes db server for A.thaliana, run with buffer size of 10000
    perl variant_effect_predictor.pl -i variants.txt -genomes -species arabidopsis_thaliana -b 10000
  • Load config from ini file, run in quiet mode
    perl variant_effect_predictor.pl -config vep.ini -i variants.txt -q
  • Use cache in /home/vep/mycache/, use gzcat instead of zcat
    perl variant_effect_predictor.pl -cache -dir /home/vep/mycache/ -i variants.txt -compress gzcat
  • Convert RefSeq-based HGVS notations to genomic coordinates in VCF format
    perl variant_effect_predictor.pl -i hgvs.txt -o hgvs.vcf -refseq -convert vcf
  • Filter input file on consequence type to include only variants that cause a coding sequence change, write output in original input format
    perl variant_effect_predictor.pl -i variants.vcf -o variants_filtered.vcf -filter coding_change -original
  • Add custom position-based phenotype annotation from remote BED file
    perl variant_effect_predictor.pl -i variants.vcf -custom ftp://ftp.myhost.org/data/phenotypes.bed.gz,phenotype
  • Use the plugin named MyPlugin, output only the variation name, feature, consequence type and MyPluginOutput fields
    perl variant_effect_predictor.pl -i variants.vcf -plugin MyPlugin -fields Uploaded_variation,Feature,Consequence,MyPluginOutput

[Back to top]


Databases and caching

The VEP script can use a variety of data sources to retrieve transcript information that is used to predict consequence types. Which one you choose to use should depend on your requirements and available resources.

Public database servers

By default, the script is configured to connect to Ensembl's public MySQL instance at ensembldb.ensembl.org. For users in the US (or for any user geographically closer to the East coast of the USA than to Ensembl's data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag --host useastdb.ensembl.org

Users of Ensembl Genomes species (e.g. plants, fungi, microbes) should use their public MySQL instance; the connection parameters for this can be automatically loaded by using the flag --genomes

Users with small data sets (100s of variants) should find using the default connection settings adequate. Those with larger data sets, or those who wish to use the script in a batch manner, should consider one of the alternatives below.

Using the cache

From version 2.1 onwards, the VEP is able to use cached data on disk in place of reading from the database. Using the cache is probably the fastest and most efficient way to use the VEP script, as in most cases only a single initial network connection is made and most data is read from local disk. The diagrams below illustrate the model of caching that is used.

Normal mode Cache mode Build mode

It is possible to use any combination of cache and database; when using the cache, the cache will take preference, with the database being used when the relevant data is not found in the cache.

Cache files are compressed using the gzip utility. By default zcat is used to decompress the files, although gzcat or gzip itself can be used to decompress also - you must have one of these utilities installed in your path to use the cache.

Pre-built caches

The easiest solution is to download a pre-built cache for your species; this eliminates the need to connect to the database while the script is running (except when using certain options). Cache files can either be downloaded and unpacked as described here, or automatically downloaded and configured using the installer script.

  1. Download the archive file for your species:

    Human (Homo sapiens) Download (with SIFT and PolyPhen) Download (without SIFT and PolyPhen)
    Mouse (Mus musculus) Download  
    Rat (Rattus norvegicus) Download  
    Zebrafish (Danio rerio) Download  
    Cow (Bos taurus) Download  

  2. Extract the archive in your cache directory. By default the VEP uses $HOME/.vep/ as the cache directory, where $HOME is your UNIX home directory.
    mv homo_sapiens_vep_67.tar.gz ~/.vep/
    cd ~/.vep/
    tar xfz homo_sapiens_vep_67.tar.gz
  3. Run the VEP with the --cache option

Caches for several species, and indeed different Ensembl releases of the same species, can be stored in the same cache base directory. The files are stored in the following directory hierarchy: $HOME -> .vep -> species -> version -> chromosome

If a pre-built cache does not exist for your species, please contact us at dev@ensembl.org and we will endeavour to add your species to the list of downloads.

Building your own cache

It is possible to build your own cache using the VEP script. You should NOT use this command when connected to the public MySQL instances - the process takes a long time, meaning the connection can break unexpectedly and you will be violating Ensembl's reasonable use policy on the public servers. You should either download one of the pre-built caches, or create a local copy of your database of interest to build the cache from.

You may wish to build a full cache if you have a custom Ensembl database with data not found on the public servers, or you may wish to create a minimal cache covering only a certain set of chromosome regions. Cache files are compressed using the gzip utility; this must be installed in your path to write cache files.

To build a cache "on-the-fly", use the --cache and --write_cache flags when you run the VEP with your input. Only cache files overlapping your input variants will be created; the next time you run the script with this cache, the data will be read from the cache instead of the database. Any data not found in the cache will be read from the database (and then written to the cache if --write_cache is enabled). If your data covers a relatively small proportion of your genome of interest (for example, a few genes of interest), it can be OK to use the public MySQL servers when building a partial cache.

perl variant_effect_predictor.pl -cache -dir /my/cache/dir/ -write_cache -i input.txt

To build a cache from scratch, use the flag

--build all
or e.g.
--build 1-5,X
to build just a subset of chromosomes. You do not need to specify any of the usual input options when building a cache:

perl variant_effect_predictor.pl -host dbhost -user username -pass password -port 3306 -build 21 -dir /my/cache/dir/

Limitations of the cache

The cache stores the following information:

  • Transcript location, sequence, exons and other attributes
  • Gene, protein and HGNC identifiers for each transcript (where applicable)
  • Location and alleles of existing variations
  • Regulatory regions
  • Predictions and scores for SIFT, PolyPhen

It does not store any information pertaining to, and therefore cannot be used for, the following:

  • Frequency filtering of input (--check_frequency)
  • HGVS names - due to a limitation in the API, it is currently not possible to create HGVS identifiers without using the database (--hgvs)
  • Using HGVS notation as input (--format hgvs)
  • Using variant identifiers as input (--format id)
  • Finding overlapping structural variants (--check_sv)

Enabling one of these options with --cache will cause the script to warn you in its status output with something like the following:

 2011-06-16 16:24:51 - INFO: Database will be accessed when using --hgvs 

Data privacy

When using the public database servers, the VEP script requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be suitable for those with sensitive or private data. Users should note that only the coordinates are transmitted to the server; no other information is sent.

By using a full downloaded cache (preferably in offline mode) or a local database, it is possible to avoid completely any network connections to public servers, thus preserving absolutely the privacy of your data.

Offline mode

It is possible to run the VEP in a offline mode that does not use the database, and does not require a standard installation of the Ensembl API. This means users require only perl (version 5.8 or greater) and the either zcat, gzcat or gzip utilities. To enable this mode, use the flag --offline.

The simplest way to set up your system is to use the installer script, INSTALL.pl. This will download the required dependencies to your system, and download and set up any cache files that you require.

The limitations described above apply absolutely when using offline mode. For example, if you specify --offline and --hgvs, the script will report an error and refuse to run.

All other features, including the ability to use custom annotations and plugins, are accessible in offline mode.

Using a local database

It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, see here. You will need a MySQL server that you can connect to from the machine where you will run the script (this can be the same machine). For most of the functionality of the VEP, you will only need the Core database (e.g. homo_sapiens_core_67_37) installed. In order to find co-located variations or to use SIFT or PolyPhen, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_67_37).

Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use a pre-built cache in place of a local database.

To connect to your mirror, you can either set the connection parameters using --host, --port, --user and --password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:

use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Registry;

Bio::EnsEMBL::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "core",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_core_67_37'
);

Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new(
  '-species' => "Homo_sapiens",
  '-group'   => "variation",
  '-port'    => 5306,
  '-host'    => 'ensembldb.ensembl.org',
  '-user'    => 'anonymous',
  '-pass'    => '',
  '-dbname'  => 'homo_sapiens_variation_67_37'
);

Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");

For more information on the registry and registry files, see here.

[Back to top]

Custom annotations

The VEP script can integrate custom annotation from standard format files into your results by using the --custom flag. These files may be hosted locally or remotely, with no limit to the number or size of the files. The files must be indexed using the tabix utility (BED, GFF, GTF, VCF); bigWig files contain their own indices. Users should note that the VEP will only look for overlaps (both exact and inexact) with these annotations; for example, any sequence in a GTF file will not be taken into account.

Annotations appear as key=value pairs in the Extra column of the VEP output; they will also appear in the INFO column if using VCF format output. The value for a particular annotation is defined as the identifier for each feature; if not available, an identifier derived from the coordinates of the annotation is used. Annotations will appear in each line of output for the variant where multiple lines exist.

Data formats

The VEP supports the following formats:

  • BED : a simple tab-delimited format containing 3-12 columns of data. The first 3 columns contain the coordinates of the feature. If available, the VEP will use the 4th column of the file as the identifier of the feature.
  • GFF : a format for describing genes and other features. If available, the VEP will use the "ID" field as the identifier of this feature.
  • GTF : treated in an identical manner to GFF.
  • VCF : a format used to describe genomic variants. The VEP will use the 3rd column of the file as the identifier.
  • bigWig : a format for storage of dense continuous data. The VEP uses the value for the given position as the "identifier". Note that bigWig files contain their own indices, and do not need to be indexed by tabix.

Any other files can be easily converted to be compatible with the VEP; the easiest format to produce is a BED-like file containing coordinates and an (optional) identifier:

chr1    10000    11000    Feature1
chr3    25000    26000    Feature2
chrX    99000    99001    Feature3

Chromosomes can be denoted by either e.g. "chr7" or "7", "chrX" or "X".

Preparing files

Custom annotation files must be prepared in a particular way in order to work with tabix and therefore with the VEP. Files must be sorted in chromosome and position order, compressed using bgzip and finally indexed using tabix. Here is an example of that process for a BED file:

sort -k1,1 -k2,2n -k3,3n myData.bed | bgzip > myData.bed.gz
tabix -p bed myData.bed.gz

The tabix utility has several preset filetypes that it can process, and it can also process any arbitrary filetype containing at least a chromosome and position column. See the documentation for details.

If you are going to use the file remotely (i.e. over HTTP or FTP protocol), you should ensure the file is world-readable on your server.

Options for custom annotation

Each custom file that you configure the VEP to use can be configured. Beyond the filepath, there are 4 further options, each of which is specified in a comma-separated list, for example:

perl variant_effect_predictor.pl -custom myFeatures.gff.gz,myFeatures,gff,overlap,0
perl variant_effect_predictor.pl -custom frequencies.bw,Frequency,bigwig,exact,0
perl variant_effect_predictor.pl -custom http://www.myserver.com/data/myPhenotypes.bed.gz,Phenotype,bed,exact,1

The options are as follows:

  • Filename : The path to the file. For tabix indexed files, the VEP will check that both the file and the corresponding .tbi file exist. For remote files, the VEP will check that the tabix index is accessible on startup.
  • Short name : A name for the annotation that will appear as the key in the key=value pairs in the results. If not defined, this will default to e.g. "Custom1" for the first set of annotation added.
  • File type : One of "bed", "gff", "gtf", "vcf", "bigwig". If not specified, the VEP assumes the file is BED format.
  • Annotation type : One of "exact", "overlap". When using "exact" only annotations whose coordinates match exactly those of the variant will be reported. This would be suitable for position specific information such as conservation scores, allele frequencies or phenotype information. Using "overlap", any annotation that overlaps the variant by even 1bp will be reported.
  • Force report coordinates : One of "0" or "1" (if left blank assumed to be "0") - if set to "1", this forces the VEP to output the coordinates of an overlapping custom feature instead of any found identifier (or value in the case of bigWig) field. If set to "0" (the default), the VEP will output the identifier field if one is found; if none is found, then the coordinates are used instead.

All options (apart from the filename) are optional and their absence will invoke the default behaviour.

Using remote files

The tabix utility makes it possible to read annotation files from remote locations, for example over HTTP or FTP protocols. In order to do this, the .tbi index file is downloaded locally (to the current working directory) when the VEP is run. From this point on, only the portions of data requested by the script (i.e. those overlapping the variants in your input file) are downloaded. Users should be aware, however, that it is still possible to cause problems with network traffic in this manner by requesting data for a large number of variants. Users with large amounts of data should download the annotation file locally rather than risk causing any issues!

bigWig files can also be used remotely in the same way as tabix-indexed files, although less stringent checks are carried out on VEP startup. Furthermore, when using bigWig files, the VEP generates temporary files that by default are written to the /tmp/ directory - to override this, use the "-tmpdir /my/tmp/dir" flag.

Adding custom annotations to existing VEP results

It is possible to add custom annotation to existing VEP results files. To do this, you need to specify the --no_consequence option, and provide your VEP output file as the input file for the script. The script should auto-detect the format of the file; if it does not, you can force it to read the file as VEP output using "--format vep".

[Back to top]

Plugins

ADVANCED The VEP can use plugin modules written in Perl to add functionality to the script. Plugins are a powerful way to extend, filter and manipulate the output of the VEP.

How it works

Plugins are run once the VEP has finished its analysis for each line of the output, but before anything is printed to the output file. When each plugin is called (using the 'run' method) it is passed two data structures to use in its analysis; the first is a data structure containing all the data for the current line, and the second is a reference to a a variation API object that represents the combination of a variant allele and an overlapping or nearby genomic feature (such as a transcript or regulatory region). This object provides access to all the relevant API objects that may be useful for further analysis by the plugin (such as the current VariationFeature and Transcript); please refer to the variation API documentation for more details.

Functionality

We expect that most plugins will simply add information to the last column of the output file, the "Extra" column, and the plugin system assumes this in various places, but plugins are also free to alter the output line as desired.

The only hard requirement for a plugin to work with the VEP is that it implements a number of required methods (such as 'new' which should create and return an instance of this plugin, 'get_header_info' which should return descriptions of the type of data this plugin produces to be included in the VEP output's header, and 'run' which should actually perform the logic of the plugin). To make development of plugins easier, we suggest that users use the Bio::EnsEMBL::Variation::Utils::BaseVepPlugin module as their base class, which provides default implementations of all the necessary methods which can be overridden as required. Please refer to the documentation in this module for details of all required methods and for a simple example of a plugin implementation.

Filtering using plugins

A common use for plugins will be to filter the output in some way (for example to limit output lines to non-synonymous variants) and so we provide a simple mechanism to support this. The 'run' method of a plugin is assumed to return a reference to a hash containing information to be included in the output, and if a plugin does not want to add any data to a particular line it should return an empty hashref. If a plugin instead wants to filter a line and exclude it from the output, it should return 'undef' from its 'run' method, this also means that no further plugins will be run on the line. If you are developing a filter plugin, we suggest that you use the Bio::EnsEMBL::Variation::Utils::BaseVepFilterPlugin as your base class and then you need only override the 'include_line' method to return true if you want to include this line, and false otherwise. Again, please refer to the documentation in this module for more details and an example implementation of a non-synonymous filter.

Using plugins

In order to run a plugin you need to include the plugin module in Perl's library path somehow; by default the VEP includes the '~/.vep/Plugins' directory in the path, so this is a convenient place to store plugins, but you are also able to include modules by any other means (e.g using the $PERL5LIB environment variable in Unix-like systems). You can then run a plugin using the '--plugin' command line option, passing the name of the plugin module as the argument. For example, if your plugin is in a module called MyPlugin.pm, stored in ~/.vep/Plugins, you can run it with a command line like:

perl variant_effect_predictor.pl -i input.vcf --plugin MyPlugin

You can pass arguments to the plugin's 'new' method by including them after the plugin name on the command line, separated by commas, e.g.:

perl variant_effect_predictor.pl -i input.vcf --plugin MyPlugin,1,FOO

If your plugin inherits from BaseVepPlugin, you can then retrieve these parameters as a list from the 'params' method.

You can run multiple plugins by supplying multiple --plugin arguments. Plugins are run serially in the order in which they are specified on the command line, so they can be run as a pipeline, with, for example, a later plugin filtering output based on the results from an earlier plugin. Note though that the first plugin to filter a line 'wins', and any later plugins won't get run on a filtered line.

Examples

We have written several example plugins that implement experimental functionality that we do not (yet) include in the variation API, and these are stored in a public github repository:

https://github.com/ensembl-variation/VEP_plugins

We hope that these will serve as useful examples for users implementing new plugins. If you have any questions about the system, or suggestions for enhancements please let us know on the developer's mailing list: dev@ensembl.org. We also encourage users to share any plugins they develop and we intend to create a central portal for VEP plugins and other scripts written using Ensembl resources in the near future. In the mean time, please contact the developers mailing list if you want to share your plugin.

[Back to top]

Other information

HGVS notation

The VEP script supports using HGVS notations as input. This feature is currently under development, and not all HGVS notation types are supported. Specifically, only notations relative to genomic (g) or coding (c) sequences are currently supported; protein (p) notations are not currently supported due to the complexity involved in determining the multiple possible underlying genomic sequence changes that could produce a single protein change. The script will warn the user if it fails to parse a particular notation.

By default the VEP script uses Ensembl transcripts as its reference for determining consequences, and hence also for HGVS notations. However, it is possible to parse HGVS notations that use RefSeq transcripts as the reference sequence by using the --refseq flag when running the script. Such notations must include the version number of the transcript e.g.

NM_080794.3:c.1001C>T

where ".3" denotes that this is version 3 of the transcript NM_080794. See below for more details on how the VEP can use RefSeq transcripts.

RefSeq transcripts

Ensembl produces Core schema databases containing alignments of RefSeq transcript objects to the reference genome. This is the otherfeatures database, and is produced for human and mouse. The database also contains alignments of CCDS transcripts and Ensembl EST sequences. By passing the --refseq flag when running the VEP script, these alternative transcripts will be used as the reference for predicting variant consequences. Gene IDs given in the output when using this option are generally NCBI GeneIDs.

Users should note that RefSeq sequences may disagree with the reference sequence to which they are aligned, hence results generated when using this option should be interpreted with a degree of caution. A much more complex and stringent process is used to produce the main Ensembl Core database, and this should be used in preference to the RefSeq transcripts.

SIFT and PolyPhen predictions and scores are now calculated and referred to internally using the translated sequence, so predictions are available using the --refseq flag where the RefSeq translation matches the Ensembl translation (they will match in the vast majority of cases - most differences between Ensembl and RefSeq transcripts occur in non-coding regions).

File conversion

The VEP script can be used to convert files between the various formats that it parses. This may be useful for a user with, for example, a number of variants given in HGVS notation against RefSeq transcript identifiers. The conversion process allows these notations to be converted into genomic reference coordinates, and then used to predict consequences in the VEP against Ensembl transcripts.

[Back to top]