Archive Ensembl HomeArchive Ensembl Home

Variant Effect Predictor data formats

About | Web version | Perl script | Data formats | Frequently asked questions

Contents

Input

Both the web and script version of the VEP can use the same input formats. Formats can be auto-detected by the VEP script, but must be manually selected when using the web interface.

Default

The default format is a simple whitespace-separated format (columns may be separated by space or tab characters), containing five required columns plus an optional identifier column:

  1. chromosome - just the name or number, with no 'chr' prefix
  2. start
  3. end
  4. allele - pair of alleles separated by a '/', with the reference allele first
  5. strand - defined as + (forward) or - (reverse).
  6. identifier - this identifier will be used in the VEP's output. If not provided, the VEP will construct an identifier from the given coordinates and alleles.
1   881907    881906    -/C   +
5   140532    140532    T/C   +
12  1017956   1017956   T/A   +
2   946507    946507    G/C   +
14  19584687  19584687  C/T   -
19  66520     66520     G/A   +    var1
8   150029    150029    A/T   +    var2

An insertion (of any size) is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand of chromosome 8 is indicated as follows:

8   12601     12600     -/C   +

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand of chromosome 8 will be:

8   12600     12602     CGT/- -

VCF

The VEP also supports using VCF (Variant Call Format) version 4.0. This is a common format used by the 1000 genomes project, and can be produced as an output format by many variant calling tools.

Users using VCF should note a peculiarity in the difference between how Ensembl and VCF describe unbalanced variations. For any unbalanced variant (i.e. insertion, deletion or unbalanced substitution), the VCF specification requires that the base immediately before the variant should be included in both the reference and variant alleles. This also affects the reported position i.e. the reported position will be one base before the actual site of the variant.

In order to parse this correctly, the VEP needs to convert such variants into Ensembl-type coordinates, and it does this by removing the additional base and adjusting the coordinates accordingly. This means that if an identifier is not supplied for a variant (in the 3rd column of the VCF), then the identifier constructed and the position reported in the VEP's output file will differ from the input. This problem can be overcome by ensuring each variant has a unique identifier specified in the 3rd column of the VCF.

The following examples illustrate how VCF describes a variant and how it is handled internally by the VEP. Consider the following aligned sequences (for the purposes of discussion on chromosome 20):

Ref: a t C g a // C is the reference base
 1 : a t G g a // C base is a G in individual 1
 2 : a t - g a // C base is deleted w.r.t. the reference in individual 2
 3 : a t CAg a // A base is inserted w.r.t. the reference sequence in individual 3

Individual 1

The first individual shows a simple balanced substitution of G for C at base 3. This is described in a compatible manner in VCF and Ensembl styles. Firstly, in VCF:

20   3   .   C   G   .   PASS   .

And in Ensembl format:

 20   3   3   C/G   +

Individual 2

The second individual has the 3rd base deleted relative to the reference. In VCF, both the reference and variant allele columns must include the preceding base (T) and the reported position is that of the preceding base:

20   2   .   TC   T   .   PASS   .

In Ensembl format, the preceding base is not included, and the start/end coordinates represent the region of the sequence deleted. A "-" character is used to indicate that the base is deleted in the variant sequence:

20   3   3   C/-   +

The upshot of this is that while in the VCF input file the position of the variant is reported as 2, in the output file from the VEP the position will be reported as 3. If no identifier is provided in the third column of the VCF, then the constructed identifier will be:

20_3_C/-

Individual 3

The third individual has an "A" inserted between the 3rd and 4th bases of the sequence relative to the reference. In VCF, as for the deletion, the base before the insertion is included in both the reference and variant allele columns, and the reported position is that of the preceding base:

20   3   .   C   CA   .   PASS   .

In Ensembl format, again the preceding base is not included, and the start/end positions are "swapped" to indicate that this is an insertion. Similarly to a deletion, a "-" is used to indicate no sequence in the reference:

 20   4   3   -/A   +

Again, the output will appear different, and the constructed identifier may not be what is expected:

20_3_-/A

The solution is to always add a unique identifer for each of your variants to the VCF file!!!

Pileup

The pileup format can also be used as input for the VEP. This is the output of the ssaha pileup package.

HGVS identifiers

See http://www.hgvs.org/mutnomen/ for details. These must be relative to genomic or Ensembl transcript coordinates. It is possible, although less reliable, to use RefSeq transcripts in both the web interface and the VEP script (see script documentation).

Examples:

ENST00000207771.3:c.344+626A>T
ENST00000471631.1:c.28_33delTCGCGG
ENST00000285667.3:c.1047_1048insC
5:g.140532T>C

Examples using RefSeq identifiers (using --refseq in the VEP script, or select the otherfeatures transcript database on the web interface):

NM_153681.2:c.7C>T
NM_005239.4:c.190G>A
NM_001025204.1:c.336G>A

Variant identifiers

These should be e.g. dbSNP rsIDs, or any synonym for a variant present in the Ensembl Variation database. See here for a list of identifier sources in Ensembl.

[Back to top]


Output

The output format from the web and script VEP is the same. The output columns are:

  1. Uploaded variation - as chromosome_start_alleles
  2. Location - in standard coordinate format (chr:start or chr:start-end)
  3. Allele - the variant allele used to calculate the consequence
  4. Gene - Ensembl stable ID of affected gene
  5. Feature - Ensembl stable ID of feature
  6. Feature type - type of feature. Currently one of Transcript, RegulatoryFeature, MotifFeature.
  7. Consequence - consequence type of this variation
  8. Relative position in cDNA - base pair position in cDNA sequence
  9. Relative position in CDS - base pair position in coding sequence
  10. Relative position in protein - amino acid position in protein
  11. Amino acid change - only given if the variation affects the protein-coding sequence
  12. Codons - the alternate codons with the variant base highlighted as bold (HTML) or upper case (text)
  13. Corresponding variation - identifier of existing variation
  14. Extra - this column contains extra information as key=value pairs separated by ";". The keys are as follows:
    • HGNC - the HGNC gene identifier
    • ENSP - the Ensembl protein identifier of the affected transcript
    • HGVSc - the HGVS coding sequence name
    • HGVSp - the HGVS protein sequence name
    • SIFT - the SIFT prediction and/or score, with both given as prediction(score)
    • PolyPhen - the PolyPhen prediction and/or score
    • MOTIF_NAME - the source and identifier of a transcription factor binding profile aligned at this position
    • MOTIF_POS - The relative position of the variation in the aligned TFBP
    • HIGH_INF_POS - a flag indicating if the variant falls in a high information position of a transcription factor binding profile (TFBP)
    • MOTIF_SCORE_CHANGE - The difference in motif score of the reference and variant sequences for the TFBP
    • CELL_TYPE - List of cell types and classifications for regulatory feature
    • CANONICAL - a flag indicating if the transcript is denoted as the canonical transcript for this gene
    • CCDS - the CCDS identifer for this transcript, where applicable
    • INTRON - the intron number (out of total number)
    • EXON - the exon number (out of total number)
    • DOMAINS - the source and identifer of any overlapping protein domains
    • IND - individual name
    • SV - IDs of overlapping structural variants
    • FREQS - Frequencies of overlapping variants used in filtering

Empty values are denoted by '-'. Further fields in the Extra column can be added by plugins or using custom annotations in the VEP script. Output fields can be configured using the --fields flag when running the VEP script.

11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000525319  Transcript         NON_SYNONYMOUS_CODING   742  716  239  T/N  aCc/aAc  -  SIFT=deleterious(0);PolyPhen=unknown(0)
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000534381  Transcript         5_PRIME_UTR             -    -    -    -    -        -  -
11_224088_C/A    11:224088   A  ENSG00000142082  ENST00000529055  Transcript         DOWNSTREAM              -    -    -    -    -        -  -
11_224585_G/A    11:224585   A  ENSG00000142082  ENST00000529937  Transcript         INTRONIC,NMD_TRANSCRIPT -    -    -    -    -        -  HGVSc=ENST00000529937.1:c.136-346G>A
22_16084370_G/A  22:16084370 A  -                ENSR00000615113  RegulatoryFeature  REGULATORY_REGION       -    -    -    -    -        -  -

The VEP script will also add a header to the output file. This contains information about the databases connected to, and also a key describing the key/value pairs used in the extra column.

## ENSEMBL VARIANT EFFECT PREDICTOR v2.5
## Output produced at 2012-05-20 16:09:38
## Connected to homo_sapiens_core_67_37 on ensembldb.ensembl.org
## Using API version 67, DB version 67
## Extra column keys:
## CANONICAL    : Indicates if transcript is canonical for this gene
## CCDS         : Indicates if transcript is a CCDS transcript
## HGNC         : HGNC gene identifier
## ENSP         : Ensembl protein identifer
## HGVSc        : HGVS coding sequence name
## HGVSp        : HGVS protein sequence name
## SIFT         : SIFT prediction
## PolyPhen     : PolyPhen prediction
## EXON         : Exon number
## INTRON       : Intron number
## DOMAINS      : The source and identifer of any overlapping protein domains
## MOTIF_NAME   : The source and identifier of a transcription factor binding profile (TFBP) aligned at this position
## MOTIF_POS    : The relative position of the variation in the aligned TFBP
## HIGH_INF_POS : A flag indicating if the variant falls in a high information position of the TFBP
## MOTIF_SCORE_CHANGE : The difference in motif score of the reference and variant sequences for the TFBP
## CELL_TYPE    : List of cell types and classifications for regulatory feature
## IND          : Individual name
## SV           : IDs of overlapping structural variants
## FREQS        : Frequencies of overlapping variants used in filtering

[Back to top]