Archive Ensembl HomeArchive Ensembl Home
About Ensembl Variation | Data Description | Predicted Data | Database Description | Perl API | Variant Effect Predictor

Ensembl Variation - Data description

Below is a description of the data we store in the databases for Ensembl Variation. For several different species in Ensembl, we import variation data (SNPs, CNVs, allele frequencies, genotypes etc) from a variety of sources (e.g. dbSNP).
We classify the variants into different classes and calculate the predicted consequence(s) of the variant. In human, we calculate the linkage disequilibrium for each variant, by population. We have also created sets to help people retrieve a specific group of variants from a particular dataset.


Variation species and data sources

Ensembl stores variation data for the following species, but note that users can still use the Variant Effect Predictor on species for which we do not currently have a variation database.

Bos taurus Bos taurus [sources]
Canis familiaris Canis familiaris [sources]
Danio rerio Danio rerio [sources]
Drosophila melanogaster Drosophila melanogaster [sources]
Equus caballus Equus caballus [sources]
Felis catus Felis catus [sources]
Gallus gallus Gallus gallus [sources]
Homo sapiens Homo sapiens [sources]
Macaca mulatta Macaca mulatta [sources] NEW!
Monodelphis domestica Monodelphis domestica [sources]
Mus musculus Mus musculus [sources]
Ornithorhynchus anatinus Ornithorhynchus anatinus [sources]
Pan troglodytes Pan troglodytes [sources]
Pongo abelii Pongo abelii [sources]
Rattus norvegicus Rattus norvegicus [sources]
Saccharomyces cerevisiae Saccharomyces cerevisiae [sources]
Sus scrofa Sus scrofa [sources]
Taeniopygia guttata Taeniopygia guttata [sources]
Tetraodon nigroviridis Tetraodon nigroviridis [sources]

The majority of variants are imported from NCBI dbSNP. The data is imported when it is released by dbSNP and incorporated into the next Ensembl release. If dbSNP releases the data on a different assembly, Ensembl will remap the variant positions onto the current assembly. Data from projects like the HapMap Project and 1000 Genomes Project is imported once it has been submitted to dbSNP.

Ensembl also includes data from other sources. To view data from these sources in the browser go to a species Location page (e.g. for human), and click on the 'Configure this page' link on the left-hand side. The 'Germline variation' and 'Somatic mutations' sections contain a track list of all sources of variation data for that species.


Variation displays

Variation data can be viewed in the browser through pages such as:

  • Gene: Variation Table and Variation Image (for all variations in a gene) e.g. for example for all variants in KCNE2. Structural Variation to see all structural variants overlapping the gene.
  • Transcript: Population comparison, Comparison image (for comparing variants in a transcript across different individual or strain sequences) e.g. compare Tmco4 in different mouse strains
  • Transcript: Sequence, protein: list of the coding variants in protein coordinates.
  • Location: Region in Detail (Variations can be drawn using "Configure this page" at the left. The menu allows display of information in Ensembl databases along with external sources in DAS format such as DGV loci.)

Clicking on any variation on an Ensembl page will open a Variation tab with information about the flanking sequence and source for the selected variation. Links to linkage disequilibrium (LD) plots, phenotype information (for human) from EGA, OMIM and NHGRI and Ensembl genes and transcripts that include the variation can be found at the left of this tab. You may also view multiple genome alignments of various species, highlighting the variation. Ancestral sequences are included in this display.

Variation information can also be accessed using BioMart (gene or variation database), and the Perl API (variation API).


Data types

The Ensembl Variation database stores data imported from external sources and also data calculated on site.

  • Data imported from external sources (dbSNP, Sanger, DGVa, ...):
    • Variations (SNPs, in-dels, insertion, deletion, ...)
    • Structural variations (copy number variation, tandem duplication, inversion, ...)
    • Probes for copy number variations
    • Locations for variations and structural variations
    • Alleles
    • Populations
    • Genotypes
    • Phenotypes
  • Calculated data: see the Predicted data page.

Variation classes

We call the class of a variation according to its component alleles and its mapping to the reference genome, and then display this information on the website. Internally we use Sequence Ontology terms, but we map these to our own 'display' terms where common usage differs from the SO definition (e.g. our term SNP is closer to the SO term SNV). All the classes we call, along with their equivalent SO term are shown in the table below. We also differentiate somatic mutations from germline variations in the display term, prefixing the term with 'somatic'. API users can fetch either the SO term or the display term.


* Ensembl term SO term SO description SO accession Called for
SNP SNV SNVs are single nucleotide positions in genomic DNA at which different sequence alternatives exist. SO:0001483 Variation
somatic_SNV
indel indel A sequence alteration which included an insertion and a deletion, affecting 2 or more bases. SO:1000032 Variation
somatic_indel
substitution substitution A sequence alteration where the length of the change in the variant is the same as that of the reference. SO:1000002 Variation
somatic_substitution
tandem_repeat tandem_repeat Two or more adjcent copies of a region (of length greater than 1). SO:0000705 Variation
somatic_tandem_repeat
Complex complex_structural_alteration A structural sequence alteration or rearrangement encompassing one or more genome fragments. SO:0001784 Structural variation
somatic_Complex
Gain copy_number_gain A sequence alteration whereby the copy number of a given regions is greater than the reference sequence. SO:0001742 Structural variation
somatic_Gain
Loss copy_number_loss A sequence alteration whereby the copy number of a given region is less than the reference sequence. SO:0001743 Structural variation
somatic_Loss
CNV copy_number_variation A variation that increases or decreases the copy number of a given region. SO:0001019 Structural variation
somatic_CNV
Interchromosomal breakpoint interchromosomal_breakpoint A rearrangement breakpoint between two different chromosomes. SO:0001873 Structural variation
somatic_Interchromosomal breakpoint
Intrachromosomal breakpoint intrachromosomal_breakpoint A rearrangement breakpoint within the same chromosome. SO:0001874 Structural variation
somatic_Intrachromosomal breakpoint
inversion inversion A continuous nucleotide sequence is inverted in the same position. SO:1000036 Structural variation
somatic_inversion
Tandem duplication tandem_duplication A duplication consisting of 2 identical adjacent regions. SO:1000173 Structural variation
somatic_Tandem duplication
translocation translocation A region of nucleotide sequence that has translocated to a new position. SO:0000199 Structural variation
somatic_translocation
deletion deletion The point at which one or more contiguous nucleotides were excised. SO:0000159 Variation
Structural variation
somatic_deletion
insertion insertion The sequence of one or more nucleotides added between two adjacent nucleotides in the sequence. SO:0000667 Variation
Structural variation
somatic_insertion
sequence_alteration sequence_alteration A sequence_alteration is a sequence_feature whose extent is the deviation from another sequence. SO:0001059 Variation
Structural variation
somatic_sequence_alteration

* Corresponding colours in the Genome browser (only for Structural variations). The colours are based on the dbVar displays.


Variation classes distribution

Insertion and Deletion coordinates

In Ensembl, an insertion is indicated by start coordinate = end coordinate + 1. For example, an insertion of 'C' between nucleotides 12600 and 12601 on the forward strand is indicated with start and end coordinates as follows:

   12601     12600   

A deletion is indicated by the exact nucleotide coordinates. For example, a three base pair deletion of nucleotides 12600, 12601, and 12602 of the reverse strand will have start and end coordinates of :

   12600     12602    

Variation sets

We use the concept of variation sets to group variations that share some property together. For example, we have grouped the variations identified in the three different 1000 Genomes pilot studies into separate variation sets. The sets can be further subdivided into supersets and subsets to reflect hierarchical relationships between them. In the case of the 1000 Genomes pilot sets, these are divided into subsets based on population. For example, the set representing variations identified in the first 1000 Genomes pilot study is named '1000 Genomes - Low coverage' and has three subsets: '1000 Genomes - Low coverage - CEU', '1000 Genomes - Low coverage - CHB+JPT' and '1000 Genomes - Low coverage - YRI'. The variation sets can be displayed as separate tracks on the location view. This behaviour is controlled from the 'Germline variations' section on the configuration panel which is accessed by clicking the 'Configure this page' link in the left hand side navigation.

The sets are constructed during production and are stored in the database. The table below lists the available variation sets in the Ensembl variation database (subsets are indicated by bullet points).

Variation sets common to all the species

Name Short name Description
All failed variations fail_all Variations that have failed the Ensembl QC checks

Variation sets specific to Human

Name Short name Description
1000 Genomes - All 1kg Variants genotyped by the 1000 Genomes project (phase 1)
  • 1000 Genomes - AFR
1kg_afr Variants genotyped in African individuals by the 1000 Genomes project (phase 1)
  • 1000 Genomes - AFR - common
1kg_afr_com Variants genotyped in African individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes - AMR
1kg_amr Variants genotyped in admixed American individuals by the 1000 Genomes project (phase 1)
  • 1000 Genomes - AMR - common
1kg_amr_com Variants genotyped in admixed American individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes - ASN
1kg_asn Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 1)
  • 1000 Genomes - ASN - common
1kg_asn_com Variants genotyped in East Asian individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes - All - common
1kg_com Variants genotyped by the 1000 Genomes project (phase 1) with frequency of at least 1%
  • 1000 Genomes - EUR
1kg_eur Variants genotyped in European individuals by the 1000 Genomes project (phase 1)
  • 1000 Genomes - EUR - common
1kg_eur_com Variants genotyped in European individuals by the 1000 Genomes project (phase 1) with frequency of at least 1%
1000 Genomes - High coverage - Trios 1kg_hct Variations called by the 1000 Genomes project on high coverage sequence data from two family trios (Pilot 2)
1000 Genomes - Low coverage 1kg_lc Variations called by the 1000 Genomes project on low coverage sequence data from 179 unrelated individuals (Pilot 1)
All phenotype-associated variants ph_variants Variants that have been associated with a phenotype
  • COSMIC phenotype variants
ph_cosmic Phenotype annotations of somatic mutations found in human cancers from the COSMIC project
  • HGMD-PUBLIC phenotype variants
ph_hgmd_pub Variants with phenotypes annotated by HGMD
  • Johnson & O'Donnell phenotype variants
ph_johnson_et_al Johnson & O'Donnell 'An Open Access Database of Genome-wide Association Results' PMID:19161620
  • NHGRI catalog phenotype variants
ph_nhgri Variants associated with phenotype data from the NHGRI GWAS catalog [http://www.genome.gov/gwastudies/]
  • OMIM phenotype variants
ph_omim Variations linked to entries in the Online Mendelian Inheritance in Man (OMIM) database
  • Uniprot phenotype variants
ph_uniprot Variations with phenotype annotations provided by Uniprot
Anonymous Irish Male ind_irish Variants genotyped in an anonymous Irish Male
Anonymous Korean ind_ak1 Variants genotyped in an anonymous Korean individual
Clinical/LSDB variations from dbSNP precious Variations that belong to a reserved or "precious" set of clinically associated SNPs from dbSNP [http://www.ncbi.nlm.nih.gov/projects/SNP/]
ENSEMBL:Venter ind_venter Variants genotyped in Craig Venter
ENSEMBL:Watson ind_watson Variants genotyped in James Watson
HapMap hapmap Variations which have been assayed by The International HapMap Project [http://hapmap.ncbi.nlm.nih.gov/]
  • HapMap - CEU
hapmap_ceu Variations which have been assayed by The International HapMap Project from CEU individuals
  • HapMap - HCB
hapmap_hcb Variations which have been assayed by The International HapMap Project from HCB individuals
  • HapMap - JPT
hapmap_jpt Variations which have been assayed by The International HapMap Project from JPT individuals
  • HapMap - YRI
hapmap_yri Variations which have been assayed by The International HapMap Project from YRI individuals
Henry Louis Gates Jr ind_gates_jr Variants genotyped in Henry Louis Gates Jr
Henry Louis Gates Sr ind_gates_sr Variants genotyped in Henry Louis Gates Sr
Marjolein Kriek ind_kriek Variants genotyped in Marjolein Kriek
Misha Angrist ind_angrist Variants genotyped in Misha Angrist
Rosalynn Gill ind_gill Variants genotyped in Rosalynn Gill
Saqqaq ind_saqqaq Variants genotyped in a Palaeo-Eskimo Saqqaq individual
Saqqaq HC ind_saqqaq_hc Variants genotyped in a Palaeo-Eskimo Saqqaq individual (high confidence SNPs)
Seong-Jin Kim ind_sjk Variants genotyped in Seong-Jin Kim
Stephen Quake ind_quake Variants genotyped in Stephen Quake
YanHang ind_yh Variants genotyped in a Han Chinese individual (YanHuang Project)