Archive Ensembl HomeArchive Ensembl Home

Ensembl Core Schema Documentation

Introduction

This document gives a high-level description of the tables that make up the EnsEMBL core schema. Tables are grouped into logical groups, and the purpose of each table is explained. It is intended to allow people to familiarise themselves with the schema when encountering it for the first time, or when they need to use some tables that they've not used before. Note that while some of the more important columns in some of the tables are discussed, this document makes no attempt to enumerate all of the names, types and contents of every single table. Some concepts which are referred to in the table descriptions are given at the end of this document; these are linked to from the table description where appropriate.

Different tables are populated throughout the gene build process:

Step Process
0 Create empty schema, populate meta table
1 Load DNA - populates dna, clone, contig, chromosome, assembly tables
2 Analyze DNA (raw computes) - populates genomic feature/analysis tables
3 Build genes - populates exon, transcript,etc. gene-related tables
4a Analyze genes - populate protein_feature, xref tables, interpro
4b ID mapping

This document refers to version 67 of the EnsEMBL core schema.


List of the tables:

Fundamental Tables

Features and Analyses

ID Mapping

External References

Miscellaneous



Fundamental Tables

A PDF document of the schema is available here.
Fundamental Tables Schema Diagram


assembly Show columns  | [Back to top]

The assembly table states, which parts of seq_regions are exactly equal. It enables to transform coordinates between seq_regions. Typically this contains how chromosomes are made of contigs, clones out of contigs, and chromosomes out of supercontigs. It allows you to artificially chunk chromosome sequence into smaller parts. The data in this table defines the "static golden path", i.e. the best effort draft full genome sequence as determined by the UCSC or NCBI (depending which assembly you are using). Each row represents a component, e.g. a contig, (comp_seq_region_id, FK from seq_region table) at least part of which is present in the golden path. The part of the component that is in the path is delimited by fields cmp_start and cmp_end (start < end), and the absolute position within the golden path chromosome (or other appropriate assembled structure) (asm_seq_region_id) is given by asm_start and asm_end.

See also:


assembly_exception Show columns  | [Back to top]

Allows multiple sequence regions to point to the same sequence, analogous to a symbolic link in a filesystem pointing to the actual file. This mechanism has been implemented specifically to support haplotypes and PARs, but may be useful for other similar structures in the future.

See also:


attrib_type Show columns  | [Back to top]

Provides codes, names and desctriptions of attribute types.

See also:


coord_system Show columns  | [Back to top]

Stores information about the available co-ordinate systems for the species identified through the species_id field. Note that for each species, there must be one co-ordinate system that has the attribute "top_level" and one that has the attribute "sequence_level".

See also:


dna Show columns  | [Back to top]

Contains DNA sequence. This table has a 1:1 relationship with the contig table.

See also:


dnac Show columns  | [Back to top]

Contains equivalent data to dna table, but 4 letters of DNA code are represented by a single binary character, based on 2 bit encoding.

See also:


exon Show columns  | [Back to top]

Stores data about exons. Associated with transcripts via exon_transcript. Allows access to contigs seq_regions. Note seq_region_start is always less that seq_region_end, i.e. when the exon is on the other strand the seq_region_start is specifying the 3prime end of the exon.

See also:


exon_transcript Show columns  | [Back to top]

Relationship table linking exons with transcripts. The rank column indicates the 5' to 3' position of the exon within the transcript, i.e. a rank of 1 means the exon is the 5' most within this transcript.

See also:


gene Show columns  | [Back to top]

Allows transcripts to be related to genes.

See also:


gene_attrib Show columns  | [Back to top]

Enables storage of attributes that relate to genes.

See also:


intron_supporting_evidence Show columns  | [Back to top]

Provides the evidence which we have used to declare an intronic region

See also:


karyotype Show columns  | [Back to top]

Describes bands that can be stained on the chromosome.

See also:


meta Show columns  | [Back to top]

Stores data about the data in the current schema. Taxonomy information, version information and the default value for the type column in the assembly table are stored here. Unlike other tables, data in the meta table is stored as key-value pairs. Also stores (via assembly.mapping keys) the relationships between co-ordinate systems in the assembly table. The species_id field of the meta table is used in multi-species databases and makes it possible to have species-specific meta key-value pairs. The species-specific meta key-value pairs needs to be repeated for each species_id. Entries in the meta table that are not specific to any one species, such as the schema_version key and any other schema-related information must have their species_id field set to NULL. The default species_id, and the only species_id value allowed in single-species databases, is 1.

See also:


meta_coord Show columns  | [Back to top]

Describes which co-ordinate systems the different feature tables use.

See also:


operon Show columns  | [Back to top]

allows one or more polycistronic transcripts to be grouped together

See also:


operon_transcript Show columns  | [Back to top]

represents polycistronic transcripts which belong to operons and encode more than one gene

See also:


operon_transcript_gene Show columns  | [Back to top]

allows association of genes with polycistronic transcripts

See also:


seq_region Show columns  | [Back to top]

Stores information about sequence regions. The primary key is used as a pointer into the dna table so that actual sequence can be obtained, and the coord_system_id allows sequence regions of multiple types to be stored. Clones, contigs and chromosomes are all now stored in the seq_region table. Contigs are stored with the co-ordinate system 'contig'. The relationship between contigs and clones is stored in the assembly table. The relationships between contigs and chromosomes, and between contigs and supercontigs, are stored in the assembly table.

See also:


seq_region_attrib Show columns  | [Back to top]

Allows "attributes" to be defined for certain seq_regions. Provides a way of storing extra information about particular seq_regions without adding extra columns to the seq_region table. e.g.

See also:


transcript Show columns  | [Back to top]

Stores information about transcripts. Has seq_region_start, seq_region_end and seq_region_strand for faster retrieval and to allow storage independently of genes and exons. Note that a transcript is usually associated with a translation, but may not be, e.g. in the case of pseudogenes and RNA genes (those that code for RNA molecules).

See also:


transcript_attrib Show columns  | [Back to top]

Enables storage of attributes that relate to transcripts.

See also:


translation Show columns  | [Back to top]

Describes which parts of which exons are used in translation. The seq_start and seq_end columns are 1-based offsets into the relative coordinate system of start_exon_id and end_exon_id. i.e, if the translation starts at the first base of the exon, seq_start would be 1. Transcripts are related to translations by the transcript_id key in this table.

See also:


translation_attrib Show columns  | [Back to top]

Enables storage of attributes that relate to translations.

See also:


unconventional_transcript_association Show columns  | [Back to top]

Describes transcripts that do not link to a single gene in the normal way.

See also:



Features and Analyses

A PDF document of the schema is available here.
Features and Analyses Schema Diagram


alt_allele Show columns  | [Back to top]

Stores information about genes on haplotypes that may be orthologous.

See also:


analysis Show columns  | [Back to top]

Usually describes a program and some database that together are used to create a feature on a piece of sequence. Each feature is marked with an analysis_id. The most important column is logic_name, which is used by the webteam to render a feature correctly on contigview (or even retrieve the right feature). Logic_name is also used in the pipeline to identify the analysis which has to run in a given status of the pipeline. The module column tells the pipeline which Perl module does the whole analysis, typically a RunnableDB module.

See also:


analysis_description Show columns  | [Back to top]

Allows the storage of a textual description of the analysis, as well as a "display label", primarily for the EnsEMBL web site.

See also:


density_feature Show columns  | [Back to top]

Describes features representing a density, or precentage coverage etc. in a given region.

See also:


density_type Show columns  | [Back to top]

Describes type representing a density, or percentage coverage etc. in a given region.

See also:


ditag Show columns  | [Back to top]

Represents a ditag object in the EnsEMBL database. Corresponds to original tag containing the full sequence. This can be a single piece of sequence like CAGE tags or a ditag with concatenated sequence from 5' and 3' end like GIS or GSC tags. This data is available as a DAS track in ContigView on the EnsEMBL web site.

See also:


ditag_feature Show columns  | [Back to top]

Describes where ditags hit on the genome. Represents a mapped ditag object in the EnsEMBL database. These are the original tags separated into start ("L") and end ("R") parts if applicable, successfully aligned to the genome. Two DitagFeatures usually relate to one parent Ditag. Alternatively there are CAGE tags e.g. which only have a 5\'tag ("F").

See also:


dna_align_feature Show columns  | [Back to top]

Stores DNA sequence alignments generated from Blast (or Blast-like) comparisons.

See also:


map Show columns  | [Back to top]

Stores the names of different genetic or radiation hybrid maps, for which there is marker map information.

See also:


marker Show columns  | [Back to top]

Stores data about the marker itself. A marker in Ensembl consists of a pair of primer sequences, an expected product size and a set of associated identifiers known as synonyms.

See also:


marker_feature Show columns  | [Back to top]

Used to describe positions of markers on the assembly. Markers are placed on the genome electronically using an analysis program.

See also:


marker_map_location Show columns  | [Back to top]

Stores map locations (genetic, radiation hybrid and in situ hybridization) for markers obtained from experimental evidence.

See also:


marker_synonym Show columns  | [Back to top]

Stores alternative names for markers, as well as their sources.

See also:


misc_attrib Show columns  | [Back to top]

Stores arbitrary attributes about the features in the misc_feature table.

See also:


misc_feature Show columns  | [Back to top]

Allows for storage of arbitrary features.

See also:


misc_feature_misc_set Show columns  | [Back to top]

This table classifies features into distinct sets.

See also:


misc_set Show columns  | [Back to top]

Defines "sets" that the features held in the misc_feature table can be grouped into.

See also:


prediction_exon Show columns  | [Back to top]

Stores exons that are predicted by ab initio gene finder programs. Unlike EnsEMBL exons they are not supported by any evidence.

See also:


prediction_transcript Show columns  | [Back to top]

Stores transcripts that are predicted by ab initio gene finder programs (e.g. genscan, SNAP). Unlike EnsEMBL transcripts they are not supported by any evidence.

See also:


protein_align_feature Show columns  | [Back to top]

Stores translation alignments generated from Blast (or Blast-like) comparisons.

See also:


protein_feature Show columns  | [Back to top]

Describes features on the translations (as opposed to the DNA sequence itself), i.e. parts of the peptide. In peptide co-ordinates rather than contig co-ordinates.

See also:


qtl Show columns  | [Back to top]

Describes the markers (of which there may be up to three) which define Quantitative Trait Loci. Note that QTL is a statistical technique used to find links between certain expressed traits and regions in a genetic map. A QTL is defined by three markers, two flanking and one peak (optional) marker. Its a region (or more often a group of regions) which is likely to affect the phenotype (trait) described in this Qtl.

See also:


qtl_feature Show columns  | [Back to top]

Describes Quantitative Trail Loci (QTL) positions as obtained from inbreeding experiments. Note the values in this table are in chromosomal co-ordinates. Also, this table is not populated for all schemas.

See also:


qtl_synonym Show columns  | [Back to top]

Describes alternative names for Quantitative Trait Loci (QTLs).

See also:


repeat_consensus Show columns  | [Back to top]

Stores consensus sequences obtained from analysing repeat features.


repeat_feature Show columns  | [Back to top]

Describes sequence repeat regions.

See also:


simple_feature Show columns  | [Back to top]

Describes general genomic features that don't fit into any of the more specific feature tables.

See also:


splicing_event Show columns  | [Back to top]

The splicing event table contains alternative splicing events and constitutive splicing events as reported by the AltSpliceFinder program. Multiple alternative splicing events can be observed on a gene. The location of the splicing event on the seq_region is reported. The type of event is stored in the @link attrib_type table.

See also:


splicing_event_feature Show columns  | [Back to top]

Represents alternative splicing event features. If the event is a constitutive exon, the constitutive exon and the transcript it belongs to is reported in this table. If the event is a cassette exon, the cassette exon and the transcript it belongs to is represented in this table. The transcript association field associates a sequence number with a transcript id. Thus, several exons skipped in an event can be attached to the same transcript. The features are ordered according to their genomic location and this is reflected in the feature order field value.

See also:


splicing_transcript_pair Show columns  | [Back to top]

Describes a pair of spliced transcripts in a splicing event. A splicing event is an observation of a change of splice sites between two isoforms. To avoid redundancy, some events, like a skipped exon observed between different pairs of transcripts are reported only once. The splicing transcript pair table contains a list of all the combinations of 2 isoforms relating to the same event.

See also:


supporting_feature Show columns  | [Back to top]

Describes the exon prediction process by linking exons to DNA or protein alignment features. As in several other tables, the feature_id column is a foreign key; the feature_type column specifies which table feature_id refers to.

See also:


transcript_supporting_feature Show columns  | [Back to top]

Describes the exon prediction process by linking transcripts to DNA or protein alignment features. As in several other tables, the feature_id column is a foreign key; the feature_type column specifies which table feature_id refers to.

See also:



ID Mapping

A PDF document of the schema is available here.
ID Mapping Schema Diagram


gene_archive Show columns  | [Back to top]

Contains a snapshot of the stable IDs associated with genes deleted or changed between releases. Includes gene, transcript and translation stable IDs.

See also:


mapping_session Show columns  | [Back to top]

Stores details of ID mapping sessions - a mapping session represents the session when stable IDs where mapped from one database to another. Details of the "old" and "new" databases are stored.

See also:


mapping_set Show columns  | [Back to top]

Table structure for seq_region mapping between releases.


peptide_archive Show columns  | [Back to top]

Contains the peptides for deleted or changed translations.


seq_region_mapping Show columns  | [Back to top]

Describes how the core seq_region_id have changed from release to release.

See also:


stable_id_event Show columns  | [Back to top]

Represents what happened to all gene, transcript and translation stable IDs during a mapping session. This includes which IDs where deleted, created and related to each other. Each event is represented by one or more rows in the table.

See also:



External References

A PDF document of the schema is available here.
External References Schema Diagram


dependent_xref Show columns  | [Back to top]

Describes dependent external references which can't be directly mapped to Ensembl entities. They are linked to primary external references instead.

See also:


external_db Show columns  | [Back to top]

Stores data about the external databases in which the objects described in the xref table are stored.

See also:


external_synonym Show columns  | [Back to top]

Some xref objects can be referred to by more than one name. This table relates names to xref IDs.

See also:


identity_xref Show columns  | [Back to top]

Describes how well a particular xref object matches the EnsEMBL object.

See also:


object_xref Show columns  | [Back to top]

Describes links between EnsEMBL objects and objects held in external databases. The EnsEMBL object can be one of several types; the type is held in the ensembl_object_type column. The ID of the particular EnsEMBL gene, translation or whatever is given in the ensembl_id column. The xref_id points to the entry in the xref table that holds data about the external object. Each EnsEMBL object can be associated with zero or more xrefs. An xref object can be associated with one or more EnsEMBL objects.

See also:


ontology_xref Show columns  | [Back to top]

This table associates Evidence Tags to the relationship between EnsEMBL objects and ontology accessions (primarily GO accessions). The relationship to GO that is stored in the database is actually derived through the relationship of EnsEMBL peptides to SwissProt peptides, i.e. the relationship is derived like this: ENSP -> SWISSPROT -> GO And the evidence tag describes the relationship between the SwissProt Peptide and the GO entry. In reality, however, we store this in the database like this: ENSP -> SWISSPROT ENSP -> GO and the evidence tag hangs off of the relationship between the ENSP and the GO identifier. Some ENSPs are associated with multiple closely related Swissprot entries which may both be associated with the same GO identifier but with different evidence tags. For this reason a single Ensembl - external db object relationship in the object_xref table can be associated with multiple evidence tags in the ontology_xref table.

See also:


seq_region_synonym Show columns  | [Back to top]

Allows for storing multiple names for sequence regions.

See also:


unmapped_object Show columns  | [Back to top]

Describes why a particular external entity was not mapped to an ensembl one.

See also:


unmapped_reason Show columns  | [Back to top]

Describes the reason why a mapping failed.


xref Show columns  | [Back to top]

Holds data about objects which are external to EnsEMBL, but need to be associated with EnsEMBL objects. Information about the database that the external object is stored in is held in the external_db table entry referred to by the external_db column.

See also:



Miscellaneous

Other tables


data_file Show columns  | [Back to top]

Allows the storage of flat file locations used to store large quanitities of data currently unsuitable in a traditional database table.


interpro Show columns  | [Back to top]

InterPro - The InterPro website


Concepts

co-ordinates

There are several different co-ordinate systems used in the EnsEMBL database and API. For every co-ordinate system, the fundamental unit is one base. The differences between co-ordinate systems lie in where a particular numbered base lies, and the start position it is relative to. CONTIG co-ordinates, also called 'raw contig' co-ordinates or 'clone fragments' are relative to the first base of the first contig of a clone. Note that the numbering is from 1, i.e. the very first base of the first contig of a clone is numbered 1, not 0. In CHROMOSOMAL co-ordinates, the co-ordinates are relative to the first base of the chromosome. Again, numbering is from 1. The seq_region table can store sequence regions in any of the co-ordinate systems defined in the coord_system table.

supercontigs

A supercontig is made up of a group of adjacent or overlapping contigs.

sticky_rank

The sticky_rank differentiates between fragments of the same exon; i.e for exons that span multiple contigs, all the fragments would have the same ID, but different sticky_rank values

stable_id

Gene predictions have changed over the various releases of the EnsEMBL databases. To allow the user to track particular gene predictions over changing co-ordinates, each gene-related prediction is given a 'stable identifier'. If a prediction looks similar between two releases, we try to give it the same name, even though it may have changed position and/or had some sequence changes.

cigar_line

This allows the compact storage of gapped alignments by storing the maximum extent of the matches and then a text string which encodes the placement of gaps inside the alignment. Colloquially inside EnsEMBL this is called a and its adoption has shrunk the number of rows in the feature table around 4-fold.