Archive Ensembl HomeArchive Ensembl Home
Regulatory Build Regulatory Build | Regulation Sources | Regulatory Segmentation | Microarray Probeset Mapping | Database/API Summary | Schema Description | Schema Diagram | API Tutorial

Regulatory Build

The regulatory build provides a single 'best guess' set of regulatory features. These features are based on the information contained within the Ensembl regulation database. This document details the methodology used in this process. The complete list of sources of data for the current release is described here.

Regulatory Feature Construction

The 'Regulatory Build' is performed by overlap analysis of annotations from data sets in a two stage cell type aware manner.

In stage one, core regions are identified across all available cell types using 'focus' features, which are chosen to define a set of potential binding sites. These tend to be broad coverage, narrowly focused marks which are likely candidates for different types of regulatory elements or motifs. Focus feature types include DNase1 which is known to mark accessible chromatin, TFBSs and CTCF, which characterises 'insulator/enhancer' elements. As such the core regions of regulatory features are likely to be positioned on or around any potential regulatory motif. Core regions are extended only in the case of direct overlap with another focus feature. To maintain resolution and to avoid chaining of regulatory features across regions of dense regulatory elements a 2KB cut-off is imposed. Exceeding this cut-off causes the offending focus feature to be treated as an attribute feature (see below) and so does not extend the core region.

Stage two extends the structure in a cell type specific manner, using 'attribute' features. Attribute features do not define a binding site and are some times longer ranging feature types which are useful for classification, such as histone modifications. If core data exists for a given cell type, a Regulatory Feature is seeded using the core region defined in stage one. The arms or bounds are defined by overlap of attribute features with respect to the core region. Directly overlapping attribute features are said to have one degree of separation. Attributes with two degrees of separation are only included if they are entirely contained within another longer associated attribute feature. This is done to capture information adjacent and indirectly associated with the core region, whilst avoiding longer range and potentially anomalous associations.

For some cell lines where the is no core data available, but there is substantial other attribute data present, a projection build method is employed. This involves projecting the core region defined by the other cell lines to the 'sparse' cell line. The attribute extension detailed above is then carried out using the projected core region.

These two stages give rise to regulatory FeatureSets for the core 'MultiCell' features and for each available cell type.

Regulatory Feature Annotation

Regulatory Features (regfeats) are classified by considering their position on the genome in relation to other classes of feature on the genome (eg genes, repeats, intergenic regions) together with the combination of regulatory attributes they possess as coded in their binary_string. In the binary string each position corresponds to a particular focus or attribute feature and a value of 1 indicates that the regulatory feature overlaps this particular type of focus or attribute feature. A set of randomly distributed features (mockfeats) corresponding to the regfeats in terms of length and chromosome are also generated so that we can judge if the placement of regfeats in relation to the genomic features is non-random.

The first step in the procedure is to determine which genomic features (genfeats) each regfeat overlaps. A single common basepair is sufficient to consider two features overlapping. We do the same with the mockfeats. (Strictly speaking this is not the first step, as we know from experience that certain regulatory features are most probably artefacts and that others contain no useable information so these are filtered out before the procedure begins and the mockfeats correspond to only the filtered set of regfeats).

Next we create a set of patterns of attributes we wish to evaluate. Currently this is all the patterns which occur in the display labels more than once, plus all the patterns which can be created by re-setting one bit of the existing patterns from 1 to 0.

For each pattern, we look at all the regulatory features which have the same or more bits set. If there are more than 100 such regfeats we count the number of times these features overlap each class of genfeat. We do the same count with the set of mockfeats which correspond to the regfeats. If >50% of the regfeats overlap a particular class of genfeat and the chi-squared statistic calculated using the mockfeat count as the 'expected' value is >8.0 (P0.005) we record that this pattern is associated with this class of genfeat.

If the pattern IS associated with a genfeat we collect a second set of patterns which have this pattern's PLUS any other bits set. For each of these patterns we look at all the regulatory features which have the same or more bits set and we count the number of times these features overlap each class of genfeat. If less than 50% of the regfeats overlap we record that this second pattern is not associated with the class of genfeat involved.

Having determined all the associated and non-associated patterns for each class of genfeat, we look at all the regfeats and use the 'associated' and then 'not-associated' patterns to set or unset a flag indicating whether the particular regfeat is associated with a particular class of genfeat. During this process it is possible for a given regfeat to be associated with more than one class of genfeat and some of these can be contradictory. This is particularly the case where all or nearly all the bits are set.

Finally, for the purposes of the regulatory build, there is a set of rules which 1. resolve conflicts amongst the above flags and 2. assign a regulatory feature_type to the regfeat. The following types are currently in use:

Classification Description
Promoter Associated Patterns over-represented in the region of the transcription start site plus or minus 2500 bp upstream of protein coding genes, but not in the downstream gene body. Likely to be a 5' proximal promoter
Gene Associated Patterns over-represented in gene bodies. Often represent gene's transcriptional activity (expressed/repressed)
Non-gene Associated Patterns over-represented in non-gene regions. Likely to correspond to a distal regulatory element such as an insulator or enhancer.
Polymerase III Associated Patterns over-represented in regions 2500 bp upstream of PolIII transcribed regions e.g. tRNAs. Likely to correspond to a proximal regulatory element specifically associated to Polymerase III transcription.
Unclassified Patterns which are currently unclassifiable.

At present only cell-type specific regulatory features are classified as different cell types may give conflicting signals reflecting their unique combination of regulatory and transcriptional states.

These data sets can be displayed along the chromosome in 'Region in Detail', displayed for a gene in the 'Regulation View' view or mined from the functional genomics database.

Transcription Factor Binding Site Annotation

For each transcription factor (TF) which has both a ChIP-seq data set in the functional genomics database and a publicly available position weight matrix (PWM) we have annotated the position of putative TF binding sites within the peaks called using the ChIP-seq reads.

Initially PWMs (currently taken from JASPAR, Bryen et al, 2008) are mapped to the genome using the find_pssm_dna program from the MOODS software (Korhonen et al, 2009) with the -f flag set and a permissive threshold of 0.001. We then filter these mappings using a log odds score threshold. The threshold is derived per PWM by considering the occurrence of mappings in a sample of randomly positioned 'background' sequences matched in terms of size and chromosome to the ChIP-seq peaks for this TF. We select the threshold such that the proportion of these background peaks containing a mapping is approximately 5%. Only mappings which overlap the corresponding ChIP-seq peaks are included in the functional genomics database.

PWM features (or MotifFeatures) are visualized in the browser as black boxes within regulatory features and TF evidence peaks. Clicking on the black box will highlight specific information in the popup menu, such as and the matching score (relatively to the optimal site). More information on the TFBS annotation process can be seen here. Information regarding the PWMs used and their correspondence to transcription factors can be seen in the data source documentation.