Archive Ensembl HomeArchive Ensembl Home

Large File Formats

Ensembl supports a number of large (multi-gigabyte) file formats for datasets that are too large to upload directly to the Ensembl servers. The file remains on the remote server and is accessed via its URL, with only small chunks of data being requested by the Ensembl webcode at any one time.

All of these formats can be viewed in Ensembl by creating a custom data track using the 'Attach Remote File' function, in the lefthand menu when you click "Manage Your Data".

screenshot

BAM format

BAM files is a compressed version of the SAM (Sequence Alignment/Map) binary format. BAM uses an index file to give fast access to small sections of the file..

Additional information about SAM/BAM is available at the SAMtools development site.

BigWig format

The BigWig format is designed for dense, continuous data that is intended to be displayed as a graph. Files can be created from WIG or BedGraph files using the appropriate utility program.

Ensembl currently allows the following configuration options to be set when you upload the file:

  • Track colour (10 choices)

VCF format

The VCF format is a tab delimited format for storing variant calls and and individual genotypes. It is able to store all variant calls from single nucleotide variants to large scale insertions and deletions.

More information on this format, which is still under development, can be obtained from the 1000 Genomes Wiki.

Please note the following:

  • You will need to have an index file with the extension .vcf.gz.tbi in the same directory as your data file (and with the same name)
  • Ensembl does not support the https protocol - please use a http:// or a ftp:// url.
In order to produce the indexed vcf file with the .gz.tbi extension you must follow the following steps:
  • Compress your vcf file using bgzip
  • Index the vcf.gz file using tabix. Use will need to pass the option -p vcf to tabix, for example "/usr/bin/tabix -p vcf my_file.vcf.gz"